Version: 1.0.7

Analyze Text with SynapseML and Azure AI Language

Azure AI Language is a cloud-based service that provides Natural Language Processing (NLP) features for understanding and analyzing text. Use this service to help build intelligent applications using the web-based Language Studio, REST APIs, and client libraries. You can use SynapseML with Azure AI Language for named entity recognition, language detection, entity linking, key phrase extraction, Pii entity recognition and sentiment analysis.

from synapse.ml.services.language import AnalyzeText
from synapse.ml.core.platform import find_secret

ai_service_key = find_secret(
    secret_name="ai-services-api-key", keyvault="mmlspark-build-keys"
)
ai_service_location = "eastus"

Named Entity Recognition

Named Entity Recognition is one of the features offered by Azure AI Language, a collection of machine learning and AI algorithms in the cloud for developing intelligent applications that involve written language. The NER feature can identify and categorize entities in unstructured text. For example: people, places, organizations, and quantities. Refer to this article for the full list of supported languages.

df = spark.createDataFrame(
    data=[
        ["en", "Dr. Smith has a very modern medical office, and she has great staff."],
        ["en", "I had a wonderful trip to Seattle last week."],
    ],
    schema=["language", "text"],
)

entity_recognition = (
    AnalyzeText()
    .setKind("EntityRecognition")
    .setLocation(ai_service_location)
    .setSubscriptionKey(ai_service_key)
    .setTextCol("text")
    .setOutputCol("entities")
    .setErrorCol("error")
    .setLanguageCol("language")
)

df_results = entity_recognition.transform(df)
display(df_results.select("language", "text", "entities.documents.entities"))

This cell should yield a result that looks like:

language	text	entities
en	Dr. Smith has a very modern medical office, and she has great staff.	[{"category": "Person", "confidenceScore": 0.98, "length": 5, "offset": 4, "subcategory": null, "text": "Smith"}, {"category": "Location", "confidenceScore": 0.79, "length": 14, "offset": 28, "subcategory": "Structural", "text": "medical office"}, {"category": "PersonType", "confidenceScore": 0.85, "length": 5, "offset": 62, "subcategory": null, "text": "staff"}]
en	I had a wonderful trip to Seattle last week.	[{"category": "Event", "confidenceScore": 0.74, "length": 4, "offset": 18, "subcategory": null, "text": "trip"}, {"category": "Location", "confidenceScore": 1, "length": 7, "offset": 26, "subcategory": "GPE", "text": "Seattle"}, {"category": "DateTime", "confidenceScore": 0.8, "length": 9, "offset": 34, "subcategory": "DateRange", "text": "last week"}]

LanguageDetection

Language detection can detect the language a document is written in. It returns a language code for a wide range of languages, variants, dialects, and some regional/cultural languages. Refer to this article for the full list of supported languages.

df = spark.createDataFrame(
    data=[
        ["This is a document written in English."],
        ["这是一份用中文写的文件"],
    ],
    schema=["text"],
)

language_detection = (
    AnalyzeText()
    .setKind("LanguageDetection")
    .setLocation(ai_service_location)
    .setSubscriptionKey(ai_service_key)
    .setTextCol("text")
    .setOutputCol("detected_language")
    .setErrorCol("error")
)

df_results = language_detection.transform(df)
display(df_results.select("text", "detected_language.documents.detectedLanguage"))

This cell should yield a result that looks like:

text	detectedLanguage
This is a document written in English.	{"name": "English", "iso6391Name": "en", "confidenceScore": 0.99}
这是一份用中文写的文件	{"name": "Chinese_Simplified", "iso6391Name": "zh_chs", "confidenceScore": 1}

EntityLinking

Entity linking identifies and disambiguates the identity of entities found in text. For example, in the sentence "We went to Seattle last week.", the word "Seattle" would be identified, with a link to more information on Wikipedia. English and Spanish are supported.

df = spark.createDataFrame(
    data=[
        ["Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975."],
        ["We went to Seattle last week."],
    ],
    schema=["text"],
)

entity_linking = (
    AnalyzeText()
    .setKind("EntityLinking")
    .setLocation(ai_service_location)
    .setSubscriptionKey(ai_service_key)
    .setTextCol("text")
    .setOutputCol("entity_linking")
    .setErrorCol("error")
)

df_results = entity_linking.transform(df)
display(df_results.select("text", "entity_linking.documents.entities"))

This cell should yield a result that looks like:

text	entities
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975.	[{"bingId": "a093e9b9-90f5-a3d5-c4b8-5855e1b01f85", "dataSource": "Wikipedia", "id": "Microsoft", "language": "en", "matches": [{"confidenceScore": 0.48, "length": 9, "offset": 0, "text": "Microsoft"}], "name": "Microsoft", "url": "https://en.wikipedia.org/wiki/Microsoft"}, {"bingId": "0d47c987-0042-5576-15e8-97af601614fa", "dataSource": "Wikipedia", "id": "Bill Gates", "language": "en", "matches": [{"confidenceScore": 0.52, "length": 10, "offset": 25, "text": "Bill Gates"}], "name": "Bill Gates", "url": "https://en.wikipedia.org/wiki/Bill_Gates"}, {"bingId": "df2c4376-9923-6a54-893f-2ee5a5badbc7", "dataSource": "Wikipedia", "id": "Paul Allen", "language": "en", "matches": [{"confidenceScore": 0.54, "length": 10, "offset": 40, "text": "Paul Allen"}], "name": "Paul Allen", "url": "https://en.wikipedia.org/wiki/Paul_Allen"}, {"bingId": "52535f87-235e-b513-54fe-c03e4233ac6e", "dataSource": "Wikipedia", "id": "April 4", "language": "en", "matches": [{"confidenceScore": 0.38, "length": 7, "offset": 54, "text": "April 4"}], "name": "April 4", "url": "https://en.wikipedia.org/wiki/April_4"}]
We went to Seattle last week.	[{"bingId": "5fbba6b8-85e1-4d41-9444-d9055436e473", "dataSource": "Wikipedia", "id": "Seattle", "language": "en", "matches": [{"confidenceScore": 0.17, "length": 7, "offset": 11, "text": "Seattle"}], "name": "Seattle", "url": "https://en.wikipedia.org/wiki/Seattle"}]

text

entities

Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975.

[{"bingId": "a093e9b9-90f5-a3d5-c4b8-5855e1b01f85", "dataSource": "Wikipedia", "id": "Microsoft", "language": "en", "matches": [{"confidenceScore": 0.48, "length": 9, "offset": 0, "text": "Microsoft"}], "name": "Microsoft", "url": "https://en.wikipedia.org/wiki/Microsoft"}, {"bingId": "0d47c987-0042-5576-15e8-97af601614fa", "dataSource": "Wikipedia", "id": "Bill Gates", "language": "en", "matches": [{"confidenceScore": 0.52, "length": 10, "offset": 25, "text": "Bill Gates"}], "name": "Bill Gates", "url": "https://en.wikipedia.org/wiki/Bill_Gates"}, {"bingId": "df2c4376-9923-6a54-893f-2ee5a5badbc7", "dataSource": "Wikipedia", "id": "Paul Allen", "language": "en", "matches": [{"confidenceScore": 0.54, "length": 10, "offset": 40, "text": "Paul Allen"}], "name": "Paul Allen", "url": "https://en.wikipedia.org/wiki/Paul_Allen"}, {"bingId": "52535f87-235e-b513-54fe-c03e4233ac6e", "dataSource": "Wikipedia", "id": "April 4", "language": "en", "matches": [{"confidenceScore": 0.38, "length": 7, "offset": 54, "text": "April 4"}], "name": "April 4", "url": "https://en.wikipedia.org/wiki/April_4"}]

We went to Seattle last week.

[{"bingId": "5fbba6b8-85e1-4d41-9444-d9055436e473", "dataSource": "Wikipedia", "id": "Seattle", "language": "en", "matches": [{"confidenceScore": 0.17, "length": 7, "offset": 11, "text": "Seattle"}], "name": "Seattle", "url": "https://en.wikipedia.org/wiki/Seattle"}]

KeyPhraseExtraction

Key phrase extraction is one of the features offered by Azure AI Language, a collection of machine learning and AI algorithms in the cloud for developing intelligent applications that involve written language. Use key phrase extraction to quickly identify the main concepts in text. For example, in the text "The food was delicious and the staff were wonderful.", key phrase extraction will return the main topics: "food" and "wonderful staff". Refer to this article for the full list of supported languages.

df = spark.createDataFrame(
    data=[
        ["Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975."],
        ["Dr. Smith has a very modern medical office, and she has great staff."],
    ],
    schema=["text"],
)

key_phrase_extraction = (
    AnalyzeText()
    .setKind("KeyPhraseExtraction")
    .setLocation(ai_service_location)
    .setSubscriptionKey(ai_service_key)
    .setTextCol("text")
    .setOutputCol("key_phrase_extraction")
    .setErrorCol("error")
)

df_results = key_phrase_extraction.transform(df)
display(df_results.select("text", "key_phrase_extraction.documents.keyPhrases"))

This cell should yield a result that looks like:

text	keyPhrases
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975.	["Bill Gates", "Paul Allen", "Microsoft", "April"]
Dr. Smith has a very modern medical office, and she has great staff.	["modern medical office", "Dr. Smith", "great staff"]

PiiEntityRecognition

The PII detection feature can identify, categorize, and redact sensitive information in unstructured text. For example: phone numbers, email addresses, and forms of identification. The method for utilizing PII in conversations is different than other use cases, and articles for this use have been separated. Refer to this article for the full list of supported languages.

df = spark.createDataFrame(
    data=[
        ["Call our office at 312-555-1234, or send an email to support@contoso.com"],
        ["Dr. Smith has a very modern medical office, and she has great staff."],
    ],
    schema=["text"],
)

pii_entity_recognition = (
    AnalyzeText()
    .setKind("PiiEntityRecognition")
    .setLocation(ai_service_location)
    .setSubscriptionKey(ai_service_key)
    .setTextCol("text")
    .setOutputCol("pii_entity_recognition")
    .setErrorCol("error")
)

df_results = pii_entity_recognition.transform(df)
display(df_results.select("text", "pii_entity_recognition.documents.entities"))

This cell should yield a result that looks like:

text	entities
Call our office at 312-555-1234, or send an email to support@contoso.com	[{"category": "PhoneNumber", "confidenceScore": 0.8, "length": 12, "offset": 19, "subcategory": null, "text": "312-555-1234"}, {"category": "Email", "confidenceScore": 0.8, "length": 19, "offset": 53, "subcategory": null, "text": "support@contoso.com"}]
Dr. Smith has a very modern medical office, and she has great staff.	[{"category": "Person", "confidenceScore": 0.93, "length": 5, "offset": 4, "subcategory": null, "text": "Smith"}]

SentimentAnalysis

Sentiment analysis and opinion mining are features offered by the Language service, a collection of machine learning and AI algorithms in the cloud for developing intelligent applications that involve written language. These features help you find out what people think of your brand or topic by mining text for clues about positive or negative sentiment, and can associate them with specific aspects of the text. Refer to this article for the full list of supported languages.

df = spark.createDataFrame(
    data=[
        ["The food and service were unacceptable. The concierge was nice, however."],
        ["It taste great."],
    ],
    schema=["text"],
)

sentiment_analysis = (
    AnalyzeText()
    .setKind("SentimentAnalysis")
    .setLocation(ai_service_location)
    .setSubscriptionKey(ai_service_key)
    .setTextCol("text")
    .setOutputCol("sentiment_analysis")
    .setErrorCol("error")
)

df_results = sentiment_analysis.transform(df)
display(df_results.select("text", "sentiment_analysis.documents.sentiment"))

This cell should yield a result that looks like:

text	sentiment
The food and service were unacceptable. The concierge was nice, however.	mixed
It tastes great.	positive

Analyze Text with TextAnalyze

Text Analyze is Deprecated, please use AnalyzeText instead

df = spark.createDataFrame(
    data=[
        ["en", "Hello Seattle"],
        ["en", "There once was a dog who lived in London and thought she was a human"],
    ],
    schema=["language", "text"],
)

from synapse.ml.services import *

text_analyze = (
    TextAnalyze()
    .setLocation(ai_service_location)
    .setSubscriptionKey(ai_service_key)
    .setTextCol("text")
    .setOutputCol("textAnalysis")
    .setErrorCol("error")
    .setLanguageCol("language")
    .setEntityRecognitionParams(
        {"model-version": "latest"}
    )  # Can pass parameters to each model individually
    .setIncludePii(False)  # Users can manually exclude tasks to speed up analysis
    .setIncludeEntityLinking(False)
    .setIncludeSentimentAnalysis(False)
)

df_results = text_analyze.transform(df)

display(df_results)

Analyze Text with SynapseML and Azure AI Language

Named Entity Recognition​

LanguageDetection​

EntityLinking​

KeyPhraseExtraction​

PiiEntityRecognition​

SentimentAnalysis​

Analyze Text with TextAnalyze​

Named Entity Recognition

LanguageDetection

EntityLinking

KeyPhraseExtraction

PiiEntityRecognition

SentimentAnalysis

Analyze Text with TextAnalyze