promptflow.evals.evaluators module#

class promptflow.evals.evaluators.BleuScoreEvaluator#

Bases: object

Evaluator that computes the BLEU Score between two strings.

BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating better quality.

Usage

eval_fn = BleuScoreEvaluator()
result = eval_fn(
    answer="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo.")

Output format

{
    "bleu_score": 0.22
}

__call__(*, answer: str, ground_truth: str, **kwargs)#

Evaluate the BLEU score between the answer and the ground truth.

Parameters:

answer (str) – The answer to be evaluated.
ground_truth (str) – The ground truth to be compared against.

Returns:

The BLEU score.

Return type:

dict

class promptflow.evals.evaluators.ChatEvaluator(model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration], eval_last_turn: bool = False, parallel: bool = True)#

Bases: object

Initialize a chat evaluator configured for a specific Azure OpenAI model.

Parameters:

model_config (Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration]) – Configuration for the Azure OpenAI model.
eval_last_turn (bool) – Set to True to evaluate only the most recent exchange in the dialogue, focusing on the latest user inquiry and the assistant’s corresponding response. Defaults to False
parallel (bool) – If True, use parallel execution for evaluators. Else, use sequential execution. Default is True.

Returns:

A function that evaluates and generates metrics for “chat” scenario.

Return type:

Callable

Usage

chat_eval = ChatEvaluator(model_config)
conversation = [
    {"role": "user", "content": "What is the value of 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 = 4", "context": {
        "citations": [
                {"id": "math_doc.md", "content": "Information about additions: 1 + 2 = 3, 2 + 2 = 4"}
                ]
        }
    }
]
result = chat_eval(conversation=conversation)

Output format

{
    "evaluation_per_turn": {
        "gpt_retrieval": [1.0, 2.0],
        "gpt_groundedness": [5.0, 2.0],
        "gpt_relevance": [3.0, 5.0],
        "gpt_coherence": [1.0, 2.0],
        "gpt_fluency": [3.0, 5.0]
    }
    "gpt_retrieval": 1.5,
    "gpt_groundedness": 3.5,
    "gpt_relevance": 4.0,
    "gpt_coherence": 1.5,
    "gpt_fluency": 4.0
}

__call__(*, conversation, **kwargs)#

Evaluates chat scenario.

Parameters:: conversation (List[Dict]) – The conversation to be evaluated. Each turn should have “role” and “content” keys. “context” key is optional for assistant’s turn and should have “citations” key with list of citations.
Returns:: The scores for Chat scenario.
Return type:: dict

class promptflow.evals.evaluators.CoherenceEvaluator(model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration])#

Bases: object

Initialize a coherence evaluator configured for a specific Azure OpenAI model.

Parameters:: model_config (Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration]) – Configuration for the Azure OpenAI model.

Usage

eval_fn = CoherenceEvaluator(model_config)
result = eval_fn(
    question="What is the capital of Japan?",
    answer="The capital of Japan is Tokyo.")

Output format

{
    "gpt_coherence": 1.0
}

__call__(*, question: str, answer: str, **kwargs)#

Evaluate coherence.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.

Returns:

The coherence score.

Return type:

Dict[str, float]

class promptflow.evals.evaluators.ContentSafetyChatEvaluator(project_scope: dict, eval_last_turn: bool = False, parallel: bool = True, credential=None)#

Bases: object

Initialize a content safety chat evaluator configured to evaluate content safetry metrics for chat scenario.

Parameters:

project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
eval_last_turn (bool) – Set to True to evaluate only the most recent exchange in the dialogue, focusing on the latest user inquiry and the assistant’s corresponding response. Defaults to False
parallel (bool) – If True, use parallel execution for evaluators. Else, use sequential execution. Default is True.
credential (TokenCredential) – The credential for connecting to Azure AI project.

Returns:

A function that evaluates and generates metrics for “chat” scenario.

Return type:

Callable

Usage

eval_fn = ContentSafetyChatEvaluator(model_config)
conversation = [
    {"role": "user", "content": "What is the value of 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 = 4"}
]
result = ContentSafetyChatEvaluator(conversation=conversation)

Output format

{
    "evaluation_per_turn": {
        "violence": ["High", "Low"],
        "violence_score": [7.0, 3.0],
        "violence_reason": "Some reason",
        "sexual": ["High", "Low"],
        "sexual_score": [7.0, 3.0],
        "sexual_reason": "Some reason",
        "self_harm": ["High", "Low"],
        "self_harm_score": [7.0, 3.0],
        "self_harm_reason": "Some reason",
        "hate_unfairness": ["High", "Low"],
        "hate_unfairness_score": [7.0, 3.0],
        "hate_unfairness_reason": "Some reason"
    },
    "violence": "Medium",
    "violence_score": 5.0,
    "sexual": "Medium",
    "sexual_score": 5.0,
    "self_harm": "Medium",
    "self_harm_score": 5.0,
    "hate_unfairness": "Medium",
    "hate_unfairness_score": 5.0,
}

__call__(*, conversation, **kwargs)#

Evaluates content-safety metrics for “chat” scenario.

Parameters:: conversation (List[Dict]) – The conversation to be evaluated. Each turn should have “role” and “content” keys.
Returns:: The scores for Chat scenario.
Return type:: dict

class promptflow.evals.evaluators.ContentSafetyEvaluator(project_scope: dict, parallel: bool = True, credential=None)#

Bases: object

Initialize a content safety evaluator configured to evaluate content safetry metrics for QA scenario.

Parameters:

project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
parallel – If True, use parallel execution for evaluators. Else, use sequential execution. Default is True.
credential (TokenCredential) – The credential for connecting to Azure AI project.

Returns:

A function that evaluates content-safety metrics for “question-answering” scenario.

Return type:

Callable

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = ContentSafetyEvaluator(project_scope)
result = eval_fn(
    question="What is the capital of France?",
    answer="Paris.",
)

Output format

{
    "violence": "Medium",
    "violence_score": 5.0,
    "violence_reason": "Some reason",
    "sexual": "Medium",
    "sexual_score": 5.0,
    "sexual_reason": "Some reason",
    "self_harm": "Medium",
    "self_harm_score": 5.0,
    "self_harm_reason": "Some reason",
    "hate_unfairness": "Medium",
    "hate_unfairness_score": 5.0,
    "hate_unfairness_reason": "Some reason"
}

__call__(*, question: str, answer: str, **kwargs)#

Evaluates content-safety metrics for “question-answering” scenario.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
parallel (bool) – Whether to evaluate in parallel.

Returns:

The scores for content-safety.

Return type:

dict

class promptflow.evals.evaluators.F1ScoreEvaluator#

Bases: object

Initialize a f1 score evaluator for calculating F1 score.

Usage

eval_fn = F1ScoreEvaluator()
result = eval_fn(
    answer="The capital of Japan is Tokyo.",
    ground_truth="Tokyo is Japan's capital, known for its blend of traditional culture                 and technological advancements.")

Output format

{
    "f1_score": 0.42
}

__call__(*, answer: str, ground_truth: str, **kwargs)#

Evaluate F1 score.

Parameters:

answer (str) – The answer to be evaluated.
ground_truth (str) – The ground truth to be evaluated.

Returns:

The F1 score.

Return type:

dict

class promptflow.evals.evaluators.FluencyEvaluator(model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration])#

Bases: object

Initialize a fluency evaluator configured for a specific Azure OpenAI model.

Parameters:: model_config (Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration]) – Configuration for the Azure OpenAI model.

Usage

eval_fn = FluencyEvaluator(model_config)
result = eval_fn(
    question="What is the capital of Japan?",
    answer="The capital of Japan is Tokyo.")

Output format

{
    "gpt_fluency": 4.0
}

__call__(*, question: str, answer: str, **kwargs)#

Evaluate fluency.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.

Returns:

The fluency score.

Return type:

dict

class promptflow.evals.evaluators.GleuScoreEvaluator#

Bases: object

Evaluator that computes the BLEU Score between two strings.

The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for use cases such as machine translation, text summarization, and text generation.

Usage

eval_fn = GleuScoreEvaluator()
result = eval_fn(
    answer="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo.")

Output format

{
    "gleu_score": 0.41
}

__call__(*, ground_truth: str, answer: str, **kwargs)#

Evaluate the GLEU score between the answer and the ground truth.

Parameters:

answer (str) – The answer to be evaluated.
ground_truth (str) – The ground truth to be compared against.

Returns:

The GLEU score.

Return type:

dict

class promptflow.evals.evaluators.GroundednessEvaluator(model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration])#

Bases: object

Initialize a groundedness evaluator configured for a specific Azure OpenAI model.

Parameters:: model_config (Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration]) – Configuration for the Azure OpenAI model.

Usage

eval_fn = GroundednessEvaluator(model_config)
result = eval_fn(
    answer="The capital of Japan is Tokyo.",
    context="Tokyo is Japan's capital, known for its blend of traditional culture                 and technological advancements.")

Output format

{
    "gpt_groundedness": 5
}

__call__(*, answer: str, context: str, **kwargs)#

Evaluate groundedness of the answer in the context.

Parameters:

answer (str) – The answer to be evaluated.
context (str) – The context in which the answer is evaluated.

Returns:

The groundedness score.

Return type:

dict

class promptflow.evals.evaluators.HateUnfairnessEvaluator(project_scope: dict, credential=None)#

Bases: object

Initialize a hate-unfairness evaluator for hate unfairness score.

Parameters:

project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
credential (TokenCredential) – The credential for connecting to Azure AI project.

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = HateUnfairnessEvaluator(project_scope)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format

{
    "hate_unfairness": "High",
    "hate_unfairness_score": 6.5,
    "hate_unfairness_reason": "Some reason"
}

__call__(*, question: str, answer: str, **kwargs)#

Evaluates hate unfairness content.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.

Returns:

The hate unfairness score.

Return type:

dict

class promptflow.evals.evaluators.IndirectAttackEvaluator(project_scope: dict, eval_last_turn: bool = False, credential=None)#

Bases: object

Initializes an XPIA (cross domain prompt injected attack) jailbreak evaluator to detect whether cross domain injected attacks are present in your AI system’s response.

Parameters:

project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
eval_last_turn (bool) – Set to True to evaluate only the most recent exchange in the dialogue, focusing on the latest user inquiry and the assistant’s corresponding response. Defaults to False
credential (TokenCredential) – The credential for connecting to Azure AI project.

Returns:

A function that evaluates and generates metrics for XPIA chat scenario. Metrics include the overall evaluation label and reason for the Q/A Pair, as well as sub-labels for manipulated content, intrusion, and information.

Return type:

Callable

Usage

eval_fn = IndirectAttackEvaluator(model_config)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format for question-answer pair

{
    'xpia_label': False,
    'xpia_reason': 'The conversation does not contain any manipulated content, intrusion or
                  information gathering.'
    'xpia_information_gathering': False,
    'xpia_intrusion': False
    'xpia_manipulated_content': False
}

__call__(*, question: Optional[str], answer: Optional[str], **kwargs)#

Evaluates content according to the presence of attacks injected into the conversation context to interrupt normal expected functionality by eliciting manipulated content, intrusion and attempting to gather information outside the scope of your AI system.

Parameters:

question (Optional[str]) – The question to be evaluated. Mutually exclusive with ‘conversation’.
answer (Optional[str]) – The answer to be evaluated. Mutually exclusive with ‘conversation’.

Returns:

The evaluation scores and reasoning.

Return type:

dict

class promptflow.evals.evaluators.MeteorScoreEvaluator(alpha: float = 0.9, beta: float = 3.0, gamma: float = 0.5)#

Bases: object

Evaluator that computes the METEOR Score between two strings.

The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and word stems to more accurately capture meaning and language variations. In addition to machine translation and text summarization, paraphrase detection is an optimal use case for the METEOR score.

Parameters:

alpha (float) – The METEOR score alpha parameter. Default is 0.9.
beta (float) – The METEOR score beta parameter. Default is 3.0.
gamma (float) – The METEOR score gamma parameter. Default is 0.5.

Usage

eval_fn = MeteorScoreEvaluator(
    alpha=0.9,
    beta=3.0,
    gamma=0.5
)
result = eval_fn(
    answer="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo.")

Output format

{
    "meteor_score": 0.62
}

__call__(*, ground_truth: str, answer: str, **kwargs)#

Evaluate the METEOR score between the answer and the ground truth.

Parameters:

answer (str) – The answer to be evaluated.
ground_truth (str) – The ground truth to be compared against.

Returns:

The METEOR score.

Return type:

dict

class promptflow.evals.evaluators.ProtectedMaterialEvaluator(project_scope: dict, credential=None)#

Bases: object

Initialize a protected material evaluator to detect whether protected material is present in your AI system’s response. Outputs True or False with AI-generated reasoning.

Parameters:

project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
credential (TokenCredential) – The credential for connecting to Azure AI project.

Returns:

Whether or not protected material was found in the response, with AI-generated reasoning.

Return type:

Dict[str, str]

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = ProtectedMaterialEvaluator(project_scope)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format

{
    "protected_material_label": "False",
    "protected_material_reason": "This question does not contain any protected material."
}

__call__(*, question: str, answer: str, **kwargs)#

Evaluates protected material content.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.

Returns:

A dictionary containing a boolean label and reasoning.

Return type:

dict

class promptflow.evals.evaluators.QAEvaluator(model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration], parallel: bool = True)#

Bases: object

Initialize a question-answer evaluator configured for a specific Azure OpenAI model.

Parameters:: model_config (Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration]) – Configuration for the Azure OpenAI model.
Returns:: A function that evaluates and generates metrics for “question-answering” scenario.
Return type:: Callable

Usage

eval_fn = QAEvaluator(model_config)
result = qa_eval(
    question="Tokyo is the capital of which country?",
    answer="Japan",
    context="Tokyo is the capital of Japan.",
    ground_truth="Japan"
)

Output format

{
    "gpt_groundedness": 3.5,
    "gpt_relevance": 4.0,
    "gpt_coherence": 1.5,
    "gpt_fluency": 4.0,
    "gpt_similarity": 3.0,
    "f1_score": 0.42
}

__call__(*, question: str, answer: str, context: str, ground_truth: str, **kwargs)#

Evaluates question-answering scenario.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
context (str) – The context to be evaluated.
ground_truth (str) – The ground truth to be evaluated.
parallel (bool) – Whether to evaluate in parallel. Defaults to True.

Returns:

The scores for QA scenario.

Return type:

dict

class promptflow.evals.evaluators.RelevanceEvaluator(model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration])#

Bases: object

Initialize a relevance evaluator configured for a specific Azure OpenAI model.

Parameters:: model_config (Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration]) – Configuration for the Azure OpenAI model.

Usage

eval_fn = RelevanceEvaluator(model_config)
result = eval_fn(
    question="What is the capital of Japan?",
    answer="The capital of Japan is Tokyo.",
    context="Tokyo is Japan's capital, known for its blend of traditional culture                 and technological advancements.")

Output format

{
    "gpt_relevance": 3.0
}

__call__(*, question: str, answer: str, context: str, **kwargs)#

Evaluate relevance.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
context (str) – The context to be evaluated.

Returns:

The relevance score.

Return type:

dict

class promptflow.evals.evaluators.RougeScoreEvaluator(rouge_type: RougeType)#

Bases: object

Evaluator for computes the ROUGE scores between two strings.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text coherence and relevance are critical.

Usage

eval_fn = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)
result = eval_fn(
    answer="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo.")

Output format

{
    "rouge_precision": 1.0,
    "rouge_recall": 1.0,
    "rouge_f1_score": 1.0
}

__call__(*, ground_truth: str, answer: str, **kwargs)#

Evaluate the ROUGE score between the answer and the ground truth.

Parameters:

answer (str) – The answer to be evaluated.
ground_truth (str) – The ground truth to be compared against.

Returns:

The ROUGE score.

Return type:

dict

class promptflow.evals.evaluators.RougeType(value)#

Bases: str, Enum

Enumeration of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) types.

ROUGE_1 = 'rouge1'#: Overlap of unigrams (single words) between generated and reference text.

ROUGE_2 = 'rouge2'#: Overlap of bigrams (two consecutive words) between generated and reference text.

ROUGE_3 = 'rouge3'#: Overlap of trigrams (three consecutive words) between generated and reference text.

ROUGE_4 = 'rouge4'#: Overlap of four-grams (four consecutive words) between generated and reference text.

ROUGE_5 = 'rouge5'#: Overlap of five-grams (five consecutive words) between generated and reference text.

ROUGE_L = 'rougeL'#: Overlap of L-grams (L consecutive words) between generated and reference text.

class promptflow.evals.evaluators.SelfHarmEvaluator(project_scope: dict, credential=None)#

Bases: object

Initialize a self harm evaluator for self harm score.

Parameters:

project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
credential (TokenCredential) – The credential for connecting to Azure AI project.

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = SelfHarmEvaluator(project_scope)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format

{
    "self_harm": "High",
    "self_harm_score": 6.5,
    "self_harm_reason": "Some reason"
}

__call__(*, question: str, answer: str, **kwargs)#

Evaluates self harm content.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.

Returns:

The self harm score.

Return type:

dict

class promptflow.evals.evaluators.SexualEvaluator(project_scope: dict, credential=None)#

Bases: object

Initialize a sexual evaluator for sexual score.

Parameters:

project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
credential (TokenCredential) – The credential for connecting to Azure AI project.

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = SexualEvaluator(project_scope)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format

{
    "sexual": "High",
    "sexual_score": 6.5,
    "sexual_reason": "Some reason"
}

__call__(*, question: str, answer: str, **kwargs)#

Evaluates sexual content.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.

Returns:

The sexual score.

Return type:

dict

class promptflow.evals.evaluators.SimilarityEvaluator(model_config: Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration])#

Bases: object

Initialize a similarity evaluator configured for a specific Azure OpenAI model.

Parameters:: model_config (Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration]) – Configuration for the Azure OpenAI model.

Usage

eval_fn = SimilarityEvaluator(model_config)
result = eval_fn(
    question="What is the capital of Japan?",
    answer="The capital of Japan is Tokyo.",
    ground_truth="Tokyo is Japan's capital.")

Output format

{
    "gpt_similarity": 3.0
}

__call__(*, question: str, answer: str, ground_truth: str, **kwargs)#

Evaluate similarity.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
ground_truth (str) – The ground truth to be evaluated.

Returns:

The similarity score.

Return type:

dict

class promptflow.evals.evaluators.ViolenceEvaluator(project_scope: dict, credential=None)#

Bases: object

Initialize a violence evaluator for violence score.

Parameters:

project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
credential (TokenCredential) – The credential for connecting to Azure AI project.

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = ViolenceEvaluator(project_scope)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format

{
    "violence": "High",
    "violence_score": 6.5,
    "violence_reason": "Some reason"
}

__call__(*, question: str, answer: str, **kwargs)#

Evaluates violence content.

Parameters:

question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.

Returns:

The violence score.

Return type:

dict