promptflow.evals.evaluators module#

class promptflow.evals.evaluators.ChatEvaluator(model_config: AzureOpenAIModelConfiguration, eval_last_turn: bool = False, parallel: bool = True)#

Bases: object

Initialize a chat evaluator configured for a specific Azure OpenAI model.

Parameters:
  • model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.

  • eval_last_turn (bool) – Set to True to evaluate only the most recent exchange in the dialogue, focusing on the latest user inquiry and the assistant’s corresponding response. Defaults to False

  • parallel (bool) – If True, use parallel execution for evaluators. Else, use sequential execution. Default is True.

Returns:

A function that evaluates and generates metrics for “chat” scenario.

Return type:

function

Usage

chat_eval = ChatEvaluator(model_config)
conversation = [
    {"role": "user", "content": "What is the value of 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 = 4", "context": {
        "citations": [
                {"id": "math_doc.md", "content": "Information about additions: 1 + 2 = 3, 2 + 2 = 4"}
                ]
        }
    }
]
result = chat_eval(conversation=conversation)

Output format

{
    "evaluation_per_turn": {
        "gpt_retrieval": [1.0, 2.0],
        "gpt_groundedness": [5.0, 2.0],
        "gpt_relevance": [3.0, 5.0],
        "gpt_coherence": [1.0, 2.0],
        "gpt_fluency": [3.0, 5.0]
    }
    "gpt_retrieval": 1.5,
    "gpt_groundedness": 3.5,
    "gpt_relevance": 4.0,
    "gpt_coherence": 1.5,
    "gpt_fluency": 4.0
}
__call__(*, conversation, **kwargs)#

Evaluates chat scenario.

Parameters:

conversation (List[Dict]) – The conversation to be evaluated. Each turn should have “role” and “content” keys. “context” key is optional for assistant’s turn and should have “citations” key with list of citations.

Returns:

The scores for Chat scenario.

Return type:

dict

class promptflow.evals.evaluators.CoherenceEvaluator(model_config: AzureOpenAIModelConfiguration)#

Bases: object

Initialize a coherence evaluator configured for a specific Azure OpenAI model.

Parameters:

model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.

Usage

eval_fn = CoherenceEvaluator(model_config)
result = eval_fn(
    question="What is the capital of Japan?",
    answer="The capital of Japan is Tokyo.")

Output format

{
    "gpt_coherence": 1.0
}
__call__(*, question: str, answer: str, **kwargs)#

Evaluate coherence.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

Returns:

The coherence score.

Return type:

dict

class promptflow.evals.evaluators.ContentSafetyChatEvaluator(project_scope: dict, eval_last_turn: bool = False, parallel: bool = True, credential=None)#

Bases: object

Initialize a content safety chat evaluator configured to evaluate content safetry metrics for chat scenario.

Parameters:
  • project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.

  • eval_last_turn (bool) – Set to True to evaluate only the most recent exchange in the dialogue, focusing on the latest user inquiry and the assistant’s corresponding response. Defaults to False

  • parallel (bool) – If True, use parallel execution for evaluators. Else, use sequential execution. Default is True.

  • credential (TokenCredential) – The credential for connecting to Azure AI project.

Returns:

A function that evaluates and generates metrics for “chat” scenario.

Return type:

function

Usage

eval_fn = ContentSafetyChatEvaluator(model_config)
conversation = [
    {"role": "user", "content": "What is the value of 2 + 2?"},
    {"role": "assistant", "content": "2 + 2 = 4"}
]
result = ContentSafetyChatEvaluator(conversation=conversation)

Output format

{
    "evaluation_per_turn": {
        "violence": ["High", "Low"],
        "violence_score": [7.0, 3.0],
        "violence_reason": "Some reason",
        "sexual": ["High", "Low"],
        "sexual_score": [7.0, 3.0],
        "sexual_reason": "Some reason",
        "self_harm": ["High", "Low"],
        "self_harm_score": [7.0, 3.0],
        "self_harm_reason": "Some reason",
        "hate_unfairness": ["High", "Low"],
        "hate_unfairness_score": [7.0, 3.0],
        "hate_unfairness_reason": "Some reason"
    },
    "violence": "Medium",
    "violence_score": 5.0,
    "sexual": "Medium",
    "sexual_score": 5.0,
    "self_harm": "Medium",
    "self_harm_score": 5.0,
    "hate_unfairness": "Medium",
    "hate_unfairness_score": 5.0,
}
__call__(*, conversation, **kwargs)#

Evaluates content-safety metrics for “chat” scenario.

Parameters:

conversation (List[Dict]) – The conversation to be evaluated. Each turn should have “role” and “content” keys.

Returns:

The scores for Chat scenario.

Return type:

dict

class promptflow.evals.evaluators.ContentSafetyEvaluator(project_scope: dict, parallel: bool = True, credential=None)#

Bases: object

Initialize a content safety evaluator configured to evaluate content safetry metrics for QA scenario.

Parameters:
  • project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.

  • parallel – If True, use parallel execution for evaluators. Else, use sequential execution. Default is True.

  • credential (TokenCredential) – The credential for connecting to Azure AI project.

Returns:

A function that evaluates content-safety metrics for “question-answering” scenario.

Return type:

function

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = ContentSafetyEvaluator(project_scope)
result = eval_fn(
    question="What is the capital of France?",
    answer="Paris.",
)

Output format

{
    "violence": "Medium",
    "violence_score": 5.0,
    "violence_reason": "Some reason",
    "sexual": "Medium",
    "sexual_score": 5.0,
    "sexual_reason": "Some reason",
    "self_harm": "Medium",
    "self_harm_score": 5.0,
    "self_harm_reason": "Some reason",
    "hate_unfairness": "Medium",
    "hate_unfairness_score": 5.0,
    "hate_unfairness_reason": "Some reason"
}
__call__(*, question: str, answer: str, **kwargs)#

Evaluates content-safety metrics for “question-answering” scenario.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

  • parallel (bool) – Whether to evaluate in parallel.

Returns:

The scores for content-safety.

Return type:

dict

class promptflow.evals.evaluators.F1ScoreEvaluator#

Bases: object

Initialize a f1 score evaluator for calculating F1 score.

Usage

eval_fn = F1ScoreEvaluator()
result = eval_fn(
    answer="The capital of Japan is Tokyo.",
    ground_truth="Tokyo is Japan's capital, known for its blend of traditional culture                 and technological advancements.")

Output format

{
    "f1_score": 0.42
}
__call__(*, answer: str, ground_truth: str, **kwargs)#

Evaluate F1 score.

Parameters:
  • answer (str) – The answer to be evaluated.

  • ground_truth (str) – The ground truth to be evaluated.

Returns:

The F1 score.

Return type:

dict

class promptflow.evals.evaluators.FluencyEvaluator(model_config: AzureOpenAIModelConfiguration)#

Bases: object

Initialize a fluency evaluator configured for a specific Azure OpenAI model.

Parameters:

model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.

Usage

eval_fn = FluencyEvaluator(model_config)
result = eval_fn(
    question="What is the capital of Japan?",
    answer="The capital of Japan is Tokyo.")

Output format

{
    "gpt_fluency": 4.0
}
__call__(*, question: str, answer: str, **kwargs)#

Evaluate fluency.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

Returns:

The fluency score.

Return type:

dict

class promptflow.evals.evaluators.GroundednessEvaluator(model_config: AzureOpenAIModelConfiguration)#

Bases: object

Initialize a groundedness evaluator configured for a specific Azure OpenAI model.

Parameters:

model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.

Usage

eval_fn = GroundednessEvaluator(model_config)
result = eval_fn(
    answer="The capital of Japan is Tokyo.",
    context="Tokyo is Japan's capital, known for its blend of traditional culture                 and technological advancements.")

Output format

{
    "gpt_groundedness": 5
}
__call__(*, answer: str, context: str, **kwargs)#

Evaluate groundedness of the answer in the context.

Parameters:
  • answer (str) – The answer to be evaluated.

  • context (str) – The context in which the answer is evaluated.

Returns:

The groundedness score.

Return type:

dict

class promptflow.evals.evaluators.HateUnfairnessEvaluator(project_scope: dict, credential=None)#

Bases: object

Initialize a hate-unfairness evaluator for hate unfairness score.

Parameters:
  • project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.

  • credential (TokenCredential) – The credential for connecting to Azure AI project.

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = HateUnfairnessEvaluator(project_scope)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format

{
    "hate_unfairness": "High",
    "hate_unfairness_score": 6.5,
    "hate_unfairness_reason": "Some reason"
}
__call__(*, question: str, answer: str, **kwargs)#

Evaluates hate unfairness content.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

Returns:

The hate unfairness score.

Return type:

dict

class promptflow.evals.evaluators.QAEvaluator(model_config: AzureOpenAIModelConfiguration, parallel: bool = True)#

Bases: object

Initialize a question-answer evaluator configured for a specific Azure OpenAI model.

Parameters:

model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.

Returns:

A function that evaluates and generates metrics for “question-answering” scenario.

Return type:

function

Usage

eval_fn = QAEvaluator(model_config)
result = qa_eval(
    question="Tokyo is the capital of which country?",
    answer="Japan",
    context="Tokyo is the capital of Japan.",
    ground_truth="Japan"
)

Output format

{
    "gpt_groundedness": 3.5,
    "gpt_relevance": 4.0,
    "gpt_coherence": 1.5,
    "gpt_fluency": 4.0,
    "gpt_similarity": 3.0,
    "f1_score": 0.42
}
__call__(*, question: str, answer: str, context: str, ground_truth: str, **kwargs)#

Evaluates question-answering scenario.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

  • context (str) – The context to be evaluated.

  • ground_truth (str) – The ground truth to be evaluated.

  • parallel (bool) – Whether to evaluate in parallel. Defaults to True.

Returns:

The scores for QA scenario.

Return type:

dict

class promptflow.evals.evaluators.RelevanceEvaluator(model_config: AzureOpenAIModelConfiguration)#

Bases: object

Initialize a relevance evaluator configured for a specific Azure OpenAI model.

Parameters:

model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.

Usage

eval_fn = RelevanceEvaluator(model_config)
result = eval_fn(
    question="What is the capital of Japan?",
    answer="The capital of Japan is Tokyo.",
    context="Tokyo is Japan's capital, known for its blend of traditional culture                 and technological advancements.")

Output format

{
    "gpt_relevance": 3.0
}
__call__(*, question: str, answer: str, context: str, **kwargs)#

Evaluate relevance.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

  • context (str) – The context to be evaluated.

Returns:

The relevance score.

Return type:

dict

class promptflow.evals.evaluators.SelfHarmEvaluator(project_scope: dict, credential=None)#

Bases: object

Initialize a self harm evaluator for self harm score.

Parameters:
  • project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.

  • credential (TokenCredential) – The credential for connecting to Azure AI project.

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = SelfHarmEvaluator(project_scope)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format

{
    "self_harm": "High",
    "self_harm_score": 6.5,
    "self_harm_reason": "Some reason"
}
__call__(*, question: str, answer: str, **kwargs)#

Evaluates self harm content.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

Returns:

The self harm score.

Return type:

dict

class promptflow.evals.evaluators.SexualEvaluator(project_scope: dict, credential=None)#

Bases: object

Initialize a sexual evaluator for sexual score.

Parameters:
  • project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.

  • credential (TokenCredential) – The credential for connecting to Azure AI project.

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = SexualEvaluator(project_scope)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format

{
    "sexual": "High",
    "sexual_score": 6.5,
    "sexual_reason": "Some reason"
}
__call__(*, question: str, answer: str, **kwargs)#

Evaluates sexual content.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

Returns:

The sexual score.

Return type:

dict

class promptflow.evals.evaluators.SimilarityEvaluator(model_config: AzureOpenAIModelConfiguration)#

Bases: object

Initialize a similarity evaluator configured for a specific Azure OpenAI model.

Parameters:

model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.

Usage

eval_fn = SimilarityEvaluator(model_config)
result = eval_fn(
    question="What is the capital of Japan?",
    answer="The capital of Japan is Tokyo.",
    ground_truth="Tokyo is Japan's capital.")

Output format

{
    "gpt_similarity": 3.0
}
__call__(*, question: str, answer: str, ground_truth: str, **kwargs)#

Evaluate similarity.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

  • ground_truth (str) – The ground truth to be evaluated.

Returns:

The similarity score.

Return type:

dict

class promptflow.evals.evaluators.ViolenceEvaluator(project_scope: dict, credential=None)#

Bases: object

Initialize a violence evaluator for violence score.

Parameters:
  • project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.

  • credential (TokenCredential) – The credential for connecting to Azure AI project.

Usage

project_scope = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}
eval_fn = ViolenceEvaluator(project_scope)
result = eval_fn(question="What is the capital of France?", answer="Paris.")

Output format

{
    "violence": "High",
    "violence_score": 6.5,
    "violence_reason": "Some reason"
}
__call__(*, question: str, answer: str, **kwargs)#

Evaluates violence content.

Parameters:
  • question (str) – The question to be evaluated.

  • answer (str) – The answer to be evaluated.

Returns:

The violence score.

Return type:

dict