promptflow.evals.evaluators module#
- class promptflow.evals.evaluators.ChatEvaluator(model_config: AzureOpenAIModelConfiguration, eval_last_turn: bool = False, parallel: bool = True)#
Bases:
object
Initialize a chat evaluator configured for a specific Azure OpenAI model.
- Parameters:
model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.
eval_last_turn (bool) – Set to True to evaluate only the most recent exchange in the dialogue, focusing on the latest user inquiry and the assistant’s corresponding response. Defaults to False
parallel (bool) – If True, use parallel execution for evaluators. Else, use sequential execution. Default is True.
- Returns:
A function that evaluates and generates metrics for “chat” scenario.
- Return type:
function
Usage
chat_eval = ChatEvaluator(model_config) conversation = [ {"role": "user", "content": "What is the value of 2 + 2?"}, {"role": "assistant", "content": "2 + 2 = 4", "context": { "citations": [ {"id": "math_doc.md", "content": "Information about additions: 1 + 2 = 3, 2 + 2 = 4"} ] } } ] result = chat_eval(conversation=conversation)
Output format
{ "evaluation_per_turn": { "gpt_retrieval": [1.0, 2.0], "gpt_groundedness": [5.0, 2.0], "gpt_relevance": [3.0, 5.0], "gpt_coherence": [1.0, 2.0], "gpt_fluency": [3.0, 5.0] } "gpt_retrieval": 1.5, "gpt_groundedness": 3.5, "gpt_relevance": 4.0, "gpt_coherence": 1.5, "gpt_fluency": 4.0 }
- __call__(*, conversation, **kwargs)#
Evaluates chat scenario.
- Parameters:
conversation (List[Dict]) – The conversation to be evaluated. Each turn should have “role” and “content” keys. “context” key is optional for assistant’s turn and should have “citations” key with list of citations.
- Returns:
The scores for Chat scenario.
- Return type:
dict
- class promptflow.evals.evaluators.CoherenceEvaluator(model_config: AzureOpenAIModelConfiguration)#
Bases:
object
Initialize a coherence evaluator configured for a specific Azure OpenAI model.
- Parameters:
model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.
Usage
eval_fn = CoherenceEvaluator(model_config) result = eval_fn( question="What is the capital of Japan?", answer="The capital of Japan is Tokyo.")
Output format
{ "gpt_coherence": 1.0 }
- __call__(*, question: str, answer: str, **kwargs)#
Evaluate coherence.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
- Returns:
The coherence score.
- Return type:
dict
- class promptflow.evals.evaluators.ContentSafetyChatEvaluator(project_scope: dict, eval_last_turn: bool = False, parallel: bool = True, credential=None)#
Bases:
object
Initialize a content safety chat evaluator configured to evaluate content safetry metrics for chat scenario.
- Parameters:
project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
eval_last_turn (bool) – Set to True to evaluate only the most recent exchange in the dialogue, focusing on the latest user inquiry and the assistant’s corresponding response. Defaults to False
parallel (bool) – If True, use parallel execution for evaluators. Else, use sequential execution. Default is True.
credential (TokenCredential) – The credential for connecting to Azure AI project.
- Returns:
A function that evaluates and generates metrics for “chat” scenario.
- Return type:
function
Usage
eval_fn = ContentSafetyChatEvaluator(model_config) conversation = [ {"role": "user", "content": "What is the value of 2 + 2?"}, {"role": "assistant", "content": "2 + 2 = 4"} ] result = ContentSafetyChatEvaluator(conversation=conversation)
Output format
{ "evaluation_per_turn": { "violence": ["High", "Low"], "violence_score": [7.0, 3.0], "violence_reason": "Some reason", "sexual": ["High", "Low"], "sexual_score": [7.0, 3.0], "sexual_reason": "Some reason", "self_harm": ["High", "Low"], "self_harm_score": [7.0, 3.0], "self_harm_reason": "Some reason", "hate_unfairness": ["High", "Low"], "hate_unfairness_score": [7.0, 3.0], "hate_unfairness_reason": "Some reason" }, "violence": "Medium", "violence_score": 5.0, "sexual": "Medium", "sexual_score": 5.0, "self_harm": "Medium", "self_harm_score": 5.0, "hate_unfairness": "Medium", "hate_unfairness_score": 5.0, }
- __call__(*, conversation, **kwargs)#
Evaluates content-safety metrics for “chat” scenario.
- Parameters:
conversation (List[Dict]) – The conversation to be evaluated. Each turn should have “role” and “content” keys.
- Returns:
The scores for Chat scenario.
- Return type:
dict
- class promptflow.evals.evaluators.ContentSafetyEvaluator(project_scope: dict, parallel: bool = True, credential=None)#
Bases:
object
Initialize a content safety evaluator configured to evaluate content safetry metrics for QA scenario.
- Parameters:
project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
parallel – If True, use parallel execution for evaluators. Else, use sequential execution. Default is True.
credential (TokenCredential) – The credential for connecting to Azure AI project.
- Returns:
A function that evaluates content-safety metrics for “question-answering” scenario.
- Return type:
function
Usage
project_scope = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = ContentSafetyEvaluator(project_scope) result = eval_fn( question="What is the capital of France?", answer="Paris.", )
Output format
{ "violence": "Medium", "violence_score": 5.0, "violence_reason": "Some reason", "sexual": "Medium", "sexual_score": 5.0, "sexual_reason": "Some reason", "self_harm": "Medium", "self_harm_score": 5.0, "self_harm_reason": "Some reason", "hate_unfairness": "Medium", "hate_unfairness_score": 5.0, "hate_unfairness_reason": "Some reason" }
- __call__(*, question: str, answer: str, **kwargs)#
Evaluates content-safety metrics for “question-answering” scenario.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
parallel (bool) – Whether to evaluate in parallel.
- Returns:
The scores for content-safety.
- Return type:
dict
- class promptflow.evals.evaluators.F1ScoreEvaluator#
Bases:
object
Initialize a f1 score evaluator for calculating F1 score.
Usage
eval_fn = F1ScoreEvaluator() result = eval_fn( answer="The capital of Japan is Tokyo.", ground_truth="Tokyo is Japan's capital, known for its blend of traditional culture and technological advancements.")
Output format
{ "f1_score": 0.42 }
- __call__(*, answer: str, ground_truth: str, **kwargs)#
Evaluate F1 score.
- Parameters:
answer (str) – The answer to be evaluated.
ground_truth (str) – The ground truth to be evaluated.
- Returns:
The F1 score.
- Return type:
dict
- class promptflow.evals.evaluators.FluencyEvaluator(model_config: AzureOpenAIModelConfiguration)#
Bases:
object
Initialize a fluency evaluator configured for a specific Azure OpenAI model.
- Parameters:
model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.
Usage
eval_fn = FluencyEvaluator(model_config) result = eval_fn( question="What is the capital of Japan?", answer="The capital of Japan is Tokyo.")
Output format
{ "gpt_fluency": 4.0 }
- __call__(*, question: str, answer: str, **kwargs)#
Evaluate fluency.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
- Returns:
The fluency score.
- Return type:
dict
- class promptflow.evals.evaluators.GroundednessEvaluator(model_config: AzureOpenAIModelConfiguration)#
Bases:
object
Initialize a groundedness evaluator configured for a specific Azure OpenAI model.
- Parameters:
model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.
Usage
eval_fn = GroundednessEvaluator(model_config) result = eval_fn( answer="The capital of Japan is Tokyo.", context="Tokyo is Japan's capital, known for its blend of traditional culture and technological advancements.")
Output format
{ "gpt_groundedness": 5 }
- __call__(*, answer: str, context: str, **kwargs)#
Evaluate groundedness of the answer in the context.
- Parameters:
answer (str) – The answer to be evaluated.
context (str) – The context in which the answer is evaluated.
- Returns:
The groundedness score.
- Return type:
dict
- class promptflow.evals.evaluators.HateUnfairnessEvaluator(project_scope: dict, credential=None)#
Bases:
object
Initialize a hate-unfairness evaluator for hate unfairness score.
- Parameters:
project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
credential (TokenCredential) – The credential for connecting to Azure AI project.
Usage
project_scope = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = HateUnfairnessEvaluator(project_scope) result = eval_fn(question="What is the capital of France?", answer="Paris.")
Output format
{ "hate_unfairness": "High", "hate_unfairness_score": 6.5, "hate_unfairness_reason": "Some reason" }
- __call__(*, question: str, answer: str, **kwargs)#
Evaluates hate unfairness content.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
- Returns:
The hate unfairness score.
- Return type:
dict
- class promptflow.evals.evaluators.QAEvaluator(model_config: AzureOpenAIModelConfiguration, parallel: bool = True)#
Bases:
object
Initialize a question-answer evaluator configured for a specific Azure OpenAI model.
- Parameters:
model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.
- Returns:
A function that evaluates and generates metrics for “question-answering” scenario.
- Return type:
function
Usage
eval_fn = QAEvaluator(model_config) result = qa_eval( question="Tokyo is the capital of which country?", answer="Japan", context="Tokyo is the capital of Japan.", ground_truth="Japan" )
Output format
{ "gpt_groundedness": 3.5, "gpt_relevance": 4.0, "gpt_coherence": 1.5, "gpt_fluency": 4.0, "gpt_similarity": 3.0, "f1_score": 0.42 }
- __call__(*, question: str, answer: str, context: str, ground_truth: str, **kwargs)#
Evaluates question-answering scenario.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
context (str) – The context to be evaluated.
ground_truth (str) – The ground truth to be evaluated.
parallel (bool) – Whether to evaluate in parallel. Defaults to True.
- Returns:
The scores for QA scenario.
- Return type:
dict
- class promptflow.evals.evaluators.RelevanceEvaluator(model_config: AzureOpenAIModelConfiguration)#
Bases:
object
Initialize a relevance evaluator configured for a specific Azure OpenAI model.
- Parameters:
model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.
Usage
eval_fn = RelevanceEvaluator(model_config) result = eval_fn( question="What is the capital of Japan?", answer="The capital of Japan is Tokyo.", context="Tokyo is Japan's capital, known for its blend of traditional culture and technological advancements.")
Output format
{ "gpt_relevance": 3.0 }
- __call__(*, question: str, answer: str, context: str, **kwargs)#
Evaluate relevance.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
context (str) – The context to be evaluated.
- Returns:
The relevance score.
- Return type:
dict
- class promptflow.evals.evaluators.SelfHarmEvaluator(project_scope: dict, credential=None)#
Bases:
object
Initialize a self harm evaluator for self harm score.
- Parameters:
project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
credential (TokenCredential) – The credential for connecting to Azure AI project.
Usage
project_scope = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = SelfHarmEvaluator(project_scope) result = eval_fn(question="What is the capital of France?", answer="Paris.")
Output format
{ "self_harm": "High", "self_harm_score": 6.5, "self_harm_reason": "Some reason" }
- __call__(*, question: str, answer: str, **kwargs)#
Evaluates self harm content.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
- Returns:
The self harm score.
- Return type:
dict
- class promptflow.evals.evaluators.SexualEvaluator(project_scope: dict, credential=None)#
Bases:
object
Initialize a sexual evaluator for sexual score.
- Parameters:
project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
credential (TokenCredential) – The credential for connecting to Azure AI project.
Usage
project_scope = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = SexualEvaluator(project_scope) result = eval_fn(question="What is the capital of France?", answer="Paris.")
Output format
{ "sexual": "High", "sexual_score": 6.5, "sexual_reason": "Some reason" }
- __call__(*, question: str, answer: str, **kwargs)#
Evaluates sexual content.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
- Returns:
The sexual score.
- Return type:
dict
- class promptflow.evals.evaluators.SimilarityEvaluator(model_config: AzureOpenAIModelConfiguration)#
Bases:
object
Initialize a similarity evaluator configured for a specific Azure OpenAI model.
- Parameters:
model_config (AzureOpenAIModelConfiguration) – Configuration for the Azure OpenAI model.
Usage
eval_fn = SimilarityEvaluator(model_config) result = eval_fn( question="What is the capital of Japan?", answer="The capital of Japan is Tokyo.", ground_truth="Tokyo is Japan's capital.")
Output format
{ "gpt_similarity": 3.0 }
- __call__(*, question: str, answer: str, ground_truth: str, **kwargs)#
Evaluate similarity.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
ground_truth (str) – The ground truth to be evaluated.
- Returns:
The similarity score.
- Return type:
dict
- class promptflow.evals.evaluators.ViolenceEvaluator(project_scope: dict, credential=None)#
Bases:
object
Initialize a violence evaluator for violence score.
- Parameters:
project_scope (dict) – The scope of the Azure AI project. It contains subscription id, resource group, and project name.
credential (TokenCredential) – The credential for connecting to Azure AI project.
Usage
project_scope = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } eval_fn = ViolenceEvaluator(project_scope) result = eval_fn(question="What is the capital of France?", answer="Paris.")
Output format
{ "violence": "High", "violence_score": 6.5, "violence_reason": "Some reason" }
- __call__(*, question: str, answer: str, **kwargs)#
Evaluates violence content.
- Parameters:
question (str) – The question to be evaluated.
answer (str) – The answer to be evaluated.
- Returns:
The violence score.
- Return type:
dict