agentchat.contrib.agent_eval.agent_eval
generate_criteria
def generate_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
task: Task = None,
additional_instructions: str = "",
max_round=2,
use_subcritic: bool = False)
Creates a list of criteria for evaluating the utility of a given task.
Arguments:
llm_config
dict or bool - llm inference configuration.task
Task - The task to evaluate.additional_instructions
str - Additional instructions for the criteria agent.max_round
int - The maximum number of rounds to run the conversation.use_subcritic
bool - Whether to use the subcritic agent to generate subcriteria.
Returns:
list
- A list of Criterion objects for evaluating the utility of the given task.
quantify_criteria
def quantify_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
criteria: List[Criterion] = None,
task: Task = None,
test_case: str = "",
ground_truth: str = "")
Quantifies the performance of a system using the provided criteria.
Arguments:
llm_config
dict or bool - llm inference configuration.criteria
[Criterion] - A list of criteria for evaluating the utility of a given task.task
Task - The task to evaluate.test_case
str - The test case to evaluate.ground_truth
str - The ground truth for the test case.
Returns:
dict
- A dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.