EvaluationAgent 🧐
The objective of the EvaluationAgent
is to evaluate whether a Session
or Round
has been successfully completed. The EvaluationAgent
assesses the performance of the HostAgent
and AppAgent
in fulfilling the request. You can configure whether to enable the EvaluationAgent
in the config_dev.yaml
file and the detailed documentation can be found here.
Note
The EvaluationAgent
is fully LLM-driven and conducts evaluations based on the action trajectories and screenshots. It may not by 100% accurate since LLM may make mistakes.
Configuration
To enable the EvaluationAgent
, you can configure the following parameters in the config_dev.yaml
file to evaluate the task completion status at different levels:
Configuration Option | Description | Type | Default Value |
---|---|---|---|
EVA_SESSION |
Whether to include the session in the evaluation. | Boolean | True |
EVA_ROUND |
Whether to include the round in the evaluation. | Boolean | False |
EVA_ALL_SCREENSHOTS |
Whether to include all the screenshots in the evaluation. | Boolean | True |
Evaluation Inputs
The EvaluationAgent
takes the following inputs for evaluation:
Input | Description | Type |
---|---|---|
User Request | The user's request to be evaluated. | String |
APIs Description | The description of the APIs used in the execution. | List of Strings |
Action Trajectories | The action trajectories executed by the HostAgent and AppAgent . |
List of Strings |
Screenshots | The screenshots captured during the execution. | List of Images |
For more details on how to construct the inputs, please refer to the EvaluationAgentPrompter
class in ufo/prompter/eva_prompter.py
.
Tip
You can configure whether to use all screenshots or only the first and last screenshot for evaluation in the EVA_ALL_SCREENSHOTS
of the config_dev.yaml
file.
Evaluation Outputs
The EvaluationAgent
generates the following outputs after evaluation:
Output | Description | Type |
---|---|---|
reason | The detailed reason for your judgment, by observing the screenshot differences and the |
String |
sub_scores | The sub-score of the evaluation in decomposing the evaluation into multiple sub-goals. | List of Dictionaries |
complete | The completion status of the evaluation, can be yes , no , or unsure . |
String |
Below is an example of the evaluation output:
{
"reason": "The agent successfully completed the task of sending 'hello' to Zac on Microsoft Teams.
The initial screenshot shows the Microsoft Teams application with the chat window of Chaoyun Zhang open.
The agent then focused on the chat window, input the message 'hello', and clicked the Send button.
The final screenshot confirms that the message 'hello' was sent to Zac.",
"sub_scores": {
"correct application focus": "yes",
"correct message input": "yes",
"message sent successfully": "yes"
},
"complete": "yes"}
Info
The log of the evaluation results will be saved in the logs/{task_name}/evaluation.log
file.
The EvaluationAgent
employs the CoT mechanism to first decompose the evaluation into multiple sub-goals and then evaluate each sub-goal separately. The sub-scores are then aggregated to determine the overall completion status of the evaluation.
Reference
Bases: BasicAgent
The agent for evaluation.
Initialize the FollowAgent. :agent_type: The type of the agent. :is_visual: The flag indicating whether the agent is visual or not.
Source code in agents/agent/evaluation_agent.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
status_manager: EvaluatonAgentStatus
property
Get the status manager.
evaluate(request, log_path)
Evaluate the task completion.
Parameters: |
|
---|
Returns: |
|
---|
Source code in agents/agent/evaluation_agent.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
|
get_prompter(is_visual, prompt_template, example_prompt_template, api_prompt_template, root_name=None)
Get the prompter for the agent.
Source code in agents/agent/evaluation_agent.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
message_constructor(log_path, request)
Construct the message.
Parameters: |
|
---|
Source code in agents/agent/evaluation_agent.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
print_response(response_dict)
Print the response of the evaluation.
Parameters: |
|
---|
Source code in agents/agent/evaluation_agent.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
process_comfirmation()
Comfirmation, currently do nothing.
Source code in agents/agent/evaluation_agent.py
110 111 112 113 114 |
|