EvaluationAgent 🧐

The objective of the EvaluationAgent is to evaluate whether a Session or Round has been successfully completed. The EvaluationAgent assesses the performance of the HostAgent and AppAgent in fulfilling the request. You can configure whether to enable the EvaluationAgent in the config_dev.yaml file and the detailed documentation can be found here.

Note

The EvaluationAgent is fully LLM-driven and conducts evaluations based on the action trajectories and screenshots. It may not by 100% accurate since LLM may make mistakes.

Configuration

To enable the EvaluationAgent, you can configure the following parameters in the config_dev.yaml file to evaluate the task completion status at different levels:

Configuration Option Description Type Default Value
EVA_SESSION Whether to include the session in the evaluation. Boolean True
EVA_ROUND Whether to include the round in the evaluation. Boolean False
EVA_ALL_SCREENSHOTS Whether to include all the screenshots in the evaluation. Boolean True

Evaluation Inputs

The EvaluationAgent takes the following inputs for evaluation:

Input Description Type
User Request The user's request to be evaluated. String
APIs Description The description of the APIs used in the execution. List of Strings
Action Trajectories The action trajectories executed by the HostAgent and AppAgent. List of Strings
Screenshots The screenshots captured during the execution. List of Images

For more details on how to construct the inputs, please refer to the EvaluationAgentPrompter class in ufo/prompter/eva_prompter.py.

Tip

You can configure whether to use all screenshots or only the first and last screenshot for evaluation in the EVA_ALL_SCREENSHOTS of the config_dev.yaml file.

Evaluation Outputs

The EvaluationAgent generates the following outputs after evaluation:

Output Description Type
reason The detailed reason for your judgment, by observing the screenshot differences and the . String
sub_scores The sub-score of the evaluation in decomposing the evaluation into multiple sub-goals. List of Dictionaries
complete The completion status of the evaluation, can be yes, no, or unsure. String

Below is an example of the evaluation output:

{
    "reason": "The agent successfully completed the task of sending 'hello' to Zac on Microsoft Teams. 
    The initial screenshot shows the Microsoft Teams application with the chat window of Chaoyun Zhang open. 
    The agent then focused on the chat window, input the message 'hello', and clicked the Send button. 
    The final screenshot confirms that the message 'hello' was sent to Zac.", 
    "sub_scores": {
        "correct application focus": "yes", 
        "correct message input": "yes", 
        "message sent successfully": "yes"
        }, 
    "complete": "yes"}

Info

The log of the evaluation results will be saved in the logs/{task_name}/evaluation.log file.

The EvaluationAgent employs the CoT mechanism to first decompose the evaluation into multiple sub-goals and then evaluate each sub-goal separately. The sub-scores are then aggregated to determine the overall completion status of the evaluation.

Reference

Bases: BasicAgent

The agent for evaluation.

Initialize the FollowAgent. :agent_type: The type of the agent. :is_visual: The flag indicating whether the agent is visual or not.

Source code in agents/agent/evaluation_agent.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def __init__(
    self,
    name: str,
    app_root_name: str,
    is_visual: bool,
    main_prompt: str,
    example_prompt: str,
    api_prompt: str,
):
    """
    Initialize the FollowAgent.
    :agent_type: The type of the agent.
    :is_visual: The flag indicating whether the agent is visual or not.
    """

    super().__init__(name=name)

    self._app_root_name = app_root_name
    self.prompter = self.get_prompter(
        is_visual,
        main_prompt,
        example_prompt,
        api_prompt,
        app_root_name,
    )

status_manager: EvaluatonAgentStatus property

Get the status manager.

evaluate(request, log_path, eva_all_screenshots=True)

Evaluate the task completion.

Parameters:
  • log_path (str) –

    The path to the log file.

Returns:
  • Tuple[Dict[str, str], float]

    The evaluation result and the cost of LLM.

Source code in agents/agent/evaluation_agent.py
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def evaluate(
    self, request: str, log_path: str, eva_all_screenshots: bool = True
) -> Tuple[Dict[str, str], float]:
    """
    Evaluate the task completion.
    :param log_path: The path to the log file.
    :return: The evaluation result and the cost of LLM.
    """

    message = self.message_constructor(
        log_path=log_path, request=request, eva_all_screenshots=eva_all_screenshots
    )
    result, cost = self.get_response(
        message=message, namescope="app", use_backup_engine=True
    )

    result = json_parser(result)

    return result, cost

get_prompter(is_visual, prompt_template, example_prompt_template, api_prompt_template, root_name=None)

Get the prompter for the agent.

Source code in agents/agent/evaluation_agent.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def get_prompter(
    self,
    is_visual,
    prompt_template: str,
    example_prompt_template: str,
    api_prompt_template: str,
    root_name: Optional[str] = None,
) -> EvaluationAgentPrompter:
    """
    Get the prompter for the agent.
    """

    return EvaluationAgentPrompter(
        is_visual=is_visual,
        prompt_template=prompt_template,
        example_prompt_template=example_prompt_template,
        api_prompt_template=api_prompt_template,
        root_name=root_name,
    )

message_constructor(log_path, request, eva_all_screenshots=True)

Construct the message.

Parameters:
  • log_path (str) –

    The path to the log file.

  • request (str) –

    The request.

  • eva_all_screenshots (bool, default: True ) –

    The flag indicating whether to evaluate all screenshots.

Returns:
  • Dict[str, Any]

    The message.

Source code in agents/agent/evaluation_agent.py
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def message_constructor(
    self, log_path: str, request: str, eva_all_screenshots: bool = True
) -> Dict[str, Any]:
    """
    Construct the message.
    :param log_path: The path to the log file.
    :param request: The request.
    :param eva_all_screenshots: The flag indicating whether to evaluate all screenshots.
    :return: The message.
    """

    evaagent_prompt_system_message = self.prompter.system_prompt_construction()

    evaagent_prompt_user_message = self.prompter.user_content_construction(
        log_path=log_path, request=request, eva_all_screenshots=eva_all_screenshots
    )

    evaagent_prompt_message = self.prompter.prompt_construction(
        evaagent_prompt_system_message, evaagent_prompt_user_message
    )

    return evaagent_prompt_message

print_response(response_dict)

Print the response of the evaluation.

Parameters:
  • response_dict (Dict[str, Any]) –

    The response dictionary.

Source code in agents/agent/evaluation_agent.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
def print_response(self, response_dict: Dict[str, Any]) -> None:
    """
    Print the response of the evaluation.
    :param response_dict: The response dictionary.
    """

    emoji_map = {
        "yes": "✅",
        "no": "❌",
        "maybe": "❓",
    }

    complete = emoji_map.get(
        response_dict.get("complete"), response_dict.get("complete")
    )

    sub_scores = response_dict.get("sub_scores", {})
    reason = response_dict.get("reason", "")

    print_with_color(f"Evaluation result🧐:", "magenta")
    print_with_color(f"[Sub-scores📊:]", "green")

    for score, evaluation in sub_scores.items():
        print_with_color(
            f"{score}: {emoji_map.get(evaluation, evaluation)}", "green"
        )

    print_with_color(
        "[Task is complete💯:] {complete}".format(complete=complete), "cyan"
    )

    print_with_color(f"[Reason🤔:] {reason}".format(reason=reason), "blue")

process_comfirmation()

Comfirmation, currently do nothing.

Source code in agents/agent/evaluation_agent.py
119
120
121
122
123
def process_comfirmation(self) -> None:
    """
    Comfirmation, currently do nothing.
    """
    pass