EvaluationAgent 🧐

The objective of the EvaluationAgent is to evaluate whether a Session or Round has been successfully completed. The EvaluationAgent assesses the performance of the HostAgent and AppAgent in fulfilling the request. You can configure whether to enable the EvaluationAgent in the config_dev.yaml file and the detailed documentation can be found here.

Note

The EvaluationAgent is fully LLM-driven and conducts evaluations based on the action trajectories and screenshots. It may not by 100% accurate since LLM may make mistakes.

Configuration

To enable the EvaluationAgent, you can configure the following parameters in the config_dev.yaml file to evaluate the task completion status at different levels:

Configuration Option	Description	Type	Default Value
`EVA_SESSION`	Whether to include the session in the evaluation.	Boolean	True
`EVA_ROUND`	Whether to include the round in the evaluation.	Boolean	False
`EVA_ALL_SCREENSHOTS`	Whether to include all the screenshots in the evaluation.	Boolean	True

Evaluation Inputs

The EvaluationAgent takes the following inputs for evaluation:

Input	Description	Type
User Request	The user's request to be evaluated.	String
APIs Description	The description of the APIs used in the execution.	List of Strings
Action Trajectories	The action trajectories executed by the `HostAgent` and `AppAgent`.	List of Strings
Screenshots	The screenshots captured during the execution.	List of Images

For more details on how to construct the inputs, please refer to the EvaluationAgentPrompter class in ufo/prompter/eva_prompter.py.

Tip

You can configure whether to use all screenshots or only the first and last screenshot for evaluation in the EVA_ALL_SCREENSHOTS of the config_dev.yaml file.

Evaluation Outputs

The EvaluationAgent generates the following outputs after evaluation:

Output	Description	Type
reason	The detailed reason for your judgment, by observing the screenshot differences and the .	String
sub_scores	The sub-score of the evaluation in decomposing the evaluation into multiple sub-goals.	List of Dictionaries
complete	The completion status of the evaluation, can be `yes`, `no`, or `unsure`.	String

Below is an example of the evaluation output:

{
    "reason": "The agent successfully completed the task of sending 'hello' to Zac on Microsoft Teams. 
    The initial screenshot shows the Microsoft Teams application with the chat window of Chaoyun Zhang open. 
    The agent then focused on the chat window, input the message 'hello', and clicked the Send button. 
    The final screenshot confirms that the message 'hello' was sent to Zac.", 
    "sub_scores": {
        "correct application focus": "yes", 
        "correct message input": "yes", 
        "message sent successfully": "yes"
        }, 
    "complete": "yes"}

Info

The log of the evaluation results will be saved in the logs/{task_name}/evaluation.log file.

The EvaluationAgent employs the CoT mechanism to first decompose the evaluation into multiple sub-goals and then evaluate each sub-goal separately. The sub-scores are then aggregated to determine the overall completion status of the evaluation.

Reference

Bases: BasicAgent

The agent for evaluation.

Initialize the FollowAgent. :agent_type: The type of the agent. :is_visual: The flag indicating whether the agent is visual or not.

Source code in agents/agent/evaluation_agent.py

def __init__(
    self,
    name: str,
    app_root_name: str,
    is_visual: bool,
    main_prompt: str,
    example_prompt: str,
    api_prompt: str,
):
    """
    Initialize the FollowAgent.
    :agent_type: The type of the agent.
    :is_visual: The flag indicating whether the agent is visual or not.
    """

    super().__init__(name=name)

    self._app_root_name = app_root_name
    self.prompter = self.get_prompter(
        is_visual,
        main_prompt,
        example_prompt,
        api_prompt,
        app_root_name,
    )

`status_manager` `property`

Get the status manager.

`evaluate(request, log_path, eva_all_screenshots=True)`

Evaluate the task completion.

Parameters:	`log_path` (`str`) – The path to the log file.

Returns:	`Tuple[Dict[str, str], float]` – The evaluation result and the cost of LLM.

Source code in agents/agent/evaluation_agent.py

def evaluate(
    self, request: str, log_path: str, eva_all_screenshots: bool = True
) -> Tuple[Dict[str, str], float]:
    """
    Evaluate the task completion.
    :param log_path: The path to the log file.
    :return: The evaluation result and the cost of LLM.
    """

    message = self.message_constructor(
        log_path=log_path, request=request, eva_all_screenshots=eva_all_screenshots
    )
    result, cost = self.get_response(
        message=message, namescope="app", use_backup_engine=True
    )

    result = json_parser(result)

    return result, cost

`get_prompter(is_visual, prompt_template, example_prompt_template, api_prompt_template, root_name=None)`

Get the prompter for the agent.

Source code in agents/agent/evaluation_agent.py

def get_prompter(
    self,
    is_visual,
    prompt_template: str,
    example_prompt_template: str,
    api_prompt_template: str,
    root_name: Optional[str] = None,
) -> EvaluationAgentPrompter:
    """
    Get the prompter for the agent.
    """

    return EvaluationAgentPrompter(
        is_visual=is_visual,
        prompt_template=prompt_template,
        example_prompt_template=example_prompt_template,
        api_prompt_template=api_prompt_template,
        root_name=root_name,
    )

`message_constructor(log_path, request, eva_all_screenshots=True)`

Construct the message.

Parameters:	`log_path` (`str`) – The path to the log file. `request` (`str`) – The request. `eva_all_screenshots` (`bool`, default: `True` ) – The flag indicating whether to evaluate all screenshots.

Returns:	`Dict[str, Any]` – The message.

Source code in agents/agent/evaluation_agent.py

def message_constructor(
    self, log_path: str, request: str, eva_all_screenshots: bool = True
) -> Dict[str, Any]:
    """
    Construct the message.
    :param log_path: The path to the log file.
    :param request: The request.
    :param eva_all_screenshots: The flag indicating whether to evaluate all screenshots.
    :return: The message.
    """

    evaagent_prompt_system_message = self.prompter.system_prompt_construction()

    evaagent_prompt_user_message = self.prompter.user_content_construction(
        log_path=log_path, request=request, eva_all_screenshots=eva_all_screenshots
    )

    evaagent_prompt_message = self.prompter.prompt_construction(
        evaagent_prompt_system_message, evaagent_prompt_user_message
    )

    return evaagent_prompt_message

`print_response(response_dict)`

Print the response of the evaluation.

Parameters:	`response_dict` (`Dict[str, Any]`) – The response dictionary.

Source code in agents/agent/evaluation_agent.py

def print_response(self, response_dict: Dict[str, Any]) -> None:
    """
    Print the response of the evaluation.
    :param response_dict: The response dictionary.
    """

    emoji_map = {
        "yes": "✅",
        "no": "❌",
        "maybe": "❓",
    }

    complete = emoji_map.get(
        response_dict.get("complete"), response_dict.get("complete")
    )

    sub_scores = response_dict.get("sub_scores", {})
    reason = response_dict.get("reason", "")

    print_with_color(f"Evaluation result🧐:", "magenta")
    print_with_color(f"[Sub-scores📊:]", "green")

    for score, evaluation in sub_scores.items():
        print_with_color(
            f"{score}: {emoji_map.get(evaluation, evaluation)}", "green"
        )

    print_with_color(
        "[Task is complete💯:] {complete}".format(complete=complete), "cyan"
    )

    print_with_color(f"[Reason🤔:] {reason}".format(reason=reason), "blue")

`process_comfirmation()`

Comfirmation, currently do nothing.

Source code in agents/agent/evaluation_agent.py

def process_comfirmation(self) -> None:
    """
    Comfirmation, currently do nothing.
    """
    pass

EvaluationAgent 🧐

Configuration

Evaluation Inputs

Evaluation Outputs

Reference

status_manager property

evaluate(request, log_path, eva_all_screenshots=True)

get_prompter(is_visual, prompt_template, example_prompt_template, api_prompt_template, root_name=None)

message_constructor(log_path, request, eva_all_screenshots=True)

print_response(response_dict)

process_comfirmation()

`status_manager` `property`

`evaluate(request, log_path, eva_all_screenshots=True)`

`get_prompter(is_visual, prompt_template, example_prompt_template, api_prompt_template, root_name=None)`

`message_constructor(log_path, request, eva_all_screenshots=True)`

`print_response(response_dict)`

`process_comfirmation()`