EvaluationAgent

The EvaluationAgent evaluates whether a Session or Round has been successfully completed by assessing the performance of the HostAgent and AppAgent in fulfilling user requests. Configuration options are available in config/ufo/system.yaml. For more details, refer to the System Configuration Guide.

The EvaluationAgent is fully LLM-driven and conducts evaluations based on action trajectories and screenshots. Since LLM-based evaluation may not be 100% accurate, the results should be used as guidance rather than absolute truth.

Evaluation Process

Configuration

Configure the EvaluationAgent in config/ufo/system.yaml:

Configuration Option	Description	Type	Default Value
`EVA_SESSION`	Whether to evaluate the entire session.	Boolean	True
`EVA_ROUND`	Whether to evaluate each round.	Boolean	False
`EVA_ALL_SCREENSHOTS`	Whether to include all screenshots in evaluation. If `False`, only the first and last screenshots are used.	Boolean	True

Evaluation Process

The EvaluationAgent uses a Chain-of-Thought (CoT) mechanism to:

Decompose the evaluation into multiple sub-goals based on the user request
Evaluate each sub-goal separately
Aggregate the sub-scores to determine the overall completion status

graph TD A[User Request] --> B[EvaluationAgent] C[Action Trajectories] --> B D[Screenshots] --> B E[APIs Description] --> B B --> F[CoT: Decompose into Sub-goals] F --> G[Evaluate Sub-goal 1] F --> H[Evaluate Sub-goal 2] F --> I[Evaluate Sub-goal N] G --> J[Aggregate Sub-scores] H --> J I --> J J --> K{Overall Completion Status} K -->|yes| L[Task Completed] K -->|no| M[Task Failed] K -->|unsure| N[Uncertain Result] B --> O[Generate Detailed Reason] O --> P[Evaluation Report] J --> P

Inputs

The EvaluationAgent takes the following inputs:

Input	Description	Type
User Request	The user's request to be evaluated.	String
APIs Description	Description of the APIs (tools) used during execution.	String
Action Trajectories	Action trajectories executed by the `HostAgent` and `AppAgent`, including subtask, step, observation, thought, plan, comment, action, and application.	List of Dictionaries
Screenshots	Screenshots captured during execution.	List of Images

The input construction is handled by the EvaluationAgentPrompter class in ufo/prompter/eva_prompter.py.

Outputs

The EvaluationAgent generates the following outputs:

Output	Description	Type
reason	Detailed reasoning for the judgment based on screenshot analysis and execution trajectory.	String
sub_scores	List of sub-scoring points evaluating different aspects of the task. Each sub-score contains a name and evaluation result.	List of Dictionaries
complete	Overall completion status: `yes`, `no`, or `unsure`.	String

Example output:

{
    "reason": "The agent successfully completed the task of sending 'hello' to Zac on Microsoft Teams. 
    The initial screenshot shows the Microsoft Teams application with the chat window of Chaoyun Zhang open. 
    The agent then focused on the chat window, input the message 'hello', and clicked the Send button. 
    The final screenshot confirms that the message 'hello' was sent to Zac.", 
    "sub_scores": [
        { "name": "correct application focus", "evaluation": "yes" }, 
        { "name": "correct message input", "evaluation": "yes" }, 
        { "name": "message sent successfully", "evaluation": "yes" }
    ], 
    "complete": "yes"
}

Evaluation logs are saved in logs/{task_name}/evaluation.log.

Reference

Bases: BasicAgent

The agent for evaluation.

Initialize the FollowAgent. :agent_type: The type of the agent. :is_visual: The flag indicating whether the agent is visual or not.

Source code in agents/agent/evaluation_agent.py

def __init__(
    self,
    name: str,
    is_visual: bool,
    main_prompt: str,
    example_prompt: str,
):
    """
    Initialize the FollowAgent.
    :agent_type: The type of the agent.
    :is_visual: The flag indicating whether the agent is visual or not.
    """

    super().__init__(name=name)

    self.prompter = self.get_prompter(
        is_visual,
        main_prompt,
        example_prompt,
    )

    # Initialize presenter for output formatting
    self.presenter = RichPresenter()

`status_manager` `property`

Get the status manager.

`evaluate(request, log_path, eva_all_screenshots=True, context=None)`

Evaluate the task completion.

Parameters:	`log_path` (`str`) – The path to the log file.

Returns:	`Tuple[Dict[str, str], float]` – The evaluation result and the cost of LLM.

Source code in agents/agent/evaluation_agent.py

def evaluate(
    self,
    request: str,
    log_path: str,
    eva_all_screenshots: bool = True,
    context: Optional[Context] = None,
) -> Tuple[Dict[str, str], float]:
    """
    Evaluate the task completion.
    :param log_path: The path to the log file.
    :return: The evaluation result and the cost of LLM.
    """

    self.context_provision(context)

    message = self.message_constructor(
        log_path=log_path, request=request, eva_all_screenshots=eva_all_screenshots
    )
    result, cost = self.get_response(
        message=message, namescope="EVALUATION_AGENT", use_backup_engine=True
    )

    result = json_parser(result)

    return result, cost

`get_prompter(is_visual, prompt_template, example_prompt_template)`

Get the prompter for the agent.

Source code in agents/agent/evaluation_agent.py

def get_prompter(
    self,
    is_visual,
    prompt_template: str,
    example_prompt_template: str,
) -> EvaluationAgentPrompter:
    """
    Get the prompter for the agent.
    """

    return EvaluationAgentPrompter(
        is_visual=is_visual,
        prompt_template=prompt_template,
        example_prompt_template=example_prompt_template,
    )

`message_constructor(log_path, request, eva_all_screenshots=True)`

Construct the message.

Parameters:	`log_path` (`str`) – The path to the log file. `request` (`str`) – The request. `eva_all_screenshots` (`bool`, default: `True` ) – The flag indicating whether to evaluate all screenshots.

Returns:	`Dict[str, Any]` – The message.

Source code in agents/agent/evaluation_agent.py

def message_constructor(
    self, log_path: str, request: str, eva_all_screenshots: bool = True
) -> Dict[str, Any]:
    """
    Construct the message.
    :param log_path: The path to the log file.
    :param request: The request.
    :param eva_all_screenshots: The flag indicating whether to evaluate all screenshots.
    :return: The message.
    """

    evaagent_prompt_system_message = self.prompter.system_prompt_construction()

    evaagent_prompt_user_message = self.prompter.user_content_construction(
        log_path=log_path, request=request, eva_all_screenshots=eva_all_screenshots
    )

    evaagent_prompt_message = self.prompter.prompt_construction(
        evaagent_prompt_system_message, evaagent_prompt_user_message
    )

    return evaagent_prompt_message

`print_response(response_dict)`

Pretty-print the evaluation response using RichPresenter.

Parameters:	`response_dict` (`Dict[str, str]`) – The response dictionary.

Source code in agents/agent/evaluation_agent.py

def print_response(self, response_dict: Dict[str, str]) -> None:
    """
    Pretty-print the evaluation response using RichPresenter.
    :param response_dict: The response dictionary.
    """
    # Convert dict to EvaluationAgentResponse object
    response = EvaluationAgentResponse(**response_dict)

    # Delegate to presenter
    self.presenter.present_evaluation_agent_response(response)

`process_confirmation()`

Comfirmation, currently do nothing.

Source code in agents/agent/evaluation_agent.py

def process_confirmation(self) -> None:
    """
    Comfirmation, currently do nothing.
    """
    pass

EvaluationAgent

Configuration

Evaluation Process

Inputs

Outputs

See Also

Reference

status_manager property

evaluate(request, log_path, eva_all_screenshots=True, context=None)

get_prompter(is_visual, prompt_template, example_prompt_template)

message_constructor(log_path, request, eva_all_screenshots=True)

print_response(response_dict)

process_confirmation()

`status_manager` `property`

`evaluate(request, log_path, eva_all_screenshots=True, context=None)`

`get_prompter(is_visual, prompt_template, example_prompt_template)`

`message_constructor(log_path, request, eva_all_screenshots=True)`

`print_response(response_dict)`

`process_confirmation()`