Skip to main content

How to evaluate a LLM agent?

· 7 min read

The challenges

It is nontrivial to evaluate the performance of a LLM agent. Existing evaluation methods typically treat the LLM agent as a function that maps input data to output data. If the agent is evaluated against a multi-step task, the evaluation process is then like a chain of calling a stateful function multiple times. To judge the output of the agent, it is typically compared to a ground truth or a reference output. As the output of the agent is in natural language, the evaluation is typically done by matching keywords or phrases in the output to the ground truth.

This evaluation method has its limitations due to its rigid nature. It is sometimes hard to use keywords matching to evaluate the output of the agent, especially when the output is long and complex. For example, if the answer is a date or a number, the evaluation method may not be able to handle the different formats. Moreover, the evaluation method should be able to act more like a human, who can understand the context and the meaning of the output. For example, when different agents are asked to perform the same task, they may behave differently, but still produce correct outputs.

Experience selection

· 4 min read

We have introduced the motivation of the experience module in Experience and how to create a handcrafted experience in Handcrafted Experience. In this blog post, we discuss more advanced topics about the experience module on experience selection.

Static experience selection

Every role in TaskWeaver can configure its own experience directory, which can be configured by setting the role_name.experience_dir field in the project configuration file. For the Planner and CodeInterpreter roles, you can configure the experience directory by setting the planner.experience_dir and code_generator.experience_dir fields respectively. The default experience directory is experience in the project directory.

info

The role name is by default the name of the implementation file (without the extension) of the role unless you have specified the role name by calling _set_name in the implementation file.

By configuring different experience directories for different roles, you can have different experiences for different roles in a static way. Use the Planner role as an example, you can have the following project configuration file to enable the experience selection for the Planner role.

{
"planner.use_experience": true,
"planner.experience_dir": "planner_exp_dir"
}

Run TaskWeaver with Locally Deployed Not-that-Large Language Models

· 6 min read
info

The feature introduced in this blog post can cause incompatibility issue with the previous version of TaskWeaver if you have customized the examples for the planner and code interpreter. The issue is easy to fix by changing the examples to the new schema. Please refer to the How we implemented the constrained generation in TaskWeaver section for more details.

Motivation

We've seen many raised issues complaining that it is difficult to run TaskWeaver with locally deployed non-that-large language models (LLMs), such as 7b or 13b. When we examine the issues, we find that the main problem is that the models failed to generate responses following our formatting instructions in the prompt. For instance, we see that the planner's response does not contain a send_to field, which is required to determine the recipient of the message.

In the past, we have tried to address this issue by adding more examples in the prompt, which however did not work well, especially for these relatively small models. Another idea was to ask the model to re-generate the response if it does not follow the format. We include the format error in the prompt to help the model understand the error and correct it. However, this approach also did not work well.

Plugins In-Depth

· 5 min read

Pre-requisites: Please refer to the Introduction and the Plugin Development pages for a better understanding of the plugin concept and its development process.

Plugin Basics

In TaskWeaver, the plugins are the essential components to extend the functionality of the agent. Specifically, a plugin is a piece of code wrapped in a class that can be called as a function by the agent in the generated code snippets. The following is a simple example of a plugin that generates n random numbers:

from taskweaver.plugin import Plugin, register_plugin

@register_plugin
class RandomGenerator(Plugin):
def __call__(self, n: int):
import random
return [random.randint(1, 100) for _ in range(n)]

In this example, the RandomGenerator class inherits the Plugin class and implements the __call__ method, which means it can be called as a function. What would be the function signature of the plugin? It is defined in the associated YAML file. For example, the YAML file for the RandomGenerator plugin is as follows:

name: random_generator
enabled: true
required: true
description: >-
This plugin generates n random numbers between 1 and 100.
examples: |-
result = random_generator(n=5)
parameters:
- name: n
type: int
required: true
description: >-
The number of random numbers to generate.

returns:
- name: result
type: list
description: >-
The list of random numbers.

The YAML file specifies the name, description, parameters, and return values of the plugin. When the LLM generates the code snippets, it will use the information in the YAML file to generate the function signature. We did not check the discrepancy between the function signature in the Python implementation and the YAML file. So, it is important to keep them consistent. The examples field is used to provide examples of how to use the plugin for the LLM.

Roles in TaskWeaver

· 5 min read

We frame TaskWeaver as a code-first agent framework. The term "code-first" means that the agent is designed to convert the user's request into one or multiple runnable code snippets and then execute them to generate the response. The philosophy behind this design is to consider programming languages as the de facto language for communication in cyber-physical systems, just like the natural language for human communication. Therefore, TaskWeaver translates the user's request in natural language into programming languages, which can be executed by the system to perform the desired tasks.

Under this design, when the developer needs to extend the agent's capability, they can write a new plugin. A plugin is a piece of code wrapped in a class that can be called as a function by the agent in the generated code snippets. Let's consider an example: the agent is asked to load a CSV file and perform anomaly detection on the data. The workflow of the agent is in the diagram below. It is very natural to represent data to be processed in variables and this task in code snippets.