How to evaluate a LLM agent?
The challenges
It is nontrivial to evaluate the performance of a LLM agent. Existing evaluation methods typically treat the LLM agent as a function that maps input data to output data. If the agent is evaluated against a multi-step task, the evaluation process is then like a chain of calling a stateful function multiple times. To judge the output of the agent, it is typically compared to a ground truth or a reference output. As the output of the agent is in natural language, the evaluation is typically done by matching keywords or phrases in the output to the ground truth.
This evaluation method has its limitations due to its rigid nature. It is sometimes hard to use keywords matching to evaluate the output of the agent, especially when the output is long and complex. For example, if the answer is a date or a number, the evaluation method may not be able to handle the different formats. Moreover, the evaluation method should be able to act more like a human, who can understand the context and the meaning of the output. For example, when different agents are asked to perform the same task, they may behave differently, but still produce correct outputs.