LLF-Bench : Benchmark for Interactive Learning from Language Feedback

What is LLF-Bench? LLF-Bench (Learning from Language Feedback Benchmark; pronounced as “elf bench”), is a new benchmark to evaluate the ability of AI agents to interactively learn from just language feedback. The agent interacts with an environment in LLF-Bench, takes action, and gets language feedback instead of rewards or action. LLF-Bench consists of 8 diverse benchmarks.

# Clone the LLF-bench code
git clone https://github.com/microsoft/LLF-Bench.git

# Optional but recommended: create a conda environment.
conda create -n LLF-Bench python=3.8 -y
conda activate LLF-Bench

# Install LLF Bench
cd LLF-Bench
pip install -e .

# To install Alfworld and Metaworld, we need some more resources. See Github for details.

How is it different from RL? Reinforcement learning (RL) is another commonly studied interactive learning setting. The key difference is that in RL, the agent is trained using rewards, whereas in LLF (the paradigm upon which LLF-Bench) is based, uses language feedback instead of rewards.

Why language feedback? Language feedback has two main advantages over rewards and expert actions (which are the two most commonly used feedbacks). Firstly, unlike rewards, language feedback is very expressive and consequently can pack a lot more information which can help the agent train faster, and unlike actions, language feedback can be more easily provided by non-expert humans. Secondly, language feedback is closer to how humans learn, and this makes it more natural for many settings.

Can LLF-Bench be used to evaluate LLMs? Yes! In fact, that is one of the main purposes behind LLF-Bench -- to robustly evaluate LLM-based Agents. There are two reasons to prefer LLF-Bench for such evaluation: firstly, LLF-Bench provides a diverse set of environments where each environment provides sampled verbalizations of the problem making it harder to prompt hack. Secondly, LLF-Bench includes environments that require learning, so that no matter how good an LLM is, it cannot zero-shot solve those LLF-Bench environments. Therefore, LLF-agents must show signs of learning new information to be able to solve those environments.

LLF Tasks

LLF-Bench includes the following set of 8 problems:

  • LLF-bandit is a verbalized version of the classic multi-armed bandit problem, which we implement based on gym-bandits. LLF-bandit tests the agent's learning ability in an unknown environment with a finite number of actions.
  • LLF-poem consists of a set of poem writing tasks, where the agent needs to write a poem satisfying certain syllable- and line-constraints. These problems tests the agent's learning ability to infer and solve constraint satisfaction problems.
  • LLF-reco-movie simulates a classic recommendation scenario where a user wants movie or TV show recommendations based on some preferences. The user specifies their preferences in text, and any recommendation made by the agent is matched to a movie database for checking whether the preferences are matched correctly.
  • LLF-optimization consists of 8 loss functions (Rosenbrock, Bohachevsky, etc.) and provides an interface to give verbal feedback for the task of optimization on any loss function.
  • LLF-parking extends the Highway gym environment, providing a long horizon goal-conditioned continuous control task. The agent must control an ego-vehicle to park in a given location without colliding with any obstacles in the environment.
  • LLF-gridworld evaluates the agent's ability to navigate in a graph-based environment. Each node of the graph is a room and the edges are doors connecting the rooms. The agent's goal is to navigate from the room it starts in to the room with treasure.
  • LLF-alfworld adds a wrapper on top of the Alfworld text-based environment to provide language feedback instead of reward. In this environment, the agent is tasked to solve problems in a text-based house environment. The agent is tested for generalization as each episode can contain a new task in a new house environment.
  • LLF-metaworld is a low-dimensional state-based version of the existing Meta-World v2 benchmark. It comprises 50 simulated robotic manipulation tasks featuring a Sawyer arm and various objects that this arm needs to bring into desired configurations, such as opening doors, placing cubes in boxes, etc.


LLF-Bench team includes the following people listed alphabetically. We are also thankful to the many people behind the scenes who provided support and feedback in the form of suggestions, Github issues, and reviews.