Skip to the content.

Abstract

Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 12% on games designed for human enjoyment

figure1

An example of an agent playing through a text-adventure game. Thought traces are fabricated for illustration. For text-adventure games, due to their length and the variety of puzzles required for progression, players must make use of a range of different reasoning skills to solve roadblocks and continue through the game. Because of the long-range, causal dependencies often found in these games, a single mistake at any step can lead to a breakdown in gameplay later on.

Scores for all Textworld games for Top 10 models

tw_allgames chart

Scores for all Textworld Express games for Top 10 models

twx_allgames chart

Scores for all Alfworld games for Top 10 models

alfw_allgames chart

Scores for all Scienceworld games for Top 10 models

sciencew_allgames chart

Scores for all Jericho games for Top 10 models

jericho_allgames chart

Breakdown of scores per framework

fws chart

Please consider citing the original work!

TextWorld

tw image

TextWorld is a framework originally designed for training agents with Reinforcement Learning on text-based games. It can generate synthetic text-adventure games of varying complexity. In TALES, we integrate the "CookingWorld" games that were used as part of the NeurIPS 2018 Competition. The task involves following a recipe that requires finding ingredients and processing them according to said recipe. We selected one game per difficulty ranging from level 1 (with one location and a recipe of 1 ingredient) to level 10 (having 12 locations and a recipe with 3 ingredients). For all difficulties, the player receives 1 point after completing sub-goals related to the task in the game. Difficulty level 1 can be solved in 7 moves with a max score of 3, while level 10 requires 44 moves with a max score of 11.

Textworld Express

twx image
Generated by GPT-4o.

Textworld Express is a highly optimized re-implementation of many Textworld game scenarios that runs approximately three orders of magnitudes faster compared to the Textworld counterparts. While token throughput is the major speed bottleneck in many LLM-based applications, we opt to use Textworld Express over Textworld for the performance improvement where applicable. While significantly faster, an arguable drawback of using Textworld Express over Textworld is also in its stricter parser. Textworld Express simplifies its parser for speed and thus does not allow for nearest-neighbor action phrases.

Alfworld

alfworld image
For TALES, we only make use of the text modality.

Alfworld is a multi-modal framework combining complementary visual and textual observations, where agents are asked to navigate and perform tasks in a household setting. All tasks provide only a terminal reward of 1 upon task completion. For TALES, we only use its textual modality as it has become the standard in the LLM literature when evaluated on.

The Alfworld environments are unique in their lack of informative feedback. Where other environments have a predefined error message relating to the type of error, whether it is due to the parser not recognizing the command or the action not being possible, Alfworld has only one error message in the form of 'Nothing happened'. In the original Alfworld framework, the visual component compensates for the insufficient text feedback. However, this lack of detailed information significantly increases the difficulty for agents that rely solely on text-based interactions. This difficulty is compounded upon by the limitation that an agent in Alfworld can only hold one object at a time.

Scienceworld

scienceworld image

Scienceworld is a framework focused on the completion of elementary-level science curriculum tasks. Notably for many of its tasks, Scienceworld emulates an open-world setting where the player can complete the task in different ways that do not follow one expected trajectory. When it comes to heating objects, this part of the task can be completed by either the oven in the kitchen or the blast furnace in the workshop. Similarly, Scienceworld also allows the player the freedom to reset the game on command. This is especially important as a number of Scienceworld games have dead states where it is no longer possible to complete the assigned task in that play-through.

Jericho

jericho image

Jericho is a suite of 55 human-written, interactive fiction games. We consider Jericho to be the most difficult framework due to the length and complexity of many of the games. Some can be completed within 17 steps while some others require over 500 steps. Those games also cover an extremely wide range of genres and styles and lack the consistency of many other text-game environment suites designed for evaluating agents. For example, '9:05' follows the morning of an ordinary office worker where 'Anchorhead' is a Lovecraftian Horror Story.

Below is an interactive version of Zork I, one of the classic text adventure games available in the Jericho suite:

Try it yourself! Type commands like "look", "inventory", or "go north" to interact with the game.

Funny LLM Fails to be added here.

All LLM Scores can be found below. For the graphs in other visualizations, we used only the top 10 models. All scores are displayed as percentages.

Rank Model Organization Model Type TALES Score
1 o3 (medium) OpenAI Reasoning 58.7%
2 o3 (high) OpenAI Reasoning 58.0%
3 o3 (low) OpenAI Reasoning 54.8%
4 claude-3.7-sonnet Anthropic Reasoning 52.5%
5 claude-3.7-sonnet Anthropic Non-reasoning 52.1%
6 claude-3.5-sonnet-latest Anthropic Non-reasoning 50.4%
7 gpt-4.1 OpenAI Non-reasoning 49.9%
8 o1 OpenAI Reasoning 44.2%
9 gpt-4o OpenAI Non-reasoning 40.6%
10 claude-3.5-haiku Anthropic Non-reasoning 39.6%
11 Llama-3.1-405B-Instruct Meta Non-reasoning 36.4%
12 gemini-2.0-flash Google Non-reasoning 35.0%
13 Qwen3-32B Alibaba (Qwen) Reasoning 34.3%
14 Llama-3.3-70B-Instruct Meta Non-reasoning 32.8%
15 Llama-3.1-70B-Instruct Meta Non-reasoning 32.0%
16 Qwen2.5-72B-Instruct Alibaba (Qwen) Non-reasoning 30.7%
17 Mistral-Large-Instruct-2407 Mistral AI Non-reasoning 30.3%
18 gpt-4.1-mini OpenAI Non-reasoning 27.1%
19 gpt-4o-mini OpenAI Non-reasoning 21.8%
20 Llama-4-Scout-17B-16E-Instruct Meta Non-reasoning 19.8%
21 Llama-4-Maverick-17B-128E-Instruct Meta Non-reasoning 15.5%
22 Mistral-Small-Instruct-2409 Mistral AI Non-reasoning 14.8%
23 Llama-3.1-8B-Instruct Meta Non-reasoning 13.9%
24 DeepSeek-R1 DeepSeek Reasoning 12.4%
25 Qwen2.5-7B-Instruct Alibaba (Qwen) Non-reasoning 11.7%
26 Llama-3.2-3B-Instruct Meta Non-reasoning 10.4%
27 phi-4 Microsoft Non-reasoning 10.3%
28 gpt-4.1-nano OpenAI Non-reasoning 10.0%
29 Mistral-Small-24B-Instruct-2501 Mistral AI Non-reasoning 8.8%
30 DeepSeek-R1-Distill-Llama-70B DeepSeek Reasoning 8.4%
31 Ministral-8B-Instruct-2410 Mistral AI Non-reasoning 4.6%
32 Mistral-Small-3.1-24B-Instruct-2503 Mistral AI Non-reasoning 4.5%
33 Mixtral-8x22B-Instruct-v0.1 Mistral AI Non-reasoning 3.7%
34 Llama-3.2-1B-Instruct Meta Non-reasoning 3.3%
35 Phi-3-mini-128k-instruct Microsoft Non-reasoning 2.2%
36 Phi-3.5-MoE-instruct Microsoft Non-reasoning 1.7%
37 Phi-4-mini-instruct Microsoft Non-reasoning 1.5%
38 Mixtral-8x7B-Instruct-v0.1 Mistral AI Non-reasoning 1.3%
39 Phi-3.5-mini-instruct Microsoft Non-reasoning 1.0%
40 Phi-3-medium-128k-instruct Microsoft Non-reasoning 0.7%
Model Textworld Textworld Express Alfworld Scienceworld Jericho Overall
o3 (medium) 100.0% 91.9% 88.3% 93.0% 15.7% 58.7%
o3 (high) 100.0% 89.6% 81.7% 93.1% 16.1% 58.0%
o3 (low) 99.1% 89.8% 70.0% 88.3% 14.2% 54.8%
claude-3.7-sonnet (thinking) 97.3% 91.3% 83.3% 76.5% 12.5% 52.5%
claude-3.7-sonnet 97.3% 95.8% 81.7% 72.4% 13.0% 52.1%
claude-3.5-sonnet-latest 95.5% 81.6% 75.0% 82.3% 9.6% 50.4%
gpt-4.1 95.3% 92.5% 83.3% 76.1% 6.8% 49.9%
o1 97.8% 70.2% 28.3% 80.1% 10.3% 44.2%
gpt-4o 83.6% 80.6% 56.7% 61.4% 5.6% 40.6%
claude-3.5-haiku 94.9% 79.8% 26.7% 67.3% 5.0% 39.6%
Llama-3.1-405B-Instruct 90.9% 79.2% 31.7% 51.8% 6.1% 36.4%
gemini-2.0-flash 80.8% 76.1% 20.0% 57.1% 5.4% 35.0%
Qwen3-32B 79.5% 68.9% 48.3% 49.8% 4.0% 34.3%
Llama-3.3-70B-Instruct 69.6% 77.2% 15.0% 55.1% 4.5% 32.8%
Llama-3.1-70B-Instruct 65.6% 81.9% 8.3% 51.9% 5.3% 32.0%
Qwen2.5-72B-Instruct 76.5% 83.8% 36.7% 35.0% 2.9% 30.7%
Mistral-Large-Instruct-2407 82.4% 68.3% 6.7% 46.1% 5.8% 30.3%
gpt-4.1-mini 62.1% 74.5% 5.0% 41.9% 3.4% 27.1%
gpt-4o-mini 56.5% 73.6% 0.0% 27.2% 1.8% 21.8%
Llama-4-Scout-17B-16E-Instruct 41.1% 68.4% 0.0% 27.0% 1.8% 19.8%
Llama-4-Maverick-17B-128E-Instruct- 43.5% 56.1% 8.3% 11.5% 2.0% 15.5%
Mistral-Small-Instruct-2409 56.1% 27.3% 0.0% 24.4% 1.4% 14.8%
Llama-3.1-8B-Instruct 29.7% 50.3% 0.0% 15.7% 2.3% 13.9%
DeepSeek-R1 37.1% 38.6% 0.0% 15.8% 1.0% 12.4%
Qwen2.5-7B-Instruct 27.7% 45.6% 0.0% 12.6% 0.7% 11.7%
Llama-3.2-3B-Instruct 21.4% 42.0% 0.0% 10.0% 1.5% 10.4%
phi-4 20.8% 43.8% 0.0% 8.9% 1.6% 10.3%
gpt-4.1-nano 12.8% 38.7% 0.0% 9.4% 3.6% 10.0%
Mistral-Small-24B-Instruct-2501 15.8% 23.0% 0.0% 15.8% 1.4% 8.8%
DeepSeek-R1-Distill-Llama-70B 8.7% 39.8% 0.0% 7.7% 1.3% 8.4%
Ministral-8B-Instruct-2410 10.9% 22.8% 0.0% 2.3% 0.4% 4.6%
Mistral-Small-3.1-24B-Instruct-2503 2.5% 10.3% 0.0% 10.5% 0.8% 4.5%
Mixtral-8x22B-Instruct-v0.1 17.1% 8.4% 0.0% 4.0% 0.4% 3.7%
Llama-3.2-1B-Instruct 0.0% 19.0% 0.0% 2.4% 0.6% 3.3%
Phi-3-mini-128k-instruct 2.7% 9.4% 0.0% 2.4% 0.3% 2.2%
Phi-3.5-MoE-instruct 0.0% 7.0% 0.0% 2.3% 0.4% 1.7%
Phi-4-mini-instruct 0.0% 5.5% 0.0% 2.3% 0.5% 1.5%
Mixtral-8x7B-Instruct-v0.1 0.0% 1.6% 0.0% 4.0% 0.3% 1.3%
Phi-3.5-mini-instruct 0.0% 2.0% 0.0% 2.4% 0.5% 1.0%
Phi-3-medium-128k-instruct 0.0% 0.0% 0.0% 2.3% 0.3% 0.7%