T A L E S

Abstract

Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 12% on games designed for human enjoyment

An example of an agent playing through a text-adventure game. Thought traces are fabricated for illustration. For text-adventure games, due to their length and the variety of puzzles required for progression, players must make use of a range of different reasoning skills to solve roadblocks and continue through the game. Because of the long-range, causal dependencies often found in these games, a single mistake at any step can lead to a breakdown in gameplay later on.

Scores for all Textworld games for Top 10 models

tw_allgames chart

Scores for all Textworld Express games for Top 10 models

twx_allgames chart

Scores for all Alfworld games for Top 10 models

alfw_allgames chart

Scores for all Scienceworld games for Top 10 models

sciencew_allgames chart

Scores for all Jericho games for Top 10 models

jericho_allgames chart

Breakdown of scores per framework

fws chart

Please consider citing the original work!

TextWorld

TextWorld is a framework originally designed for training agents with Reinforcement Learning on text-based games. It can generate synthetic text-adventure games of varying complexity. In TALES, we integrate the "CookingWorld" games that were used as part of the NeurIPS 2018 Competition. The task involves following a recipe that requires finding ingredients and processing them according to said recipe. We selected one game per difficulty ranging from level 1 (with one location and a recipe of 1 ingredient) to level 10 (having 12 locations and a recipe with 3 ingredients). For all difficulties, the player receives 1 point after completing sub-goals related to the task in the game. Difficulty level 1 can be solved in 7 moves with a max score of 3, while level 10 requires 44 moves with a max score of 11.

Textworld Express

Generated by GPT-4o.

Textworld Express is a highly optimized re-implementation of many Textworld game scenarios that runs approximately three orders of magnitudes faster compared to the Textworld counterparts. While token throughput is the major speed bottleneck in many LLM-based applications, we opt to use Textworld Express over Textworld for the performance improvement where applicable. While significantly faster, an arguable drawback of using Textworld Express over Textworld is also in its stricter parser. Textworld Express simplifies its parser for speed and thus does not allow for nearest-neighbor action phrases.

Alfworld

For TALES, we only make use of the text modality.

Alfworld is a multi-modal framework combining complementary visual and textual observations, where agents are asked to navigate and perform tasks in a household setting. All tasks provide only a terminal reward of 1 upon task completion. For TALES, we only use its textual modality as it has become the standard in the LLM literature when evaluated on.

The Alfworld environments are unique in their lack of informative feedback. Where other environments have a predefined error message relating to the type of error, whether it is due to the parser not recognizing the command or the action not being possible, Alfworld has only one error message in the form of 'Nothing happened'. In the original Alfworld framework, the visual component compensates for the insufficient text feedback. However, this lack of detailed information significantly increases the difficulty for agents that rely solely on text-based interactions. This difficulty is compounded upon by the limitation that an agent in Alfworld can only hold one object at a time.

Scienceworld

Scienceworld is a framework focused on the completion of elementary-level science curriculum tasks. Notably for many of its tasks, Scienceworld emulates an open-world setting where the player can complete the task in different ways that do not follow one expected trajectory. When it comes to heating objects, this part of the task can be completed by either the oven in the kitchen or the blast furnace in the workshop. Similarly, Scienceworld also allows the player the freedom to reset the game on command. This is especially important as a number of Scienceworld games have dead states where it is no longer possible to complete the assigned task in that play-through.

Jericho

Jericho is a suite of 55 human-written, interactive fiction games. We consider Jericho to be the most difficult framework due to the length and complexity of many of the games. Some can be completed within 17 steps while some others require over 500 steps. Those games also cover an extremely wide range of genres and styles and lack the consistency of many other text-game environment suites designed for evaluating agents. For example, '9:05' follows the morning of an ordinary office worker where 'Anchorhead' is a Lovecraftian Horror Story.

Below is an interactive version of Zork I, one of the classic text adventure games available in the Jericho suite:

Try it yourself! Type commands like "look", "inventory", or "go north" to interact with the game.

Funny LLM Fails to be added here.

All LLM Scores can be found below. For the graphs in other visualizations, we used only the top 10 models. All scores are displayed as percentages.

Rank	Model	Organization	Model Type	TALES Score
1	o3 (medium)	OpenAI	Reasoning	58.7%
2	o3 (high)	OpenAI	Reasoning	58.0%
3	o3 (low)	OpenAI	Reasoning	54.8%
4	claude-3.7-sonnet	Anthropic	Reasoning	52.5%
5	claude-3.7-sonnet	Anthropic	Non-reasoning	52.1%
6	claude-3.5-sonnet-latest	Anthropic	Non-reasoning	50.4%
7	gpt-4.1	OpenAI	Non-reasoning	49.9%
8	o1	OpenAI	Reasoning	44.2%
9	gpt-4o	OpenAI	Non-reasoning	40.6%
10	claude-3.5-haiku	Anthropic	Non-reasoning	39.6%
11	Llama-3.1-405B-Instruct	Meta	Non-reasoning	36.4%
12	gemini-2.0-flash	Google	Non-reasoning	35.0%
13	Qwen3-32B	Alibaba (Qwen)	Reasoning	34.3%
14	Llama-3.3-70B-Instruct	Meta	Non-reasoning	32.8%
15	Llama-3.1-70B-Instruct	Meta	Non-reasoning	32.0%
16	Qwen2.5-72B-Instruct	Alibaba (Qwen)	Non-reasoning	30.7%
17	Mistral-Large-Instruct-2407	Mistral AI	Non-reasoning	30.3%
18	gpt-4.1-mini	OpenAI	Non-reasoning	27.1%
19	gpt-4o-mini	OpenAI	Non-reasoning	21.8%
20	Llama-4-Scout-17B-16E-Instruct	Meta	Non-reasoning	19.8%
21	Llama-4-Maverick-17B-128E-Instruct	Meta	Non-reasoning	15.5%
22	Mistral-Small-Instruct-2409	Mistral AI	Non-reasoning	14.8%
23	Llama-3.1-8B-Instruct	Meta	Non-reasoning	13.9%
24	DeepSeek-R1	DeepSeek	Reasoning	12.4%
25	Qwen2.5-7B-Instruct	Alibaba (Qwen)	Non-reasoning	11.7%
26	Llama-3.2-3B-Instruct	Meta	Non-reasoning	10.4%
27	phi-4	Microsoft	Non-reasoning	10.3%
28	gpt-4.1-nano	OpenAI	Non-reasoning	10.0%
29	Mistral-Small-24B-Instruct-2501	Mistral AI	Non-reasoning	8.8%
30	DeepSeek-R1-Distill-Llama-70B	DeepSeek	Reasoning	8.4%
31	Ministral-8B-Instruct-2410	Mistral AI	Non-reasoning	4.6%
32	Mistral-Small-3.1-24B-Instruct-2503	Mistral AI	Non-reasoning	4.5%
33	Mixtral-8x22B-Instruct-v0.1	Mistral AI	Non-reasoning	3.7%
34	Llama-3.2-1B-Instruct	Meta	Non-reasoning	3.3%
35	Phi-3-mini-128k-instruct	Microsoft	Non-reasoning	2.2%
36	Phi-3.5-MoE-instruct	Microsoft	Non-reasoning	1.7%
37	Phi-4-mini-instruct	Microsoft	Non-reasoning	1.5%
38	Mixtral-8x7B-Instruct-v0.1	Mistral AI	Non-reasoning	1.3%
39	Phi-3.5-mini-instruct	Microsoft	Non-reasoning	1.0%
40	Phi-3-medium-128k-instruct	Microsoft	Non-reasoning	0.7%

Model	Textworld	Textworld Express	Alfworld	Scienceworld	Jericho	Overall
o3 (medium)	100.0%	91.9%	88.3%	93.0%	15.7%	58.7%
o3 (high)	100.0%	89.6%	81.7%	93.1%	16.1%	58.0%
o3 (low)	99.1%	89.8%	70.0%	88.3%	14.2%	54.8%
claude-3.7-sonnet (thinking)	97.3%	91.3%	83.3%	76.5%	12.5%	52.5%
claude-3.7-sonnet	97.3%	95.8%	81.7%	72.4%	13.0%	52.1%
claude-3.5-sonnet-latest	95.5%	81.6%	75.0%	82.3%	9.6%	50.4%
gpt-4.1	95.3%	92.5%	83.3%	76.1%	6.8%	49.9%
o1	97.8%	70.2%	28.3%	80.1%	10.3%	44.2%
gpt-4o	83.6%	80.6%	56.7%	61.4%	5.6%	40.6%
claude-3.5-haiku	94.9%	79.8%	26.7%	67.3%	5.0%	39.6%
Llama-3.1-405B-Instruct	90.9%	79.2%	31.7%	51.8%	6.1%	36.4%
gemini-2.0-flash	80.8%	76.1%	20.0%	57.1%	5.4%	35.0%
Qwen3-32B	79.5%	68.9%	48.3%	49.8%	4.0%	34.3%
Llama-3.3-70B-Instruct	69.6%	77.2%	15.0%	55.1%	4.5%	32.8%
Llama-3.1-70B-Instruct	65.6%	81.9%	8.3%	51.9%	5.3%	32.0%
Qwen2.5-72B-Instruct	76.5%	83.8%	36.7%	35.0%	2.9%	30.7%
Mistral-Large-Instruct-2407	82.4%	68.3%	6.7%	46.1%	5.8%	30.3%
gpt-4.1-mini	62.1%	74.5%	5.0%	41.9%	3.4%	27.1%
gpt-4o-mini	56.5%	73.6%	0.0%	27.2%	1.8%	21.8%
Llama-4-Scout-17B-16E-Instruct	41.1%	68.4%	0.0%	27.0%	1.8%	19.8%
Llama-4-Maverick-17B-128E-Instruct-	43.5%	56.1%	8.3%	11.5%	2.0%	15.5%
Mistral-Small-Instruct-2409	56.1%	27.3%	0.0%	24.4%	1.4%	14.8%
Llama-3.1-8B-Instruct	29.7%	50.3%	0.0%	15.7%	2.3%	13.9%
DeepSeek-R1	37.1%	38.6%	0.0%	15.8%	1.0%	12.4%
Qwen2.5-7B-Instruct	27.7%	45.6%	0.0%	12.6%	0.7%	11.7%
Llama-3.2-3B-Instruct	21.4%	42.0%	0.0%	10.0%	1.5%	10.4%
phi-4	20.8%	43.8%	0.0%	8.9%	1.6%	10.3%
gpt-4.1-nano	12.8%	38.7%	0.0%	9.4%	3.6%	10.0%
Mistral-Small-24B-Instruct-2501	15.8%	23.0%	0.0%	15.8%	1.4%	8.8%
DeepSeek-R1-Distill-Llama-70B	8.7%	39.8%	0.0%	7.7%	1.3%	8.4%
Ministral-8B-Instruct-2410	10.9%	22.8%	0.0%	2.3%	0.4%	4.6%
Mistral-Small-3.1-24B-Instruct-2503	2.5%	10.3%	0.0%	10.5%	0.8%	4.5%
Mixtral-8x22B-Instruct-v0.1	17.1%	8.4%	0.0%	4.0%	0.4%	3.7%
Llama-3.2-1B-Instruct	0.0%	19.0%	0.0%	2.4%	0.6%	3.3%
Phi-3-mini-128k-instruct	2.7%	9.4%	0.0%	2.4%	0.3%	2.2%
Phi-3.5-MoE-instruct	0.0%	7.0%	0.0%	2.3%	0.4%	1.7%
Phi-4-mini-instruct	0.0%	5.5%	0.0%	2.3%	0.5%	1.5%
Mixtral-8x7B-Instruct-v0.1	0.0%	1.6%	0.0%	4.0%	0.3%	1.3%
Phi-3.5-mini-instruct	0.0%	2.0%	0.0%	2.4%	0.5%	1.0%
Phi-3-medium-128k-instruct	0.0%	0.0%	0.0%	2.3%	0.3%	0.7%