Skip to the content.

All LLM Scores can be found below. For the graphs in other visualizations, we used only the top 9 models. All scores are displayed as percentages.

* indicates we do not have all 5 seeds for the LLM. This will be updated once the runs are finished.

Rank Model Organization Model Type TALES Score
1 claude-3.7-sonnet Anthropic Reasoning 52.5%
2 claude-3.5-sonnet-latest Anthropic Non-reasoning 50.4%
3 gemini-2.5-pro-preview* Google Non-reasoning 48.8%
4 o1 Anthropic Reasoning 44.2%
5 gpt-4o OpenAI Non-reasoning 40.6%
6 claude-3.5-haiku Anthropic Non-reasoning 39.6%
7 Llama-3.1-405B-Instruct Meta Non-reasoning 36.4%
8 gemini-2.0-flash Google Non-reasoning 35.0%
9 Llama-3.3-70B-Instruct Meta Non-reasoning 32.8%
10 Llama-3.1-70B-Instruct Meta Non-reasoning 32.0%
11 Qwen2.5-72B-Instruct Alibaba Non-reasoning 30.7%
12 Mistral-Large-Instruct-2407 Mistral AI Non-reasoning 30.3%
13 gpt-4o-mini OpenAI Non-reasoning 21.8%
14 Llama-4-Scout-17B-16E-Instruct Meta Non-reasoning 19.8%
15 Llama-4-Maverick-17B-128E-Instruct Meta Non-reasoning 15.5%
16 Mistral-Small-Instruct-2409 Mistral AI Non-reasoning 14.8%
17 Llama-3.1-8B-Instruct Meta Non-reasoning 13.9%
18 DeepSeek-R1 DeepSeek AI Reasoning 12.4%
19 Qwen2.5-7B-Instruct Alibaba Non-reasoning 11.7%
20 Llama-3.2-3B-Instruct Meta Non-reasoning 10.4%
21 phi-4 Microsoft Non-reasoning 10.3%
22 Mistral-Small-24B-Instruct-2501 Mistral AI Non-reasoning 8.8%
23 DeepSeek-R1-Distill-Llama-70B DeepSeek AI Reasoning 8.4%
24 Ministral-8B-Instruct-2410 Mistral AI Non-reasoning 4.6%
25 Mistral-Small-3.1-24B-Instruct-2503 Mistral AI Non-reasoning 4.5%
26 Mixtral-8x22B-Instruct-v0.1 Mistral AI Non-reasoning 3.7%
27 Llama-3.2-1B-Instruct Meta Non-reasoning 3.3%
28 Phi-3-mini-128k-instruct Microsoft Non-reasoning 2.2%
29 Phi-3.5-MoE-instruct Microsoft Non-reasoning 1.7%
30 Phi-4-mini-instruct Microsoft Non-reasoning 1.5%
31 Mixtral-8x7B-Instruct-v0.1 Mistral AI Non-reasoning 1.3%
32 Phi-3.5-mini-instruct Microsoft Non-reasoning 1.0%
33 Phi-3-medium-128k-instruct Microsoft Non-reasoning 0.7%
Model Textworld Textworld Express Alfworld Scienceworld Jericho Overall
claude-3.7-sonnet 97.3% 91.3% 83.3% 76.5% 12.5% 52.5%
claude-3.5-sonnet-latest 95.5% 81.6% 75.0% 82.3% 9.6% 50.4%
gemini-2.5-pro-preview* 98.5% 91.8% 75.0% 64.2% 12.4% 48.8%
o1 97.8% 70.2% 28.3% 80.1% 10.3% 44.2%
gpt-4o 83.6% 80.6% 56.7% 61.4% 5.6% 40.6%
claude-3.5-haiku 94.9% 79.8% 26.7% 67.3% 5.0% 39.6%
Llama-3.1-405B-Instruct 90.9% 79.2% 31.7% 51.8% 6.1% 36.4%
gemini-2.0-flash 80.8% 76.1% 20.0% 57.1% 5.4% 35.0%
Llama-3.3-70B-Instruct 69.6% 77.2% 15.0% 55.1% 4.5% 32.8%
Llama-3.1-70B-Instruct 65.6% 81.9% 8.3% 51.9% 5.3% 32.0%
Qwen2.5-72B-Instruct 76.5% 83.8% 36.7% 35.0% 2.9% 30.7%
Mistral-Large-Instruct-2407 82.4% 68.3% 6.7% 46.1% 5.8% 30.3%
gpt-4o-mini 56.5% 73.6% 0.0% 27.2% 1.8% 21.8%
Llama-4-Scout-17B-16E-Instruct 41.1% 68.4% 0.0% 27.0% 1.8% 19.8%
Llama-4-Maverick-17B-128E-Instruct- 43.5% 56.1% 8.3% 11.5% 2.0% 15.5%
Mistral-Small-Instruct-2409 56.1% 27.3% 0.0% 24.4% 1.4% 14.8%
Llama-3.1-8B-Instruct 29.7% 50.3% 0.0% 15.7% 2.3% 13.9%
DeepSeek-R1 37.1% 38.6% 0.0% 15.8% 1.0% 12.4%
Qwen2.5-7B-Instruct 27.7% 45.6% 0.0% 12.6% 0.7% 11.7%
Llama-3.2-3B-Instruct 21.4% 42.0% 0.0% 10.0% 1.5% 10.4%
phi-4 20.8% 43.8% 0.0% 8.9% 1.6% 10.3%
Mistral-Small-24B-Instruct-2501 15.8% 23.0% 0.0% 15.8% 1.4% 8.8%
DeepSeek-R1-Distill-Llama-70B 8.7% 39.8% 0.0% 7.7% 1.3% 8.4%
Ministral-8B-Instruct-2410 10.9% 22.8% 0.0% 2.3% 0.4% 4.6%
Mistral-Small-3.1-24B-Instruct-2503 2.5% 10.3% 0.0% 10.5% 0.8% 4.5%
Mixtral-8x22B-Instruct-v0.1 17.1% 8.4% 0.0% 4.0% 0.4% 3.7%
Llama-3.2-1B-Instruct 0.0% 19.0% 0.0% 2.4% 0.6% 3.3%
Phi-3-mini-128k-instruct 2.7% 9.4% 0.0% 2.4% 0.3% 2.2%
Phi-3.5-MoE-instruct 0.0% 7.0% 0.0% 2.3% 0.4% 1.7%
Phi-4-mini-instruct 0.0% 5.5% 0.0% 2.3% 0.5% 1.5%
Mixtral-8x7B-Instruct-v0.1 0.0% 1.6% 0.0% 4.0% 0.3% 1.3%
Phi-3.5-mini-instruct 0.0% 2.0% 0.0% 2.4% 0.5% 1.0%
Phi-3-medium-128k-instruct 0.0% 0.0% 0.0% 2.3% 0.3% 0.7%