AIOpsLab
A Holistic Framework to Design, Develop, and Evaluate AI Agents for Enabling Autonomous Clouds

Leaderboard

We showcase the key results on the leaderboard. If you'd like your results to appear, please email us at AIOpsLab@microsoft.com.
In the table, AVERAGE represents the average accuracy across all tasks. TIME indicates the average runtime for the agents.

radar and bar
Agent Name Average Detection Localization Diagnosis Mitigation Time Organization Link
FLASH (GPT-4) 59.27 100 46.15 36.36 54.55 102.57 AIOpsLab 🔗
REACT (GPT-4) 53.15 76.92 53.85 45.45 36.36 44.25 AIOpsLab 🔗
DeepSeek-R1 50.47 78.57 58.33 35.0 30.0 343.39 AIOpsLab 🔗
GPT-4 w Shell 49.74 69.23 61.54 40.9 27.27 30.57 AIOpsLab 🔗
FLASH (Llama3-8b) 33.34 80 20 0 33.34 63.16 AIOpsLab 🔗
GPT-3.5 w Shell 15.73 23.07 30.77 9.09 0 12.79 AIOpsLab 🔗
REACT (Llama3-8b) 15 60 0 0 0 230.74 AIOpsLab 🔗
LocaleXpert (Llama3-8b) - - 80 - - 102.08 AIOpsLab NKU 🔗