OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?
OpenRCA includes 335 failures from three enterprise software systems, along with over 68 GB of telemetry data (logs, metrics, and traces). Given a failure case and its associated telemetry, the LLM is tasked to identify the root cause of the failure, requiring comprehension of software dependencies and reasoning over heterogeneous, long-context telemetry data.





News
2025/1/23 Our paper has been accepted by ICLR 2025.
2025/1/23 Released OpenRCA dataset with 335 failure cases.
Leaderboard
Method Name | Model | Org. | Correct | Partial | Date |
---|---|---|---|---|---|
RCA-Agent | Claude 3.5 Sonnet | 11.34% | 17.31% | 2025/1/23 | |
RCA-Agent | GPT-4o | 8.96% | 17.91% | 2025/1/23 | |
Prompting (Oracle) | Gemini 1.5 Pro | 7.16% | 23.58% | 2025/1/23 | |
Prompting (Balanced) | Gemini 1.5 Pro | 6.27% | 24.18% | 2025/1/23 | |
Prompting (Oracle) | GPT-4o | 6.27% | 15.82% | 2025/1/23 | |
Prompting (Oracle) | Claude 3.5 Sonnet | 5.37% | 17.61% | 2025/1/23 | |
Prompting (Oracle) | Command R+ | 4.78% | 7.46% | 2025/1/23 | |
Prompting (Oracle) | Mistral Large 2 | 4.48% | 10.45% | 2025/1/23 | |
Prompting (Balanced) | Command R+ | 4.18% | 8.96% | 2025/1/23 | |
Prompting (Balanced) | Claude 3.5 Sonnet | 3.88% | 18.81% | 2025/1/23 | |
Prompting (Oracle) | Llama 3.1 Instruct | 3.88% | 14.93% | 2025/1/23 | |
Prompting (Balanced) | Mistral Large 2 | 3.58% | 6.40% | 2025/1/23 | |
Prompting (Balanced) | GPT-4o | 3.28% | 14.33% | 2025/1/23 | |
RCA-Agent | Llama 3.1 Instruct | 3.28% | 5.67% | 2025/1/23 | |
Prompting (Balanced) | Llama 3.1 Instruct | 2.99% | 14.63% | 2025/1/23 | |
RCA-Agent | Gemini 1.5 Pro | 2.69% | 6.87% | 2025/1/23 |
Is your model or agent up to the challenge? Submit your results here!
Submission Guidelines
If you want to have your results included, please include the following in your email:
Name of your method
Inference results in valid format (see GitHub repository)
Accuracy of your method tested in your own environment
(Optional) Link to your repository
(Optional) Execution trajectory of your method
(Optional) Reproduction guidelines of your method
(Optional) Docker image of your method and environment
Note: Inclusion in the leaderboard will be attempted on a best-effort basis. We cannot guarantee the timely processing of requests.
What is the task in OpenRCA?
Identify the root cause of the failure!

Each OpenRCA task is based on a real-world failure case from a software system and its associated telemetry data. Given the failure case and its associated telemetry, the task is to identify the root cause of the failure, requiring comprehension of software dependencies and reasoning over heterogeneous, long-context telemetry data.
Check out our paper for more details!
OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?
Junjielong Xu1,2, Qinan Zhang1, Zhiqing Zhong1, Shilin He2, Chaoyun Zhang2, Qingwei Lin2, Dan Pei3, Pinjia He1, Dongmei Zhang2, Qi Zhang2
1School of Data Science, The Chinese University of Hong Kong, Shenzhen 2Microsoft 3Tsinghua University
If you have any remaining questions, please feel free to contact us at openrcanon@gmail.com
Citing this work
If you use this benchmark, please cite:
@inproceedings{xu2025openrca,
title={OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?},
author={Junjielong Xu and Qinan Zhang and Zhiqing Zhong and Shilin He and Chaoyun Zhang and Qingwei Lin and
Dan Pei and Pinjia He and Dongmei Zhang and Qi Zhang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=M4qNIzQYpd}
}