OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?

OpenRCA includes 335 failures from three enterprise software systems, along with over 68 GB of telemetry data (logs, metrics, and traces). Given a failure case and its associated telemetry, the LLM is tasked to identify the root cause of the failure, requiring comprehension of software dependencies and reasoning over heterogeneous, long-context telemetry data.

News

2025/1/23 Our paper has been accepted by ICLR 2025.

2025/1/23 Released OpenRCA dataset with 335 failure cases.

Leaderboard

Method Name	Model	Correct	Partial	Date
RCA-Agent	Claude 3.5 Sonnet	11.34%	17.31%	2025/1/23
RCA-Agent	GPT-4o	8.96%	17.91%	2025/1/23
Prompting (Oracle)	Gemini 1.5 Pro	7.16%	23.58%	2025/1/23
Prompting (Balanced)	Gemini 1.5 Pro	6.27%	24.18%	2025/1/23
Prompting (Oracle)	GPT-4o	6.27%	15.82%	2025/1/23
Prompting (Oracle)	Claude 3.5 Sonnet	5.37%	17.61%	2025/1/23
Prompting (Oracle)	Command R+	4.78%	7.46%	2025/1/23
Prompting (Oracle)	Mistral Large 2	4.48%	10.45%	2025/1/23
Prompting (Balanced)	Command R+	4.18%	8.96%	2025/1/23
Prompting (Balanced)	Claude 3.5 Sonnet	3.88%	18.81%	2025/1/23
Prompting (Oracle)	Llama 3.1 Instruct	3.88%	14.93%	2025/1/23
Prompting (Balanced)	Mistral Large 2	3.58%	6.40%	2025/1/23
Prompting (Balanced)	GPT-4o	3.28%	14.33%	2025/1/23
RCA-Agent	Llama 3.1 Instruct	3.28%	5.67%	2025/1/23
Prompting (Balanced)	Llama 3.1 Instruct	2.99%	14.63%	2025/1/23
RCA-Agent	Gemini 1.5 Pro	2.69%	6.87%	2025/1/23

Is your model or agent up to the challenge? Submit your results here!

Submit

Submission Guidelines

If you want to have your results included, please include the following in your email:

Name of your method

Inference results in valid format (see GitHub repository)

Accuracy of your method tested in your own environment

(Optional) Link to your repository

(Optional) Execution trajectory of your method

(Optional) Reproduction guidelines of your method

(Optional) Docker image of your method and environment

Note: Inclusion in the leaderboard will be attempted on a best-effort basis. We cannot guarantee the timely processing of requests.

What is the task in OpenRCA?

Identify the root cause of the failure!

Each OpenRCA task is based on a real-world failure case from a software system and its associated telemetry data. Given the failure case and its associated telemetry, the task is to identify the root cause of the failure, requiring comprehension of software dependencies and reasoning over heterogeneous, long-context telemetry data.

Check out our paper for more details!

OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?

Junjielong Xu^1,2, Qinan Zhang¹, Zhiqing Zhong¹, Shilin He², Chaoyun Zhang², Qingwei Lin², Dan Pei³, Pinjia He¹, Dongmei Zhang², Qi Zhang²

¹School of Data Science, The Chinese University of Hong Kong, Shenzhen ²Microsoft ³Tsinghua University

Paper Code

If you have any remaining questions, please feel free to contact us at openrcanon@gmail.com

Citing this work

If you use this benchmark, please cite:

@inproceedings{xu2025openrca,
     title={OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?},
     author={Junjielong Xu and Qinan Zhang and Zhiqing Zhong and Shilin He and Chaoyun Zhang and Qingwei Lin and
             Dan Pei and Pinjia He and Dongmei Zhang and Qi Zhang},
     booktitle={The Thirteenth International Conference on Learning Representations},
     year={2025},
     url={https://openreview.net/forum?id=M4qNIzQYpd}
}