Root Cause Analysis (RCA) Template

This template provides a structured format for post-incident documentation, inspired by industry best practices including Google's SRE Postmortem Culture and Example Postmortem.

Template

# Incident Report: {Title}

## Summary

- **Incident ID**: INC-YYYY-MMDD-NNN
- **Date**: {Date}
- **Duration**: {Start} to {End} ({total time})
- **Severity**: {1-4}
- **Services Affected**: {list}
- **Incident Commander**: {Name}

## Executive Summary

{2-3 sentence summary of what happened, impact, and resolution}

## Timeline

All times in UTC.

| Time  | Event                         |
|-------|-------------------------------|
| HH:MM | {First symptom detected}      |
| HH:MM | {Incident declared}           |
| HH:MM | {Key investigation milestone} |
| HH:MM | {Mitigation applied}          |
| HH:MM | {Service restored}            |
| HH:MM | {Incident resolved}           |

## Impact

- **Users affected**: {count or percentage}
- **Transactions impacted**: {count}
- **Revenue impact**: {if applicable}
- **SLA impact**: {if applicable}
- **Data loss**: {Yes/No, details if applicable}

## Root Cause

{Detailed technical explanation of what caused the incident. Be specific and factual.}

## Contributing Factors

- {Factor 1: e.g., Missing monitoring for specific failure mode}
- {Factor 2: e.g., Documentation gap in runbooks}
- {Factor 3: e.g., Insufficient testing coverage}

## Trigger

{What specific event triggered the incident? Deployment, configuration change, traffic spike, external dependency failure, etc.}

## Resolution

{What was done to resolve the incident? Include specific commands, rollbacks, or configuration changes.}

## Detection

- **How was the incident detected?** {Monitoring alert / Customer report / Manual discovery}
- **Time to detect (TTD)**: {minutes from incident start to detection}
- **Could detection be improved?** {Yes/No, how}

## Response

- **Time to engage (TTE)**: {minutes from detection to first responder}
- **Time to mitigate (TTM)**: {minutes from engagement to mitigation}
- **Time to resolve (TTR)**: {minutes from incident start to full resolution}

## Five Whys Analysis

1. **Why** did the service fail?
   → {Answer}

2. **Why** did that happen?
   → {Answer}

3. **Why** was that the case?
   → {Answer}

4. **Why** wasn't this prevented?
   → {Answer}

5. **Why** wasn't this detected earlier?
   → {Answer}

## Action Items

| ID | Priority | Action                                | Owner  | Due Date | Status |
|----|----------|---------------------------------------|--------|----------|--------|
| 1  | P1       | {Immediate fix to prevent recurrence} | {Name} | {Date}   | Open   |
| 2  | P2       | {Improve monitoring/alerting}         | {Name} | {Date}   | Open   |
| 3  | P2       | {Update documentation/runbooks}       | {Name} | {Date}   | Open   |
| 4  | P3       | {Long-term systemic improvement}      | {Name} | {Date}   | Open   |

## Lessons Learned

### What went well

- {e.g., Quick detection due to recent monitoring improvements}
- {e.g., Effective communication during incident}

### What went poorly

- {e.g., Runbook was outdated}
- {e.g., Escalation path unclear}

### Where we got lucky

- {e.g., Incident occurred during low-traffic period}
- {e.g., Expert happened to be available}

## Supporting Information

- **Related incidents**: {links to similar past incidents}
- **Monitoring dashboards**: {links}
- **Relevant logs/queries**: {links or references}
- **Slack/Teams thread**: {link to incident channel}

Usage Guidelines

Start the document immediately when an incident is declared
Update continuously during the incident - don't rely on memory afterward
Be blameless - focus on systems and processes, not individuals
Be thorough - future responders will thank you
Track action items - incidents without follow-through will repeat

References

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.

Template​

Usage Guidelines​

References​

Template

Usage Guidelines

References