Skip to the content.

Reliability review

When reviewing the Reliability pillar you typically start with the business requirements of the application. Try to ask the questions:

Level 1 - Composite SLA, SPOF and HA

The first thing is to identify the type of architecture you are looking at, for example:

With the Architecture Diagram at hand, you need to identify the Single Point Of Failure this architecture may suffer, and then calculate the composite SLA for the whole solution.

To audit SPOF, see how the architecture can respond to failures by using load-balancing, scalesets, stateless services, etc. Check if application resiliency is tested using any Chaos Engineering techniques.

Discuss about the High Availability requirements and see how is the application prepared for HA.

Calculate the cost implications of adding HA, but also the potential cost of application downtime and data loss.

Level 2 - BCDR, Detection & Response

Review the backup strategy and disaster recovery plan, and check if these are tested regularly.

Check what are the RPO and RTO constraints for this solution:

RPO RTO

Discuss what are the Disaster Recovery plan triggers, what health checks are in place, how the system is monitored and what alerts are configured. Is there any traffic routing in place in the case of a general failure?

Level 3 - Design Patterns

There are several design patterns that can be used to improve reliability of the workload. Check with the development team if they are used in the architecture:

Checkpoint

Now you can move to the next pillar: Performance

< prev   1   2   (3)   4   5   6   next >


Index