Skip to the content.

Operational Excellence review

Operational excellence is about ensuring that you have full visibility into how your application is running, and ensuring the best experience for your users. The most important topics when looking into the operational excellence are:

  1. DevOps and continuous integration in mind
  2. Use of monitoring and analytics to gain operational insights
  3. Apply automation to reduce effort and errors
  4. Test everything

Try to answer these questions:

Level 1 - Monitoring, Organization and Deployment

When we talk about operations, the most important aspect is having the right monitoring in place. Identify the key metrics, which logs are you monitoring and what kind of dashboarding and alerts you are using. Having a resource dependencies map will be very helpful too.

Review the resource organization and naming conventions being used, and how are them followed and/or enforced. Review at subscription, management group and tenant levels. This information should have come out during the discovery phase with the subscription walkthrough and the CCO Dashboard.

Then you can review the deployment, what is the current deployment strategy, and what automation tools are being used for it. Does the deployment has a rollback process? Are Infra-as-Code and configuration management used?

Level 2 - Testing

To ensure that everything will run as expected you need to test everything before deploying into production. For sure, you need unit tests and code coverage metrics during your development cycle, but you also need to run integration tests as soon as possible in your cycle.

A main tenet of a DevOps practice to achieve system reliability is the shift left principle.

If your process for developing and deploying an application is depicted as a series of steps that are listed from left to right, your testing should be shifted as much as possible toward the beginning of your process (e.g. to the left), and not just at the very end of your process (e.g. to the right).

Level 3 - BCDR

This is again about the BCDR strategy discussed in the Reliability pillar:

Review the backup strategy and disaster recovery plan, and check if these are tested regularly.

Check what are the [RPO and RTO][rporto] constraints for this solution:

RPO RTO

Discuss what are the Disaster Recovery plan triggers, what health checks are in place, how the system is monitored and what alerts are configured. Is there any traffic routing in place in the case of a general failure?

Resources

Checkpoint

< prev   1   2   (3)   4   5   6   next >


Index