04 How do you define reliability targets?
Reliability targets are derived through workshop exercises with business stakeholders. The targets are refined through monitoring and testing. With your internal stakeholders, set realistic expectations for workload reliability so that stakeholders can communicate those expectations to customers through contractual agreements.
Power Platform handles most of the infrastructure needed to run its services. That said it is a shared responsibility based on the workloads and strain applied to the platform. Also, as the platform provides extensibility into Azure, the responsibility can shift more towards organizations.
This question will primarily focus on the RE:04 Target Metrics recommendations.
To learn more and get started with defining target metrics, review the official FMA collection below.
Well-Architected Framework Target Metrics
Monitor and optimize your Dynamics 365 environments
Official Microsoft Well-Architected Framework Reliability Failure Mode Analysis Collection
YouTube: Power Platform Well-Architected Framework - Defining a Monitoring Strategy
Reliability targets require metrics like service-level objectives (SLOs), service-level agreements (SLAs), and service-level indicators (SLIs).
Service Level Objective - A target that you set for your workload or application based on the quality of service that your customers expect to receive.
Service Level Indicator - Measurement of a particular aspect of a service’s performance
Service Level Agreement - Agreement between a service provider and a customer
These metrics should come from workshops performed with business and technical stakeholders and should be realistic in nature. SLOs are mandatory for each system and workload.
NOTE - If you do not have an SLO for each workload, do not check this box.
Power Platform Availability targets
Recovery targets correspond to the recovery time objective (RTO), recovery point objective (RPO), mean time to recovery (MTTR), and mean time between failure (MTBF) metrics. In contrast to availability targets, recovery targets for these measurements don’t depend heavily on SLAs.
Recovery Time Objective - The maximum acceptable time that an application can be unavailable after an incident.
Recovery Point Objective - The maximum acceptable duration of data loss during an incident.
Mean Time to Recovery - The time taken to restore a component after a failure is detected.
These metrics rely heavily on your FMA and BC/DR strategy. Organizations will need to work with business stakeholders to discuss aspirations and review architecture.
NOTE - If you have not defined all three of these metrics, do not check this box.
Power Platform Recovery targets
A health model aims to transform your system metrics into data that you can compare against your service-level objectives (SLOs).
The health model should be based on the availability and recovery targets defined above. With a health model, you are performing an operational maturity assessment. Key things to focus on in the model are: “How you detect and respond to issues”, “How you diagnose issues that have happened” and now to “Predict and more importantly prevent issues before they take place”.
Each mission critical workload needs to be analyzed with the health model. Below is an example of a readable health model
NOTE - If you have not analyzed and documented with a health model, do not check this box.
Building a Monitoring and Alerting strategy
Health modeling of mission-critical workloads in Azure
Application Insights Artifacts for Dataverse
Application Insights Artifacts for Power Platform
Agreeing on what constitutes a healthy or degraded operation is crucial for design discussions.
Below are recommended health states for the model including definitions.
Healthy - Operates optimally and meets quality expectations
Degraded - Exhibits less than healthy behavior, which indicates potential problems
Unhealthy - In a critical state and requires immediate attention
NOTE - If you have not defined health states do not check this box
Azure Recommended Health States
You have visualization or reporting in place that informs stakeholders about the overall state of the workload. Dashboards can also be reports by email, instant messaging, or wiki that notify business stakeholders when the health state changes.
Visualizations can come in many forms. Dashboards, Azure workbooks, reports, etc. are all viable options. Based on the stakeholder and responsible parties, consider the access needed to underlying data and any licenses needed to review.
The following example shows an Azure Data Explorer dashboard visualizing Dataverse telemetry.
NOTE - If you haven’t created a dashboard showcasing telemetry for Dataverse and Power Platform services, do not check this box.