Task 03 - Identifying bottlenecks, stress tests, and resilience testing (30 minutes)

Introduction

At this point, you should have load tests running against your environment. Although our load tests are automated now, we need to make sure we review the tests to see if we can identify any bottlenecks or changes in behavior. From there, we can share our findings and work towards remediating the issues. If your application is hosted in Azure, you can leverage the synergy Azure Load Testing has with the other Azure services.

In addition to load testing, we may also wish to perform stress testing. Stress testing focuses on overloading the system until it breaks. A stress test determines how stable a system is and its ability to withstand extreme increases in load. It does this by testing the maximum number of requests that a system can handle at a given time before performance is compromised and fails. Find this maximum to understand what kind of load the current environment can adequately support.

We will also want to perform a third type of testing, called resilience testing. The resiliency of an application isn’t solely decided by how quickly it can scale up or out, but also how it handles failures. This means planning for a failure of every application component: a single container, cluster, VM, database, or region.

Description

The team at Munson’s Pickles and Preserves is excited about the possibilities of automated load testing of their applications and they are interested in learning more about the types of testing we can perform in Azure to gauge application quality and resiliency.

Review your load test results to see where there are potential bottlenecks in your application. If needed, perform additional tests with different settings to help identity any bottlenecks. Once you have identified one or more bottlenecks, see how you can resolve the bottleneck and show the new performance profile after your fixes are in place.
Update your existing load test script to convert it to a stress test.
Identify the bottlenecks and breaking points of the application under stress.
Decide whether you should add stress test remediation steps as GitHub Issues.
Use Azure Chaos Studio to design a Chaos experiment. Note that you will need to register the Microsoft.Chaos provider for your subscription if you have not already. You can register this in the Resource providers menu item under the Settings menu for the subscription.
After ensuring that you have a recent baseline test, run a load test.
During the load test, start the Chaos experiment and note the failures. Note that there is, at present, no GitHub Action for Chaos experiments, so you will need to run this manually, outside of a GitHub Actions workflow.
Prioritize the failure points arising from the Chaos experiment and build an action plan to remediate.

Success Criteria

Demonstrate how you found any bottlenecks in the Team Messaging System application.
Verify that your bottleneck has been resolved.
Show a side-by-side comparison of the load tests results and the stress test results.
Show, based on these tests, where the bottlenecks and breakpoints are in the application.
Show the failure points that occurred during your resilience testing, as well as your remediation plan.

Learning Resources

Solution

Expand this section to view the solution

The following high-level notes cover step 1.
- For a primer on identifying bottlenecks using Azure Load Testing, this article covers the process.
- Because there is no database associated with this application, identifying areas of slowdown will require code analysis.
- There are two operations which consistently lag behind everything else in terms of performance: initial loading of messages and posting a new message. In both cases, the root cause is a thread sleep loop. After removing these “speed loops,” you should notice a drastic performance improvement when performing those operations.
The following high-level notes cover steps 2-4. There is also a sample JMeter script in the solution directory.
- The key difference between load testing and stress testing, in this case, is in the amount of load we will scale up. The goal is to increase the load up to a point where the application fails or performance degrades.
- In the load test script, we perform 20 loops of 30 threads. In the stress test script, we increase this to 100 loops of 500 threads. We do increase the ramp time somewhat, but we are performing significantly more operations on the same server load.
The following high-level notes cover steps 5-8.
- Chaos Studio does not offer a GitHub Action at this time, so you will not be able to integrate this into your CI/CD pipelines.
- Azure Chaos studio only supports a limited number of services, such as VMs, AKS, App Services, Cosmos DB, and networking. Because our sample app only contains App Services, we will use App Service faults.
- Prior to creating a chaos experiment, you will want to scale out the App Service Plan to multiple instances, such as 3.
- Create Azure Chaos Studio/Experiment
  - Register Chaos Studio Provider
    - Go to your subscription
    - On the left-hand side, select Resource provider
    - In the list of providers, search for “Microsoft.Chaos”
    - Click on the provider and select Register
  - Go to Azure Chaos Studio
  - Navigate to the Targets menu on the left-hand side and then select the App Service you will test. From there, select Enable Targets then Enable service-direct targets to complete enablement.
  - On the left-hand side menu for Chaos Studio, select the Experiments option.
  - Select + Create and then choose New experiment from the dropdown.
  - Fill in your subscription, resource group, location, and name for this experiment. Keep track of the experiment name as a managed user will be created for you.
  - Go to the experiment designer on the next tab. Change the name of the step or branch if you wish.
  - Select Add Action and then Add Fault to create a new fault.
  - Select “Stop App Service” as the Fault. You can choose the duration but the minimum duration is 5 minutes and that should be enough. From the target resources tab, choose the App Service you wish to test and then select Add to complete the addition.
  - Save your experiment by selecting Review + create and then choosing the Create option.
- Update App Service Permissions
  - In the appropriate App Service, select Access control (IAM) from the left-hand menu.
  - Select Add followed by Add Role Assignment
  - Select the Contributor role, then select the Members” tab. Choose the + Select members link.
  - Search the name of the experiment from the earlier step and then choose Select.
  - Review and assign the permissions which will grant the role to the experiment.
- Run load test + experiment
  - Run the load test first, then while the load test is running kick off the chaos experiment. You should notice that the application returns a 403 error and the web app is stopped.