Test failover and liveness

Test failover and liveness in Coyote actors

Wikipedia provides this definition: “Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network. Systems designers usually provide failover capability in servers, systems or networks requiring near-continuous availability and a high degree of reliability.”

This sample applies the failover concept to the firmware of an automated espresso machine using the Coyote asynchronous actors programming model. Imagine what would happen if the tiny CPU running the machine rebooted in the middle of making a coffee. What bad things might happen? Can we design a state machine that can handle this scenario and provide a more fault tolerant coffee machine?

An identical version of this tutorial is available that uses regular C# tasks.

The following diagram shows how Coyote can be used to test this scenario and help you design more reliable software.

FailoverCoffeeMachine

The CoffeeMachine is modeled as an asynchronous state machine. This example is not providing real firmware, instead it mocks the hardware sensor platform built into the machine. This is done in MockSensors.cs where you will find three actors that model various hardware components: MockDoorSensor, MockWaterTank and MockCoffeeGrinder. This actor provides async ways of reading sensor values like Water temperature and water levels, or turning on and off the coffee grinder and so on. The CoffeeMachine does not know the sensors are mocks, all it knows is the public interface defined in SensorEvents.cs. In this way the CoffeeMachine is production code, while the mocks are only for testing.

The reason we are using an asynchronous model is that even in the smallest of devices, often times there is a message passing architecture where different hardware components are connected via some sort of bus, whether it is a simple serial port, or something more sophisticated like a CAN bus.

We will test that we can kill the CoffeeMachine and restart it without anything bad happening. This test is setup by the FailoverDriver. The FailoverDriver lets the first CoffeeMachine instance run for a bit then it randomly kills it by using the HaltEvent, then it starts a new CoffeeMachine. The new CoffeeMachine instance needs to figure out the state of the sensors such that when a MakeCoffeeEvent arrives, it doesn’t do something silly that breaks the machine. The mock sensors are not killed so that it acts as a persistent store for sensor state across all instances of the CoffeeMachine.

Some safety Asserts are placed in the code that verify certain important things, including: - do not turn on heater if there is no water - do not turn on grinder if there are no beans in the hopper - do not turn on shot maker if there is no water - do not do anything if the door is open

There is also a correctness assert in the CoffeeMachine to make sure the correct number of espresso shots are made and there is a LivenessMonitor that monitors the CoffeeMachine to make sure it never gets stuck, i.e., it always finishes the job it was given or it goes to an error state if the machine needs to be fixed. See Liveness Checking.

A number of excellent bugs were found by Coyote during the development of this sample, and this illustrates the fact that Coyote can be applied to any type of asynchronous software, not just cloud services. There is still one bug remaining in the code which you can find using coyote test, and it happens after failover just to prove the usefulness of this testing methodology.

What you will need

To run the CoffeeMachine example, you will need to:

Install Visual Studio 2022.
Install the .NET 8.0 version of the coyote tool.
Be familiar with the coyote tool. See using Coyote.
Clone the Coyote git repo.

Build the sample

You can build the sample by following the instructions here.

Run the failover coffee machine application

Now you can run the CoffeeMachine application:

./Samples/bin/net8.0/CoffeeMachineActors.exe

The coffee machine

There are many different types of coffee machines. This example is based on the following machine which can automatically heat water, grind beans, and make an espresso shot all with the press of a button:

The following diagram shows the states and actions in our example implementation in the CoffeeMachine class:

When you run the executable without using coyote test (this is called running in production mode), you will see the following console output. Notice in the output below that the FailoverDriver forces the termination of the CoffeeMachine right in the middle of making a coffee. Then when the CoffeeMachine is restarted, the FailoverDriver requests another coffee and the CoffeeMachine is able to continue on, the water is already warm, and it dumps the old grinds so you have the freshest possible coffee each time.

<FailoverDriver> starting new CoffeeMachine.
<CoffeeMachine> initializing...
<CoffeeMachine> checking initial state of sensors...
<CoffeeMachine> Water level is 60 %
<CoffeeMachine> Hopper level is 93 %
<CoffeeMachine> Warming the water to 100 degrees
<CoffeeMachine> Turning on the water heater
<CoffeeMachine> Coffee machine is warming up (64 degrees)...
<CoffeeMachine> Coffee machine is warming up (74 degrees)...
<CoffeeMachine> Coffee machine is warming up (84 degrees)...
<CoffeeMachine> Coffee machine is warming up (94 degrees)...
<CoffeeMachine> Coffee machine water temperature is now 100
<CoffeeMachine> Turning off the water heater
<CoffeeMachine> Coffee machine is ready to make coffee (green light is on)
<CoffeeMachine> Coffee requested, shots=1
<CoffeeMachine> Grinding beans...
<CoffeeMachine> PortaFilter is 10 % full
<CoffeeMachine> PortaFilter is 20 % full
<CoffeeMachine> PortaFilter is 30 % full
<CoffeeMachine> PortaFilter is 40 % full
<CoffeeMachine> PortaFilter is 50 % full
<CoffeeMachine> PortaFilter is 60 % full
<CoffeeMachine> PortaFilter is 70 % full
<CoffeeMachine> PortaFilter is 80 % full
<CoffeeMachine> PortaFilter is 90 % full
<CoffeeMachine> PortaFilter is full
<CoffeeMachine> Making shots...
<FailoverDriver> forcing termination of CoffeeMachine.
<CoffeeMachine> Coffee Machine Terminating...
<CoffeeMachine> #################################################################
<CoffeeMachine> # Coffee Machine Halted                                         #
<CoffeeMachine> #################################################################

<FailoverDriver> starting new CoffeeMachine.
<CoffeeMachine> initializing...
<CoffeeMachine> checking initial state of sensors...
<CoffeeMachine> Water level is 60 %
<CoffeeMachine> Hopper level is 83 %
<CoffeeMachine> Dumping old smelly grinds!
<CoffeeMachine> Warming the water to 100 degrees
<CoffeeMachine> Coffee machine water temperature is now 100
<CoffeeMachine> Coffee machine is ready to make coffee (green light is on)
<CoffeeMachine> Coffee requested, shots=2
<CoffeeMachine> Grinding beans...
<CoffeeMachine> PortaFilter is 10 % full
<CoffeeMachine> PortaFilter is 20 % full
<CoffeeMachine> PortaFilter is 30 % full
<CoffeeMachine> PortaFilter is 40 % full
<CoffeeMachine> PortaFilter is 50 % full
<CoffeeMachine> PortaFilter is 60 % full
<CoffeeMachine> PortaFilter is 70 % full
<CoffeeMachine> PortaFilter is 80 % full
<CoffeeMachine> PortaFilter is 90 % full
<CoffeeMachine> PortaFilter is full
<CoffeeMachine> Making shots...
<CoffeeMachine> Shot count is 1
<CoffeeMachine> 2 shots completed and 2 shots requested!
<CoffeeMachine> Dumping the grinds!
<CoffeeMachine> Coffee machine is ready to make coffee (green light is on)
<FailoverDriver> CoffeeMachine completed the job.
...

The test will continue on making coffee until it runs out of either water or coffee beans and the FailoverDriver halts each CoffeeMachine instance at random times until the machine is out of resources, at which point the test is complete. The mock sensors also randomly choose some error conditions, so instead of the above you may see some errors like:

<CoffeeMachine> Cannot safely operate coffee machine with the door open!
<CoffeeMachine> Coffee machine needs manual refilling of water and/or coffee beans!

If you see these errors, press ENTER to terminate the program and run it again. These random start conditions help the test cover more cases.

Each halted machine is terminated and discarded, then a new CoffeeMachine instance is started that must figure out what is happening with the sensors and make the next coffee without incident. Eventually a CoffeeMachine will report there is no more water or coffee beans and then it will stop with an error message saying the machine needs to be manually refilled.

Coyote testing

You can now use coyote test to exercise the code and see if any bugs can be found. From the samples directory:

coyote test ./Samples/bin/net8.0/CoffeeMachineActors.dll -i 100 -ms 2000 -s prioritization -sv 10 --actor-graph

Chances are this will find a bug quickly, one of the safety assertions will fire and you will see that a test output log and a DGML diagram are produced, like this:

.\Samples\bin\net8.0\Output\CoffeeMachineActors.exe\CoyoteOutput\CoffeeMachine_0_0.txt
.\Samples\bin\net8.0\Output\CoffeeMachineActors.exe\CoyoteOutput\CoffeeMachine_0_0.dgml

This log can be pretty big, a couple thousand lines where each line represents one async operation. This log contains only the one iteration that failed, and towards the end you will see something like this:

<ActionLog> Microsoft.Coyote.Samples.CoffeeMachineActors.MockCoffeeGrinder(3) invoked action 'OnGrinderButton'.
<ErrorLog> Please do not turn on grinder if there are no beans in the hopper

So the CoffeeMachine accidentally tried to grind beans when the hopper was empty. If you look at the resulting DGML diagram you will see exactly what happened:

The Timer machines were removed from this diagram just for simplicity. The FailoverDriver started the first CoffeeMachine on the left which ran to completion but it ran low on coffee beans. Then this first machine was halted. The FailoverDriver then started a new CoffeeMachine, which made it all the way to GrindingBeans where it tripped the safety assertion in MockCoffeeGrinder. So the bug here is that somehow, the second CoffeeMachine instance missed the fact that it was low on coffee beans. A bug exists in the code somewhere. Can you find it?

It is not a trivial bug because the CheckSensors state is clearly checking the coffee level by sending the ReadHopperLevelEvent to the MockCoffeeGrinder actor and CheckInitialState does not advance to the HeatingWater state until this reading is returned. So what happened?

Hint: if you search backwards in the output log you will find the following situation reported in CheckState:

<CoffeeMachine> Hopper level is -5 %

The first CoffeeMachine instance left the grinder running a bit too long, and the sensor got confused thinking the coffee level is negative. The new CoffeeMachine instance never thought about this situation and checked only:

if ((int)this.HopperLevel.Value == 0)
...

And so it missed the fact it might be negative. The fix is easy, just change this condition to <= and the bug goes away. The fact that such a bug was found shows the usefulness of the failover testing strategy.

Testing the scheduling of highly asynchronous operations

This raises a bigger design question, how did the coffee level become negative? In firmware it is common to poll sensor readings and do something based on that. In this case we are polling a PortaFilterCoffeeLevelEvent in a tight loop while in the GrindingBeans state. Meanwhile the MockCoffeeGrinder class has a timer running and when HandleTimer calls MonitorGrinder it decreases the coffee level by 10 percent during every time interval. So we have an asynchronous operation going on here. Coffee level is decreasing based on a timer, and the CoffeeMachine is monitoring that coffee level using async events. This all seems to work perfectly in the production code where we see this output:

<CoffeeMachine> Grinding beans...
<CoffeeMachine> PortaFilter is 10 % full
<CoffeeMachine> PortaFilter is 20 % full
<CoffeeMachine> PortaFilter is 30 % full
<CoffeeMachine> PortaFilter is 40 % full
<CoffeeMachine> PortaFilter is 50 % full
<CoffeeMachine> PortaFilter is 60 % full
<CoffeeMachine> PortaFilter is 70 % full
<CoffeeMachine> PortaFilter is 80 % full
<CoffeeMachine> PortaFilter is 90 % full
<CoffeeMachine> PortaFilter is full

And the reason it works is because your Operating System is scheduling both of these async threads in a way that is relatively fair meaning one does not run for a long time without the other being scheduled also. But what if these two systems were running in a distributed world and one of them hangs for a long time? This is the kind of thread scheduling that coyote test is testing where one machine can run way ahead of another.

You need to take this into account when using this kind of timer based async events. One way to improve the design in a firmware based system like a coffee machine is to switch from a polling based system to an interrupt based system where the MockCoffeeGrinder can send important events to the CoffeeMachine. This style of interrupt based eventing is used to model the ShotCompleteEvent, WaterHotEvent, WaterEmptyEvent and HopperEmptyEvent.

This shows how Coyote can help find actual design flaws in your code so you can design a system that is more robust in the face of unexpected faults. The coyote test engine provides several different scheduling strategies that test different kinds of fairness algorithms. These are designed to find different kinds of bugs.

You can find out how much testing was actually done during testing by setting the --coverage flag. The coverage report summarizes how many of the possible events were covered.

Liveness monitor

As described in the documentation on Liveness Checking the CoffeeMachine must also eventually finish what it is doing. It must either make a coffee when requested and return to the Ready state, or it must find a problem and go to the Error state or the RefillRequired state. This “liveness” property can be enforced using a very simple LivenessMonitor as shown below:

internal class LivenessMonitor : Monitor
{
    public class BusyEvent : Event { }

    public class IdleEvent : Event { }

    [Start]
    [Cold]
    [OnEventGotoState(typeof(BusyEvent), typeof(Busy))]
    [IgnoreEvents(typeof(IdleEvent))]
    private class Idle : State { }

    [Hot]
    [OnEventGotoState(typeof(IdleEvent), typeof(Idle))]
    [IgnoreEvents(typeof(BusyEvent))]
    private class Busy : State { }
}

This type of Monitor is also a kind of state machine. The CoffeeMachine can send events to this monitor to tell it when it has switched into Busy state or Idle state. When the CoffeeMachine starts heating water, or making coffee it sends this event:

this.Monitor<LivenessMonitor>(new LivenessMonitor.BusyEvent());

and when the CoffeeMachine is done making coffee or it has moved to an error state it sends this event:

this.Monitor<LivenessMonitor>(new LivenessMonitor.IdleEvent());

The Busy state is marked as a [Hot] state and the Idle state is marked as a [Cold] state. During testing if coyote test finds the LivenessMonitor to be stuck in the [Hot] state too long it raises an exception and the test fails.

Reliable termination handshake

You may notice in the code that when the FailoverDriver wants to stop the first CoffeeMachine it sends a CoffeeMachine.TerminateEvent and waits for a CoffeeMachine.HaltedEvent before it starts a new CoffeeMachine by running this.RaiseGotoStateEvent<Test>().

This may seem a bit convoluted compared to just this.SendEvent(this.CoffeeMachineId, HaltEvent.Instance) followed by this.RaiseGotoStateEvent<Test>(). The reason a direct halt event was not used in this case is because a HaltEvent is processed asynchronously, which means the RaiseGotoStateEvent would end up creating the new CoffeeMachine instance before the old one was fully halted. This can lead to confusion in the mock sensors which are written to expect one and only one client CoffeeMachine at a time. The TerminateEvent handshake solves that problem.

Since the TerminateEvent could be sent to the CoffeeMachine at any time we need an easy way to handle this event at any time in CoffeeMachine, hopefully without having to decorate every single state in the machine with the custom attribute:

[OnEventDoAction(typeof(TerminateEvent), nameof(OnTerminate))]

The solution is to promote this OnEventDoAction to the class level. Class level handlers are handled like a fall back mechanism so that no matter what state the CoffeeMachine is in the class level handler can be invoked, unless the current state overrides that handler.

Summary

Failover testing is simple to achieve using Coyote and yields many interesting bugs in your code, including some thought-provoking design bugs. The technique of halting your “production” actors/state-machines, and recreating them by reading from a “persistent” mock (which is not halted during testing) can be generalized to many other scenarios (e.g. cloud services) where someone needs to test failover logic of production actors using Coyote.

In this tutorial you learned:

How to do failover testing using Coyote FailoverDriver state machines.
How to use Coyote in a firmware/sensor scenario.
How to use --strategy portfolio testing to find tricky bugs more quickly.
How Assert helps find violations of safety properties during testing.
How to ensure full termination of one state machine before creating a new one.
How to use class level event handlers in a StateMachine to define an event handler in one place that is invoked no matter what state the machine is in.
How to write a LivenessMonitor.