Disaster Recovery¶
For unexpected reasons, a significant number [1] of CCF nodes may become unavailable. In this catastrophic scenario, operators and members can recover transactions that were committed on the crashed service by starting a new network.
The disaster recovery procedure is costly (e.g. the Service Identity certificate will need to be re-distributed to clients) and should only be staged once operators are confident that the service will not heal by itself. In other words, the recovery procedure should only be staged once a majority of nodes do not consistently report one of them as their primary node.
Tip
See tests/infra/health_watcher.py for an example of how a network can be monitored to detect a disaster recovery scenario.
Note
From 4.0.9/5.0.0-dev2 onwards secret sharing used for ledger recovery now relies on a much simpler implementation that requires no external dependencies. Note that while the code still accepts shares generated by the old code for now, it only generates shares with the new implementation. As a result, a DR attempt that would downgrade the code to a version that pre-dates this change, after having previously picked it up, would not succeed if a reshare had already taken place.
Overview¶
The recovery procedure consists of two phases:
Operators should retrieve one of the ledgers of the previous service and re-start one or several nodes in
recovermode. The public transactions of the previous network are restored and the new network established.After agreeing that the configuration of the new network is suitable, members should vote to accept to recover the network and once this is done, submit their recovery shares to initiate the end of the recovery procedure. See here for more details.
Note
Before attempting to recover a network, it is recommended to make a copy of all available ledger and snapshot files.
Tip
See Sandbox recovery for an example of the recovery procedure using the CCF sandbox.
Establishing a Recovered Public Network¶
To initiate the first phase of the recovery procedure, one or several nodes should be started with the Recover command in their config file (see also the sample recovery configuration file recover_config.json):
$ cat /path/to/config/file
...
"command": {
"type": "Recover",
...
"recover": {
"initial_service_certificate_validity_days": 1
}
...
$ /opt/ccf/bin/js_generic --config /path/to/config/file
Each node will then immediately restore the public entries of its ledger (ledger.directory and ledger.read_only_ledger_dir configuration entries). Because deserialising the public entries present in the ledger may take some time, operators can query the progress of the public recovery by calling GET /node/state which returns the version of the last signed recovered ledger entry. Once the public ledger is fully recovered, the recovered node automatically becomes part of the public network, allowing other nodes to join the network.
The recovery procedure can be accelerated by specifying a valid snapshot file created by the previous service in the directory specified via the snapshots.directory configuration entry. If specified, the recover node will automatically recover the snapshot and the ledger entries following that snapshot, which in practice should be a fraction of the total time required to recover the entire historical ledger.`
The state machine for the recover node is as follows:
graph LR;
Uninitialized-- config -->Initialized;
Initialized-- recovery -->ReadingPublicLedger;
ReadingPublicLedger-->PartOfPublicNetwork;
PartOfPublicNetwork-- member shares reassembly -->ReadingPrivateLedger;
ReadingPrivateLedger-->PartOfNetwork;
Note
It is possible that the length of the ledgers of each node may differ slightly since some transactions may not have yet been fully replicated. It is preferable to use the ledger of the primary node before the service crashed. If the latest primary node of the defunct service is not known, it is recommended to concurrently start as many nodes as previous existed in recover mode, each recovering one ledger of each defunct node. Once all nodes have completed the public recovery procedure, operators can query the highest recovered signed seqno (as per the response to the GET /node/state endpoint) and select this ledger to recover the service. Other nodes should be shutdown and new nodes restarted with the join option.
Similarly to the normal join protocol (see Adding a New Node to the Network), other nodes are then able to join the network.
Warning
After recovery, the identity of the network has changed. The new service certificate service_cert.pem must be distributed to all existing and new users.
The state machine for the join node is as follows:
graph LR;
Uninitialized-- config -->Initialized;
Initialized-- join -->Pending;
Pending-- poll status -->Pending;
Pending-- trusted -->PartOfPublicNetwork;
Summary Diagram¶
sequenceDiagram
participant Operators
participant Node 0
participant Node 1
participant Node 2
Operators->>+Node 0: recover
Node 0-->>Operators: Service Certificate 0
Note over Node 0: Reading Public Ledger...
Operators->>+Node 1: recover
Node 1-->>Operators: Service Certificate 1
Note over Node 1: Reading Public Ledger...
Operators->>+Node 0: GET /node/state
Node 0-->>Operators: {"last_signed_seqno": 50, "state": "readingPublicLedger"}
Note over Node 0: Finished Reading Public Ledger, now Part of Public Network
Operators->>Node 0: GET /node/state
Node 0-->>Operators: {"last_signed_seqno": 243, "state": "partOfPublicNetwork"}
Operators->>+Node 1: GET /node/state
Node 1-->>Operators: {"last_signed_seqno": 36, "state": "readingPublicLedger"}
Note over Node 1: Finished Reading Public Ledger, now Part of Public Network
Operators->>Node 1: GET /node/state
Node 1-->>Operators: {"last_signed_seqno": 203, "state": "partOfPublicNetwork"}
Note over Operators, Node 1: Operators select Node 0 to start the new network (243 > 203)
Operators->>+Node 1: shutdown
Operators->>+Node 2: join
Node 2->>+Node 0: Join network (over TLS)
Node 0-->>Node 2: Join network response
Note over Node 2: Part of Public Network
Once operators have established a recovered crash-fault tolerant public network, the existing members of the consortium must vote to accept the recovery of the network and submit their recovery shares.
Local Sealing Recovery (Experimental)¶
SNP provides the DERIVED_KEY guest message which derives a key from the CPU’s VCEK (or VLEK), TCB version and the guest’s measurement and host_data (policy), thus any change to the CPU, measurement or policy, or a rolled-back TCB version, will prevent the key from being reconstructed. If configured, the node will unseal the secrets it previously sealed instead of waiting for recovery shares from members after transition_to_open is triggered.
If, in config.json, output_files.sealed_ledger_secret_location is set, the node will derive a key and seal versioned ledger secrets to that directory. This capability is noted in public:ccf.gov.node.info[node].will_locally_seal_ledger_secrets, to allow it to be audited.
Then if command.recover.previous_sealed_ledger_secret_location is set in the config.json, when the node recovers and receives the transition_to_open transaction, the node will try to unseal the latest ledger secret and use that to recover the ledger. If this is unsuccessful, it will fall back to waiting for recovery shares. Which of these two paths is taken is noted in the public:ccf.internal.last_recovery_type.
$ cat /path/to/config/file
...
"command": {
"type": "Recover",
...
"recover": {
...
"previous_sealed_ledger_secret_location": "/path/to/previous/secret"
}
}
"output_files": {
...
"sealed_ledger_secret_location": "/path/to/new/secret"
}
...
$ /opt/ccf/bin/js_generic --config /path/to/config/file
Self-Healing-Open recovery (Experimental)¶
In environments with limited orchestration or limited operator access, it is desirable to allow an automated disaster recovery without operator intervention. At a high level, Self-Healing-Open recovery allows recovering replicas to discover which node has the most up-to-date ledger and automatically recover the network using that ledger. The protocol completes with a node choosing to transition-to-open, and so requires another mechanism to recover the private ledger. If it is likely that the nodes will restart on the same hardware, local sealing recovery (see above) can be used to recover the private ledger automatically, and bring the service fully online.
There are two paths, an election path, and a very-high-availability failover path. The election path ensures that if all nodes restart and have full network connectivity, a majority of nodes’ on-disk ledger contains every committed transaction, and no timeouts trigger, then there will be only one recovered network and all committed transactions will be persisted. However, the election path can become stuck, in which case the failover path is designed to ensure progress.
In the election path, nodes first gossip with each other, learning of the ledgers of other nodes. Once they have heard from every node they vote for the node with the best ledger. If a node receives votes from a majority of nodes, it invokes transition-to-open and notifies the other nodes to restart and join it. This path is illustrated below, and is guaranteed to succeed if all nodes can communicate and no timeouts trigger.
sequenceDiagram
participant N1
participant N2
participant N3
Note over N1, N3: Gossip
N1 ->> N2: Gossip(Tx=1)
N1 ->> N3: Gossip(Tx=1)
N2 ->> N3: Gossip(Tx=2)
N3 ->> N2: Gossip(Tx=3)
Note over N1, N3: Vote
N2 ->> N3: Vote
N3 ->> N3: Vote
Note over N1, N3: Open/Join
N3 ->> N1: IAmOpen
N3 ->> N2: IAmOpen
Note over N1, N2: Restart
Note over N3: Transition-to-open
Note over N3: Local unsealing
Note over N3: Open
N1 ->> N3: Join
N2 ->> N3: Join
In the failover path, each phase has a timeout to skip to the next phase if a failure has occurred. For example, the election path requires all nodes to communicate to advance from the gossip phase to the vote phase. However, if any node fails to recover, the election path is stuck. In this case, after a timeout, nodes will advance to the vote phase regardless of whether they have heard from all nodes, and vote for the best ledger they have heard of at that point.
Unfortunately, this can lead to multiple forks of the service if different nodes cannot communicate with each other and timeout. Hence, we recommend setting the timeout substantially higher than the highest expected recovery time, to minimise the chance of this happening. To audit if timeouts were used to open the service, the public:ccf.gov.selfhealingopen.failover_open table tracks this.
This failover path is illustrated below.
sequenceDiagram
participant N1
participant N2
participant N3
Note over N1, N3: Gossip
N2 ->> N3: Gossip(Tx=2)
N3 ->> N2: Gossip(Tx=3)
Note over N1: Timeout
Note over N3: Timeout
Note over N1, N3: Vote
N1 ->> N1: Vote
N3 ->> N3: Vote
N2 ->> N3: Vote
Note over N1, N3: Open/Join
Note over N1: Transition-to-open
Note over N3: Transition-to-open
If the network fails during reconfiguration, each node will use its latest known configuration to recover. Since reconfiguration requires votes from a majority of nodes, the latest configuration should recover using the election path, however nodes in the previous configuration may recover using the election path.
Notes¶
Operators can track the number of times a given service has undergone the disaster recovery procedure via the
GET /node/networkendpoint (recovery_countfield).
Footnotes