Resiliency in Azure Cosmos DB Garnet Cache

Azure Cosmos DB Garnet Cache is designed for high availability and resilience, providing enterprise-grade uptime, data persistence, and automatic recovery capabilities. The service offers multiple layers of protection to ensure your cache remains available and your data is preserved even during infrastructure failures, maintenance events, or unexpected outages.

Architecture Overview

Azure Cosmos DB Garnet Cache implements a distributed architecture with built-in redundancy and data persistence, deployed in a single Azure region.

Primary-Replica Architecture: Each cache cluster consists of one or more shards with primary node and replica nodes
Asynchronous Replication: Data is asynchronously replicated from primaries
AOF Persistence: Append-Only File snapshots provide durable data storage to disk
Automatic Failover: Fast failover for minimal service interruption
Automatic Recovery: Failed nodes recover using AOF snapshots and replication

Architecture

Data Distribution and Sharding

Azure Cosmos DB Garnet Cache uses a consistent hashing approach to distribute data across the cluster for optimal availability and load distribution. The cluster's key space is divided into 16,384 hash slots, with each slot owned by a single primary node. Any given key maps to exactly one slot, ensuring predictable data placement and efficient retrieval.

When you have multiple shards in your cluster, hash slots are evenly distributed across all primary nodes. This distribution ensures that data and load are balanced across the cluster, preventing hotspots and maximizing resource utilization. Replica nodes serve read-only requests for the keys hashing to slots owned by their corresponding primary nodes, enabling read scaling while maintaining data consistency.

This sharding strategy provides several resiliency benefits: if a primary node fails, only the keys in its assigned slots are affected, while the rest of the cluster continues operating normally. Automatic slot reassignment ensures that when nodes become unavailable, their hash slots can be redistributed to healthy nodes, maintaining cluster availability even during node failures.

High Availability

Replication

Azure Cosmos DB Garnet Cache supports configurable replication factors to balance availability requirements with cost. You can configure replication during cluster provisioning and it can't be changed on existing clusters.

Configuration	Replication Factor	Total Nodes Per Shard	Use Cases
No Replication	1x	1, primary only	Development, testing, cost-optimized production
High Availability	2x	2, one replica per primary	Mission-critical applications
Optimized Read Performance	3x-5x*	3-5, two to four replicas per primary	Mission-critical applications requiring high read throughput

*Reach out to CosmosGarnetCache@service.microsoft.com if you need a higher replication factor.

The replication process is as follows

Write Operations: All writes go to the primary node first
Asynchronous Replication: The write is acknowledged once written in the primary and is asynchronously replicated to all replicas in the shard
Consistency: Eventual consistency between replicas
Read Distribution: Read operations can be distributed across primary and replica nodes

Automatic Failovers

A failover can be either planned, such as system updates or management operations, or it can be unplanned, such as hardware failure or unexpected outages. The Azure Cosmos DB Garnet Cache automatically handles failovers for you and will promote a replica to primary after detecting one of these events.

Multi Availability Zone Deployment

Azure Cosmos DB Garnet Cache can be configured with availability zones during provisioning in supported Azure regions where there is capacity for your chosen SKU. See the list of Azure regions with availability zone support.

If enabled, nodes are automatically distributed across multiple availability zones within the region. Primary and replica nodes are not guaranteed to be in different availability zones. If a zone goes down and all replicas for a given shard are unhealthy, the hash slots assigned to it will automatically be reassigned to healthy shards.

Data Persistence

Azure Cosmos DB Garnet Cache uses append only file (AOF) persistence to ensure data durability. Every write operation is appended to a persistent log file stored on an attached locally redundant Premium Managed Disk. Checkpoints are stored every second. The disk size is automatically provisioned for each node based on 2x the total amount of memory for the SKU. See the disks provisioned for each SKU.

Data on nodes is restored from the latest data checkpoint upon restart, including from automatic failovers. It is possible to experience data loss after a failover for the portion of data that hasn't been replicated or wasn't included in the latest checkpoint.

Architecture Overview​

Data Distribution and Sharding​

High Availability​

Replication​

Automatic Failovers​

Multi Availability Zone Deployment​

Data Persistence​

Learn More​