Resiliency recommendations for Azure Cosmos DB for NoSQL

This article contains recommendations for achieving resiliency for Azure Cosmos DB for NoSQL. Many of the recommendations contain supporting Azure Resource Graph (ARG) queries to help identify non-compliant resources.

Resiliency recommendations impact matrix

Each recommendation is marked in accordance with the following impact matrix:

Image Impact Description
High Immediate fix needed.
Medium Fix within 3-6 months.
Low Needs to be reviewed.

Resiliency recommendations summary

Category Priority Recommendation
Availability Configure at least two regions for high availability
Disaster recovery Enable service-managed failover for multi-region accounts with single write region
Evaluate multi-region write capability
Choose appropriate consistency mode reflecting data durability requirements
Configure continuous backup mode
System efficiency Ensure query results are fully drained
Maintain singleton pattern in your client
Application resilience Implement retry logic in your client
Monitoring Monitor Cosmos DB health and set up alerts

Availability

Configure at least two regions for high availability

It's crucial to enable a secondary region on your Cosmos DB to achieve higher SLA. Doing so doesn't incur any downtime and it's as easy as selecting a pin on map. Cosmos DB instances utilizing Strong consistency need to configure at least three regions to retain write availability if there is one region failure.

Potential benefits: Enhances SLA and resilience.

Learn more: Reliability (High availability) in Cosmos DB for No SQL

Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
     array_length(properties.locations) < 2 or
    (array_length(properties.locations) < 3 and properties.consistencyPolicy.defaultConsistencyLevel == 'Strong')
| project recommendationId='cosmos-1', name, id, tags

Disaster recovery

Enable service-managed failover for multi-region accounts with single write region

Cosmos DB boasts high uptime and resiliency. Even so, issues may arise. With Service-Managed failover, if a region is down, Cosmos DB automatically switches to the next available region, requiring no user action.

Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
    array_length(properties.locations) > 1 and
    tobool(properties.enableAutomaticFailover) == false and
    tobool(properties.enableMultipleWriteLocations) == false
| project recommendationId='cosmos-2', name, id, tags

Evaluate multi-region write capability

Multi-region write capability allows for designing applications that are highly available across multiple regions, though it demands careful attention to consistency requirements and conflict resolution. Improper setup may decrease availability and cause data corruption due to unhandled conflicts.

Potential benefits: Enhances high availability.

Learn more:

Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
    array_length(properties.locations) > 1 and
    properties.enableMultipleWriteLocations == false
| project recommendationId='cosmos-3', name, id, tags

Choose appropriate consistency mode reflecting data durability requirements

In a globally distributed database, consistency level impacts data durability during regional outages. Understand data loss tolerance for recovery planning. Use Session consistency unless stronger is needed, accepting higher write latencies and potential write region impact from read-only outages.

Potential benefits: Enhances data durability and recovery.

Learn more: Consistency levels in Azure Cosmos DB

Configure continuous backup mode

Cosmos DB's backup is always on, offering protection against data mishaps. Continuous mode allows for self-serve restoration to a pre-mishap point, unlike periodic mode, which requires contacting Azure support, leading to longer restore times.

Potential Benefits: Faster self-serve data restore.

Learn more: Continuous backup with point in time restore feature in Azure Cosmos DB

Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
    properties.backupPolicy.type == 'Periodic' and
    properties.enableMultipleWriteLocations == false and
    properties.enableAnalyticalStorage == false
| project recommendationId='cosmos-5', name, id, tags

System efficiency

Ensure query results are fully drained

Cosmos DB has a 4 MB response limit, leading to paginated results for large or partition-spanning queries. Each page shows availability and provides a continuation token for the next. A while loop in code is necessary to traverse all pages until completion.

Potential Benefits: Maximizes data retrieval efficiency.

Learn more: Pagination in Azure Cosmos DB for No SQL.

Maintain singleton pattern in your client

Using a single instance of the SDK client for each account and application is crucial as connections are tied to the client. Compute environments have a limit on open connections, affecting connectivity when exceeded.

Potential Benefits: Optimizes connections and efficiency.

Learn more: Designing resilient applications with Azure Cosmos DB SDKs.

Application resilience

Implement retry logic in your client

Cosmos DB SDKs automatically manage many transient errors through retries. Despite this, it's crucial for applications to implement additional retry policies targeting specific cases that the SDKs can't generically address, ensuring more robust error handling.

Potential Benefits: Enhances error handling resilience.

Learn more: Designing resilient applications with Azure Cosmos DB SDKs.

Monitoring

Monitor Cosmos DB health and set up alerts

Monitoring the availability and responsiveness of Azure Cosmos DB resources and having alerts set up for your workload is a good practice. This ensures you stay proactive in handling unforeseen events.

Potential Benefits: Proactive issue management.

Learn more: Create alerts for Azure Cosmos DB using Azure Monitor