Resiliency recommendations for Azure Cosmos DB for NoSQL
This article contains recommendations for achieving resiliency for Azure Cosmos DB for NoSQL. Many of the recommendations contain supporting Azure Resource Graph (ARG) queries to help identify non-compliant resources.
Resiliency recommendations impact matrix
Each recommendation is marked in accordance with the following impact matrix:
Image | Impact | Description |
---|---|---|
High | Immediate fix needed. | |
Medium | Fix within 3-6 months. | |
Low | Needs to be reviewed. |
Resiliency recommendations summary
Availability
Configure at least two regions for high availability
It's crucial to enable a secondary region on your Cosmos DB to achieve higher SLA. Doing so doesn't incur any downtime and it's as easy as selecting a pin on map. Cosmos DB instances utilizing Strong consistency need to configure at least three regions to retain write availability if there is one region failure.
Potential benefits: Enhances SLA and resilience.
Learn more: Reliability (High availability) in Cosmos DB for No SQL
Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
array_length(properties.locations) < 2 or
(array_length(properties.locations) < 3 and properties.consistencyPolicy.defaultConsistencyLevel == 'Strong')
| project recommendationId='cosmos-1', name, id, tags
Disaster recovery
Enable service-managed failover for multi-region accounts with single write region
Cosmos DB boasts high uptime and resiliency. Even so, issues may arise. With Service-Managed failover, if a region is down, Cosmos DB automatically switches to the next available region, requiring no user action.
Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
array_length(properties.locations) > 1 and
tobool(properties.enableAutomaticFailover) == false and
tobool(properties.enableMultipleWriteLocations) == false
| project recommendationId='cosmos-2', name, id, tags
Evaluate multi-region write capability
Multi-region write capability allows for designing applications that are highly available across multiple regions, though it demands careful attention to consistency requirements and conflict resolution. Improper setup may decrease availability and cause data corruption due to unhandled conflicts.
Potential benefits: Enhances high availability.
Learn more:
- Distribute your data globally with Azure Cosmos DB
- Conflict types and resolution policies when using multiple write regions
Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
array_length(properties.locations) > 1 and
properties.enableMultipleWriteLocations == false
| project recommendationId='cosmos-3', name, id, tags
Choose appropriate consistency mode reflecting data durability requirements
In a globally distributed database, consistency level impacts data durability during regional outages. Understand data loss tolerance for recovery planning. Use Session consistency unless stronger is needed, accepting higher write latencies and potential write region impact from read-only outages.
Potential benefits: Enhances data durability and recovery.
Learn more: Consistency levels in Azure Cosmos DB
Configure continuous backup mode
Cosmos DB's backup is always on, offering protection against data mishaps. Continuous mode allows for self-serve restoration to a pre-mishap point, unlike periodic mode, which requires contacting Azure support, leading to longer restore times.
Potential Benefits: Faster self-serve data restore.
Learn more: Continuous backup with point in time restore feature in Azure Cosmos DB
Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
properties.backupPolicy.type == 'Periodic' and
properties.enableMultipleWriteLocations == false and
properties.enableAnalyticalStorage == false
| project recommendationId='cosmos-5', name, id, tags
System efficiency
Ensure query results are fully drained
Cosmos DB has a 4 MB response limit, leading to paginated results for large or partition-spanning queries. Each page shows availability and provides a continuation token for the next. A while loop in code is necessary to traverse all pages until completion.
Potential Benefits: Maximizes data retrieval efficiency.
Learn more: Pagination in Azure Cosmos DB for No SQL.
Maintain singleton pattern in your client
Using a single instance of the SDK client for each account and application is crucial as connections are tied to the client. Compute environments have a limit on open connections, affecting connectivity when exceeded.
Potential Benefits: Optimizes connections and efficiency.
Learn more: Designing resilient applications with Azure Cosmos DB SDKs.
Application resilience
Implement retry logic in your client
Cosmos DB SDKs automatically manage many transient errors through retries. Despite this, it's crucial for applications to implement additional retry policies targeting specific cases that the SDKs can't generically address, ensuring more robust error handling.
Potential Benefits: Enhances error handling resilience.
Learn more: Designing resilient applications with Azure Cosmos DB SDKs.
Monitoring
Monitor Cosmos DB health and set up alerts
Monitoring the availability and responsiveness of Azure Cosmos DB resources and having alerts set up for your workload is a good practice. This ensures you stay proactive in handling unforeseen events.
Potential Benefits: Proactive issue management.
Learn more: Create alerts for Azure Cosmos DB using Azure Monitor