使用 Azure Cosmos DB 实现高可用性High availability with Azure Cosmos DB

Azure Cosmos DB 以透明方式在与 Cosmos 帐户关联的所有 Azure 区域之间复制数据。Azure Cosmos DB transparently replicates your data across all the Azure regions associated with your Cosmos account. Cosmos DB 对数据采用多层冗余,如下图所示:Cosmos DB employs multiple layers of redundancy for your data as shown in the following image:


  • Cosmos 容器中的数据已水平分区The data within Cosmos containers is horizontally partitioned.

  • 在每个区域中,每个分区受副本集的保护,该副本集中的大多数副本将复制并以持久方式提交所有写入内容。Within each region, every partition is protected by a replica-set with all writes replicated and durably committed by a majority of replicas. 副本分布在最多 10 到 20 个容错域中。Replicas are distributed across as many as 10-20 fault domains.

  • 将复制所有区域中的每个分区。Each partition across all the regions is replicated. 每个区域包含某个 Cosmos 容器的所有数据分区,可接受写入并为读取提供服务。Each region contains all the data partitions of a Cosmos container and can accept writes and serve reads.

如果 Cosmos 帐户分布在 N 个 Azure 区域之间,则所有数据至少有 N x 4 个副本。If your Cosmos account is distributed across N Azure regions, there will be at least N x 4 copies of all your data. 除了在与 Cosmos 帐户关联的区域之间提供低延迟数据访问和缩放写入/读取吞吐量以外,部署更多的区域(N 值较高)还可进一步提高可用性。In addition to providing low latency data access and scaling write/read throughput across the regions associated with your Cosmos account, having more regions (higher N) further improves availability.

可用性 SLASLAs for availability

作为多区域分布式数据库,Cosmos DB 提供综合性的 SLA,涵盖了吞吐量、99% 时间内的延迟、一致性和高可用性。As a multiple-regionally distributed database, Cosmos DB provides comprehensive SLAs that encompass throughput, latency at the 99th percentile, consistency, and high availability. 下表显示了 Cosmos DB 针对单区域和多区域帐户提供的高可用性保证。The table below shows the guarantees for high availability provided by Cosmos DB for single and multi-region accounts. 为实现高可用性,请始终将 Cosmos 帐户配置为使用多个写入区域。For high availability, always configure your Cosmos accounts to have multiple write regions.

操作类型Operation type 单区域Single region 多区域(单区域写入)Multi-region (single region writes) 多区域(多区域写入)Multi-region (multi-region writes)
写入Writes 99.9999.99 99.9999.99 99.99999.999
读取Reads 99.9999.99 99.99999.999 99.99999.999


在实践中,有限过期、会话、一致前缀和最终一致性模型的实际写入可用性明显高于发布的 SLA。In practice, the actual write availability for bounded staleness, session, consistent prefix and eventual consistency models is significantly higher than the published SLAs. 所有一致性级别的实际读取可用性明显高于发布的 SLA。The actual read availability for all consistency levels is significantly higher than the published SLAs.

使用 Cosmos DB 在遇到区域性服务中断时提供高可用性High availability with Cosmos DB in the event of regional outages

区域性服务中断并不少见,而 Azure Cosmos DB 可确保你的数据库始终保持高可用性。Regional outages aren't uncommon, and Azure Cosmos DB makes sure your database is always highly available. 下面根据 Cosmos 帐户配置详细汇总了 Cosmos DB 在服务中断期间的行为:The following details capture Cosmos DB behavior during an outage, depending on your Cosmos account configuration:

  • 使用 Cosmos DB 时,在客户端确认写入操作之前,数据将由接受写入操作的区域中的副本仲裁持久提交。With Cosmos DB, before a write operation is acknowledged to the client, the data is durably committed by a quorum of replicas within the region that accepts the write operations.

  • 配置有多个写入区域的多区域帐户对于写入和读取都将具有高可用性。Multi-region accounts configured with multiple-write regions will be highly available for both writes and reads. 区域性故障转移可在瞬间完成,不需要在应用程序中进行任何更改。Regional failovers are instantaneous and don't require any changes from the application.

  • 发生区域性服务中断时,单区域帐户可能会失去可用性。Single-region accounts may lose availability following a regional outage. 建议始终对 Cosmos 帐户至少设置两个区域(最好是至少设置两个写入区域),以确保始终保持高可用性。It's always recommended to set up at least two regions (preferably, at least two write regions) with your Cosmos account to ensure high availability at all times.

  • 配置为使用单个写入区域的多区域帐户(写入区域服务中断):Multi-region accounts with a single-write region (write region outage):

    • 在写入区域中断期间,如果在 Azure Cosmos 帐户上配置了启用自动故障转移,则 Cosmos 帐户会将次要区域自动提升为新的主要写入区域。During a write region outage, the Cosmos account will automatically promote a secondary region to be the new primary write region when enable automatic failover is configured on the Azure Cosmos account. 当启用后,将按您指定的区域优先级顺序故障转移到其他区域。When enabled, the failover will occur to another region in the order of region priority you've specified.
    • 客户还可以选择使用手动故障转移并使用他们自己构建的代理亲自监视其 Cosmos 写入终结点 URL。Customers may also choose to use manual failover and monitor their Cosmos write endpoint URL's themselves using an agent built themselves. 对于具有复杂和精密的运行状况监视需求的客户,这可以降低当写入区域发生故障时的 RTO。For customers with complex and sophisticated health monitoring needs, this can provide reduced RTO should a failure occur in the write region.
    • 当上一个受影响的区域重新联机时,可以通过冲突源来使用该区域发生故障时未复制的任何写入数据。When the previously impacted region is back online, any write data that was unreplicated when the region failed, is made available through the conflicts feed. 应用程序可以读取冲突源,根据应用程序特定的逻辑解决冲突,并相应地将更新后的数据写回 Azure Cosmos 容器。Applications can read the conflicts feed, resolve the conflicts based on the application-specific logic, and write the updated data back to the Azure Cosmos container as appropriate.
    • 以前受影响的写入区域恢复后,它将自动用作读取区域。Once the previously impacted write region recovers, it becomes automatically available as a read region. 可以切换回到用作写入区域的已恢复区域。You can switch back to the recovered region as the write region. 可以使用 Azure CLI 或 Azure 门户来切换区域。You can switch the regions by using Azure CLI or Azure portal. 在切换写入区域之前、期间或之后,不会丢失数据或可用性,应用程序将继续保持高可用性。There is no data or availability loss before, during or after you switch the write region and your application continues to be highly available.
  • 配置为使用单个写入区域的多区域帐户(读取区域服务中断):Multi-region accounts with a single-write region (read region outage):

    • 在发生读取区域服务中断期间,这些帐户将保持很高的读写可用性。During a read region outage, these accounts will remain highly available for reads and writes.
    • 受影响的区域将自动断开连接,并标记为脱机。The impacted region is automatically disconnected and will be marked offline. Azure Cosmos DB SDK 会将读取调用重定向到首选区域列表中的下一个可用区域。The Azure Cosmos DB SDKs will redirect read calls to the next available region in the preferred region list.
    • 如果首选区域列表中没有区域可用,则会自动让调用返回到当前的写入区域。If none of the regions in the preferred region list is available, calls automatically fall back to the current write region.
    • 处理读取区域服务中断不需要对应用程序代码进行更改。No changes are required in your application code to handle read region outage. 最终,当受影响区域重新联机时,以前受影响的读取区域将自动与当前写入区域同步,并再次可用于为读取请求提供服务。Eventually, when the impacted region is back online, the previously impacted read region will automatically sync with the current write region and will be available again to serve read requests.
    • 后续的读取会重定向到恢复的区域,不需更改应用程序代码。Subsequent reads are redirected to the recovered region without requiring any changes to your application code. 在故障转移和重新加入以前发生故障的区域期间,Cosmos DB 会持续遵循读取一致性保证。During both failover and rejoining of a previously failed region, read consistency guarantees continue to be honored by Cosmos DB.
  • 即使在罕见的不幸事件中,发生了 Azure 区域永久无法恢复的情况,如果为多区域 Cosmos 帐户配置了默认的非常一致性,也不会丢失数据。 Even in a rare and unfortunate event when the Azure region is permanently irrecoverable, there is no data loss if your multi-region Cosmos account is configured with Strong consistency. 如果出现永久不可恢复的写入区域,对于配置了有限过期一致性的多区域 Cosmos 帐户,潜在的数据丢失时段限制为过期时段(KT),其中 K = 100,000 次更新,T = 5 分钟。In the event of a permanently irrecoverable write region, a multi-region Cosmos account configured with bounded-staleness consistency, the potential data loss window is restricted to the staleness window (K or T) where K=100,000 updates and T=5 minutes. 对于会话、一致前缀和最终一致性级别,潜在的数据丢失时段限制为最多 15 分钟。For session, consistent-prefix and eventual consistency levels, the potential data loss window is restricted to a maximum of 15 minutes. 有关 Azure Cosmos DB 的 RTO 和 RPO 目标的详细信息,请参阅一致性级别和数据持续性For more information on RTO and RPO targets for Azure Cosmos DB, see Consistency levels and data durability

生成高可用性应用程序Building highly available applications

  • 为确保较高的写入和读取可用性,请将 Cosmos 帐户配置为跨越多个写入区域中的至少两个区域。To ensure high write and read availability, configure your Cosmos account to span at least two regions with multiple-write regions. 对于读取和写入,此配置都可提供由 SLA 作为保障的最高可用性、最低延迟和最佳可伸缩性。This configuration will provide the highest availability, lowest latency, and best scalability for both reads and writes backed by SLAs. 详细了解如何将 Cosmos 帐户配置为使用多个写入区域To learn more, see how to configure your Cosmos account with multiple write-regions.

  • 对于配置为使用单个写入区域的多区域 Cosmos 帐户,请使用 Azure CLI 或 Azure 门户中启用自动故障转移For multi-region Cosmos accounts that are configured with a single-write region, enable automatic-failover by using Azure CLI or Azure portal. 启用自动故障转移后,每当发生区域性灾难时,Cosmos DB 都会自动故障转移你的帐户。After you enable automatic failover, whenever there is a regional disaster, Cosmos DB will automatically failover your account.

  • 即使 Cosmos 帐户具有高可用性,应用程序也不一定能够正常保持高可用性。Even if your Cosmos account is highly available, your application may not be correctly designed to remain highly available. 若要测试应用程序的端到端高可用性,请在应用程序测试或灾难恢复 (DR) 演练过程中,暂时禁用帐户的自动故障转移,使用 Azure CLI 或 Azure 门户调用手动故障转移,然后监视你的应用程序的故障转移。To test the end-to-end high availability of your application, as a part of your application testing or disaster-recovery (DR) drills, temporarily disable automatic-failover for the account, invoke the manual failover by using Azure CLI or Azure portal, then monitor your application's failover. 完成后,可以故障回复到主区域,然后还原该帐户的自动故障转移。Once complete, you can fail back over to the primary region and restore automatic-failover for the account.

  • 在多区域分布式数据库环境中,当发生区域范围的服务中断时,一致性级别与数据持续性之间存在直接关系。Within a multiple-regionally distributed database environment, there is a direct relationship between the consistency level and data durability in the presence of a region-wide outage. 制定业务连续性计划时,需了解应用程序在中断事件发生后完全恢复之前的最大可接受时间。As you develop your business continuity plan, you need to understand the maximum acceptable time before the application fully recovers after a disruptive event. 应用程序完全恢复所需的时间称为恢复时间目标 (RTO)。The time required for an application to fully recover is known as recovery time objective (RTO). 此外,还需要了解从中断事件恢复时,应用程序可忍受最近数据更新丢失的最长期限。You also need to understand the maximum period of recent data updates the application can tolerate losing when recovering after a disruptive event. 可以承受更新丢失的时限称为恢复点目标 (RPO)。The time period of updates that you might afford to lose is known as recovery point objective (RPO). 若要查看 Azure Cosmos DB 的 RPO 和 RTO,请参阅一致性级别和数据持续性To see the RPO and RTO for Azure Cosmos DB, see Consistency levels and data durability

后续步骤Next steps

接下来可以阅读以下文章:Next you can read the following articles: