使用 Azure Cosmos DB 实现高可用性High availability with Azure Cosmos DB

Azure Cosmos DB 以透明方式在与 Azure Cosmos 帐户关联的所有 Azure 区域之间复制数据。Azure Cosmos DB transparently replicates your data across all the Azure regions associated with your Azure Cosmos account. Azure Cosmos DB 对数据采用多层冗余,如下图所示:Azure Cosmos DB employs multiple layers of redundancy for your data as shown in the following image:

物理分区

  • Azure Cosmos 容器中的数据已水平分区The data within Azure Cosmos containers is horizontally partitioned.

  • 分区集是多个副本集的集合。A partition-set is a collection of multiple replica-sets. 在每个区域中,每个分区受副本集的保护,该副本集中的大多数副本将复制并以持久方式提交所有写入内容。Within each region, every partition is protected by a replica-set with all writes replicated and durably committed by a majority of replicas. 副本分布在最多 10 到 20 个容错域中。Replicas are distributed across as many as 10-20 fault domains.

  • 将复制所有区域中的每个分区。Each partition across all the regions is replicated. 每个区域包含某个 Azure Cosmos 容器的所有数据分区,可接受写入并维护读取。Each region contains all the data partitions of an Azure Cosmos container and can accept writes and serve reads.

如果 Azure Cosmos 帐户分布在 N 个 Azure 区域之间,则所有数据至少有 N x 4 个副本。If your Azure Cosmos account is distributed across N Azure regions, there will be at least N x 4 copies of all your data. 通常情况下,在超过 2 个区域中拥有 Azure Cosmos 帐户可提高应用程序的可用性,并在相关区域之间提供较低的延迟。Generally having an Azure Cosmos account in more than 2 regions improves the availability of your application and provides low latency across the associated regions.

可用性 SLASLAs for availability

作为多区域分布式数据库,Azure Cosmos DB 提供综合性 SLA,涵盖吞吐量、99% 时间内的延迟、一致性和高可用性。As a multiple-regionally distributed database, Azure Cosmos DB provides comprehensive SLAs that encompass throughput, latency at the 99th percentile, consistency, and high availability. 下表显示 Azure Cosmos DB 针对单区域和多区域帐户提供的高可用性保证。The table below shows the guarantees for high availability provided by Azure Cosmos DB for single and multi-region accounts. 若要实现高可用性,请始终将 Azure Cosmos 帐户配置为使用多个写入区域(也称为多主数据库)。For high availability, always configure your Azure Cosmos accounts to have multiple write regions(also called multi-master).

操作类型Operation type 单区域Single region 多区域(单区域写入)Multi-region (single region writes) 多区域(多区域写入)Multi-region (multi-region writes)
写入Writes 99.9999.99 99.9999.99 99.99999.999
读取Reads 99.9999.99 99.99999.999 99.99999.999

备注

在实践中,有限过期、会话、一致前缀和最终一致性模型的实际写入可用性明显高于发布的 SLA。In practice, the actual write availability for bounded staleness, session, consistent prefix and eventual consistency models is significantly higher than the published SLAs. 所有一致性级别的实际读取可用性明显高于发布的 SLA。The actual read availability for all consistency levels is significantly higher than the published SLAs.

使用 Azure Cosmos DB 在发生区域性服务中断时提供高可用性High availability with Azure Cosmos DB in the event of regional outages

对于区域性服务中断的罕见情况,Azure Cosmos DB 可确保你的数据库始终保持高可用性。For the rare cases of regional outage, Azure Cosmos DB makes sure your database is always highly available. 下面根据 Azure Cosmos 帐户配置详细介绍 Azure Cosmos DB 在服务中断期间的行为:The following details capture Azure Cosmos DB behavior during an outage, depending on your Azure Cosmos account configuration:

  • 使用 Azure Cosmos DB 时,在客户端确认写入操作之前,数据将由接受写入操作的区域中的副本仲裁持续提交。With Azure Cosmos DB, before a write operation is acknowledged to the client, the data is durably committed by a quorum of replicas within the region that accepts the write operations.

  • 配置了多个写入区域/多主数据库的多区域帐户对于写入和读取都将保持高可用性。Multi-region accounts configured with multiple-write regions/multi-master will be highly available for both writes and reads. 区域性故障转移可在瞬间完成,不需要在应用程序中进行任何更改。Regional failovers are instantaneous and don't require any changes from the application.

  • 发生区域性服务中断时,单区域帐户可能会失去可用性。Single-region accounts may lose availability following a regional outage. 始终建议对 Azure Cosmos 帐户至少设置两个区域(最好至少设置两个写入区域),以确保始终保持高可用性。It's always recommended to set up at least two regions (preferably, at least two write regions) with your Azure Cosmos account to ensure high availability at all times.

配置为使用单个写入区域的多区域帐户(写入区域服务中断)Multi-region accounts with a single-write region (write region outage)

  • 在写入区域服务中断期间,如果在 Azure Cosmos 帐户上配置了“启用自动故障转移”,则 Azure Cosmos 帐户会自动将次要区域提升为新的主要写入区域。During a write region outage, the Azure Cosmos account will automatically promote a secondary region to be the new primary write region when enable automatic failover is configured on the Azure Cosmos account. 当启用后,将按您指定的区域优先级顺序故障转移到其他区域。When enabled, the failover will occur to another region in the order of region priority you've specified.
  • 当上一个受影响的区域重新联机时,可以通过冲突源使用该区域发生故障时未复制的任何写入数据。When the previously impacted region is back online, any write data that was not replicated when the region failed, is made available through the conflicts feed. 应用程序可以读取冲突源,根据应用程序特定的逻辑解决冲突,并相应地将更新后的数据写回 Azure Cosmos 容器。Applications can read the conflicts feed, resolve the conflicts based on the application-specific logic, and write the updated data back to the Azure Cosmos container as appropriate.
  • 以前受影响的写入区域恢复后,它将自动用作读取区域。Once the previously impacted write region recovers, it becomes automatically available as a read region. 可以切换回到用作写入区域的已恢复区域。You can switch back to the recovered region as the write region. 可以使用 PowerShell、Azure CLI 或 Azure 门户来切换区域。You can switch the regions by using PowerShell, Azure CLI or Azure portal. 在切换写入区域之前、期间或之后,不会丢失数据或可用性,应用程序将继续保持高可用性。There is no data or availability loss before, during or after you switch the write region and your application continues to be highly available.

重要

强烈建议将用于生产工作负载的 Azure Cosmos 帐户配置为“启用自动故障转移” 。It is strongly recommended that you configure the Azure Cosmos accounts used for production workloads to enable automatic failover. 手动故障转移要求在辅助写入区域与主要写入区域之间进行连接来完成一致性检查,确保在故障转移期间不会丢失数据。Manual failover requires connectivity between secondary and primary write region to complete a consistency check to ensure there is no data loss during the failover. 如果主要区域不可用,则此一致性检查无法完成,手动故障转移不会成功,导致不可写入。If the primary region is unavailable, this consistency check cannot complete and the manual failover will not succeed, resulting in loss of write availability.

配置为使用单个写入区域的多区域帐户(读取区域服务中断)Multi-region accounts with a single-write region (read region outage)

  • 在读取区域服务中断期间,使用任何一致性级别或强一致性且具有三个或更多读取区域的 Azure Cosmos 帐户仍将对读取和写入保持高可用性。During a read region outage, Azure Cosmos accounts using any consistency level or strong consistency with three or more read regions will remain highly available for reads and writes.
  • 使用强一致性且读取区域不超过两个(包括读写区域)的 Azure Cosmos 帐户将在一个读取区域发生服务中断期间失去写入可用性,但会保持剩余区域的读取可用性。Azure Cosmos accounts using strong consistency with two or fewer read regions (which includes the read & write region) will lose write availability during a read region outage but will maintain read availability for remaining regions.
  • 受影响的区域将自动断开连接,并标记为脱机。The impacted region is automatically disconnected and will be marked offline. Azure Cosmos DB SDK 会将读取调用重定向到首选区域列表中的下一个可用区域。The Azure Cosmos DB SDKs will redirect read calls to the next available region in the preferred region list.
  • 如果首选区域列表中没有区域可用,则会自动让调用返回到当前的写入区域。If none of the regions in the preferred region list is available, calls automatically fall back to the current write region.
  • 处理读取区域服务中断不需要对应用程序代码进行更改。No changes are required in your application code to handle read region outage. 当受影响的读取区域重新联机时,它会自动与当前写入区域同步,并再次可用于为读取请求提供服务。When the impacted read region is back online it will automatically sync with the current write region and will be available again to serve read requests.
  • 后续的读取会重定向到恢复的区域,不需更改应用程序代码。Subsequent reads are redirected to the recovered region without requiring any changes to your application code. 在故障转移和重新加入之前发生故障的区域期间,Azure Cosmos DB 会继续提供读取一致性保证。During both failover and rejoining of a previously failed region, read consistency guarantees continue to be honored by Azure Cosmos DB.
  • 即使在发生了 Azure 区域永久无法恢复的罕见不幸事件中,如果为多区域 Azure Cosmos 帐户配置了强一致性,也不会丢失数据。Even in a rare and unfortunate event when the Azure region is permanently irrecoverable, there is no data loss if your multi-region Azure Cosmos account is configured with Strong consistency. 如果出现永久不可恢复的写入区域,对于配置了有限过期一致性的多区域 Azure Cosmos 帐户,潜在的数据丢失时段限制为过期时段(K 或 T),其中 K = 100,000 次更新,T = 5 分钟。In the event of a permanently irrecoverable write region, a multi-region Azure Cosmos account configured with bounded-staleness consistency, the potential data loss window is restricted to the staleness window (K or T) where K=100,000 updates and T=5 minutes. 对于会话、一致前缀和最终一致性级别,潜在的数据丢失时段限制为最多 15 分钟。For session, consistent-prefix and eventual consistency levels, the potential data loss window is restricted to a maximum of 15 minutes. 有关 Azure Cosmos DB 的 RTO 和 RPO 目标的详细信息,请参阅一致性级别和数据持续性For more information on RTO and RPO targets for Azure Cosmos DB, see Consistency levels and data durability

生成高可用性应用程序Building highly available applications

  • 若要确保较高的写入和读取可用性,请将 Azure Cosmos 帐户配置为跨越至少两个区域并使用多个写入区域。To ensure high write and read availability, configure your Azure Cosmos account to span at least two regions with multiple-write regions. 对于读取和写入,此配置都可提供由 SLA 作为保障的最高可用性、最低延迟和最佳可伸缩性。This configuration will provide the highest availability, lowest latency, and best scalability for both reads and writes backed by SLAs. 若要了解详细信息,请参阅如何将 Azure Cosmos 帐户配置为使用多个写入区域To learn more, see how to configure your Azure Cosmos account with multiple write-regions.

  • 对于配置为使用单个写入区域的多区域 Azure Cosmos 帐户,请使用 Azure CLI 或 Azure 门户启用自动故障转移For multi-region Azure Cosmos accounts that are configured with a single-write region, enable automatic-failover by using Azure CLI or Azure portal. 启用自动故障转移后,每当发生区域性灾难时,Cosmos DB 都会自动故障转移你的帐户。After you enable automatic failover, whenever there is a regional disaster, Cosmos DB will automatically failover your account.

  • 即使 Azure Cosmos 帐户具有高可用性,应用程序也不一定能够正常保持高可用性。Even if your Azure Cosmos account is highly available, your application may not be correctly designed to remain highly available. 若要测试应用程序的端到端高可用性,请在应用程序测试或灾难恢复 (DR) 演练过程中,暂时禁用帐户的自动故障转移功能,使用 PowerShell、Azure CLI 或 Azure 门户调用手动故障转移,然后监视应用程序的故障转移。To test the end-to-end high availability of your application, as a part of your application testing or disaster recovery (DR) drills, temporarily disable automatic-failover for the account, invoke the manual failover by using PowerShell, Azure CLI or Azure portal, then monitor your application's failover. 完成后,可以故障回复到主区域,然后还原该帐户的自动故障转移。Once complete, you can fail back over to the primary region and restore automatic-failover for the account.

  • 在多区域分布式数据库环境中,当发生区域范围的服务中断时,一致性级别与数据持续性之间存在直接关系。Within a multiple-regionally distributed database environment, there is a direct relationship between the consistency level and data durability in the presence of a region-wide outage. 制定业务连续性计划时,需了解应用程序在中断事件发生后完全恢复之前的最大可接受时间。As you develop your business continuity plan, you need to understand the maximum acceptable time before the application fully recovers after a disruptive event. 应用程序完全恢复所需的时间称为恢复时间目标 (RTO)。The time required for an application to fully recover is known as recovery time objective (RTO). 此外,还需要了解从中断事件恢复时,应用程序可忍受最近数据更新丢失的最长期限。You also need to understand the maximum period of recent data updates the application can tolerate losing when recovering after a disruptive event. 可以承受更新丢失的时限称为恢复点目标 (RPO)。The time period of updates that you might afford to lose is known as recovery point objective (RPO). 若要查看 Azure Cosmos DB 的 RPO 和 RTO,请参阅一致性级别和数据持续性To see the RPO and RTO for Azure Cosmos DB, see Consistency levels and data durability

后续步骤Next steps

接下来可以阅读以下文章:Next you can read the following articles: