Azure Cosmos DB 多区域分配数据 - 揭秘Multiple-region data distribution with Azure Cosmos DB - under the hood

Azure Cosmos DB 是 Azure 的一个基础服务,因此它将部署在中国境内的所有 Azure 中国区域中。Azure Cosmos DB is a foundational service in Azure, so it's deployed across all Azure China regions around China. 在数据中心内,我们会在大量的计算机阵列上部署和管理 Azure Cosmos DB,其中的每个服务都有专用的本地存储。Within a data center, we deploy and manage the Azure Cosmos DB on massive stamps of machines, each with dedicated local storage. 在数据中心内,Azure Cosmos DB 部署在许多群集之间,每个群集可能运行多代硬件。Within a data center, Azure Cosmos DB is deployed across many clusters, each potentially running multiple generations of hardware. 群集中的计算机通常分散在 10 到 20 个容错域之间,以便在一个区域中实现高可用性。Machines within a cluster are typically spread across 10-20 fault domains for high availability within a region. 下图显示了 Cosmos DB 多区域分布系统拓扑:The following image shows the Cosmos DB multiple-region distribution system topology:


Azure Cosmos DB 中的多区域分布是统包式: 随时可以点击几下鼠标或者使用单个 API 调用以编程方式来添加或删除与 Cosmos 数据库关联的地理区域。Multiple-region distribution in Azure Cosmos DB is turnkey: At any time, with a few clicks or programmatically with a single API call, you can add or remove the geographical regions associated with your Cosmos database. 而 Cosmos 数据库包含一组 Cosmos 容器。A Cosmos database, in turn, consists of a set of Cosmos containers. 在 Cosmos DB 中,容器充当逻辑性的分布和缩放单元。In Cosmos DB, containers serve as the logical units of distribution and scalability. 创建的集合、表和图形(在内部)只是 Cosmos 容器。The collections, tables, and graphs you create are (internally) just Cosmos containers. 容器对架构完全不可知,它提供查询范围。Containers are completely schema-agnostic and provide a scope for a query. Cosmos 容器中的数据在引入时会自动编制索引。Data in a Cosmos container is automatically indexed upon ingestion. 自动编制索引使用户无需进行繁琐的架构或索引管理(尤其是在多区域分布式设置中)就能查询数据。Automatic indexing enables users to query the data without the hassles of schema or index management, especially in a multiple-regionally distributed setup.

  • 在给定的区域中,可以使用分区键来分布容器中的数据。分区键由你提供,并由基础物理分区以透明方式进行管理(本地分布)。 In a given region, data within a container is distributed by using a partition-key, which you provide and is transparently managed by the underlying physical partitions (local distribution).

  • 每个物理分区还会跨地理区域进行复制(多区域分布)。 Each physical partition is also replicated across geographical regions (multiple-region distribution).

当使用 Cosmos DB 的应用以弹性方式在 Cosmos 容器中缩放吞吐量或使用其他存储时,Cosmos DB 将以透明方式处理所有区域中的分区管理操作(拆分、克隆、删除)。When an app using Cosmos DB elastically scales throughput on a Cosmos container or consumes more storage, Cosmos DB transparently handles partition management operations (split, clone, delete) across all the regions. 无论缩放、分布或故障情况如何,Cosmos DB 都会在分布于任意个区域之间的容器中提供单个数据系统映像。Independent of the scale, distribution, or failures, Cosmos DB continues to provide a single system image of the data within the containers, which are multiple-regionally distributed across any number of regions.

如下图所示,容器中的数据朝两个维分布 - 在某个区域内部,或者在中国境内的各个区域:As shown in the following image, the data within a container is distributed along two dimensions - within a region and across regions, around China:


物理分区是通过一组副本(称作副本集)实现的。 A physical partition is implemented by a group of replicas, called a replica-set. 每台计算机托管数百个副本,这些副本对应于一组固定进程中的各个物理分区,如上图所示。Each machine hosts hundreds of replicas that correspond to various physical partitions within a fixed set of processes as shown in the image above. 对应于物理分区的副本在区域中群集与数据中心内的计算机之间进行动态定位和负载均衡。Replicas corresponding to the physical partitions are dynamically placed and load balanced across the machines within a cluster and data centers within a region.

一个副本专属于一个 Azure Cosmos DB 租户。A replica uniquely belongs to an Azure Cosmos DB tenant. 每个副本托管 Cosmos DB 数据库引擎的实例,该实例管理资源以及关联的索引。Each replica hosts an instance of Cosmos DB's database engine, which manages the resources as well as the associated indexes. Cosmos 数据库引擎在基于原子记录序列 (ARS) 的系统上运行。The Cosmos database engine operates on an atom-record-sequence (ARS) based type system. 该引擎对架构概念不可知,并将记录的结构与实例值之间的边界模糊化。The engine is agnostic to the concept of a schema, blurring the boundary between the structure and instance values of records. Cosmos DB 通过在引入时为所有内容自动编制索引来有效实现架构的完全不可知性,使用户无需处理架构或索引管理,即可查询其多区域分布式数据。Cosmos DB achieves full schema agnosticism by automatically indexing everything upon ingestion in an efficient manner, which allows users to query their multiple-regionally distributed data without having to deal with schema or index management.

Cosmos 数据库引擎包含的组件可以实现多个协调基元、语言运行时、查询处理器,并包括分别负责事务存储和数据索引的存储子系统与索引子系统。The Cosmos database engine consists of components including implementation of several coordination primitives, language runtimes, the query processor, and the storage and indexing subsystems responsible for transactional storage and indexing of data, respectively. 为了提供持久性和高可用性,数据库引擎将在 SSD 中保存其数据和索引,并相应地将其复制到副本集中的数据库引擎实例之间。To provide durability and high availability, the database engine persists its data and index on SSDs and replicates it among the database engine instances within the replica-set(s) respectively. 大型租户的吞吐量和存储规模较高,并且数据和索引的副本数也较多。Larger tenants correspond to higher scale of throughput and storage and have either bigger or more replicas or both. 该系统的每个组件是完全异步的 – 永远不会出现线程阻塞,每个线程执行生存期较短的工作,可避免任何不必要的线程切换。Every component of the system is fully asynchronous - no thread ever blocks, and each thread does short-lived work without incurring any unnecessary thread switches. 速率限制和反压在从许可控制到所有 I/O 路径的整个堆栈中传播。Rate-limiting and back-pressure are plumbed across the entire stack from the admission control to all I/O paths. Cosmos 数据库引擎旨在利用精细并发性和提供高吞吐量,同时可在有限的系统资源中运行。Cosmos database engine is designed to exploit fine-grained concurrency and to deliver high throughput while operating within frugal amounts of system resources.

Cosmos DB 的多区域分布依赖于两个关键抽象 – 副本集和分区集。 Cosmos DB's multiple-region distribution relies on two key abstractions - replica-sets and partition-sets. 副本集是用于协调的模块化 Lego 块,分区集是一个或多个地理分散式物理分区的动态叠加层。A replica-set is a modular Lego block for coordination, and a partition-set is a dynamic overlay of one or more geographically distributed physical partitions. 若要了解多区域分布的工作原理,需要了解这两个关键抽象。To understand how multiple-region distribution works, we need to understand these two key abstractions.


物理分区具体化为一组分散在多个容错域之间的、名为“副本集”的自我托管动态负载均衡副本。A physical partition is materialized as a self-managed and dynamically load-balanced group of replicas spread across multiple fault domains, called a replica-set. 此集统一实现复制的状态机协议,使物理分区中的数据保持高度可用、持久且一致。This set collectively implements the replicated state machine protocol to make the data within the physical partition highly available, durable, and consistent. 副本集成员身份 N 是动态的 - 它根据故障、管理操作以及重新生成/恢复有故障副本所需的时间,在 NMinNMax 之间波动。The replica-set membership N is dynamic - it keeps fluctuating between NMin and NMax based on the failures, administrative operations, and the time for failed replicas to regenerate/recover. 复制协议还会根据成员身份的变化来重新配置读取和写入仲裁的大小。Based on the membership changes, the replication protocol also reconfigures the size of read and write quorums. 为了均匀分布分配给指定物理分区的吞吐量,我们采用了两种思路:To uniformly distribute the throughput that is assigned to a given physical partition, we employ two ideas:

  • 首先,处理领先者写入请求的开销高于对后继者应用更新的开销。First, the cost of processing the write requests on the leader is higher than the cost of applying the updates on the follower. 相应地,为领先者预算的系统资源比后继者更多。Correspondingly, the leader is budgeted more system resources than the followers.

  • 其次,确保给定一致性级别的读取仲裁尽可能地专门由后继者副本组成。Secondly, as far as possible, the read quorum for a given consistency level is composed exclusively of the follower replicas. 除非有必要,否则我们会避免访问领先者来为读取提供服务。We avoid contacting the leader for serving reads unless required. 在基于仲裁的系统中针对 Cosmos DB 支持的五个一致性模型执行负载和容量关系研究后,我们采用了多种思路。We employ a number of ideas from the research done on the relationship of load and capacity in the quorum-based systems for the five consistency models that Cosmos DB supports.


一组物理分区,其中的每个分区配置了 Cosmos 数据库区域,旨在管理跨所有已配置分区复制的相同键集。A group of physical partitions, one from each of the configured with the Cosmos database regions, is composed to manage the same set of keys replicated across all the configured regions. 这种更高的协调基元称为分区集 - 管理给定键集的物理分区的地理分布式动态叠加层。 This higher coordination primitive is called a partition-set - a geographically distributed dynamic overlay of physical partitions managing a given set of keys. 给定的物理分区(副本集)限定在群集范围内,而分区集可跨群集、数据中心和地理区域,如下图所示:While a given physical partition (a replica-set) is scoped within a cluster, a partition-set can span clusters, data centers, and geographical regions as shown in the image below:


可将分区集视为地理分散的“超副本集”,它由多个拥有相同键集的副本集构成。You can think of a partition-set as a geographically dispersed "super replica-set", which is composed of multiple replica-sets owning the same set of keys. 类似于副本集,分区集的成员身份也是动态的 – 它会根据在给定分区集中添加/删除新分区的隐式物理分区管理操作(例如,在容器中扩展吞吐量、在 Cosmos 数据库中添加/删除区域,或发生故障时)进行波动。Similar to a replica-set, a partition-set's membership is also dynamic - it fluctuates based on implicit physical partition management operations to add/remove new partitions to/from a given partition-set (for instance, when you scale out throughput on a container, add/remove a region to your Cosmos database, or when failures occur). 由于(分区集的)每个分区可在其自身的副本集中管理分区集成员身份,因此,成员身份是完全分散且高度可用的。By virtue of having each of the partitions (of a partition-set) manage the partition-set membership within its own replica-set, the membership is fully decentralized and highly available. 在重新配置分区集期间,还会建立物理分区之间的叠加层拓扑。During the reconfiguration of a partition-set, the topology of the overlay between physical partitions is also established. 该拓扑是根据源与目标物理分区之间的一致性级别、地理距离和可用网络带宽动态选择的。The topology is dynamically selected based on the consistency level, geographical distance, and available network bandwidth between the source and the target physical partitions.

该服务允许为 Cosmos 数据库配置一个或多个写入区域,根据所做的选择,分区集将配置为接受写入正好一个或所有区域。The service allows you to configure your Cosmos databases with either a single write region or multiple write regions, and depending on the choice, partition-sets are configured to accept writes in exactly one or all regions. 该系统采用两级别嵌套共识协议 – 一个级别在接受写入的物理分区副本集的副本内运行,另一个级别在分区集的级别运行,以针对分区中提交的所有写入操作提供完全有序的保证。The system employs a two-level, nested consensus protocol - one level operates within the replicas of a replica-set of a physical partition accepting the writes, and the other operates at the level of a partition-set to provide complete ordering guarantees for all the committed writes within the partition-set. 此多层嵌套共识协议对于实现严格的高可用性 SLA,以及实现 Cosmos DB 为客户提供的一致性模型至关重要。This multi-layered, nested consensus is critical for the implementation of our stringent SLAs for high availability, as well as the implementation of the consistency models, which Cosmos DB offers to its customers.

冲突解决Conflict resolution

我们在更新传播、冲突解决和因果关系跟踪的设计灵感来源于以往 Epidemic 算法Bayou 系统工作的启发。Our design for the update propagation, conflict resolution, and causality tracking is inspired from the prior work on epidemic algorithms and the Bayou system. 尽管这些思路的核心得以留存,并为传达 Cosmos DB 的系统设计提供方便的参考框架,但我们在将其应用于 Cosmos DB 系统时,还是对其做了重大改造。While the kernels of the ideas have survived and provide a convenient frame of reference for communicating the Cosmos DB's system design, they have also undergone significant transformation as we applied them to the Cosmos DB system. 之所以需要这样做,是因为以前的系统既没有设计资源调控,也不具备运行 Cosmos DB 所需的规模,无法提供 Cosmos DB 向其客户承诺的功能(例如有限过期一致性)和严格且全面的 SLA。This was needed, because the previous systems were designed neither with the resource governance nor with the scale at which Cosmos DB needs to operate, nor to provide the capabilities (for example, bounded staleness consistency) and the stringent and comprehensive SLAs that Cosmos DB delivers to its customers.

前面提到,分区集分布在多个区域之间,并遵循 Cosmos DB(多主数据库)复制协议在包含给定分区集的物理分区之间复制数据。Recall that a partition-set is distributed across multiple regions and follows Cosmos DBs (multi-master) replication protocol to replicate the data among the physical partitions comprising a given partition-set. (分区集的)每个物理分区通常接受写入到该区域本地的客户端,并为读取操作提供服务。Each physical partition (of a partition-set) accepts writes and serves reads typically to the clients that are local to that region. 区域中物理分区接受的写入操作在由客户端确认之前,将以持久方式进行提交并在物理分区中保持高可用性。Writes accepted by a physical partition within a region are durably committed and made highly available within the physical partition before they are acknowledged to the client. 这些写入是试探性的,将使用反熵通道传播到分区集中的其他物理分区。These are tentative writes and are propagated to other physical partitions within the partition-set using an anti-entropy channel. 客户端可以通过传递请求标头来请求试探性写入或提交的写入。Clients can request either tentative or committed writes by passing a request header. 反熵传播(包括传播频率)是动态的,基于分区集的拓扑、物理分区的区域邻近性,以及配置的一致性级别。The anti-entropy propagation (including the frequency of propagation) is dynamic, based on the topology of the partition-set, regional proximity of the physical partitions, and the consistency level configured. 在分区集中,Cosmos DB 遵循采用动态选定仲裁器分区的主要提交方案。Within a partition-set, Cosmos DB follows a primary commit scheme with a dynamically selected arbiter partition. 仲裁器的选择是动态的,在基于叠加层拓扑重新配置分区集时,它是不可或缺的一部分。The arbiter selection is dynamic and is an integral part of the reconfiguration of the partition-set based on the topology of the overlay. 可保证提交的写入(包括多行/批处理更新)的顺序。The committed writes (including multi-row/batched updates) are guaranteed to be ordered.

我们采用编码向量时钟(包含分别对应于每个副本集和分区集共识级别的区域 ID 和逻辑时钟)进行因果关系跟踪,并采用版本向量来检测和解决更新冲突。We employ encoded vector clocks (containing region ID and logical clocks corresponding to each level of consensus at the replica-set and partition-set, respectively) for causality tracking and version vectors to detect and resolve update conflicts. 拓扑和对等选择算法旨在确保版本向量的存储和网络开销固定保持在最低水平。The topology and the peer selection algorithm are designed to ensure fixed and minimal storage and minimal network overhead of version vectors. 该算法可保证严格的融合属性。The algorithm guarantees the strict convergence property.

对于配置了多个写入区域的 Cosmos 数据库,该系统提供多种灵活的自动冲突解决策略供开发人员选择,包括:For the Cosmos databases configured with multiple write regions, the system offers a number of flexible automatic conflict resolution policies for the developers to choose from, including:

  • 最后写入优先 (LWW) :默认使用系统定义的时间戳属性(基于时间同步时钟协议)。Last-Write-Wins (LWW), which, by default, uses a system-defined timestamp property (which is based on the time-synchronization clock protocol). Cosmos DB 还允许指定其他任何自定义数字属性用于解决冲突。Cosmos DB also allows you to specify any other custom numerical property to be used for conflict resolution.
  • 应用程序定义的自定义冲突解决策略(通过合并过程表示):旨在使用应用程序定义的语义来调解冲突。Application-defined (Custom) conflict resolution policy (expressed via merge procedures), which is designed for application-defined semantics reconciliation of conflicts. 在服务器端检测到数据库事务造成的写入-写入冲突时,将会调用这些过程。These procedures get invoked upon detection of the write-write conflicts under the auspices of a database transaction on the server side. 在执行提交协议过程中,该系统可保证正好执行合并过程一次。The system provides exactly once guarantee for the execution of a merge procedure as a part of the commitment protocol. 我们提供了多个冲突解决方法示例供你演练。There are several conflict resolution samples available for you to play with.

一致性模型Consistency Models

不管为 Cosmos 数据库配置了一个还是多个写入区域,都可以从五个妥善定义的一致性模型中进行选择。Whether you configure your Cosmos database with a single or multiple write regions, you can choose from the five well-defined consistency models. 借助多个写入区域,一致性级别获得了以下明显改善:With multiple write regions, the following are a few notable aspects of the consistency levels:

有限过期一致性可保证所有读取操作将保留在任意区域中最新写入操作的 K 前缀或 T 秒范围内。The bounded staleness consistency guarantees that all reads will be within K prefixes or T seconds from the latest write in any of the regions. 此外,可保证具有有限过期一致性的读取操作是单调的,且附带一致前缀保证。Furthermore, reads with bounded staleness consistency are guaranteed to be monotonic and with consistent prefix guarantees. 反熵协议在运行时受到速率限制,确保前缀不会执行累加,并且不需要对写入应用反压。The anti-entropy protocol operates in a rate-limited manner and ensures that the prefixes do not accumulate and the backpressure on the writes does not have to be applied. 会话一致性可保证单调读取、单调写入、读取自己的写入、写入后读取,以及在整个中国的一致前缀保证。Session consistency guarantees monotonic read, monotonic write, read your own writes, write follows read, and consistent prefix guarantees, around China. 对于配置了非常一致性的数据库,由于跨区域的同步复制,多个写入区域的优势(低写入延迟、高写入可用性)不适用。For the databases configured with strong consistency, the benefits (low write latency, high write availability) of multiple write regions does not apply, because of synchronous replication across regions.

此处介绍了 Cosmos DB 中五个一致性模型的语义,此处使用高级 TLA+ 规范对其数学原理做了描述。The semantics of the five consistency models in Cosmos DB are described here, and mathematically described using a high-level TLA+ specifications here.

后续步骤Next steps

接下来,请通过以下文章了解如何配置多区域分布:Next learn how to configure multiple-region distribution by using the following articles: