在 Azure Cosmos DB 中使用分区图形Using a partitioned graph in Azure Cosmos DB

Azure Cosmos DB 中 Gremlin API 的重要功能之一是通过横向缩放处理大规模图形。One of the key features of the Gremlin API in Azure Cosmos DB is the ability to handle large-scale graphs through horizontal scaling. 容器可以在存储和吞吐量方面独立缩放。The containers can scale independently in terms of storage and throughput. 可以在 Azure Cosmos DB 中创建自动缩放的容器以存储图形数据。You can create containers in Azure Cosmos DB that can be automatically scaled to store a graph data. 数据根据指定的分区键自动均衡 。The data is automatically balanced based on the specified partition key.

如果预计容器的存储大小超过 20 GB,或者希望每秒分配超过 10,000 个请求单位 (RU),则需要进行分区 。Partitioning is required if the container is expected to store more than 20 GB in size or if you want to allocate more than 10,000 request units per second (RUs). Azure Cosmos DB 分区机制中的相同常规原则也适用,下面介绍了一些特定于图形的优化。The same general principles from the Azure Cosmos DB partitioning mechanism apply with a few graph-specific optimizations described below.

图形分区。

图形分区机制Graph partitioning mechanism

以下指南介绍了 Azure Cosmos DB 中分区策略的运作方式:The following guidelines describe how the partitioning strategy in Azure Cosmos DB operates:

  • 顶点和边缘作为 JSON 文档存储 。Both vertices and edges are stored as JSON documents.

  • 顶点需要分区键Vertices require a partition key. 此键通过哈希算法确定在哪个分区中存储顶点。This key will determine in which partition the vertex will be stored through a hashing algorithm. 创建新容器时将定义分区键属性名称,其格式为 /partitioning-key-nameThe partition key property name is defined when creating a new container and it has a format: /partitioning-key-name.

  • 边缘连同其源顶点一起存储Edges will be stored with their source vertex. 换句话说,对于每个顶点,其分区键定义它们与其传出边缘一起存储的位置。In other words, for each vertex its partition key defines where they are stored along with its outgoing edges. 进行此优化是为了避免在图形查询中使用 out() 基数时出现跨分区查询。This optimization is done to avoid cross-partition queries when using the out() cardinality in graph queries.

  • 边缘包含对其指向的顶点的引用Edges contain references to the vertices they point to. 所有边缘在存储时,都有分区键以及边缘指向的顶点的 ID。All edges are stored with the partition keys and IDs of the vertices that they are pointing to. 此计算使得所有 out() 方向查询始终是带范围的分区查询,而不是跨分区的盲查询。This computation makes all out() direction queries always be a scoped partitioned query, and not a blind cross-partition query.

  • 图形查询需要指定分区键Graph queries need to specify a partition key. 若要充分利用 Azure Cosmos DB 中的水平分区,选择单个顶点时应尽量指定分区键。To take full advantage of the horizontal partitioning in Azure Cosmos DB, the partition key should be specified when a single vertex is selected, whenever it's possible. 下面是用于在分区图形中选择一个或多个顶点的查询:The following are queries for selecting one or multiple vertices in a partitioned graph:

    • 在 Gremlin API 中,不支持将 /id/label 作为容器的分区键。/id and /label are not supported as partition keys for a container in Gremlin API.

    • 根据 ID 选择顶点,然后使用 .has() 步骤指定分区键属性Selecting a vertex by ID, then using the .has() step to specify the partition key property:

      g.V('vertex_id').has('partitionKey', 'partitionKey_value')
      
    • 通过指定包含分区键值和 ID 的元组选择顶点:Selecting a vertex by specifying a tuple including partition key value and ID:

      g.V(['partitionKey_value', 'vertex_id'])
      
    • 指定分区键值和 ID 的元组数组Specifying an array of tuples of partition key values and IDs:

      g.V(['partitionKey_value0', 'verted_id0'], ['partitionKey_value1', 'vertex_id1'], ...)
      
    • 选择一组带 ID 的顶点并指定分区键值列表Selecting a set of vertices with their IDs and specifying a list of partition key values:

      g.V('vertex_id0', 'vertex_id1', 'vertex_id2', …).has('partitionKey', within('partitionKey_value0', 'partitionKey_value01', 'partitionKey_value02', …)
      
    • 在查询开始时使用分区策略,并指定一个分区,用于 Gremlin 查询余下部分的范围:Using the Partition strategy at the beginning of a query and specifying a partition for the scope of the rest of the Gremlin query:

      g.withStrategies(PartitionStrategy.build().partitionKey('partitionKey').readPartitions('partitionKey_value').create()).V()
      

使用分区图形时的最佳做法Best practices when using a partitioned graph

将分区图形用于无限制容器时,请使用以下指南以确保性能和可伸缩性:Use the following guidelines to ensure performance and scalability when using partitioned graphs with unlimited containers:

  • 查询顶点时始终指定分区键值Always specify the partition key value when querying a vertex. 从已知的分区获取顶点是一种实现性能的方法。Getting vertex from a known partition is a way to achieve performance. 所有后续的邻接操作将始终局限于某个分区,因为边缘包含其目标顶点的引用 ID 和分区键。All subsequent adjacency operations will always be scoped to a partition since Edges contain reference ID and partition key to their target vertices.

  • 查询边缘时尽量使用传出方向Use the outgoing direction when querying edges whenever it's possible. 如上所述,边缘连同其源顶点一起朝传出方向存储。As mentioned above, edges are stored with their source vertices in the outgoing direction. 在设计数据和查询时考虑了此模式时,则可以最大程度地减少出现跨分区查询的可能性。So the chances of resorting to cross-partition queries are minimized when the data and queries are designed with this pattern in mind. 相反,in() 查询将始终是开销巨大的扇出查询。On the contrary, the in() query will always be an expensive fan-out query.

  • 选择可将数据均匀分配到不同分区的分区键Choose a partition key that will evenly distribute data across partitions. 此决策在很大程度上取决于解决方案的数据模型。This decision heavily depends on the data model of the solution. 请在 Azure Cosmos DB 中的分区和缩放中详细了解如何创建适当的分区键。Read more about creating an appropriate partition key in Partitioning and scale in Azure Cosmos DB.

  • 优化查询,以获取分区边界内的数据 。Optimize queries to obtain data within the boundaries of a partition. 最佳的分区策略与查询模式相符。An optimal partitioning strategy would be aligned to the querying patterns. 从单个分区获取数据的查询可提供最佳性能。Queries that obtain data from a single partition provide the best possible performance.

后续步骤Next steps

接下来可以继续阅读以下文章:Next you can proceed to read the following articles: