Azure Cosmos DB Gremlin API 的图形数据建模Graph data modeling for Azure Cosmos DB Gremlin API

以下文档旨在提供图形数据建模建议。The following document is designed to provide graph data modeling recommendations. 此步骤对于确保数据变化时图形数据库系统的可伸缩性和性能至关重要。This step is vital in order to ensure the scalability and performance of a graph database system as the data evolves. 高效的数据模型对于大型图形尤其重要。An efficient data model is especially important with large-scale graphs.

要求Requirements

本指南中概述的过程基于以下假设:The process outlined in this guide is based on the following assumptions:

  • 识别了问题空间中的实体 。The entities in the problem-space are identified. 每个请求以原子方式使用这些实体 。These entities are meant to be consumed atomically for each request. 换句话说,数据库系统不会在多个查询请求中检索单个实体的数据。In other words, the database system isn't designed to retrieve a single entity's data in multiple query requests.
  • 了解数据库系统的读取和写入要求 。There is an understanding of read and write requirements for the database system. 这些要求将指导图形数据模型所需的优化。These requirements will guide the optimizations needed for the graph data model.
  • 充分了解 Apache Tinkerpop 属性图形标准的原则。The principles of the Apache Tinkerpop property graph standard are well understood.

何时需要图形数据库?When do I need a graph database?

如果数据域中的实体和关系具有以下任何特征,则可以最佳地应用图形数据库解决方案:A graph database solution can be optimally applied if the entities and relationships in a data domain have any of the following characteristics:

  • 实体通过描述性关系彼此联系紧密 。The entities are highly connected through descriptive relationships. 这种情况的好处是关系在存储中永久存在。The benefit in this scenario is the fact that the relationships are persisted in storage.
  • 有循环关系或自引用实体 。There are cyclic relationships or self-referenced entities. 使用关系数据库或文档数据库时,这种模式通常是一个挑战。This pattern is often a challenge when using relational or document databases.
  • 实体之间存在动态发展的关系 There are dynamically evolving relationships between entities. 此模式特别适用于具有多个级别的分层或树状结构数据。This pattern is especially applicable to hierarchical or tree-structured data with many levels.
  • 实体之间存在多对多关系 There are many-to-many relationships between entities.
  • 对实体和关系都有写入和读取要求 。There are write and read requirements on both entities and relationships.

如果满足上述条件,则图形数据库方法可能会为查询复杂性、数据模型可伸缩性和查询性能提供优势 。If the above criteria is satisfied, it's likely that a graph database approach will provide advantages for query complexity, data model scalability, and query performance.

下一步是确定图形是否将用于分析或事务目的。The next step is to determine if the graph is going to be used for analytic or transactional purposes. 如果图形将用于繁重的计算和数据处理工作负载,则值得探索 Cosmos DB Spark 连接器以及 GraphX 库的使用。If the graph is intended to be used for heavy computation and data processing workloads, it would be worth to explore the Cosmos DB Spark connector and the use of the GraphX library.

如何使用图形对象How to use graph objects

Apache Tinkerpop 属性图形标准定义了两种类型的对象:顶点和边缘 。The Apache Tinkerpop property graph standard defines two types of objects Vertices and Edges.

以下是图形对象中属性的最佳实践:The following are the best practices for the properties in the graph objects:

ObjectObject 属性Property 类型Type 注释Notes
顶点Vertex IDID StringString 每个分区唯一强制执行。Uniquely enforced per partition. 如果插入时未提供值,则将存储自动生成的 GUID。If a value isn't supplied upon insertion, an auto-generated GUID will be stored.
顶点Vertex labellabel StringString 此属性用于定义顶点表示的实体类型。This property is used to define the type of entity that the vertex represents. 如果未提供值,则将使用默认值 vertex。If a value isn't supplied, a default value "vertex" will be used.
顶点Vertex propertiesproperties 字符串、布尔值、数字String, Boolean, Numeric 在每个顶点中存储为键值对的单独属性的列表。A list of separate properties stored as key-value pairs in each vertex.
顶点Vertex 分区键partition key 字符串、布尔值、数字String, Boolean, Numeric 此属性定义顶点及其传出边缘的存储位置。This property defines where the vertex and its outgoing edges will be stored. 有关更多信息,请参阅图形分区Read more about graph partitioning.
EdgeEdge IDID StringString 每个分区唯一强制执行。Uniquely enforced per partition. 默认情况下自动生成。Auto-generated by default. 边缘通常不需要通过 ID 唯一检索。Edges usually don't have the need to be uniquely retrieved by an ID.
EdgeEdge labellabel StringString 此属性用于定义两个顶点具有的关系类型。This property is used to define the type of relationship that two vertices have.
EdgeEdge propertiesproperties 字符串、布尔值、数字String, Boolean, Numeric 在每个边缘中存储为键值对的单独属性的列表。A list of separate properties stored as key-value pairs in each edge.

备注

边缘不需要分区键值,因为它的值是根据其源顶点自动分配的。Edges don't require a partition key value, since its value is automatically assigned based on their source vertex. 有关详细信息,请参阅图形分区一文。Learn more in the graph partitioning article.

实体和关系建模指南Entity and relationship modeling guidelines

以下是一组用于处理 Azure Cosmos DB Gremlin API 图形数据库的数据建模的指南。The following are a set of guidelines to approach data modeling for an Azure Cosmos DB Gremlin API graph database. 这些指南假设存在数据域的现有定义并对其进行查询。These guidelines assume that there's an existing definition of a data domain and queries for it.

备注

下面列出的步骤作为建议提出。The steps outlined below are presented as recommendations. 在考虑用于生产之前,应对最终模型进行评估和测试。The final model should be evaluated and tested before its consideration as production-ready. 此外,以下建议特定于 Azure Cosmos DB Gremlin API 实现。Additionally, the recommendations below are specific to Azure Cosmos DB's Gremlin API implementation.

对顶点和属性建模Modeling vertices and properties

图形数据模型的第一步是将每个已识别的实体映射到顶点对象 。The first step for a graph data model is to map every identified entity to a vertex object. 第一步应为所有实体到顶点的一对一映射,但可能发生变化。A one to one mapping of all entities to vertices should be an initial step and subject to change.

一个常见的缺陷是将单个实体的属性映射为单独的顶点。One common pitfall is to map properties of a single entity as separate vertices. 请考虑下面的示例,其中相同的实体以两种不同的方式表示:Consider the example below, where the same entity is represented in two different ways:

  • 基于顶点属性:在这种方法中,实体使用三个单独的顶点和两个边缘来描述其属性。Vertex-based properties: In this approach, the entity uses three separate vertices and two edges to describe its properties. 虽然这种方法可以减少冗余,但会增加模型复杂性。While this approach might reduce redundancy, it increases model complexity. 模型复杂性的增加可能会导致延迟、查询复杂性和计算成本增加。An increase in model complexity can result in added latency, query complexity, and computation cost. 此模型还可能在分区方面带来挑战。This model can also present challenges in partitioning.

具有属性顶点的实体模型。

  • 属性嵌入式顶点:这种方法利用键值对列表来表示顶点内实体的所有属性。Property-embedded vertices: This approach takes advantage of the key-value pair list to represent all the properties of the entity inside a vertex. 这种方法降低了模型复杂性,使查询更简单、遍历成本更低。This approach provides reduced model complexity, which will lead to simpler queries and more cost-efficient traversals.

具有属性顶点的实体模型。

备注

上面的示例显示了一个简化的图形模型,仅对划分实体属性的两种方法进行了比较。The above examples show a simplified graph model to only show the comparison between the two ways of dividing entity properties.

属性嵌入式顶点模式通常提供更高的性能和可缩放的方法 。The property-embedded vertices pattern generally provides a more performant and scalable approach. 新图形数据模型的默认方法应该倾向于这种模式。The default approach to a new graph data model should gravitate towards this pattern.

但是,在某些情况下,引用属性可能会带来优势。However, there are scenarios where referencing to a property might provide advantages. 例如:如果引用的属性经常更新。For example: if the referenced property is updated frequently. 使用单独的顶点来表示不断更改的属性将最大程度地减少更新所需的写入操作量。Using a separate vertex to represent a property that is constantly changed would minimize the amount of write operations that the update would require.

与边缘方向的关系建模Relationship modeling with edge directions

对顶点建模之后,可以添加边缘,以表示它们之间的关系。After the vertices are modeled, the edges can be added to denote the relationships between them. 需要评估的第一个方面是关系的方向 The first aspect that needs to be evaluated is the direction of the relationship.

边缘对象具有默认方向,在使用 out()outE() 函数时后跟遍历。Edge objects have a default direction that is followed by a traversal when using the out() or outE() function. 使用这种自然方向可以实现高效操作,因为所有顶点都与其传出边缘一起存储。Using this natural direction results in an efficient operation, since all vertices are stored with their outgoing edges.

但是,使用 in() 函数在边缘的相反方向遍历将始终导致跨分区查询。However, traversing in the opposite direction of an edge, using the in() function, will always result in a cross-partition query. 详细了解图形分区Learn more about graph partitioning. 如果需要使用 in() 函数不断遍历,建议在两个方向上添加边缘。If there's a need to constantly traverse using the in() function, it's recommended to add edges in both directions.

你可以通过在 .addE() Gremlin 步骤中使用 .to().from() 谓词来确定边缘方向。You can determine the edge direction by using the .to() or .from() predicates to the .addE() Gremlin step. 或通过使用适用于 Gremlin API 的批量执行程序库来确定。Or by using the bulk executor library for Gremlin API.

备注

边缘对象默认具有方向。Edge objects have a direction by default.

关系标签Relationship labeling

使用描述性关系标签可以提高边缘解析操作的效率。Using descriptive relationship labels can improve the efficiency of edge resolution operations. 可以通过以下方式应用此模式:This pattern can be applied in the following ways:

  • 使用非通用术语来标记关系。Use non-generic terms to label a relationship.
  • 使用关系名称将源顶点的标签与目标顶点的标签相关联。Associate the label of the source vertex to the label of the target vertex with the relationship name.

关系标签示例。

遍历器用于筛选边缘的标签越具体越好。The more specific the label that the traverser will use to filter the edges, the better. 此决定也会对查询成本产生显著影响。This decision can have a significant impact on query cost as well. 可以 使用 executionProfile 步骤随时评估查询成本。You can evaluate the query cost at any time using the executionProfile step.

后续步骤:Next steps: