Execute queries on graph data in Azure Cosmos DB for Apache Gremlin

The Azure Cosmos DB for Apache Gremlin supports the Gremlin TinkerPop syntax for queries. This guide walks through common queries that can be performed using this service. You can run following queries in this guide using the Gremlin console, or your favorite Gremlin driver.

Prerequisites

  • An Azure subscription

    • If you don't have an Azure subscription, create a Trial before you begin.
  • An Azure Cosmos DB for Apache Gremlin account
  • Access to sample data for testing

Count the number of vertices in the graph

Count the total number of product vertices in the graph. This operation is useful for understanding the size of your product catalog or validating data loads.

g.V().hasLabel('product').count()

Count the number of vertices with a specific label in the graph

Count the total number of product vertices in the graph that include a specific label. In this example, the label is product.

g.V().hasLabel('product').count()

Filter products by label and property

Retrieve products that match a specific label and property value. This query is helpful for narrowing down results to a subset of interest, such as products with a price greater than $800.

g.V().hasLabel('product').has('price', gt(800))

Project specific properties from products

Return only selected properties from the matched products. This query reduces the amount of data returned and focuses on relevant fields, such as product names.

g.V().hasLabel('product').values('name')

Find related products by traversing the graph. For example, find all products that replaced by a specific product by traversing outgoing 'replaces' edges and then to the connected product vertices.

g.V(['gear-surf-surfboards', 'bbbbbbbb-1111-2222-3333-cccccccccccc']).outE('replaces').inV().hasLabel('product')

Use this query to find products two hops away in the replacement chain:

g.V(['gear-surf-surfboards', 'bbbbbbbb-1111-2222-3333-cccccccccccc']).outE('replaces').inV().hasLabel('product').outE('replaces').inV().hasLabel('product')

Analyze query execution with execution profile

Analyze the performance and execution details of a Gremlin query using the executionProfile() step. This step returns a JSON object with metrics for each step in the query, which helps with troubleshooting and optimization.

g.V(['gear-surf-surfboards', 'bbbbbbbb-1111-2222-3333-cccccccccccc']).out().executionProfile()
[
  {
    "gremlin": "g.V('mary').out().executionProfile()",
    "totalTime": 28,
    "metrics": [
      {
        "name": "GetVertices",
        "time": 24,
        "annotations": { "percentTime": 85.71 },
        "counts": { "resultCount": 2 },
        "storeOps": [ { "fanoutFactor": 1, "count": 2, "size": 696, "time": 0.4 } ]
      },
      {
        "name": "GetEdges",
        "time": 4,
        "annotations": { "percentTime": 14.29 },
        "counts": { "resultCount": 1 },
        "storeOps": [ { "fanoutFactor": 1, "count": 1, "size": 419, "time": 0.67 } ]
      },
      {
        "name": "GetNeighborVertices",
        "time": 0,
        "annotations": { "percentTime": 0 },
        "counts": { "resultCount": 1 }
      },
      {
        "name": "ProjectOperator",
        "time": 0,
        "annotations": { "percentTime": 0 },
        "counts": { "resultCount": 1 }
      }
    ]
  }
]

For more information about the executionProfile() step, see execution profile reference.

Tip

The executionProfile() step executes the Gremlin query. This query includes the addV or addE steps, which results in the creation and commit of the changes specified in the query. Request units generated by the Gremlin query are also charged.

Identify blind fan-out query patterns

A blind fan-out occurs when a query accesses more partitions than necessary, often due to missing partition key predicates. This antipattern can increase latency and cost. The execution profile helps identify such patterns by showing a high fanoutFactor.

g.V(['gear-surf-surfboards', 'aaaaaaaa-0000-1111-2222-bbbbbbbbbbbb']).executionProfile()
[
  {
    "gremlin": "g.V('tt0093640').executionProfile()",
    "totalTime": 46,
    "metrics": [
      {
        "name": "GetVertices",
        "time": 46,
        "annotations": { "percentTime": 100 },
        "counts": { "resultCount": 1 },
        "storeOps": [ { "fanoutFactor": 5, "count": 1, "size": 589, "time": 75.61 } ]
      },
      {
        "name": "ProjectOperator",
        "time": 0,
        "annotations": { "percentTime": 0 },
        "counts": { "resultCount": 1 }
      }
    ]
  }
]

Optimizing fan-out queries

A high fanoutFactor (such as 5) indicates the query accessed multiple partitions. To optimize, include the partition key in the query predicate:

g.V(['gear-surf-surfboards', 'aaaaaaaa-0000-1111-2222-bbbbbbbbbbbb'])

Unfiltered query pattern

Unfiltered queries could process a large initial dataset, increasing cost and latency.

g.V().hasLabel('product').out().executionProfile()
[
  {
    "gremlin": "g.V().hasLabel('tweet').out().executionProfile()",
    "totalTime": 42,
    "metrics": [
      {
        "name": "GetVertices",
        "time": 31,
        "annotations": { "percentTime": 73.81 },
        "counts": { "resultCount": 30 },
        "storeOps": [ { "fanoutFactor": 1, "count": 13, "size": 6819, "time": 1.02 } ]
      },
      {
        "name": "GetEdges",
        "time": 6,
        "annotations": { "percentTime": 14.29 },
        "counts": { "resultCount": 18 },
        "storeOps": [ { "fanoutFactor": 1, "count": 20, "size": 7950, "time": 1.98 } ]
      },
      {
        "name": "GetNeighborVertices",
        "time": 5,
        "annotations": { "percentTime": 11.9 },
        "counts": { "resultCount": 20 },
        "storeOps": [ { "fanoutFactor": 1, "count": 4, "size": 1070, "time": 1.19 } ]
      },
      {
        "name": "ProjectOperator",
        "time": 0,
        "annotations": { "percentTime": 0 },
        "counts": { "resultCount": 20 }
      }
    ]
  }
]

Filtered query pattern

Adding filters before traversals can reduce the working set and improve performance. The execution profile shows the effect of filtering. The filtered query processes fewer vertices, resulting in lower latency and cost.

g.V().hasLabel('product').has('clearance', true).out().executionProfile()
[
  {
    "gremlin": "g.V().hasLabel('tweet').has('lang', 'en').out().executionProfile()",
    "totalTime": 14,
    "metrics": [
      {
        "name": "GetVertices",
        "time": 14,
        "annotations": { "percentTime": 58.33 },
        "counts": { "resultCount": 11 },
        "storeOps": [ { "fanoutFactor": 1, "count": 11, "size": 4807, "time": 1.27 } ]
      },
      {
        "name": "GetEdges",
        "time": 5,
        "annotations": { "percentTime": 20.83 },
        "counts": { "resultCount": 18 },
        "storeOps": [ { "fanoutFactor": 1, "count": 18, "size": 7159, "time": 1.7 } ]
      },
      {
        "name": "GetNeighborVertices",
        "time": 5,
        "annotations": { "percentTime": 20.83 },
        "counts": { "resultCount": 18 },
        "storeOps": [ { "fanoutFactor": 1, "count": 4, "size": 1070, "time": 1.01 } ]
      },
      {
        "name": "ProjectOperator",
        "time": 0,
        "annotations": { "percentTime": 0 },
        "counts": { "resultCount": 18 }
      }
    ]
  }
]