Ingest data in bulk in the Azure Cosmos DB for Gremlin by using a bulk executor library
APPLIES TO: Gremlin
Graph databases often need to ingest data in bulk to refresh an entire graph or update a portion of it. Azure Cosmos DB, a distributed database and the backbone of the Azure Cosmos DB for Gremlin, is meant to perform best when the loads are well distributed. Bulk executor libraries in Azure Cosmos DB are designed to exploit this unique capability of Azure Cosmos DB and provide optimal performance. For more information, see Introducing bulk support in the .NET SDK.
In this tutorial, you learn how to use the Azure Cosmos DB bulk executor library to import and update graph objects into an Azure Cosmos DB for Gremlin container. During this process, you use the library to create vertex and edge objects programmatically and then insert multiple objects per network request.
Instead of sending Gremlin queries to a database, where the commands are evaluated and then executed one at a time, you use the bulk executor library to create and validate the objects locally. After the library initializes the graph objects, it allows you to send them to the database service sequentially.
By using this method, you can increase data ingestion speeds as much as a hundredfold, which makes it an ideal way to perform initial data migrations or periodic data movement operations.
The bulk executor library now comes in the following varieties.
.NET
Prerequisites
Before you begin, make sure that you have the following:
Visual Studio 2019 with the Azure development workload. You can get started with the Visual Studio 2019 Community Edition for free.
An Azure subscription. If you don't already have a subscription, you can create a Azure account.
Alternatively, you can create a free Azure Cosmos DB account without an Azure subscription.
An Azure Cosmos DB for Gremlin database with an unlimited collection. To get started, go to Azure Cosmos DB for Gremlin in .NET.
Git. To begin, go to the git downloads page.
Clone
To use this sample, run the following command:
git clone https://github.com/Azure-Samples/azure-cosmos-graph-bulk-executor.git
To get the sample, go to .\azure-cosmos-graph-bulk-executor\dotnet\src\
.
Sample
IGraphBulkExecutor graphBulkExecutor = new GraphBulkExecutor("MyConnectionString", "myDatabase", "myContainer");
List<IGremlinElement> gremlinElements = new List<IGremlinElement>();
gremlinElements.AddRange(Program.GenerateVertices(Program.documentsToInsert));
gremlinElements.AddRange(Program.GenerateEdges(Program.documentsToInsert));
BulkOperationResponse bulkOperationResponse = await graphBulkExecutor.BulkImportAsync(
gremlinElements: gremlinElements,
enableUpsert: true);
Execute
Modify the parameters, as described in the following table:
Parameter | Description |
---|---|
ConnectionString |
Your service connection string, which you'll find in the Keys section of your Azure Cosmos DB for Gremlin account. It's formatted as AccountEndpoint=https://<account-name>.documents.azure.cn:443/;AccountKey=<account-key>; . |
DatabaseName , ContainerName |
The names of the target database and container. |
DocumentsToInsert |
The number of documents to be generated (relevant only to synthetic data). |
PartitionKey |
Ensures that a partition key is specified with each document during data ingestion. |
NumberOfRUs |
Is relevant only if a container doesn't already exist and it needs to be created during execution. |
Download the full sample application in .NET.
Java
Sample usage
The following sample application illustrates how to use the GraphBulkExecutor package. The samples use either the domain object annotations or the POJO (plain old Java object) objects directly. We recommend that you try both approaches to determine which one better meets your implementation and performance demands.
Clone
To use the sample, run the following command:
git clone https://github.com/Azure-Samples/azure-cosmos-graph-bulk-executor.git
To get the sample, go to .\azure-cosmos-graph-bulk-executor\java\
.
Prerequisites
To run this sample, you need to have the following software:
- OpenJDK 11
- Maven
- An Azure Cosmos DB account that's configured to use the Gremlin API
Sample
private static void executeWithPOJO(Stream<GremlinVertex> vertices,
Stream<GremlinEdge> edges,
boolean createDocs) {
results.transitionState("Configure Database");
UploadWithBulkLoader loader = new UploadWithBulkLoader();
results.transitionState("Write Documents");
loader.uploadDocuments(vertices, edges, createDocs);
}
Configuration
To run the sample, refer to the following configuration and modify it as needed.
The /resources/application.properties file defines the data that's required to configure Azure Cosmos DB. The required values are described in the following table:
Property | Description |
---|---|
sample.sql.host |
The value that's provided by Azure Cosmos DB. Ensure that you're using the .NET SDK URI, which you'll find in the Overview section of the Azure Cosmos DB account. |
sample.sql.key |
You can get the primary or secondary key from the Keys section of the Azure Cosmos DB account. |
sample.sql.database.name |
The name of the database within the Azure Cosmos DB account to run the sample against. If the database isn't found, the sample code creates it. |
sample.sql.container.name |
The name of the container within the database to run the sample against. If the container isn't found, the sample code creates it. |
sample.sql.partition.path |
If you need to create the container, use this value to define the partitionKey path. |
sample.sql.allow.throughput |
The container will be updated to use the throughput value that's defined here. If you're exploring various throughput options to meet your performance demands, be sure to reset the throughput on the container when you're done with your exploration. There are costs associated with leaving the container provisioned with a higher throughput. |
Execute
After you've modified the configuration according to your environment, run the following command:
mvn clean package
For added safety, you can also run the integration tests by changing the skipIntegrationTests
value in the pom.xml file to false
.
After you've run the unit tests successfully, you can run the sample code:
java -jar target/azure-cosmos-graph-bulk-executor-1.0-jar-with-dependencies.jar -v 1000 -e 10 -d
Running the preceding command executes the sample with a small batch (1,000 vertices and roughly 5,000 edges). Use the command-line arguments in the following sections to tweak the volumes that are run and which sample version to run.
Command-line arguments
Several command-line arguments are available while you're running this sample, as described in the following table:
Argument | Description |
---|---|
--vertexCount (-v ) |
Tells the application how many person vertices to generate. |
--edgeMax (-e ) |
Tells the application the maximum number of edges to generate for each vertex. The generator randomly selects a number from 1 to the value you provide. |
--domainSample (-d ) |
Tells the application to run the sample by using the person and relationship domain structures instead of the GraphBulkExecutors , GremlinVertex , and GremlinEdge POJOs. |
--createDocuments (-c ) |
Tells the application to use create operations. If the argument isn't present, the application defaults to using upsert operations. |
Detailed sample information
Person vertex
The person class is a simple domain object that's been decorated with several annotations to help a transformation into the GremlinVertex
class, as described in the following table:
Class annotation | Description |
---|---|
GremlinVertex |
Uses the optional label parameter to define all vertices that you create by using this class. |
GremlinId |
Used to define which field will be used as the ID value. The field name on the person class is ID, but it isn't required. |
GremlinProperty |
Used on the email field to change the name of the property when it's stored in the database. |
GremlinPartitionKey |
Used to define which field on the class contains the partition key. The field name you provide should match the value that's defined by the partition path on the container. |
GremlinIgnore |
Used to exclude the isSpecial field from the property that's being written to the database. |
The RelationshipEdge class
The RelationshipEdge
class is a versatile domain object. By using the field level label annotation, you can create a dynamic collection of edge types, as shown in the following table:
Class annotation | Description |
---|---|
GremlinEdge |
The GremlinEdge decoration on the class defines the name of the field for the specified partition key. When you create an edge document, the assigned value comes from the source vertex information. |
GremlinEdgeVertex |
Two instances of GremlinEdgeVertex are defined, one for each side of the edge (source and destination). Our sample has the field's data type as GremlinEdgeVertexInfo . The information provided by the GremlinEdgeVertex class is required for the edge to be created correctly in the database. Another option would be to have the data type of the vertices be a class that has been decorated with the GremlinVertex annotations. |
GremlinLabel |
The sample edge uses a field to define the label value. It allows various labels to be defined, because it uses the same base domain class. |
Output explained
The console finishes its run with a JSON string that describes the run times of the sample. The JSON string contains the following information:
JSON string | Description |
---|---|
startTime | The System.nanoTime() when the process started. |
endTime | The System.nanoTime() when the process finished. |
durationInNanoSeconds | The difference between the endTime and startTime values. |
durationInMinutes | The durationInNanoSeconds value, converted into minutes. The durationInMinutes value is represented as a float number, not a time value. For example, a value of 2.5 represents 2 minutes and 30 seconds. |
vertexCount | The volume of generated vertices, which should match the value that's passed into the command-line execution. |
edgeCount | The volume of generated edges, which isn't static and is built with an element of randomness. |
exception | Populated only if an exception is thrown when you attempt to make the run. |
States array
The states array gives insight into how long each step within the execution takes. The steps are described in the following table:
Execution step | Description |
---|---|
Build sample vertices | The amount of time it takes to fabricate the requested volume of person objects. |
Build sample edges | The amount of time it takes to fabricate the relationship objects. |
Configure database | The amount of time it takes to configure the database, based on the values that are provided in application.properties . |
Write documents | The amount of time it takes to write the documents to the database. |
Each state contains the following values:
State value | Description |
---|---|
stateName |
The name of the state that's being reported. |
startTime |
The System.nanoTime() value when the state started. |
endTime |
The System.nanoTime() value when the state finished. |
durationInNanoSeconds |
The difference between the endTime and startTime values. |
durationInMinutes |
The durationInNanoSeconds value, converted into minutes. The durationInMinutes value is represented as a float number, not a time value. For example, a value of 2.5 represents 2 minutes and 30 seconds. |
Next steps
- For more information about the classes and methods that are defined in this namespace, review the BulkExecutor Java open source documentation.
- See Bulk import data to the Azure Cosmos DB SQL API account by using the .NET SDK article. This bulk mode documentation is part of the .NET V3 SDK.