Best practices for Azure Cosmos DB Java SDK

Article
11/27/2023

APPLIES TO: NoSQL

This article walks through the best practices for using the Azure Cosmos DB Java SDK. Following these practices, will help improve your latency, availability, and boost overall performance.

Checklist

Checked	Topic	Details/Links
	SDK Version	Always using the latest version of the Azure Cosmos DB SDK available for optimal performance.
	Singleton Client	Use a single instance of `CosmosClient` for the lifetime of your application for better performance.
	Regions	Make sure to run your application in the same Azure region as your Azure Cosmos DB account, whenever possible to reduce latency. Enable 2-4 regions and replicate your accounts in multiple regions for best availability. For production workloads, enable service-managed failover. In the absence of this configuration, the account will experience loss of write availability for all the duration of the write region outage, as manual failover won't succeed due to lack of region connectivity. To learn how to add multiple regions using the Java SDK visit here
	Availability and Failovers	Set the preferredRegions in the v4 SDK. During failovers, write operations are sent to the current write region and all reads are sent to the first region within your preferred regions list. For more information about regional failover mechanics, see the availability troubleshooting guide.
	CPU	You may run into connectivity/availability issues due to lack of resources on your client machine. Monitor your CPU utilization on nodes running the Azure Cosmos DB client, and scale up/out if usage is very high.
	Hosting	For most common cases of production workloads, we highly recommend using at least 4-cores and 8-GB memory VMs whenever possible.
	Connectivity Modes	Use Direct mode for the best performance. For instructions on how to do this, see the V4 SDK documentation.
	Networking	If using a virtual machine to run your application, enable Accelerated Networking on your VM to help with bottlenecks due to high traffic and reduce latency or CPU jitter. You might also want to consider using a higher end Virtual Machine where the max CPU usage is under 70%.
	Ephemeral Port Exhaustion	For sparse or sporadic connections, we recommend setting the `idleEndpointTimeout` to a higher value. The `idleEndpointTimeout` property in `DirectConnectionConfig` helps which control the time unused connections are closed. This will reduce the number of unused connections. By default, idle connections to an endpoint are kept open for 1 hour. If there aren't requests to a specific endpoint for idle endpoint timeout duration, direct client closes all connections to that endpoint to save resources and I/O cost.
	Use Appropriate Scheduler (Avoid stealing Event loop IO Netty threads)	Avoid blocking calls: `.block()`. The entire call stack is asynchronous in order to benefit from async API patterns and use of appropriate threading and schedulers
	End-to-End Timeouts	To get end-to-end timeouts, implement end-to-end timeout policy in the Java SDK. For more details on timeouts with Azure Cosmos DB visit here
	Retry Logic	A transient error is an error that has an underlying cause that soon resolves itself. Applications that connect to your database should be built to expect these transient errors. To handle them, implement retry logic in your code instead of surfacing them to users as application errors. The SDK has built-in logic to handle these transient failures on retryable requests like read or query operations. The SDK won't retry on writes for transient failures as writes aren't idempotent. The SDK does allow users to configure retry logic for throttles. For details on which errors to retry on visit here
	Caching database/collection names	Retrieve the names of your databases and containers from configuration or cache them on start. Calls like `CosmosAsyncDatabase#read()` or `CosmosAsyncContainer#read()` will result in metadata calls to the service, which consume from the system-reserved RU limit. `createDatabaseIfNotExists()` should also only be used once for setting up the database. Overall, these operations should be performed infrequently.
	Parallel Queries	The Azure Cosmos DB SDK supports running queries in parallel for better latency and throughput on your queries. We recommend setting the `maxDegreeOfParallelism` property within the `CosmosQueryRequestsOptions` to the number of partitions you have. If you aren't aware of the number of partitions, set the value to `-1` that will give you the best latency. Also, set the `maxBufferedItemCount` to the expected number of results returned to limit the number of pre-fetched results.
	Performance Testing Backoffs	When performing testing on your application, you should implement backoffs at `RetryAfter` intervals. Respecting the backoff helps ensure that you'll spend a minimal amount of time waiting between retries.
	Indexing	The Azure Cosmos DB indexing policy also allows you to specify which document paths to include or exclude from indexing by using indexing paths `IndexingPolicy#getIncludedPaths()` and `IndexingPolicy#getExcludedPaths()`. Ensure that you exclude unused paths from indexing for faster writes. For a sample on how to create indexes using the SDK visit here
	Document Size	The request charge of a specified operation correlates directly to the size of the document. We recommend reducing the size of your documents as operations on large documents cost more than operations on smaller documents.
	Page Size	By default, query results are returned in chunks of 100 items or 4 MB, whichever limit is hit first. If a query will return more than 100 items, increase the page size to reduce the number of round trips required. Memory consumption will increase as page size increases.
	Enabling Query Metrics	For additional logging of your backend query executions, follow instructions on how to capture SQL Query Metrics using Java SDK
	SDK Logging	Use SDK logging to capture additional diagnostics information and troubleshoot latency issues. Log the CosmosDiagnostics in Java SDK for more detailed Azure Cosmos DB diagnostic information for the current request to the service. As an example use case, capture Diagnostics on any exception and on completed operations if the `CosmosDiagnostics#getDuration()` is greater than a designated threshold value (i.e. if you have an SLA of 10 seconds, then capture diagnostics when `getDuration()` > 10 seconds). It's advised to only use these diagnostics during performance testing. For more information, follow capture diagnostics on Java SDK
	Avoid using any special characters in identifiers	Some characters are restricted and cannot be used in some identifiers: '/', '\', '?', '#'. The general recommendation is to not use any special characters in identifiers like database name, collection name, item id, or partition key to avoid any unexpected behavior.

Best practices when using Gateway mode

Azure Cosmos DB requests are made over HTTPS/REST when you use Gateway mode. They're subject to the default connection limit per hostname or IP address. You might need to tweak maxConnectionPoolSize to a different value (from 100 through 1,000) so that the client library can use multiple simultaneous connections to Azure Cosmos DB. In Java v4 SDK, the default value for GatewayConnectionConfig#maxConnectionPoolSize is 1000. To change the value, you can set GatewayConnectionConfig#maxConnectionPoolSize to a different value.

Best practices for write-heavy workloads

For workloads that have heavy create payloads, set the CosmosClientBuilder#contentResponseOnWriteEnabled() request option to false. The service will no longer return the created or updated resource to the SDK. Normally, because the application has the object that's being created, it doesn't need the service to return it. The header values are still accessible, like a request charge. Disabling the content response can help improve performance, because the SDK no longer needs to allocate memory or serialize the body of the response. It also reduces the network bandwidth usage to further help performance.

Next steps

To learn more about performance tips for Java SDK, see Performance tips for Azure Cosmos DB Java SDK v4.

To learn more about designing your application for scale and high performance, see Partitioning and scaling in Azure Cosmos DB.

Trying to do capacity planning for a migration to Azure Cosmos DB? You can use information about your existing database cluster for capacity planning.

If all you know is the number of vCores and servers in your existing database cluster, read about estimating request units using vCores or vCPUs
If you know typical request rates for your current database workload, read about estimating request units using Azure Cosmos DB capacity planner

Best practices for Azure Cosmos DB Java SDK

Checklist

Best practices when using Gateway mode

Best practices for write-heavy workloads

Next steps

Additional resources