诊断和排查 Azure Cosmos DB Java v4 SDK 请求超时异常Diagnose and troubleshoot Azure Cosmos DB Java v4 SDK request timeout exceptions

适用于: SQL API

如果 SDK 在超时限制发生之前未能完成请求,则会出现 HTTP 408 错误。The HTTP 408 error occurs if the SDK was unable to complete the request before the timeout limit occurred.

疑难解答步骤Troubleshooting steps

下面的列表包含请求超时异常的已知原因和解决方案。The following list contains known causes and solutions for request timeout exceptions.

现有问题Existing issues

如果看到请求被卡住的时间更长或超时更频繁,请将 Java v4 SDK 升级到最新版本。If you are seeing requests getting stuck for longer duration or timing out more frequently, please upgrade the Java v4 SDK to the latest version. 注意:强烈建议使用 4.7.0 及更高版本。NOTE: We strongly recommend to use the version 4.7.0 and above. 请查看 Java v4 SDK 发行说明,了解更多详细信息。Checkout the Java v4 SDK release notes for more details.

CPU 利用率较高High CPU utilization

最常见的情况是 CPU 利用率较高。High CPU utilization is the most common case. 为实现最佳延迟,CPU 利用率应大约为 40%。For optimal latency, CPU usage should be roughly 40 percent. 将时间间隔设为 10 秒来监视最大 CPU 利用率(而不是平均利用率)。Use 10 seconds as the interval to monitor maximum (not average) CPU utilization. 对于可能会为单个查询执行多个连接的跨分区查询,更常见的情况是出现 CPU 峰值。CPU spikes are more common with cross-partition queries where it might do multiple connections for a single query.

解决方案:Solution:

应纵向扩展或横向扩展使用 SDK 的客户端应用程序。The client application that uses the SDK should be scaled up or out.

连接限制Connection throttling

连接限制可能会因主机上的连接限制或 Azure SNAT (PAT) 端口耗尽而出现。Connection throttling can happen because of either a connection limit on a host machine or Azure SNAT (PAT) port exhaustion.

主机上的连接限制Connection limit on a host machine

某些 Linux 系统(例如 Red Hat)的打开文件总数存在上限。Some Linux systems, such as Red Hat, have an upper limit on the total number of open files. Linux 中的套接字以文件形式实现,因此,此上限也限制了连接总数。Sockets in Linux are implemented as files, so this number limits the total number of connections, too. 运行以下命令。Run the following command.

ulimit -a

解决方案:Solution:

允许打开文件的最大数(标识为“nofile”)至少需要是 10,000 或以上。The number of max allowed open files, which are identified as "nofile," needs to be at least 10,000 or more. 有关详细信息,请参阅 Azure Cosmos DB Java SDK v4 性能提示For more information, see the Azure Cosmos DB Java SDK v4 performance tips.

套接字或端口可用性可能较低Socket or port availability might be low

在 Azure 中运行时,使用 Java SDK 的客户端可能会遇到 Azure SNAT (PAT) 端口耗尽的情况。When running in Azure, clients using the Java SDK can hit Azure SNAT (PAT) port exhaustion.

解决方案 1:Solution 1:

如果正在 Azure VM 上运行,请按照 SNAT 端口耗尽指南操作。If you're running on Azure VMs, follow the SNAT port exhaustion guide.

解决方案 2:Solution 2:

如果正在 Azure 应用服务上运行,请按照连接错误故障排除指南使用应用服务诊断操作。If you're running on Azure App Service, follow the connection errors troubleshooting guide and use App Service diagnostics.

解决方案 3:Solution 3:

如果正在 Azure Functions 上运行,请验证是否遵循 Azure Functions 建议,即是否为所有涉及的服务(包括 Azure Cosmos DB)维护单一实例或静态客户端。If you're running on Azure Functions, verify you're following the Azure Functions recommendation of maintaining singleton or static clients for all of the involved services (including Azure Cosmos DB). 根据函数应用托管的类型和大小查看服务限制Check the service limits based on the type and size of your Function App hosting.

解决方案 4:Solution 4:

如果使用 HTTP 代理,请确保它支持 SDK GatewayConnectionConfig 中配置的连接数。If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK GatewayConnectionConfig. 否则将遇到连接问题。Otherwise, you'll face connection issues.

创建多个客户端实例Create multiple client instances

创建多个客户端实例可能会导致连接争用和超时问题。Creating multiple client instances might lead to connection contention and timeout issues.

解决方案 1:Solution 1:

按照性能提示操作,并在整个应用程序中使用单个 CosmosClient 实例。Follow the performance tips, and use a single CosmosClient instance across an entire application.

解决方案 2:Solution 2:

如果单一实例 CosmosClient 没法包含在应用程序中,建议在 CosmosClient 中通过此 API connectionSharingAcrossClientsEnabled(true) 跨多个 Cosmos 客户端使用连接共享。If singleton CosmosClient is not possible to have in an application, we recommend using connection sharing across multiple Cosmos Clients through this API connectionSharingAcrossClientsEnabled(true) in CosmosClient. 如果同一个 JVM 中有多个 Cosmos 客户端实例与多个 Cosmos 帐户交互,则启用此选项后可在 Cosmos 客户端的实例之间以直接模式进行连接共享(若可行)。When you have multiple instances of Cosmos Client in the same JVM interacting to multiple Cosmos accounts, enabling this allows connection sharing in Direct mode if possible between instances of Cosmos Client. 请注意,设置此选项时,将对其他所有客户端实例使用第一个实例化客户端的连接配置(例如套接字超时配置、空闲超时配置)。Please note, when setting this option, the connection configuration (e.g., socket timeout config, idle timeout config) of the first instantiated client will be used for all other client instances.

热分区键Hot partition key

Azure Cosmos DB 在物理分区之间均匀分配预配的总吞吐量。Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. 存在热分区时,物理分区上的一个或多个逻辑分区键会消耗物理分区的所有请求单位/秒 (RU/s)。When there's a hot partition, one or more logical partition keys on a physical partition are consuming all the physical partition's Request Units per second (RU/s). 同时,将无法使用其他物理分区上的 RU/s。At the same time, the RU/s on other physical partitions are going unused. 故障表现是,所消耗的 RU/s 总数将小于数据库或容器上整体预配的 RU/s,但针对热逻辑分区键仍将显示请求限制 (429s)。As a symptom, the total RU/s consumed will be less than the overall provisioned RU/s at the database or container, but you'll still see throttling (429s) on the requests against the hot logical partition key.

解决方案:Solution:

选择均匀分配请求量和存储的适当分区键。Choose a good partition key that evenly distributes request volume and storage. 了解如何更改分区键Learn how to change your partition key.

并发度较高High degree of concurrency

应用程序正在执行高级别的并发,这可能会导致通道上出现争用。The application is doing a high level of concurrency, which can lead to contention on the channel.

解决方案:Solution:

应纵向扩展或横向扩展使用 SDK 的客户端应用程序。The client application that uses the SDK should be scaled up or out.

大型请求或响应Large requests or responses

较大的请求或响应可能导致通道上出现队头阻塞并加剧资源争用(甚至在并发度相对较低的情况下)。Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency.

解决方案:Solution:

应纵向扩展或横向扩展使用 SDK 的客户端应用程序。The client application that uses the SDK should be scaled up or out.

失败率在 Azure Cosmos DB SLA 范围之内Failure rate is within the Azure Cosmos DB SLA

应用程序应该能够处理暂时性故障,并在必要时重试。The application should be able to handle transient failures and retry when necessary. 任何 408 异常不会被重试,因为在创建路径时不可能知道服务是否创建了该项。Any 408 exceptions aren't retried because on create paths it's impossible to know if the service created the item or not. 再次发送相同的项进行创建将导致冲突异常。Sending the same item again for create will cause a conflict exception. 用户应用程序业务逻辑可能包含用于处理冲突的自定义逻辑,这会消除现有项的不确定性与来自“创建”重试的冲突。User applications business logic might have custom logic to handle conflicts, which would break from the ambiguity of an existing item versus conflict from a create retry.

失败率与 Azure Cosmos DB SLA 不符Failure rate violates the Azure Cosmos DB SLA

请联系 Azure 支持部门Contact Azure Support.

后续步骤Next steps