排查将 Azure Cosmos DB Java SDK v4 与 SQL API 帐户配合使用时出现的问题Troubleshoot issues when you use Azure Cosmos DB Java SDK v4 with SQL API accounts

重要

本文仅介绍 Azure Cosmos DB Java SDK v4 的故障排除。This article covers troubleshooting for Azure Cosmos DB Java SDK v4 only. 有关详细信息,请参阅 Azure Cosmos DB Java SDK v4 发行说明Maven 存储库性能提示Please see the Azure Cosmos DB Java SDK v4 Release notes, Maven repository, and performance tips for more information. 如果你当前使用的是早于 v4 的版本,请参阅迁移到 Azure Cosmos DB Java SDK v4 指南,获取升级到 v4 的相关帮助。If you are currently using an older version than v4, see the Migrate to Azure Cosmos DB Java SDK v4 guide for help upgrading to v4.

本文介绍了将 Azure Cosmos DB Java SDK v4 与 Azure Cosmos DB SQL API 帐户配合使用时的常见问题、解决方法、诊断步骤和工具。This article covers common issues, workarounds, diagnostic steps, and tools when you use Azure Cosmos DB Java SDK v4 with Azure Cosmos DB SQL API accounts. Azure Cosmos DB Java SDK v4 提供客户端逻辑表示用于访问 Azure Cosmos DB SQL API。Azure Cosmos DB Java SDK v4 provides client-side logical representation to access the Azure Cosmos DB SQL API. 本文介绍了在遇到任何问题时可以提供帮助的工具和方法。This article describes tools and approaches to help you if you run into any issues.

从本列表开始:Start with this list:

  • 请查看本文中的常见问题和解决方法部分。Take a look at the Common issues and workarounds section in this article.
  • 查看 Azure Cosmos DB 中心存储库中的 Java SDK,它以 GitHub 上的开放源代码的形式提供。Look at the Java SDK in the Azure Cosmos DB central repo, which is available open source on GitHub. 该 SDK 拥有受到主动监视的问题部分It has an issues section that's actively monitored. 检查是否已提交包含解决方法的任何类似问题。Check to see if any similar issue with a workaround is already filed. 一个有用的提示是通过 cosmos:v4-item 标签来筛选问题。One helpful tip is to filter issues by the cosmos:v4-item tag.
  • 查看适用于 Azure Cosmos DB Java SDK v4 的性能提示并按照建议的做法进行操作。Review the performance tips for Azure Cosmos DB Java SDK v4, and follow the suggested practices.
  • 阅读本文的其余部分,如果找不到解决方案,Read the rest of this article, if you didn't find a solution. 则提交 GitHub 问题Then file a GitHub issue. 如果有向 GitHub 问题添加标签的选项,请添加 cosmos:v4-item 标签。If there is an option to add tags to your GitHub issue, add a cosmos:v4-item tag.

常见问题和解决方法Common issues and workarounds

网络问题、Netty 读取超时故障、低吞吐量、高延迟Network issues, Netty read timeout failure, low throughput, high latency

常规建议General suggestions

为获得最佳性能:For best performance:

  • 确保应用与 Azure Cosmos DB 帐户在同一区域运行。Make sure the app is running on the same region as your Azure Cosmos DB account.
  • 检查运行应用的主机 CPU 使用情况。Check the CPU usage on the host where the app is running. 如果 CPU 使用率为 50% 或更高,请在具有更高配置的主机上运行应用。If CPU usage is 50 percent or more, run your app on a host with a higher configuration. 或者,可将负载分发到更多计算机。Or you can distribute the load on more machines.

连接限制Connection throttling

连接限制可能会因主机上的连接限制Azure SNAT (PAT) 端口耗尽而出现。Connection throttling can happen because of either a connection limit on a host machine or Azure SNAT (PAT) port exhaustion.

主机上的连接限制Connection limit on a host machine

某些 Linux 系统(例如 Red Hat)的打开文件总数存在上限。Some Linux systems, such as Red Hat, have an upper limit on the total number of open files. Linux 中的套接字以文件形式实现,因此,此上限也限制了连接总数。Sockets in Linux are implemented as files, so this number limits the total number of connections, too. 运行以下命令。Run the following command.

ulimit -a

允许的最大打开文件数(标识为“nofile”)至少需要是连接池大小的两倍。The number of max allowed open files, which are identified as "nofile," needs to be at least double your connection pool size. 有关详细信息,请参阅 Azure Cosmos DB Java SDK v4 性能提示For more information, see the Azure Cosmos DB Java SDK v4 performance tips.

Azure SNAT (PAT) 端口耗尽Azure SNAT (PAT) port exhaustion

如果应用部署在没有公共 IP 地址的 Azure 虚拟机上,则默认情况下,Azure SNAT 端口用于建立与 VM 外部任何终结点的连接。If your app is deployed on Azure Virtual Machines without a public IP address, by default Azure SNAT ports establish connections to any endpoint outside of your VM. 从 VM 到 Azure Cosmos DB 终结点,允许的连接数受 Azure SNAT 配置的限制。The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the Azure SNAT configuration.

仅当 VM 具有专用 IP 地址且来自 VM 的进程尝试连接到公共 IP 地址时,才会使用 Azure SNAT 端口。Azure SNAT ports are used only when your VM has a private IP address and a process from the VM tries to connect to a public IP address. 有两种解决方法可以避免 Azure SNAT 限制:There are two workarounds to avoid Azure SNAT limitation:

  • 向 Azure 虚拟机虚拟网络的子网添加 Azure Cosmos DB 服务终结点。Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. 有关详细信息,请参阅 Azure 虚拟网络服务终结点For more information, see Azure Virtual Network service endpoints.

    启用服务终结点后,不再从公共 IP 向 Azure Cosmos DB 发送请求,When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. 而是发送虚拟网络和子网标识。Instead, the virtual network and subnet identity are sent. 如果仅允许公共 IP,则此更改可能会导致防火墙丢失。This change might result in firewall drops if only public IPs are allowed. 如果使用防火墙,则在启用服务终结点后,请使用虚拟网络 ACL 将子网添加到防火墙。If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using Virtual Network ACLs.

  • 将公共 IP 分配给 Azure VM。Assign a public IP to your Azure VM.

不能访问服务 - 防火墙Can't reach the Service - firewall

ConnectTimeoutException 指示 SDK 不能访问服务。ConnectTimeoutException indicates that the SDK cannot reach the service. 使用直接模式时,可能会出现如下所示的故障:You may get a failure similar to the following when using the direct mode:

GoneException{error=null, resourceAddress='https://cdb-ms-prod-chinanorth-fd4.documents.azure.cn:14940/apps/e41242a5-2d71-5acb-2e00-5e5f744b12de/services/d8aa21a5-340b-21d4-b1a2-4a5333e7ed8a/partitions/ed028254-b613-4c2a-bf3c-14bd5eb64500/replicas/131298754052060051p//', statusCode=410, message=Message: The requested resource is no longer available at the server., getCauseInfo=[class: class io.netty.channel.ConnectTimeoutException, message: connection timed out: cdb-ms-prod-chinanorth-fd4.documents.azure.cn/101.13.12.5:14940]

如果应用计算机上有防火墙运行,请打开 10,000 到 20,000 这一端口范围,该范围由直接模式使用。If you have a firewall running on your app machine, open port range 10,000 to 20,000 which are used by the direct mode. 另请按主机上的连接限制中的说明操作。Also follow the Connection limit on a host machine.

HTTP 代理HTTP proxy

如果使用 HTTP 代理,请确保它支持 SDK ConnectionPolicy 中配置的连接数。If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK ConnectionPolicy. 否则,将遇到连接问题。Otherwise, you face connection issues.

无效的编码模式:阻塞性 Netty IO 线程Invalid coding pattern: Blocking Netty IO thread

SDK 使用 Netty IO 库与 Azure Cosmos DB 通信。The SDK uses the Netty IO library to communicate with Azure Cosmos DB. SDK 拥有异步 API 并使用 Netty 的非阻塞性 IO API。The SDK has an Async API and uses non-blocking IO APIs of Netty. SDK 的 IO 工作在 IO Netty 线程上执行。The SDK's IO work is performed on IO Netty threads. IO Netty 线程的数量配置为与应用计算机的 CPU 核心数相同。The number of IO Netty threads is configured to be the same as the number of CPU cores of the app machine.

Netty IO 线程仅用于非阻塞性 Netty IO 工作。The Netty IO threads are meant to be used only for non-blocking Netty IO work. SDK 将其中一个 Netty IO 线程上的 API 调用结果返回至应用代码。The SDK returns the API invocation result on one of the Netty IO threads to the app's code. 如果应用在收到 Netty 线程上的结果后执行持续时间较长的操作,则 SDK 可能会没有足够的 IO 线程来执行其内部 IO 工作。If the app performs a long-lasting operation after it receives results on the Netty thread, the SDK might not have enough IO threads to perform its internal IO work. 此类应用编码可能会导致低吞吐量、高延迟和 io.netty.handler.timeout.ReadTimeoutException 故障。Such app coding might result in low throughput, high latency, and io.netty.handler.timeout.ReadTimeoutException failures. 解决方法是在知道操作需要耗费一定时间的情况下切换线程。The workaround is to switch the thread when you know the operation takes time.

例如,查看以下代码片段,该代码片段将项添加到容器中(请参阅此处获取设置数据库和容器的指南。)你可能会在 Netty 线程上执行耗时超过几毫秒的持续时间较长的工作。For example, take a look at the following code snippet which adds items to a container (look here for guidance on setting up the database and container.) You might perform long-lasting work that takes more than a few milliseconds on the Netty thread. 在这种情况下,你最终会陷入没有 Netty IO 线程来处理 IO 工作的状态。If so, you eventually can get into a state where no Netty IO thread is present to process IO work. 因此,你会遇到 ReadTimeoutException 故障。As a result, you get a ReadTimeoutException failure.

Java SDK V4 (Maven com.azure::azure-cosmos) 异步 APIJava SDK V4 (Maven com.azure::azure-cosmos) Async API


//Bad code with read timeout exception

int requestTimeoutInSeconds = 10;

/* ... */

AtomicInteger failureCount = new AtomicInteger();
// Max number of concurrent item inserts is # CPU cores + 1
Flux<Family> familyPub =
        Flux.just(Families.getAndersenFamilyItem(), Families.getAndersenFamilyItem(), Families.getJohnsonFamilyItem());
familyPub.flatMap(family -> {
    return container.createItem(family);
}).flatMap(r -> {
    try {
        // Time-consuming work is, for example,
        // writing to a file, computationally heavy work, or just sleep.
        // Basically, it's anything that takes more than a few milliseconds.
        // Doing such operations on the IO Netty thread
        // without a proper scheduler will cause problems.
        // The subscriber will get a ReadTimeoutException failure.
        TimeUnit.SECONDS.sleep(2 * requestTimeoutInSeconds);
    } catch (Exception e) {
    }
    return Mono.empty();
}).doOnError(Exception.class, exception -> {
    failureCount.incrementAndGet();
}).blockLast();
assert(failureCount.get() > 0);

解决方法是更改用于执行需要耗费一定时间的工作的线程。The workaround is to change the thread on which you perform work that takes time. 为应用定义计划程序的单一实例。Define a singleton instance of the scheduler for your app.

Java SDK V4 (Maven com.azure::azure-cosmos) 异步 APIJava SDK V4 (Maven com.azure::azure-cosmos) Async API

// Have a singleton instance of an executor and a scheduler.
ExecutorService ex  = Executors.newFixedThreadPool(30);
Scheduler customScheduler = Schedulers.fromExecutor(ex);

你可能会需要完成需耗费一定时间的工作,例如,计算工作量繁重的工作或阻塞性 IO。You might need to do work that takes time, for example, computationally heavy work or blocking IO. 在这种情况下,使用 .publishOn(customScheduler) API 将线程切换为 customScheduler 提供的辅助角色。In this case, switch the thread to a worker provided by your customScheduler by using the .publishOn(customScheduler) API.

Java SDK V4 (Maven com.azure::azure-cosmos) 异步 APIJava SDK V4 (Maven com.azure::azure-cosmos) Async API

container.createItem(family)
        .publishOn(customScheduler) // Switches the thread.
        .subscribe(
                // ...
        );

通过使用 publishOn(customScheduler),可以释放 Netty IO 线程并切换到自定义计划程序提供的自定义线程。By using publishOn(customScheduler), you release the Netty IO thread and switch to your own custom thread provided by the custom scheduler. 此修改可解决这一问题。This modification solves the problem. 你不会再遇到 io.netty.handler.timeout.ReadTimeoutException 故障。You won't get a io.netty.handler.timeout.ReadTimeoutException failure anymore.

请求速率过大Request rate too large

此故障是服务器端故障。This failure is a server-side failure. 它表明预配的吞吐量已用完。It indicates that you consumed your provisioned throughput. 请稍后重试。Retry later. 如果经常遇到此故障,请考虑增加集合吞吐量。If you get this failure often, consider an increase in the collection throughput.

  • 按 getRetryAfterInMilliseconds 间隔实现回退Implement backoff at getRetryAfterInMilliseconds intervals

    在性能测试期间,应该增加负载,直到系统对小部分请求进行限制为止。During performance testing, you should increase load until a small rate of requests get throttled. 如果受到限制,客户端应用程序应按照服务器指定的重试间隔退让。If throttled, the client application should backoff for the server-specified retry interval. 遵循退让可确保最大程度地减少等待重试的时间。Respecting the backoff ensures that you spend minimal amount of time waiting between retries.

连接到 Azure Cosmos DB 仿真器时出现故障Failure connecting to Azure Cosmos DB emulator

Azure Cosmos DB 仿真器 HTTPS 证书是自签名证书。The Azure Cosmos DB emulator HTTPS certificate is self-signed. 若要将 SDK 与仿真器配合使用,请将仿真器证书导入 Java TrustStore。For the SDK to work with the emulator, import the emulator certificate to a Java TrustStore. 有关详细信息,请参阅导出 Azure Cosmos DB 仿真器证书For more information, see Export Azure Cosmos DB emulator certificates.

依赖项冲突问题Dependency Conflict Issues

Azure Cosmos DB Java SDK 可提取大量依赖项;一般来说,如果项目依赖关系树包含 Azure Cosmos DB Java SDK 所依赖项目的旧版本,这可能会导致在运行应用程序时生成意外错误。The Azure Cosmos DB Java SDK pulls in a number of dependencies; generally speaking, if your project dependency tree includes an older version of an artifact that Azure Cosmos DB Java SDK depends on, this may result in unexpected errors being generated when you run your application. 如果要针对应用程序意外引发异常的原因进行调试,最好仔细检查依赖关系树是否意外地提取了一个或多个 Azure Cosmos DB Java SDK 依赖项的旧版本。If you are debugging why your application unexpectedly throws an exception, it is a good idea to double-check that your dependency tree is not accidentally pulling in an older version of one or more of the Azure Cosmos DB Java SDK dependencies.

解决此问题的方法是确定哪些项目依赖项会引入旧版本,排除该旧版本上的可传递依赖项,并允许 Azure Cosmos DB Java SDK 引入新版本。The workaround for such an issue is to identify which of your project dependencies brings in the old version and exclude the transitive dependency on that older version, and allow Azure Cosmos DB Java SDK to bring in the newer version.

若要确定哪个项目依赖项引入了 Azure Cosmos DB Java SDK 依赖项的旧版本,请对项目 pom.xml 文件运行以下命令:To identify which of your project dependencies brings in an older version of something that Azure Cosmos DB Java SDK depends on, run the following command against your project pom.xml file:

mvn dependency:tree

有关详细信息,请参阅maven 依赖项树指南For more information, see the maven dependency tree guide.

了解项目的哪个依赖项依赖于旧版本后,就可以修改 pom 文件中该 lib 上的依赖项并排除可传递依赖项,如下所示(假定 reactor core 是过时的依赖项):Once you know which dependency of your project depends on an older version, you can modify the dependency on that lib in your pom file and exclude the transitive dependency, following the example below (which assumes that reactor-core is the outdated dependency):

<dependency>
  <groupId>${groupid-of-lib-which-brings-in-reactor}</groupId>
  <artifactId>${artifactId-of-lib-which-brings-in-reactor}</artifactId>
  <version>${version-of-lib-which-brings-in-reactor}</version>
  <exclusions>
    <exclusion>
      <groupId>io.projectreactor</groupId>
      <artifactId>reactor-core</artifactId>
    </exclusion>
  </exclusions>
</dependency>

有关详细信息,请参阅排除传递依赖项指南For more information, see the exclude transitive dependency guide.

启用客户端 SDK 日志记录Enable client SDK logging

Azure Cosmos DB Java SDK v4 使用 SLF4j 作为日志外观,支持记录到常用的日志框架,如 log4j 和 logback。Azure Cosmos DB Java SDK v4 uses SLF4j as the logging facade that supports logging into popular logging frameworks such as log4j and logback.

例如,如果要使用 log4j 作为日志记录框架,请在 Java classpath 中添加以下库。For example, if you want to use log4j as the logging framework, add the following libs in your Java classpath.

<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-log4j12</artifactId>
  <version>${slf4j.version}</version>
</dependency>
<dependency>
  <groupId>log4j</groupId>
  <artifactId>log4j</artifactId>
  <version>${log4j.version}</version>
</dependency>

同时添加 log4j 配置。Also add a log4j config.

# this is a sample log4j configuration

# Set root logger level to DEBUG and its only appender to A1.
log4j.rootLogger=INFO, A1

log4j.category.com.microsoft.azure.cosmosdb=DEBUG
#log4j.category.io.netty=INFO
#log4j.category.io.reactivex=INFO
# A1 is set to be a ConsoleAppender.
log4j.appender.A1=org.apache.log4j.ConsoleAppender

# A1 uses PatternLayout.
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%d %5X{pid} [%t] %-5p %c - %m%n

有关详细信息,请参阅 sfl4j 日志记录手册For more information, see the sfl4j logging manual.

OS 网络统计信息OS network statistics

运行 netstat 命令,掌握处于 ESTABLISHEDCLOSE_WAIT 等状态的连接数。Run the netstat command to get a sense of how many connections are in states such as ESTABLISHED and CLOSE_WAIT.

在 Linux 上可以运行以下命令。On Linux, you can run the following command.

netstat -nap

在 Windows 上,可以使用不同的参数标志运行同一命令:On Windows, you can run the same command with different argument flags:

netstat -abn

筛选结果,以便仅显示到 Azure Cosmos DB 终结点的连接。Filter the result to only connections to the Azure Cosmos DB endpoint.

处于 ESTABLISHED 状态的 Azure Cosmos DB 终结点连接数不应大于配置的连接池大小。The number of connections to the Azure Cosmos DB endpoint in the ESTABLISHED state can't be greater than your configured connection pool size.

到 Azure Cosmos DB 终结点的许多连接可能会处于 CLOSE_WAIT 状态。Many connections to the Azure Cosmos DB endpoint might be in the CLOSE_WAIT state. 可能会有超过 1,000 个连接。There might be more than 1,000. 较大的数字表明建立和销毁连接的速度很快。A number that high indicates that connections are established and torn down quickly. 这种情况可能会导致问题。This situation potentially causes problems. 有关详细信息,请参阅常见问题和解决方法部分。For more information, see the Common issues and workarounds section.