诊断和排查在使用 Azure Cosmos DB .NET SDK 时出现的问题Diagnose and troubleshoot issues when using Azure Cosmos DB .NET SDK

适用于: SQL API

本文介绍将 .NET SDK 与 Azure Cosmos DB SQL API 帐户配合使用时的常见问题、解决方法、诊断步骤和工具。This article covers common issues, workarounds, diagnostic steps, and tools when you use the .NET SDK with Azure Cosmos DB SQL API accounts. .NET SDK 提供了客户端逻辑表示,以用于访问 Azure Cosmos DB SQL API。The .NET SDK provides client-side logical representation to access the Azure Cosmos DB SQL API. 本文介绍了在遇到任何问题时可以提供帮助的工具和方法。This article describes tools and approaches to help you if you run into any issues.

问题排查清单Checklist for troubleshooting issues

在将应用程序转移到生产环境之前,请考虑根据以下清单进行检查。Consider the following checklist before you move your application to production. 使用该清单有助于防止出现多个常见问题。Using the checklist will prevent several common issues you might see. 出现问题时可以快速诊断:You can also quickly diagnose when an issue occurs:

  • 使用最新的 SDKUse the latest SDK. 不要将预览版 SDK 用于生产。Preview SDKs should not be used for production. 这样就可以避免遇到已更正的已知问题。This will prevent hitting known issues that are already fixed.
  • 查看性能提示并按照建议的做法进行操作。Review the performance tips, and follow the suggested practices. 这有助于避免缩放、延迟和其他性能问题。This will help prevent scaling, latency, and other performance issues.
  • 启用 SDK 日志记录以帮助排查问题。Enable the SDK logging to help you troubleshoot an issue. 启用日志记录可能会影响性能,因此,最好是只在排查问题时才启用日志记录。Enabling the logging may affect performance so it's best to enable it only when troubleshooting issues. 可以启用以下日志:You can enable the following logs:
  • 使用 Azure 门户记录指标Log metrics by using the Azure portal. 门户指标显示 Azure Cosmos DB 遥测数据,这有助于确定问题是否与 Azure Cosmos DB 相关,或者是否由客户端造成。Portal metrics show the Azure Cosmos DB telemetry, which is helpful to determine if the issue corresponds to Azure Cosmos DB or if it's from the client side.
  • 记录点操作响应中的诊断字符串(在 V2 SDK 中)或诊断(在 V3 SDK 中)。Log the diagnostics string in the V2 SDK or diagnostics in V3 SDK from the point operation responses.
  • 记录所有查询响应中的 SQL 查询指标Log the SQL Query Metrics from all the query responses
  • 遵循针对 SDK 日志记录的设置步骤Follow the setup for SDK logging

请查看本文中的常见问题和解决方法部分。Take a look at the Common issues and workarounds section in this article.

查看我们积极关注的 GitHub 问题部分Check the GitHub issues section that's actively monitored. 检查是否已提交包含解决方法的任何类似问题。Check to see if any similar issue with a workaround is already filed. 如果找不到解决方法,请提出 GitHub 问题。If you didn't find a solution, then file a GitHub issue. 对于紧急问题,可以开具支持票证。You can open a support tick for urgent issues.

常见问题和解决方法Common issues and workarounds

常规建议General suggestions

  • 尽可能在 Azure Cosmos DB 帐户所在的同一个 Azure 区域中运行应用。Run your app in the same Azure region as your Azure Cosmos DB account, whenever possible.
  • 由于客户端计算机上的资源不足,你可能会遇到连接/可用性问题。You may run into connectivity/availability issues due to lack of resources on your client machine. 我们建议监视运行 Azure Cosmos DB 客户端的节点上的 CPU 利用率,如果这些节点的负载较大,请纵向/横向扩展节点。We recommend monitoring your CPU utilization on nodes running the Azure Cosmos DB client, and scaling up/out if they're running at high load.

检查门户指标Check the portal metrics

检查门户指标有助于确定问题是否与客户端相关,或者服务是否有问题。Checking the portal metrics will help determine if it's a client-side issue or if there is an issue with the service. 例如,如果指标中包含较高比率的速率受限请求(HTTP 状态代码 429,表示请求受到限制),请查看请求速率过大部分。For example, if the metrics contain a high rate of rate-limited requests (HTTP status code 429) which means the request is getting throttled then check the Request rate too large section.

重试逻辑Retry Logic

如果可以在 SDK 中重试,则任何 IO 故障的 Cosmos DB SDK 都将尝试重试失败的操作。Cosmos DB SDK on any IO failure will attempt to retry the failed operation if retry in the SDK is feasible. 重试任何故障是一种好习惯,特别是处理/重试写入故障必不可少。Having a retry in place for any failure is a good practice but specifically handling/retrying write failures is a must. 由于重试逻辑不断改进,因此建议使用最新的 SDK。It's recommended to use the latest SDK as retry logic is continuously being improved.

  1. SDK 会重试读取和查询 IO 故障,而不会将它们呈现给最终用户。Read and query IO failures will get retried by the SDK without surfacing them to the end user.
  2. 写入(创建、更新、替换、删除)不是幂等的,因此,SDK 不能总是盲目地重试失败的写入操作。Writes (Create, Upsert, Replace, Delete) are "not" idempotent and hence SDK cannot always blindly retry the failed write operations. 要求用户的应用程序逻辑能够处理故障并重试。It is required that user's application logic to handle the failure and retry.
  3. SDK 可用性疑难解答说明了多区域 Cosmos DB 帐户的重试。Trouble shooting sdk availability explains retries for multi-region Cosmos DB accounts.

常见错误状态代码Common error status codes

状态代码Status Code 说明Description
400400 错误请求(取决于错误消息)Bad request (Depends on the error message)
401401 未授权Not authorized
404404 找不到资源Resource is not found
408408 请求已超时Request timed out
409409 冲突失败是指为写入操作中的资源提供的 ID 已被现有资源使用。Conflict failure is when the ID provided for a resource on a write operation has been taken by an existing resource. 对资源使用另一个 ID 可解决此问题,因为 ID 在具有相同分区键值的所有文档中必须唯一。Use another ID for the resource to resolve this issue as ID must be unique within all documents with the same partition key value.
410410 消失异常(不应违反 SLA 的瞬间失败)Gone exceptions (Transient failure that should not violate SLA)
412412 前提条件失败是操作指定的 eTag 与服务器上提供的版本不同。Precondition failure is where the operation specified an eTag that is different from the version available at the server. 这是乐观并发错误。It's an optimistic concurrency error. 在读取资源的最新版本并更新请求中的 eTag 后重试该请求。Retry the request after reading the latest version of the resource and updating the eTag on the request.
413413 请求实体太大Request Entity Too Large
429429 请求过多Too many requests
449449 仅在进行写入操作时才发生的暂时性错误,可安全重试Transient error that only occurs on write operations, and is safe to retry
500500 操作由于意外服务错误而失败。The operation failed due to an unexpected service error. 联系支持人员。Contact support. 请参阅“申报 Azure 支持问题”。See Filing an Azure support issue.
503503 服务不可用Service unavailable

Azure SNAT (PAT) 端口耗尽Azure SNAT (PAT) port exhaustion

如果应用部署在没有公共 IP 地址的 Azure 虚拟机上,则默认情况下,Azure SNAT 端口用于建立与 VM 外部任何终结点的连接。If your app is deployed on Azure Virtual Machines without a public IP address, by default Azure SNAT ports establish connections to any endpoint outside of your VM. 从 VM 到 Azure Cosmos DB 终结点,允许的连接数受 Azure SNAT 配置的限制。The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the Azure SNAT configuration. 这种情况可能会导致连接限制、连接关闭或上述请求超时This situation can lead to connection throttling, connection closure, or the above mentioned Request timeouts.

仅当 VM 具有专用 IP 地址且连接到公共 IP 地址时,才会使用 Azure SNAT 端口。Azure SNAT ports are used only when your VM has a private IP address is connecting to a public IP address. 有两种解决方法可以避免 Azure SNAT 限制(前提是已在整个应用程序中使用单个客户端实例):There are two workarounds to avoid Azure SNAT limitation (provided you already are using a single client instance across the entire application):

  • 向 Azure 虚拟机虚拟网络的子网添加 Azure Cosmos DB 服务终结点。Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. 有关详细信息,请参阅 Azure 虚拟网络服务终结点For more information, see Azure Virtual Network service endpoints.

    启用服务终结点后,不再从公共 IP 向 Azure Cosmos DB 发送请求,When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. 而是发送虚拟网络和子网标识。Instead, the virtual network and subnet identity are sent. 如果仅允许公共 IP,则此更改可能会导致防火墙丢失。This change might result in firewall drops if only public IPs are allowed. 如果使用防火墙,则在启用服务终结点后,请使用虚拟网络 ACL 将子网添加到防火墙。If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using Virtual Network ACLs.

  • 将公共 IP 分配给 Azure VMAssign a public IP to your Azure VM.

网络延迟过高High network latency

可以使用 V2 SDK 中的 diagnosticsstring 或 V3 SDK 中的 diagnostics 来识别网络延迟过高的情况。High network latency can be identified by using the diagnostics string in the V2 SDK or diagnostics in V3 SDK.

如果未发生超时,并且诊断根据 ResponseTimeRequestStartTime之差显示单个请求的延迟明显较高(在本示例中超过 300 毫秒),如下所示:If no timeouts are present and the diagnostics show single requests where the high latency is evident on the difference between ResponseTime and RequestStartTime, like so (>300 milliseconds in this example):

RequestStartTime: 2020-03-09T22:44:49.5373624Z, RequestEndTime: 2020-03-09T22:44:49.9279906Z,  Number of regions attempted:1
ResponseTime: 2020-03-09T22:44:49.9279906Z, StoreResult: StorePhysicalAddress: rntbd://..., ...

这种延迟的可能原因有多种:This latency can have multiple causes:

常见查询问题Common query issues

查询指标有助于确定查询在何处花费的时间最多。The query metrics will help determine where the query is spending most of the time. 在查询指标中,可以查看查询在客户端与后端上花费的时间。From the query metrics, you can see how much of it is being spent on the back-end vs the client. 详细了解如何排查查询性能问题Learn more about troubleshooting query performance.

  • 如果后端查询的返回速度较快,并将大量的时间花费在客户端上,请检查计算机上的负载。If the back-end query returns quickly, and spends a large time on the client check the load on the machine. 可能的原因是资源不足,SDK 正在等待资源可用于处理响应。It's likely that there are not enough resource and the SDK is waiting for resources to be available to handle the response.

  • 如果后端查询速度较慢,请尝试优化查询,并查看当前的索引策略If the back-end query is slow, try optimizing the query and looking at the current indexing policy


    为获得提升的性能,建议使用 Windows 64 位主机处理。For improved performance, we recommend Windows 64-bit host processing. SQL SDK 包含一个本机 ServiceInterop.dll,用于在本地分析和优化查询。The SQL SDK includes a native ServiceInterop.dll to parse and optimize queries locally. 仅 Windows x64 平台支持 ServiceInterop.dll。ServiceInterop.dll is supported only on the Windows x64 platform. 对于 ServiceInterop.dll 在其中不可用的 Linux 平台及其他不受支持的平台,将对网关进行额外的网络调用以获取优化的查询。For Linux and other unsupported platforms where ServiceInterop.dll isn't available, an additional network call will be made to the gateway to get the optimized query.

如果遇到“Unable to load DLL 'Microsoft.Azure.Cosmos.ServiceInterop.dll' or one of its dependencies:”错误并且正在使用 Windows,则应升级到最新的 Windows 版本。If you encounter the following error: Unable to load DLL 'Microsoft.Azure.Cosmos.ServiceInterop.dll' or one of its dependencies: and are using Windows, you should upgrade to the latest Windows version.

后续步骤Next steps