Azure 事件中心故障排除指南Troubleshooting guide for Azure Event Hubs

本文提供了事件中心 .NET Framework API 导致的一些 .NET 异常,以及其他用于排查问题的提示。This article provides some of the .NET exceptions generated by Event Hubs .NET Framework APIs and also other tips for troubleshooting issues.

事件中心消息传送异常 - .NETEvent Hubs messaging exceptions - .NET

本部分列出了 .NET Framework API 生成的 .NET 异常。This section lists the .NET exceptions generated by .NET Framework APIs.

异常类别Exception categories

事件中心 .NET API 会生成以下类别的异常,以及在尝试修复这些异常时可以采取的相关操作。The Event Hubs .NET APIs generate exceptions that can fall into the following categories, along with the associated action you can take to try to fix them.

  1. 用户编码错误:System.ArgumentExceptionSystem.InvalidOperationExceptionSystem.OperationCanceledExceptionSystem.Runtime.Serialization.SerializationExceptionUser coding error: System.ArgumentException, System.InvalidOperationException, System.OperationCanceledException, System.Runtime.Serialization.SerializationException. 常规操作:继续之前尝试修复代码。General action: try to fix the code before proceeding.
  2. 设置/配置错误:Microsoft.ServiceBus.Messaging.MessagingEntityNotFoundExceptionMicrosoft.Azure.EventHubs.MessagingEntityNotFoundExceptionSystem.UnauthorizedAccessExceptionSetup/configuration error: Microsoft.ServiceBus.Messaging.MessagingEntityNotFoundException, Microsoft.Azure.EventHubs.MessagingEntityNotFoundException, System.UnauthorizedAccessException. 常规操作:检查配置,必要时进行更改。General action: review your configuration and change if necessary.
  3. 暂时性异常:Microsoft.ServiceBus.Messaging.MessagingExceptionMicrosoft.ServiceBus.Messaging.ServerBusyExceptionMicrosoft.Azure.EventHubs.ServerBusyExceptionMicrosoft.ServiceBus.Messaging.MessagingCommunicationExceptionTransient exceptions: Microsoft.ServiceBus.Messaging.MessagingException, Microsoft.ServiceBus.Messaging.ServerBusyException, Microsoft.Azure.EventHubs.ServerBusyException, Microsoft.ServiceBus.Messaging.MessagingCommunicationException. 常规操作:重试操作或通知用户。General action: retry the operation or notify users.
  4. 其他异常:System.Transactions.TransactionExceptionSystem.TimeoutExceptionMicrosoft.ServiceBus.Messaging.MessageLockLostExceptionMicrosoft.ServiceBus.Messaging.SessionLockLostExceptionOther exceptions: System.Transactions.TransactionException, System.TimeoutException, Microsoft.ServiceBus.Messaging.MessageLockLostException, Microsoft.ServiceBus.Messaging.SessionLockLostException. 常规操作:特定于异常类型;请参考以下部分中的表。General action: specific to the exception type; refer to the table in the following section.

异常类型Exception types

下表列出了消息异常的类型及其原因,并说明可以采取的建议性操作。The following table lists messaging exception types, and their causes, and notes suggested action you can take.

异常类型Exception Type 说明/原因/示例Description/Cause/Examples 建议的操作Suggested Action 自动/立即重试注意事项Note on automatic/immediate retry
TimeoutExceptionTimeoutException 服务器在 OperationTimeout 控制的指定时间内未响应请求的操作。The server didn't respond to the requested operation within the specified time, which is controlled by OperationTimeout. 服务器可能已完成请求的操作。The server may have completed the requested operation. 此异常可能是由于网络或其他基础结构延迟造成的。This exception can happen because of network or other infrastructure delays. 检查系统状态的一致性,并根据需要重试。Check the system state for consistency and retry if necessary.
请参阅 TimeoutExceptionSee TimeoutException.
在某些情况下,重试可能会有帮助;在代码中添加重试逻辑。Retry might help in some cases; add retry logic to code.
InvalidOperationExceptionInvalidOperationException 不允许在服务器或服务中执行请求的用户操作。The requested user operation isn't allowed within the server or service. 有关详细信息,请查看异常消息。See the exception message for details. 例如,如果在 ReceiveAndDelete 模式下收到消息,则 Complete 将生成此异常。For example, Complete generates this exception if the message was received in ReceiveAndDelete mode. 检查代码和文档。Check the code and the documentation. 确保请求的操作有效。Make sure the requested operation is valid. 重试不起作用。Retry won't help.
OperationCanceledExceptionOperationCanceledException 尝试对已关闭、中止或释放的对象调用某个操作。An attempt is made to invoke an operation on an object that has already been closed, aborted, or disposed. 在极少数情况下,环境事务已释放。In rare cases, the ambient transaction is already disposed. 检查代码并确保代码不会对已释放的对象调用操作。Check the code and make sure it doesn't invoke operations on a disposed object. 重试不起作用。Retry won't help.
UnauthorizedAccessExceptionUnauthorizedAccessException TokenProvider 对象无法获取令牌,该令牌无效,或者令牌不包含执行操作所需的声明。The TokenProvider object couldn't acquire a token, the token is invalid, or the token doesn't contain the claims required to do the operation. 确保使用正确的值创建令牌提供程序。Make sure the token provider is created with the correct values. 检查访问控制服务的配置。Check the configuration of the Access Control Service. 在某些情况下,重试可能会有帮助;在代码中添加重试逻辑。Retry might help in some cases; add retry logic to code.
ArgumentExceptionArgumentException
ArgumentNullExceptionArgumentNullException
ArgumentOutOfRangeExceptionArgumentOutOfRangeException
提供给该方法的一个或多个参数均无效。One or more arguments supplied to the method are invalid. 提供给 NamespaceManagerCreate 的 URI 包含路径段。The URI supplied to NamespaceManager or Create contains path segment(s). 提供给 NamespaceManagerCreate 的 URI 方案无效。The URI scheme supplied to NamespaceManager or Create is invalid. 属性值大于 32 KB。The property value is larger than 32 KB. 检查调用代码并确保参数正确。Check the calling code and make sure the arguments are correct. 重试不会解决问题。Retry will not help.
Microsoft.ServiceBus.Messaging MessagingEntityNotFoundExceptionMicrosoft.ServiceBus.Messaging MessagingEntityNotFoundException

Microsoft.Azure.EventHubs MessagingEntityNotFoundExceptionMicrosoft.Azure.EventHubs MessagingEntityNotFoundException
与操作关联的实体不存在或已被删除。Entity associated with the operation does not exist or it has been deleted. 确保该实体存在。Make sure the entity exists. 重试不会解决问题。Retry will not help.
MessagingCommunicationExceptionMessagingCommunicationException 客户端无法与事件中心建立连接。Client is not able to establish a connection to Event Hub. 确保提供的主机名正确并且主机可访问。Make sure the supplied host name is correct and the host is reachable. 如果存在间歇性的连接问题,重试可能会有帮助。Retry might help if there are intermittent connectivity issues.
Microsoft.ServiceBus.Messaging ServerBusyExceptionMicrosoft.ServiceBus.Messaging ServerBusyException

Microsoft.Azure.EventHubs ServerBusyExceptionMicrosoft.Azure.EventHubs ServerBusyException
服务目前无法处理请求。Service is not able to process the request at this time. 客户端可以等待一段时间,并重试操作。Client can wait for a period of time, then retry the operation.
请参阅 ServerBusyExceptionSee ServerBusyException.
客户端可在特定的时间间隔后重试操作。Client may retry after certain interval. 如果重试导致其他异常,请检查该异常的重试行为。If a retry results in a different exception, check retry behavior of that exception.
MessagingExceptionMessagingException 在以下情况下,可能会引发一般消息异常:尝试使用属于其他实体类型(例如主题)的名称或路径创建 QueueClientGeneric messaging exception that may be thrown in the following cases: An attempt is made to create a QueueClient using a name or path that belongs to a different entity type (for example, a topic). 尝试发送大于 1 MB 的消息。An attempt is made to send a message larger than 1 MB. 服务器或服务在处理请求期间遇到错误。The server or service encountered an error during processing of the request. 有关详细信息,请查看异常消息。See the exception message for details. 此异常通常是暂时性的异常。This exception is usually a transient exception. 检查代码,并确保只对消息正文使用可序列化对象(或使用自定义序列化程序)。Check the code and ensure that only serializable objects are used for the message body (or use a custom serializer). 在文档中查看属性支持的值类型,并只使用支持的类型。Check the documentation for the supported value types of the properties and only use supported types. 检查 IsTransient 属性。Check the IsTransient property. 如果为 true,可以重试操作。If it is true, you can retry the operation. 重试行为的效果不确定,可能不会解决问题。Retry behavior is undefined and might not help.
MessagingEntityAlreadyExistsExceptionMessagingEntityAlreadyExistsException 尝试使用已被该服务命名空间中另一实体使用的名称创建实体。Attempt to create an entity with a name that is already used by another entity in that service namespace. 删除现有的实体,或者选择不同的名称来创建实体。Delete the existing entity or choose a different name for the entity to be created. 重试不会解决问题。Retry will not help.
QuotaExceededExceptionQuotaExceededException 消息实体已达到其最大允许大小。The messaging entity has reached its maximum allowable size. 如果已经在每使用者组级别上打开最大接收方数(即 5),则可能会发生此异常。This exception can happen if the maximum number of receivers (which is 5) has already been opened on a per-consumer group level. 通过从实体或其子队列接收消息在该实体中创建空间。Create space in the entity by receiving messages from the entity or its subqueues.
请参阅 QuotaExceededExceptionSee QuotaExceededException
如果同时已删除消息,则重试可能会有帮助。Retry might help if messages have been removed in the meantime.
MessagingEntityDisabledExceptionMessagingEntityDisabledException 对已禁用的实体请求运行时操作。Request for a runtime operation on a disabled entity. 激活实体。Activate the entity. 如果在此期间该实体已激活,则重试可能会有帮助。Retry might help if the entity has been activated in the interim.
Microsoft.ServiceBus.Messaging MessageSizeExceededExceptionMicrosoft.ServiceBus.Messaging MessageSizeExceededException

Microsoft.Azure.EventHubs MessageSizeExceededExceptionMicrosoft.Azure.EventHubs MessageSizeExceededException
消息负载超出 1 MB 限制。A message payload exceeds the 1-MB limit. 1 MB 限制是指总消息大小,可包括系统属性和任何 .NET 开销。This 1-MB limit is for the total message, which can include system properties and any .NET overhead. 减少消息负载的大小,并重试操作。Reduce the size of the message payload, then retry the operation. 重试不会解决问题。Retry will not help.

QuotaExceededExceptionQuotaExceededException

QuotaExceededException 指示已超出某个特定实体的配额。QuotaExceededException indicates that a quota for a specific entity has been exceeded.

如果已经在每使用者组级别上打开最大接收方数 (5),则可能会发生此异常。This exception can happen if the maximum number of receivers (5) has already been opened on a per-consumer group level.

事件中心Event Hubs

每个事件中心最多只能有 20 个使用者组。Event Hubs has a limit of 20 consumer groups per Event Hub. 尝试创建更多组时,会收到 QuotaExceededExceptionWhen you attempt to create more, you receive a QuotaExceededException.

TimeoutExceptionTimeoutException

TimeoutException 指示用户启动的操作所用的时间超过操作超时值。A TimeoutException indicates that a user-initiated operation is taking longer than the operation timeout.

对于事件中心,超时作为连接字符串的一部分指定,或通过 ServiceBusConnectionStringBuilder 指定。For Event Hubs, the timeout is specified either as part of the connection string, or through ServiceBusConnectionStringBuilder. 错误消息本身可能会有所不同,但它始终包含当前操作的指定超时值。The error message itself might vary, but it always contains the timeout value specified for the current operation.

常见原因Common causes

此错误有两个常见的原因:配置不正确或暂时性服务错误。There are two common causes for this error: incorrect configuration, or a transient service error.

  1. 配置不正确 :运行条件下的操作超时值可能太小。Incorrect configuration The operation timeout might be too small for the operational condition. 客户端 SDK 的操作超时默认值为 60 秒。The default value for the operation timeout in the client SDK is 60 seconds. 请查看代码是否将该值设置得过小。Check to see if your code has the value set to something too small. 网络和 CPU 使用率的状况会影响完成特定操作所用的时间,因此,操作超时不应设置为很小的值。The condition of the network and CPU usage can affect the time it takes for a particular operation to complete, so the operation timeout should not be set to a small value.
  2. 暂时性服务错误 :有时,事件中心服务在处理请求时会遇到延迟,例如,高流量时段。Transient service error Sometimes the Event Hubs service can experience delays in processing requests; for example, during periods of high traffic. 在这种情况下,可以在延迟后重试操作,直到操作成功为止。In such cases, you can retry your operation after a delay, until the operation is successful. 如果多次尝试同一操作后仍然失败,请访问 Azure 服务状态站点,看是否有任何已知的服务中断。If the same operation still fails after multiple attempts, visit the Azure service status site to see if there are any known service outages.

ServerBusyExceptionServerBusyException

Microsoft.ServiceBus.Messaging.ServerBusyExceptionMicrosoft.Azure.EventHubs.ServerBusyException 指示服务器已重载。A Microsoft.ServiceBus.Messaging.ServerBusyException or Microsoft.Azure.EventHubs.ServerBusyException indicates that a server is overloaded. 此异常有两个相关的错误代码。There are two relevant error codes for this exception.

错误代码 50002Error code 50002

导致此错误发生的原因可能是以下之一:This error can occur for one of two reasons:

  1. 负载未均匀分布在事件中心的所有分区上,并且一个分区达到了本地吞吐量单位限制。The load isn't evenly distributed across all partitions on the event hub, and one partition hits the local throughput unit limitation.

    解决方法:修改分区分发策略,或尝试 EventHubClient.Send(eventDataWithOutPartitionKey) 可能会有所帮助。Resolution: Revising the partition distribution strategy or trying EventHubClient.Send(eventDataWithOutPartitionKey) might help.

  2. 事件中心命名空间没有足够的吞吐量单位(可以在 Azure 门户中检查事件中心命名空间窗口中的“指标” 屏幕来确认)。The Event Hubs namespace doesn't have sufficient throughput units (you can check the Metrics screen in the Event Hubs namespace window in the Azure portal to confirm). 门户显示聚合(1 分钟)的信息,但我们会实时测量吞吐量,因此它只是一个估计值。The portal shows aggregated (1 minute) information, but we measure the throughput in real time – so it's only an estimate.

    解决方法:增加命名空间上的吞吐量单位可有所帮助。Resolution: Increasing the throughput units on the namespace can help. 可在门户上的事件中心命名空间屏幕的“缩放” 窗口中执行此操作。You can do this operation on the portal, in the Scale window of the Event Hubs namespace screen. 或者,可以使用自动膨胀Or, you can use Auto-inflate.

错误代码 50001Error code 50001

此错误很少发生。This error should rarely occur. 但如果为命名空间运行代码的容器的 CPU 比较低时(在事件中心负载均衡器开始之前不超过几秒钟),则可能发生此错误。It happens when the container running code for your namespace is low on CPU – not more than a few seconds before the Event Hubs load balancer begins.

对 GetRuntimeInformation 方法的调用限制Limit on calls to the GetRuntimeInformation method

Azure 事件中心每秒最多支持 50 次对 GetRuntimeInfo 的调用。Azure Event Hubs supports up to 50 calls per second to the GetRuntimeInfo per second. 达到限制后,你可能会收到类似以下的异常:You may receive an exception similar to the following one once the limit is reached:

ExceptionId: 00000000000-00000-0000-a48a-9c908fbe84f6-ServerBusyException: The request was terminated because the namespace 75248:aaa-default-eventhub-ns-prodb2b is being throttled. Error code : 50001. Please wait 10 seconds and try again.

连接性、证书或超时问题Connectivity, certificate, or timeout issues

以下步骤可帮助排查 *.servicebus.chinacloudapi.cn 下所有服务的连接性/证书/超时问题。The following steps may help you with troubleshooting connectivity/certificate/timeout issues for all services under *.servicebus.chinacloudapi.cn.

  • 浏览至 https://<yournamespacename>.servicebus.chinacloudapi.cn/ 或使用 wgetBrowse to or wget https://<yournamespacename>.servicebus.chinacloudapi.cn/. 这可帮助检查是否存在 IP 筛选或虚拟网络或证书链问题(使用 java SDK 时最常见)。It helps with checking whether you have IP filtering or virtual network or certificate chain issues (most common when using java SDK).

    成功消息的示例:An example of successful message:

    <feed xmlns="http://www.w3.org/2005/Atom"><title type="text">Publicly Listed Services</title><subtitle type="text">This is the list of publicly-listed services currently available.</subtitle><id>uuid:27fcd1e2-3a99-44b1-8f1e-3e92b52f0171;id=30</id><updated>2019-12-27T13:11:47Z</updated><generator>Service Bus 1.1</generator></feed>
    

    失败错误消息的示例:An example of failure error message:

    <Error>
        <Code>400</Code>
        <Detail>
            Bad Request. To know more visit https://aka.ms/sbResourceMgrExceptions. . TrackingId:b786d4d1-cbaf-47a8-a3d1-be689cda2a98_G22, SystemTracker:NoSystemTracker, Timestamp:2019-12-27T13:12:40
        </Detail>
    </Error>
    
  • 运行以下命令,检查防火墙是否阻止了任何端口。Run the following command to check if any port is blocked on the firewall. 使用的端口为 443 (HTTPS)、5671 (AMQP) 和 9093 (Kafka)。Ports used are 443 (HTTPS), 5671 (AMQP) and 9093 (Kafka). 根据使用的库,还会使用其他端口。Depending on the library you use, other ports are also used. 下面是用于检查 5671 端口是否被阻止的示例命令。Here is the sample command that check whether the 5671 port is blocked.

    tnc <yournamespacename>.servicebus.chinacloudapi.cn -port 5671
    

    在 Linux 上:On Linux:

    telnet <yournamespacename>.servicebus.chinacloudapi.cn 5671
    
  • 出现间歇性连接问题时,请运行以下命令,检查是否存在任何丢弃的数据包。When there are intermittent connectivity issues, run the following command to check if there are any dropped packets. 此命令会尝试通过服务每隔 1 秒建立 25 个不同的 TCP 连接。This command will try to establish 25 different TCP connections every 1 second with the service. 然后,可以检查其中有多少成功/失败,还可以查看 TCP 连接延迟。Then, you can check how many of them succeeded/failed and also see TCP connection latency. 可从此处下载 psping 工具。You can download the psping tool from here.

    .\psping.exe -n 25 -i 1 -q <yournamespacename>.servicebus.chinacloudapi.cn:5671 -nobanner     
    

    如果使用的是 tncping 等其他工具,可以使用等效的命令。You can use equivalent commands if you're using other tools such as tnc, ping, and so on.

  • 如果上述步骤没有帮助,请获取网络跟踪,并使用 Wireshark 之类的工具对其进行分析。Obtain a network trace if the previous steps don't help and analyze it using tools such as Wireshark. 如果需要,请联系 Microsoft 支持部门Contact Microsoft Support if needed.

后续步骤Next steps

访问以下链接可以了解有关事件中心的详细信息:You can learn more about Event Hubs by visiting the following links: