排查连接问题 - Azure 事件中心Troubleshoot connectivity issues - Azure Event Hubs

客户端应用程序无法连接到事件中心的原因有很多。There are various reasons for client applications not able to connect to an event hub. 你遇到的连接问题可能是永久性的,也可能是暂时性的。The connectivity issues that you experience may be permanent or transient. 如果问题一直发生(永久性的),则可能需要检查连接字符串、组织的防火墙设置、IP 防火墙设置、网络安全设置(服务终结点、专用终结点等),等等。If the issue happens all the time (permanent), you may want to check the connection string, your organization's firewall settings, IP firewall settings, network security settings (service endpoints, private endpoints, etc.), and more. 对于暂时性问题,升级到最新版本的 SDK、运行命令来检查丢弃的数据包以及获取网络跟踪可能有助于解决问题。For transient issues, upgrading to latest version of the SDK, running commands to check dropped packets, and obtaining network traces may help with troubleshooting the issues.

本文提供了排查 Azure 事件中心的连接问题的技巧。This article provides tips for troubleshooting connectivity issues with Azure Event Hubs.

排查永久性连接问题Troubleshoot permanent connectivity issues

如果应用程序根本无法连接到事件中心,请按此部分的步骤操作来解决问题。If the application isn't able to connect to the event hub at all, follow steps from this section to troubleshoot the issue.

检查是否存在服务中断Check if there is a service outage

Azure 服务状态站点上检查是否存在 Azure 事件中心服务中断的情况。Check for the Azure Event Hubs service outage on the Azure service status site.

验证连接字符串Verify the connection string

验证你使用的连接字符串是否正确。Verify that the connection string you are using is correct. 请参阅获取连接字符串,以便使用 Azure 门户、CLI 或 PowerShell 获取连接字符串。See Get connection string to get the connection string using the Azure portal, CLI, or PowerShell.

对于 Kafka 客户端,请验证是否正确配置了 producer.config 或 consumer.config 文件。For Kafka clients, verify that producer.config or consumer.config files are configured properly. 有关详细信息,请参阅在事件中心内使用 Kafka 发送和接收消息For more information, see Send and receive messages with Kafka in Event Hubs.

检查组织的防火墙是否阻止了与事件中心通信所需的端口Check if the ports required to communicate with Event Hubs are blocked by organization's firewall

确认组织的防火墙未阻止用于与 Azure 事件中心通信的端口。Verify that ports used in communicating with Azure Event Hubs aren't blocked on your organization's firewall. 请查看下表,了解与 Azure 事件中心通信需要打开哪些出站端口。See the following table for the outbound ports you need to open to communicate with Azure Event Hubs.

协议Protocol 端口Ports 详细信息Details
AMQPAMQP 5671 和 56725671 and 5672 请参阅 AMQP 协议指南See AMQP protocol guide
HTTP、HTTPSHTTP, HTTPS 80、44380, 443
KafkaKafka 90939093 请参阅使用 Kafka 应用程序中的事件中心See Use Event Hubs from Kafka applications

下面是用于检查 5671 端口是否被阻止的示例命令。Here is a sample command that checks whether the 5671 port is blocked.

tnc <yournamespacename>.servicebus.chinacloudapi.cn -port 5671

在 Linux 上:On Linux:

telnet <yournamespacename>.servicebus.chinacloudapi.cn 5671

验证你的企业防火墙中是否允许使用这些 IP 地址Verify that IP addresses are allowed in your corporate firewall

使用 Azure 时,有时必须在公司防火墙或代理中允许特定的 IP 地址范围或 URL 才能访问你正在使用或尝试使用的所有 Azure 服务。When you are working with Azure, sometimes you have to allow specific IP address ranges or URLs in your corporate firewall or proxy to access all Azure services you are using or trying to use. 确认在事件中心使用的 IP 地址上是否允许该流量。Verify that the traffic is allowed on IP addresses used by Event Hubs. 有关 Azure 事件中心使用的 IP 地址,请参阅 Azure IP 范围和服务标记 - 中国云For IP addresses used by Azure Event Hubs: see Azure IP Ranges and Service Tags - China Cloud.

另外,请验证是否允许你的命名空间的 IP 地址。Also, verify that the IP address for your namespace is allowed. 若要查找允许你的连接的正确 IP 地址,请执行以下步骤:To find the right IP addresses to allow for your connections, follow these steps:

  1. 从命令提示符处运行以下命令:Run the following command from a command prompt:

    nslookup <YourNamespaceName>.servicebus.chinacloudapi.cn
    
  2. 记下 Non-authoritative answer 中返回的 IP 地址。Note down the IP address returned in Non-authoritative answer. 只有在你将命名空间还原到另一群集时,它才会更改。The only time it would change is if you restore the namespace on to a different cluster.

如果对命名空间使用区域冗余,则需要执行一些额外步骤:If you use the zone redundancy for your namespace, you need to do a few additional steps:

  1. 首先,在命名空间中运行 nslookup。First, you run nslookup on the namespace.

    nslookup <yournamespace>.servicebus.chinacloudapi.cn
    
  2. 记下“非权威回答”部分中的名称,该名称采用下述格式之一:Note down the name in the non-authoritative answer section, which is in one of the following formats:

    <name>-s1.chinacloudapp.cn
    <name>-s2.chinacloudapp.cn
    <name>-s3.chinacloudapp.cn
    
  3. 为带有后缀 s1、s2 和 s3 的每个实例运行 nslookup,以获取在三个可用性区域中运行的所有三个实例的 IP 地址。Run nslookup for each one with suffixes s1, s2, and s3 to get the IP addresses of all three instances running in three availability zones.

若要排查事件中心的网络相关问题,请执行以下步骤:To troubleshoot network-related issues with Event Hubs, follow these steps:

浏览至 https://<yournamespacename>.servicebus.chinacloudapi.cn/ 或使用 wgetBrowse to or wget https://<yournamespacename>.servicebus.chinacloudapi.cn/. 这可帮助检查是否存在 IP 筛选或虚拟网络或证书链问题(使用 Java SDK 时最常见)。It helps with checking whether you have IP filtering or virtual network or certificate chain issues (most common when using Java SDK).

成功消息的示例:An example of successful message:

<feed xmlns="http://www.w3.org/2005/Atom"><title type="text">Publicly Listed Services</title><subtitle type="text">This is the list of publicly-listed services currently available.</subtitle><id>uuid:27fcd1e2-3a99-44b1-8f1e-3e92b52f0171;id=30</id><updated>2019-12-27T13:11:47Z</updated><generator>Service Bus 1.1</generator></feed>

失败错误消息的示例:An example of failure error message:

<Error>
    <Code>400</Code>
    <Detail>
        Bad Request. To know more visit https://aka.ms/sbResourceMgrExceptions. . TrackingId:b786d4d1-cbaf-47a8-a3d1-be689cda2a98_G22, SystemTracker:NoSystemTracker, Timestamp:2019-12-27T13:12:40
    </Detail>
</Error>

排查暂时性连接问题Troubleshoot transient connectivity issues

如果遇到间歇性连接问题,请参阅以下部分来了解排查技巧。If you are experiencing intermittent connectivity issues, go through the following sections for troubleshooting tips.

使用最新版本的客户端 SDKUse the latest version of the client SDK

在较高的 SDK 版本(高于你所用版本)中,一些暂时性的连接问题可能已经修复。Some of the transient connectivity issues may have been fixed in the later versions of the SDK than what you are using. 确保在应用程序中使用最新版本的客户端 SDK。Ensure that you are using the latest version of client SDKs in your applications. SDK 通过新的/更新的功能和 bug 修复持续进行改进,因此请始终使用最新的包进行测试。SDKs are continuously improved with new/updated features and bug fixes, so always test with latest package. 请查看发行说明,了解已修复的问题以及已添加/更新的功能。Check the release notes for issues that are fixed and features added/updated.

有关客户端 SDK 的信息,请参阅 Azure 事件中心 - 客户端 SDK 一文。For information about client SDKs, see the Azure Event Hubs - Client SDKs article.

运行命令来检查丢弃的数据包Run the command to check dropped packets

出现间歇性连接问题时,请运行以下命令,检查是否存在任何丢弃的数据包。When there are intermittent connectivity issues, run the following command to check if there are any dropped packets. 此命令会尝试通过服务每隔 1 秒建立 25 个不同的 TCP 连接。This command will try to establish 25 different TCP connections every 1 second with the service. 然后,可以检查其中有多少成功/失败,还可以查看 TCP 连接延迟。Then, you can check how many of them succeeded/failed and also see TCP connection latency. 可以从此处下载 psping 工具。You can download the psping tool from here.

.\psping.exe -n 25 -i 1 -q <yournamespacename>.servicebus.chinacloudapi.cn:5671 -nobanner

如果使用的是其他工具(如 tncping 等),则可以使用等效的命令。You can use equivalent commands if you're using other tools such as tnc, ping, and so on.

如果上述步骤没有帮助,请获取网络跟踪,并使用 Wireshark 之类的工具对其进行分析。Obtain a network trace if the previous steps don't help and analyze it using tools such as Wireshark. 如果需要,请联系 Microsoft 支持部门Contact Microsoft Support if needed.

服务升级/重启Service upgrades/restarts

由于后端服务升级和重启,可能会出现暂时性连接问题。Transient connectivity issues may occur because of backend service upgrades and restarts. 出现这种情况时,可能会看到以下症状:When they occur, you may see the following symptoms:

  • 传入的消息/请求可能会减少。There may be a drop in incoming messages/requests.
  • 日志文件可能包含错误消息。The log file may contain error messages.
  • 应用程序可能会在几秒内断开与服务的连接。The applications may be disconnected from the service for a few seconds.
  • 可能会暂时限制请求。Requests may be momentarily throttled.

如果应用程序代码使用 SDK,则重试策略已内置且处于活动状态。If the application code utilizes SDK, the retry policy is already built in and active. 应用程序会重新连接,此操作不会对应用程序/工作流产生重大影响。The application will reconnect without significant impact to the application/workflow. 捕获这些暂时性错误,后退然后重试调用,将确保代码能够从这些暂时性错误中复原。Catching these transient errors, backing off and then retrying the call will ensure that your code is resilient to these transient issues.

后续步骤Next steps

请参阅以下文章:See the following articles: