排查 Azure 虚拟网络 NAT 连接问题Troubleshoot Azure Virtual Network NAT connectivity

本文帮助管理员诊断和解决在使用虚拟网络 NAT 时出现的连接问题。This article helps administrators diagnose and resolve connectivity problems when using Virtual Network NAT.

问题Problems

若要解决这些问题,请遵循以下部分中的步骤。To resolve these problems, follow the steps in the following section.

解决方法Resolution

SNAT 耗尽SNAT exhaustion

一个 NAT 网关资源支持 64,000 到 100 万个并发流。A single NAT gateway resource supports from 64,000 up to 1 million concurrent flows. 每个 IP 地址为可用库存提供 64,000 个 SNAT 端口。Each IP address provides 64,000 SNAT ports to the available inventory. 对于每个 NAT 网关资源,最多可以使用 16 个 IP 地址。You can use up to 16 IP addresses per NAT gateway resource. 此文更详细地介绍了 SNAT 机制。The SNAT mechanism is described here in more detail.

通常,SNAT 耗尽的根本原因是建立和管理出站连接的方式出现了对立模式,或者可配置的计时器已更改,不再使用默认值。Frequently the root cause of SNAT exhaustion is an anti-pattern for how outbound connectivity is established, managed, or configurable timers changed from their default values. 请认真阅读本部分。Review this section carefully.

步骤Steps

  1. 检查是否已将默认空闲超时修改为大于 4 分钟的值。Check if you have modified the default idle timeout to a value higher than 4 minutes.
  2. 调查应用程序如何建立出站连接(例如,通过代码评审或数据包捕获进行调查)。Investigate how your application is creating outbound connectivity (for example, code review or packet capture).
  3. 确定此活动是否为预期行为,或者应用程序是否行为异常。Determine if this activity is expected behavior or whether the application is misbehaving. 使用 Azure Monitor 中的指标来证实发现结果。Use metrics in Azure Monitor to substantiate your findings. 对“SNAT 连接”指标使用“失败”类别。Use "Failed" category for SNAT Connections metric.
  4. 评估是否遵循适当的模式。Evaluate if appropriate patterns are followed.
  5. 评估向 NAT 网关资源分配更多的 IP 地址是否应该可以缓解 SNAT 端口耗尽的问题。Evaluate if SNAT port exhaustion should be mitigated with additional IP addresses assigned to NAT gateway resource.

设计模式Design patterns

始终尽量利用连接重用和连接池。Always take advantage of connection reuse and connection pooling whenever possible. 这些模式可以避免资源耗尽问题,并使行为可预测。These patterns will avoid resource exhaustion problems and result in predictable behavior. 在许多开发库和框架中,都可以找到这些模式的根源。Primitives for these patterns can be found in many development libraries and frameworks.

解决方法: 使用适当的模式和最佳做法Solution: Use appropriate patterns and best practices

  • NAT 网关资源的默认 TCP 空闲超时为 4 分钟。NAT gateway resources have a default TCP idle timeout of 4 minutes. 如果将此设置更改为更大的值,NAT 绑定到流的时间会更长,可能导致在 SNAT 端口库存方面出现不必要的压力If this setting is changed to a higher value, NAT will hold on to flows longer and can cause unnecessary pressure on SNAT port inventory.
  • 原子请求(每个连接一个请求)是一个拙劣的设计选项。Atomic requests (one request per connection) are a poor design choice. 这种对立模式会限制缩放,降低性能并降低可靠性。Such anti-pattern limits scale, reduces performance, and decreases reliability. 应该重复使用 HTTP/S 连接来减少连接数和关联的 SNAT 端口数。Instead, reuse HTTP/S connections to reduce the numbers of connections and associated SNAT ports. 由于使用 TLS 时可以减少握手次数、系统开销以及加密操作的开销,因此应用规模与性能都会提高。The application scale will increase and performance improve due to reduced handshakes, overhead, and cryptographic operation cost when using TLS.
  • 如果客户端不缓存 DNS 解析器的结果,DNS 可能会在卷上引入许多单独的流。DNS can introduce many individual flows at volume when the client is not caching the DNS resolvers result. 使用缓存。Use caching.
  • UDP 流(例如 DNS 查找)根据空闲超时持续时间分配 SNAT 端口。UDP flows (for example DNS lookups) allocate SNAT ports for the duration of the idle timeout. 空闲超时越长,SNAT 端口上的压力越大。The longer the idle timeout, the higher the pressure on SNAT ports. 使用较短的空闲超时(例如 4 分钟)。Use short idle timeout (for example 4 minutes).
  • 使用连接池来调整连接卷。Use connection pools to shape your connection volume.
  • 切勿以静默方式丢弃 TCP 流,且不要依赖 TCP 计时器来清理流。Never silently abandon a TCP flow and rely on TCP timers to clean up flow. 如果不允许 TCP 显式关闭连接,中间系统和终结点上将保持已分配状态,使 SNAT 端口不可用于其他连接。If you don't let TCP explicitly close the connection, state remains allocated at intermediate systems and endpoints and makes SNAT ports unavailable for other connections. 此模式可能会触发应用程序故障和 SNAT 耗尽。This pattern can trigger application failures and SNAT exhaustion.
  • 在对造成的影响了解不深的情况下,请不要更改与 OS 级别的 TCP 关闭相关的计时器值。Don't change OS-level TCP close related timer values without expert knowledge of impact. 当某个连接的终结点不符合预期时,尽管 TCP 堆栈会恢复,但应用程序的性能可能会受负面影响。While the TCP stack will recover, your application performance can be negatively impacted when the endpoints of a connection have mismatched expectations. 需要更改计时器往往意味着底层设计出现了问题。The desire to change timers is usually a sign of an underlying design problem. 查看以下建议:Review following recommendations.

底层应用程序中的其他反模式也会使 SNAT 耗尽问题变得严重。SNAT exhaustion can also be amplified with other anti-patterns in the underlying application. 请查看这些附加模式和最佳做法,以改善服务的可伸缩性和可靠性。Review these additional patterns and best practices to improve the scale and reliability of your service.

  • 了解将 TCP 空闲超时减小到更小的值(包括默认的 4 分钟空闲超时)来提前释放 SNAT 端口库存所造成的影响。Explore impact of reducing TCP idle timeout to lower values including default idle timeout of 4 minutes to free up SNAT port inventory earlier.
  • 考虑对长时间运行的操作使用异步轮询模式,以释放连接资源供其他操作使用。Consider asynchronous polling patterns for long-running operations to free up connection resources for other operations.
  • 生存期较长的流(例如重复使用的 TCP 连接)应使用 TCP Keepalive 或应用层 Keepalive,以避免中间系统超时。增大空闲超时是最终手段,不一定可以解决根本原因。Long-lived flows (for example reused TCP connections) should use TCP keepalives or application layer keepalives to avoid intermediate systems timing out. Increasing the idle timeout is a last resort and may not resolve the root cause. 较长的超时可以在超时时间过去时降低失败的频率,同时也会造成延迟和不必要的失败。A long timeout can cause low rate failures when timeout expires and introduce delay and unnecessary failures.
  • 应使用正常重试模式,以避免在发生暂时性故障或故障恢复期间出现频繁重试/突发。Graceful retry patterns should be used to avoid aggressive retries/bursts during transient failure or failure recovery. 为每个 HTTP 操作创建新 TCP 连接(也称为“原子连接”)是一种对立模式。Creating a new TCP connection for every HTTP operation (also known as "atomic connections") is an anti-pattern. 原子连接会阻止应用程序正常缩放,并且会浪费资源。Atomic connections will prevent your application from scaling well and waste resources. 始终通过管道将多个操作输送到同一连接。Always pipeline multiple operations into the same connection. 应用程序将对事务速度和资源开销有利。Your application will benefit in transaction speed and resource costs. 当应用程序使用传输层加密(例如 TLS)时,处理新连接的开销会很大。When your application uses transport layer encryption (for example TLS), there's a significant cost associated with the processing of new connections. 有关其他最佳做法模式,请查看 Azure 云设计模式Review Azure Cloud Design Patterns for additional best practice patterns.

其他可能的缓解措施Additional possible mitigations

解决方法: 按如下所述缩放出站连接:Solution: Scale outbound connectivity as follows:

场景Scenario 证据Evidence 缓解操作Mitigation
在使用高峰期遇到 SNAT 端口争用和 SNAT 端口耗尽的情况。You're experiencing contention for SNAT ports and SNAT port exhaustion during periods of high usage. Azure Monitor 中“SNAT 连接”指标的“失败”类别显示一段时间内的暂时性或持续性故障,以及高连接量。"Failed" category for SNAT Connections metric in Azure Monitor shows transient or persistent failures over time and high connection volume. 确定是否可以添加更多的公共 IP 地址资源或公共 IP 前缀资源。Determine if you can add additional public IP address resources or public IP prefix resources. 添加这些资源可让 NAT 网关总共最多获得 16 个 IP 地址。This addition will allow for up to 16 IP addresses in total to your NAT gateway. 这种做法将为可用 SNAT 端口(每个 IP 地址 64,000 个端口)提供更多库存,并让你进一步缩放方案。This addition will provide more inventory for available SNAT ports (64,000 per IP address) and allow you to scale your scenario further.
已获得 16 个 IP 地址,但仍遇到 SNAT 端口耗尽的问题。You've already given 16 IP addresses and still are experiencing SNAT port exhaustion. 尝试添加更多 IP 地址失败。Attempt to add additional IP address fails. 公共 IP 地址资源或公共 IP 前缀资源的 IP 地址总数超过了 16 个。Total number of IP addresses from public IP address resources or public IP prefix resources exceeds a total of 16. 跨多个子网分布应用程序环境,并为每个子网提供一个 NAT 网关资源。Distribute your application environment across multiple subnets and provide a NAT gateway resource for each subnet. 重新评估设计模式,根据前面的指导进行优化。Reevaluate your design pattern(s) to optimize based on preceding guidance.

备注

必须了解发生 SNAT 消耗的原因。It is important to understand why SNAT exhaustion occurs. 确保为可缩放、可靠的方案使用适当的模式。Make sure you are using the right patterns for scalable and reliable scenarios. 只有在万不得已时,才能在不了解需求原因的情况下将更多 SNAT 端口添加到方案。Adding more SNAT ports to a scenario without understanding the cause of the demand should be a last resort. 如果你不知道方案为何给 SNAT 端口库存施加压力,通过添加更多 IP 地址将更多 SNAT 端口添加到库存,只能在应用程序缩放时延缓相同的资源耗尽问题。If you do not understand why your scenario is applying pressure on SNAT port inventory, adding more SNAT ports to the inventory by adding more IP addresses will only delay the same exhaustion failure as your application scales. 其他低效的做法和对立模式可能会被掩盖。You may be masking other inefficiencies and anti-patterns.

ICMP ping 失败ICMP ping is failing

虚拟网络 NAT 支持 IPv4 UDP 和 TCP 协议。Virtual Network NAT supports IPv4 UDP and TCP protocols. ICMP 不受支持,预期将会失败。ICMP isn't supported and expected to fail.

解决方法: 请改用 TCP 连接测试(例如“TCP ping”)和 UDP 特定的应用层测试来验证端到端的连接。Solution: Instead, use TCP connection tests (for example "TCP ping") and UDP-specific application layer tests to validate end to end connectivity.

可以使用下表作为起点,来确定要使用哪些工具开始测试。The following table can be used a starting point for which tools to use to start tests.

操作系统Operating system 常规 TCP 连接测试Generic TCP connection test TCP 应用层测试TCP application layer test UDPUDP
LinuxLinux nc(常规连接测试)nc (generic connection test) curl(TCP 应用层测试)curl (TCP application layer test) 特定于应用程序application specific
WindowsWindows PsPingPsPing PowerShell Invoke-WebRequestPowerShell Invoke-WebRequest 特定于应用程序application specific

连接失败Connectivity failures

虚拟网络 NAT 的连接问题可能是由多个不同的因素造成的:Connectivity issues with Virtual Network NAT can be caused by several different issues:

  • 由于配置错误而导致的永久性故障。permanent failures due to configuration mistakes.
  • NAT 网关出现暂时性或持续性的 SNAT 耗尽transient or persistent SNAT exhaustion of the NAT gateway,
  • Azure 基础结构出现暂时性故障;transient failures in the Azure infrastructure,
  • Azure 与公共 Internet 目标之间的路径出现暂时性故障;transient failures in the path between Azure and the public Internet destination,
  • 公共 Internet 目标出现暂时性或持续性故障。transient or persistent failures at the public Internet destination.

使用如下所述的工具来验证连接。Use tools like the following to validation connectivity. 不支持 ICMP pingICMP ping isn't supported.

操作系统Operating system 常规 TCP 连接测试Generic TCP connection test TCP 应用层测试TCP application layer test UDPUDP
LinuxLinux nc(常规连接测试)nc (generic connection test) curl(TCP 应用层测试)curl (TCP application layer test) 特定于应用程序application specific
WindowsWindows PsPingPsPing PowerShell Invoke-WebRequestPowerShell Invoke-WebRequest 特定于应用程序application specific

配置Configuration

检查配置:Check your configuration:

  1. NAT 网关资源是至少具有一个公共 IP 资源还是一个公共 IP 前缀资源?Does the NAT gateway resource have at least one public IP resource or one public IP prefix resource? 它必须至少具有一个与 NAT 网关关联的 IP 地址,才能提供出站连接。You must at least have one IP address associated with the NAT gateway for it to be able to provide outbound connectivity.
  2. 虚拟网络的子网是否配置为使用 NAT 网关?Is the virtual network's subnet configured to use the NAT gateway?
  3. 是否正在使用 UDR(用户定义的路由)?是否要替代目标?Are you using UDR (user-defined route) and are you overriding the destination? NAT 网关资源将在配置的子网上成为默认路由 (0/0)。NAT gateway resources become the default route (0/0) on configured subnets.

SNAT 耗尽SNAT exhaustion

请参阅本文中有关 SNAT 耗尽的部分。Review section on SNAT exhaustion in this article.

Azure 基础结构Azure infrastructure

Azure 会非常严谨地监视和运行其基础结构。Azure monitors and operates its infrastructure with great care. 但暂时性故障仍可能会发生,且无法保证传输内容不会丢失。Transient failures can occur, there's no guarantee that transmissions are lossless. 使用允许 TCP 应用程序进行 SYN 重传的设计模式。Use design patterns that allow for SYN retransmissions for TCP applications. 使用足够大的连接超时,以允许 TCP SYN 重传,从而减轻 SYN 丢包造成的暂时性影响。Use connection timeouts large enough to permit TCP SYN retransmission to reduce transient impacts caused by a lost SYN packet.

解决方法:Solution:

  • 查看 SNAT 耗尽Check for SNAT exhaustion.
  • TCP 堆栈中用于控制 SYN 重传行为的配置参数名为 RTO(重传超时)。The configuration parameter in a TCP stack that controls the SYN retransmission behavior is called RTO (Retransmission Time-Out). RTO 值可调整,通常默认为 1 秒或更高,支持指数退让。The RTO value is adjustable but typically 1 second or higher by default with exponential back-off. 如果应用程序的连接超时过短(例如 1 秒),可能会出现偶发性的连接超时。If your application's connection time-out is too short (for example 1 second), you may see sporadic connection timeouts. 增大应用程序连接超时。Increase the application connection time-out.
  • 如果在使用默认应用程序行为时观察到长时间的意外超时,请提出支持案例,让我们做进一步的故障排除。If you observe longer, unexpected timeouts with default application behaviors, open a support case for further troubleshooting.

我们不建议人为减小 TCP 连接超时或优化 RTO 参数。We don't recommend artificially reducing the TCP connection timeout or tuning the RTO parameter.

公共 Internet 传输Public Internet transit

如果通往目标的路径较长且存在较多的中间系统,则出现暂时性故障的几率会增大。The chances of transient failures increases with a longer path to the destination and more intermediate systems. 通过 Azure 基础结构传输时,出现暂时性故障的频率预期会增大。It's expected that transient failures can increase in frequency over Azure infrastructure.

请遵循前面的 Azure 基础结构部分中的相同指导。Follow the same guidance as preceding Azure infrastructure section.

Internet 终结点Internet endpoint

可以运用前面部分中的指导,同时,遵循与用于建立通信的 Internet 终结点相关的注意事项。The previous sections apply, along with the Internet endpoint that communication is established with. 可能影响连接成功的其他因素为:Other factors that can impact connectivity success are:

  • 目标端的流量管理,包括traffic management on destination side, including
  • 目标端施加的 API 速率限制API rate limiting imposed by the destination side
  • 巨流量 DDoS 攻击缓解措施或传输层流量造型Volumetric DDoS mitigations or transport layer traffic shaping
  • 目标上的防火墙或其他组件firewall or other components at the destination

通常,在源和目标(如果适用)上进行数据包捕获需要确定当前发生的情况。Usually packet captures at the source and the destination (if available) are required to determine what is taking place.

解决方法:Solution:

  • 查看 SNAT 耗尽Check for SNAT exhaustion.
  • 验证与同一区域或其他位置的终结点的连接,以进行比较。Validate connectivity to an endpoint in the same region or elsewhere for comparison.
  • 如果要创建高流量或事务速率测试,请观察降低速率是否会减少故障的发生。If you're creating high volume or transaction rate testing, explore if reducing the rate reduces the occurrence of failures.
  • 如果更改速率会影响故障率,请检查是否已达到目标端上的 API 速率限制或其他约束。If changing rate impacts the rate of failures, check if API rate limits or other constraints on the destination side might have been reached.
  • 如果调查没有结论,请提出支持案例,让我们做进一步的故障排除。If your investigation is inconclusive, open a support case for further troubleshooting.

收到 TCP 重置TCP Resets received

NAT 网关在源 VM 上为处理过程中未识别到的流量生成 TCP 重置。The NAT gateway generates TCP resets on the source VM for traffic that isn't recognized as in progress.

一个可能的原因是 TCP 连接发生空闲超时。可将空闲超时从 4 分钟调整为最长 120 分钟。One possible reason is the TCP connection has idle timed out. You can adjust the idle timeout from 4 minutes to up to 120 minutes.

不会在 NAT 网关资源的公共端上生成 TCP 重置。TCP Resets aren't generated on the public side of NAT gateway resources. 目标端上的 TCP 重置是由源 VM 生成的,而不是由 NAT 网关资源生成的。TCP resets on the destination side are generated by the source VM, not the NAT gateway resource.

解决方法:Solution:

  • 查看设计模式建议。Review design patterns recommendations.
  • 如有必要,请提出支持案例,让我们做进一步的故障排除。Open a support case for further troubleshooting if necessary.

IPv6 共存IPv6 coexistence

虚拟网络 NAT 支持 IPv4 UDP 和 TCP 协议,不支持在使用 IPv6 前缀的子网中进行部署。Virtual Network NAT supports IPv4 UDP and TCP protocols and deployment on a subnet with an IPv6 prefix isn't supported.

解决方法: 在不使用 IPv6 前缀的子网中部署 NAT 网关。Solution: Deploy NAT gateway on a subnet without IPv6 prefix.

可以通过虚拟网络 NAT UserVoice 来表明对其他功能的兴趣。You can indicate interest in additional capabilities through Virtual Network NAT UserVoice.

连接不是源自 NAT 网关 IPConnection doesn't originate from NAT gateway IP(s)

你可以配置 NAT 网关、要使用的 IP 地址以及应使用 NAT 网关资源的子网。You configure NAT gateway, IP address(es) to use, and which subnet should use a NAT gateway resource. 但是,在 NAT 网关进行部署之前存在的虚拟机实例中的连接不使用 IP 地址。However, connections from virtual machine instances that existed before the NAT gateway was deployed don't use the IP address(es). 它们似乎使用不适用于 NAT 网关资源的 IP 地址。They appear to be using IP address(es) not used with the NAT gateway resource.

解决方法:Solution:

虚拟网络 NAT 替换它于其上进行配置的子网的出站连接。Virtual Network NAT replaces the outbound connectivity for the subnet it is configured on. 从默认 SNAT 或负载均衡器出站 SNAT 转换到使用 NAT 网关时,新连接将立即开始使用与 NAT 网关资源相关联的 IP 地址。When transitioning from default SNAT or load balancer outbound SNAT to using NAT gateways, new connections will immediately begin using the IP address(es) associated with the NAT gateway resource. 但是,如果虚拟机在切换到 NAT 网关资源的过程中仍具有已建立的连接,则连接将继续使用建立连接时分配的旧 SNAT IP 地址。However, if a virtual machine still has an established connection during the switch to NAT gateway resource, the connection will continue using the old SNAT IP address that was assigned when the connection was established. 请确保你确实在建立新的连接,而不是重用已存在的连接,因为 OS 或浏览器正在连接池中缓存连接。Make sure you are really establishing a new connection rather than reusing a connection that already existed because the OS or the browser was caching the connections in a connection pool. 例如,在 PowerShell 中使用卷曲时,请确保指定了“-DisableKeepalive”参数以强制建立新连接 。For example, when using curl in PowerShell, make sure to specify the -DisableKeepalive parameter to force a new connection. 如果使用的是浏览器,则可能还会将连接汇集在池中。If you're using a browser, connections may also be pooled.

不需要重启为 NAT 网关资源配置子网的虚拟机。It's not necessary to reboot a virtual machine configuring a subnet for a NAT gateway resource. 但是,如果虚拟机重启,则连接状态会刷新。However, if a virtual machine is rebooted, the connection state is flushed. 刷新连接状态后,所有连接都将开始使用 NAT 网关资源的 IP 地址。When the connection state has been flushed, all connections will begin using the NAT gateway resource's IP address(es). 但是,这是重启虚拟机的后果之一,而不表示需要重启。However, this is a side effect of the virtual machine being rebooted and not an indicator that a reboot is required.

如果仍然遇到问题,请打开支持案例进行进一步的故障排除。If you are still having trouble, open a support case for further troubleshooting.

后续步骤Next steps