适用于 Azure VM 的 TCP/IP 性能优化TCP/IP performance tuning for Azure VMs

本文介绍常用的 TCP/IP 性能优化技术,以及将其用于 Azure 上运行的虚拟机时要注意的一些事项。This article discusses common TCP/IP performance tuning techniques and some things to consider when you use them for virtual machines running on Azure. 其中将会简要地概述这些技术,并探讨其优化方式。It will provide a basic overview of the techniques and explore how they can be tuned.

常用的 TCP/IP 优化技术Common TCP/IP tuning techniques

MTU、分段和大规模发送卸载MTU, fragmentation, and large send offload

MTUMTU

最大传输单元 (MTU) 是可以通过网络接口发送的最大帧(数据包),以字节为单位。The maximum transmission unit (MTU) is the largest size frame (packet), specified in bytes, that can be sent over a network interface. MTU 是可配置的设置。The MTU is a configurable setting. 在 Azure VM 上使用的默认 MTU(也是大多数网络设备上全局使用的默认设置)为 1,500 字节。The default MTU used on Azure VMs, and the default setting on most network devices globally, is 1,500 bytes.

分段Fragmentation

当发送的数据包超过网络接口的 MTU 时,就会发生分段。Fragmentation occurs when a packet is sent that exceeds the MTU of a network interface. TCP/IP 堆栈会将该数据包分解成符合接口 MTU 的较小片段。The TCP/IP stack will break the packet into smaller pieces (fragments) that conform to the interface's MTU. 分段发生在 IP 层,与底层协议(例如 TCP)无关。Fragmentation occurs at the IP layer and is independent of the underlying protocol (such as TCP). 通过 MTU 为 1,500 的网络接口发送 2,000 字节的数据包时,该数据包将分解成一个 1,500 字节的数据包和一个 500 字节的数据包。When a 2,000-byte packet is sent over a network interface with an MTU of 1,500, the packet will be broken down into one 1,500-byte packet and one 500-byte packet.

源与目标之间的路径中的网络设备可能会丢弃超过 MTU 的数据包,或者将数据包分段成较小的片段。Network devices in the path between a source and destination can either drop packets that exceed the MTU or fragment the packet into smaller pieces.

IP 数据包中的 Don't Fragment(不要分段)位The Don't Fragment bit in an IP packet

Don't Fragment (DF) 位是 IP 协议标头中的一个标志。The Don't Fragment (DF) bit is a flag in the IP protocol header. DF 位指示发送方与接收方之间的路径中的网络设备不得将数据包分段。The DF bit indicates that network devices on the path between the sender and receiver must not fragment the packet. 设置此位的原因有很多。This bit could be set for many reasons. (请参阅本文的“路径 MTU 发现”部分中的一个示例。)当网络设备收到一个设置了 Don't Fragment 位的数据包时,如果该数据包超过了设备的接口 MTU,则标准行为是该设备丢弃该数据包。(See the "Path MTU Discovery" section of this article for one example.) When a network device receives a packet with the Don't Fragment bit set, and that packet exceeds the device's interface MTU, the standard behavior is for the device to drop the packet. 设备会将一条 ICMP Fragmentation Needed(需要 ICMP 分段)消息发回给数据包的原始源。The device sends an ICMP Fragmentation Needed message back to the original source of the packet.

分段对性能的影响Performance implications of fragmentation

分段可能会对性能造成负面影响。Fragmentation can have negative performance implications. 影响性能的主要原因之一是分段再重组数据包会影响 CPU/内存。One of the main reasons for the effect on performance is the CPU/memory impact of the fragmentation and reassembly of packets. 当网络设备需要将某个数据包分段时,它必须分配 CPU/内存资源来执行分段。When a network device needs to fragment a packet, it will have to allocate CPU/memory resources to perform fragmentation.

重组数据包时也是如此。The same thing happens when the packet is reassembled. 网络设备必须存储所有片段,直到这些片段被接收并可重组成原始数据包为止。The network device has to store all the fragments until they're received so it can reassemble them into the original packet. 此分段和重组过程还会导致延迟。This process of fragmentation and reassembly can also cause latency.

分段可能对性能造成的另一种负面影响是,分段的数据包可能失序。The other possible negative performance implication of fragmentation is that fragmented packets might arrive out of order. 不按顺序接收数据包时,某些类型的网络设备可能会丢弃数据包。When packets are received out of order, some types of network devices can drop them. 如果发生这种情况,必须重新传输整个数据包。When that happens, the whole packet has to be retransmitted.

网络防火墙等安全设备经常丢弃片段,当网络设备的接收缓冲区耗尽时,也会丢弃片段。Fragments are typically dropped by security devices like network firewalls or when a network device's receive buffers are exhausted. 当网络设备的接收缓冲区耗尽时,网络设备会尝试重组分段的数据包,但没有资源可用于存储和恢复数据包。When a network device's receive buffers are exhausted, a network device is attempting to reassemble a fragmented packet but doesn't have the resources to store and reassume the packet.

可将分段视为一种负面操作,但通过 Internet 连接各种网络时,有必要支持分段。Fragmentation can be seen as a negative operation, but support for fragmentation is necessary when you're connecting diverse networks over the internet.

修改 MTU 的好处和后果Benefits and consequences of modifying the MTU

一般情况下,可以通过增大 MTU 来创建更高效的网络。Generally speaking, you can create a more efficient network by increasing the MTU. 传输的每个数据包会包含已添加到原始数据包的标头信息。Every packet that's transmitted has header information that's added to the original packet. 当分段操作创建更多数据包时,标头开销将会增加,因此会降低网络的效率。When fragmentation creates more packets, there's more header overhead, and that makes the network less efficient.

下面是一个示例。Here's an example. 以太网标头大小为 14 字节,加上用于确保帧一致性的 4 字节帧检查序列。The Ethernet header size is 14 bytes plus a 4-byte frame check sequence to ensure frame consistency. 如果发送一个 2,000 字节的数据包,则会在网络中添加 18 字节的以太网开销。If one 2,000-byte packet is sent, 18 bytes of Ethernet overhead is added on the network. 如果将该数据包分段成一个 1,500 字节的数据包和一个 500 字节的数据包,则每个数据包以太网标头为 18 字节,总共为 36 字节。If the packet is fragmented into a 1,500-byte packet and a 500-byte packet, each packet will have 18 bytes of Ethernet header, a total of 36 bytes.

请记住,增大 MTU 不一定可以创建更高效的网络。Keep in mind that increasing the MTU won't necessarily create a more efficient network. 如果应用程序发送仅有 500 字节的数据包,则不管 MTU 是 1,500 字节还是 9,000 字节,标头开销都是相同的。If an application sends only 500-byte packets, the same header overhead will exist whether the MTU is 1,500 bytes or 9,000 bytes. 仅当网络使用受 MTU 影响的更大数据包大小时,才会变得更加高效。The network will become more efficient only if it uses larger packet sizes that are affected by the MTU.

Azure 和 VM MTUAzure and VM MTU

Azure VM 的默认 MTU 为 1,500 字节。The default MTU for Azure VMs is 1,500 bytes. Azure 虚拟网络堆栈尝试按 1,400 字节将数据包分段。The Azure Virtual Network stack will attempt to fragment a packet at 1,400 bytes.

请注意,虚拟网络堆栈本质上并不是低效的,因为即使 VM 的 MTU 为 1,500,该堆栈也会按 1,400 字节将数据包分段。Note that the Virtual Network stack isn't inherently inefficient because it fragments packets at 1,400 bytes even though VMs have an MTU of 1,500. 大部分网络数据包远远小于 1,400 或 1,500 字节。A large percentage of network packets are much smaller than 1,400 or 1,500 bytes.

Azure 和分段Azure and fragmentation

虚拟网络堆栈设置为丢弃“无序片段”(不按原始分段顺序抵达的分段数据包)。Virtual Network stack is set up to drop "out of order fragments," that is, fragmented packets that don't arrive in their original fragmented order. 丢弃这些数据包的主要原因是 2018 年 11 月公布了一个称作 FragmentSmack 的网络安全漏洞。These packets are dropped mainly because of a network security vulnerability announced in November 2018 called FragmentSmack.

FragmentSmack 是 Linux 内核处理分段的 IPv4 和 IPv6 数据包重组时存在的一个缺陷。FragmentSmack is a defect in the way the Linux kernel handled reassembly of fragmented IPv4 and IPv6 packets. 远程攻击者可能会利用此缺陷来触发高开销的片段重组操作,导致目标系统上的 CPU 用量增大并出现拒绝服务错误。A remote attacker could use this flaw to trigger expensive fragment reassembly operations, which could lead to increased CPU and a denial of service on the target system.

优化 MTUTune the MTU

可以像在任何其他操作系统中一样配置 Azure VM MTU。You can configure an Azure VM MTU, as you can in any other operating system. 但是,在配置 MTU 时,应考虑到 Azure 中发生的上述分段情况。But you should consider the fragmentation that occurs in Azure, described above, when you're configuring an MTU.

我们不建议客户增大 VM MTU。We don't encourage customers to increase VM MTUs. 本文旨在详细说明 Azure 如何实现 MTU 和执行分段。This discussion is meant to explain the details of how Azure implements MTU and performs fragmentation.

重要

增大 MTU 不一定可以提高性能,反而可能会对应用程序的性能造成负面影响。Increasing MTU isn't known to improve performance and could have a negative effect on application performance.

大规模发送卸载Large send offload

大规模发送卸载 (LSO) 可以通过将数据包分段任务卸载到以太网适配器来提高网络性能。Large send offload (LSO) can improve network performance by offloading the segmentation of packets to the Ethernet adapter. 启用 LSO 后,TCP/IP 堆栈会创建一个较大的 TCP 数据包并将其发送到以太网适配器进行分段,然后转发片段。When LSO is enabled, the TCP/IP stack creates a large TCP packet and sends it to the Ethernet adapter for segmentation before forwarding it. LSO 的好处是无需让 CPU 将数据包分段成符合 MTU 的大小,而是将该处理任务卸载到硬件中执行分段的以太网接口。The benefit of LSO is that it can free the CPU from segmenting packets into sizes that conform to the MTU and offload that processing to the Ethernet interface where it's performed in hardware. 若要详细了解 LSO 的好处,请参阅支持大规模发送卸载To learn more about the benefits of LSO, see Supporting large send offload.

启用 LSO 后,Azure 客户在执行数据包捕获时可能会看到较大的帧大小。When LSO is enabled, Azure customers might see large frame sizes when they perform packet captures. 这些较大的帧大小可能导致某些客户认为发生了分段或者使用了较大的 MTU,但实际上并非如此。These large frame sizes might lead some customers to think fragmentation is occurring or that a large MTU is being used when it's not. 以太网适配器可以使用 LSO 将更大的最大片段大小 (MSS) 播发到 TCP/IP 堆栈,以创建更大的 TCP 数据包。With LSO, the Ethernet adapter can advertise a larger maximum segment size (MSS) to the TCP/IP stack to create a larger TCP packet. 然后,整个未分段的帧将转发到以太网适配器,并在 VM 上执行的数据包捕获中显示。This entire non-segmented frame is then forwarded to the Ethernet adapter and would be visible in a packet capture performed on the VM. 但是,以太网适配器会根据其 MTU 将数据包分解成许多的较小帧。But the packet will be broken down into many smaller frames by the Ethernet adapter, according to the Ethernet adapter's MTU.

TCP MSS 窗口缩放和 PMTUDTCP MSS window scaling and PMTUD

TCP 最大片段大小TCP maximum segment size

TCP 最大片段大小 (MSS) 是一项限制 TCP 片段大小的设置,可避免将 TCP 数据包分段。TCP maximum segment size (MSS) is a setting that limits the size of TCP segments, which avoids fragmentation of TCP packets. 操作系统往往使用以下公式来设置 MSS:Operating systems will typically use this formula to set MSS:

MSS = MTU - (IP header size + TCP header size)

IP 标头和 TCP 标头各为 20 字节,总共为 40 字节。The IP header and the TCP header are 20 bytes each, or 40 bytes total. 因此,MTU 为 1,500 的接口的 MSS 为 1,460。So an interface with an MTU of 1,500 will have an MSS of 1,460. 但是,MSS 是可配置的。But the MSS is configurable.

此项设置符合在源与目标之间建立 TCP 会话时的 TCP 三向握手。This setting is agreed to in the TCP three-way handshake when a TCP session is set up between a source and a destination. 双方都会发送 MSS 值,两个值的较小者用于 TCP 连接。Both sides send an MSS value, and the lower of the two is used for the TCP connection.

请记住,源和目标的 MTU 不是 MSS 值的唯一决定因素。Keep in mind that the MTUs of the source and destination aren't the only factors that determine the MSS value. 中间网络设备(例如 VPN 网关,包括 Azure VPN 网关)可以独立于源和目标调整 MTU,以确保提供最佳网络性能。Intermediary network devices, like VPN gateways, including Azure VPN Gateway, can adjust the MTU independently of the source and destination to ensure optimal network performance.

路径 MTU 发现Path MTU Discovery

MSS 会经过协商,但不一定表明可以使用实际 MSS。MSS is negotiated, but it might not indicate the actual MSS that can be used. 这是因为,源与目标之间的路径中的其他网络设备使用的 MTU 值,可能小于源和目标的 MTU 值。This is because other network devices in the path between the source and the destination might have a lower MTU value than the source and destination. 在这种情况下,MTU 小于数据包的设备将会丢弃该数据包。In this case, the device whose MTU is smaller than the packet will drop the packet. 该设备会发回包含其 MTU 的 ICMP Fragmentation Needed (Type 3, Code 4) 消息。The device will send back an ICMP Fragmentation Needed (Type 3, Code 4) message that contains its MTU. 此 ICMP 消息可让源主机适当减小其路径 MTU。This ICMP message allows the source host to reduce its Path MTU appropriately. 此过程称为“路径 MTU 发现”(PMTUD)。The process is called Path MTU Discovery (PMTUD).

PMTUD 过程的效率低下,会影响网络性能。The PMTUD process is inefficient and affects network performance. 如果发送的数据包超过网络路径的 MTU,需要使用更低的 MSS 重新传输该数据包。When packets are sent that exceed a network path's MTU, the packets need to be retransmitted with a lower MSS. 如果发送方未收到 ICMP Fragmentation Needed 消息,原因可能是路径中存在网络防火墙(通常称为“PMTUD 黑洞”),因此发送方不知道它需要降低 MSS 并持续重新传输数据包。 If the sender doesn't receive the ICMP Fragmentation Needed message, maybe because of a network firewall in the path (commonly referred to as a PMTUD blackhole), the sender doesn't know it needs to lower the MSS and will continuously retransmit the packet. 这就是我们不建议增大 Azure VM MTU 的原因。This is why we don't recommend increasing the Azure VM MTU.

VPN 和 MTUVPN and MTU

如果使用执行封装(例如 IPsec VPN)的 VM,在数据包大小和 MTU 方面还需要注意其他一项事项。If you use VMs that perform encapsulation (like IPsec VPNs), there are some additional considerations regarding packet size and MTU. VPN 将更多标头添加到数据包,这会增大数据包的大小并需要减小 MSS。VPNs add more headers to packets, which increases the packet size and requires a smaller MSS.

对于 Azure,我们建议将 TCP MSS 钳位设置为 1,350 字节,将隧道接口 MTU 设置为 1,400。For Azure, we recommend that you set TCP MSS clamping to 1,350 bytes and tunnel interface MTU to 1,400. 有关详细信息,请参阅 VPN 设备和 IPSec/IKE 参数页For more information, see the VPN devices and IPSec/IKE parameters page.

延迟、往返时间和 TCP 窗口缩放Latency, round-trip time, and TCP window scaling

延迟和往返时间Latency and round-trip time

网络延迟受制于光纤网络的光速。Network latency is governed by the speed of light over a fiber optic network. TCP 的网络吞吐量实际上还受制于两个网络设备之间的往返时间 (RTT)。Network throughput of TCP is also effectively governed by the round-trip time (RTT) between two network devices.

路由Route 距离Distance 单向时间One-way time RTTRTT
北京到上海Beijing to Shanghai 1,080 公里1,080 km 5.4 毫秒5.4 ms 10.8 毫秒10.8 ms

此表显示了两个位置之间的直线距离。This table shows the straight-line distance between two locations. 网络中的距离往往大于直线距离。In networks, the distance is typically longer than the straight-line distance. 下面是计算受制于光速的最小 RTT 的简单公式:Here's a simple formula to calculate minimum RTT as governed by the speed of light:

minimum RTT = 2 * (Distance in kilometers / Speed of propagation)

可以使用 200 作为传播速度。You can use 200 for the speed of propagation. 这是光在 1 毫秒内传播的距离(以千米为单位)。This is the distance, in kilometers, that light travels in 1 millisecond.

我们以从北京到上海为例。Let's take Beijing to Shanghai as an example. 两者之间的直线距离为 1,080 公里。The straight-line distance is 1,080 km. 将该值插入公式,会得到以下结果:Plugging that value into the equation, we get the following:

Minimum RTT = 2 * (1,080 / 200)

公式输出以毫秒为单位。The output of the equation is in milliseconds.

若要获得最佳网络性能,符合逻辑的做法是选择两者之间距离最短的目的地。If you want to get the best network performance, the logical option is to select destinations with the shortest distance between them. 还应将虚拟网络设计为优化流量路径并减少延迟。You should also design your virtual network to optimize the path of traffic and reduce latency. 有关详细信息,请参阅本文的“网络设计注意事项”部分。For more information, see the "Network design considerations" section of this article.

延迟和往返时间对 TCP 的影响Latency and round-trip time effects on TCP

往返时间直接影响到最大 TCP 吞吐量。Round-trip time has a direct effect on maximum TCP throughput. 在 TCP 协议中,“窗口大小”是指发送方需要收到接收方的确认之前,可以通过 TCP 连接发送的最大流量。 In TCP protocol, window size is the maximum amount of traffic that can be sent over a TCP connection before the sender needs to receive acknowledgement from the receiver. 如果 TCP MSS 设置为 1,460,TCP 窗口大小设置为 65,535,则发送方在必须收到接收方的确认之前,可以发送 45 个数据包。If the TCP MSS is set to 1,460 and the TCP window size is set to 65,535, the sender can send 45 packets before it has to receive acknowledgement from the receiver. 如果发送方未收到确认,将重新传输数据。If the sender doesn't get acknowledgement, it will retransmit the data. 公式如下:Here's the formula:

TCP window size / TCP MSS = packets sent

在此示例中,65,535 / 1,460 的结果已四舍五入为 45。In this example, 65,535 / 1,460 is rounded up to 45.

正是这种“等待确认”状态(确保可靠传送数据的机制)导致 RTT 影响了 TCP 吞吐量。This "waiting for acknowledgement" state, a mechanism to ensure reliable delivery of data, is what causes RTT to affect TCP throughput. 发送方等待确认的时间越长,在发送更多数据之前需要等待的时间也就越长。The longer the sender waits for acknowledgement, the longer it needs to wait before sending more data.

下面是计算单个 TCP 连接的最大吞吐量的公式:Here's the formula for calculating the maximum throughput of a single TCP connection:

Window size / (RTT latency in milliseconds / 1,000) = maximum bytes/second

下表显示了单个 TCP 连接的最大每秒兆字节吞吐量。This table shows the maximum megabytes/per second throughput of a single TCP connection. (为便于阅读,此处使用了每秒兆字节作为度量单位。)(For readability, megabytes is used for the unit of measure.)

TCP 窗口大小(字节)TCP window size (bytes) RTT 延迟(毫秒)RTT latency (ms) 最大每秒兆字节吞吐量Maximum megabyte/second throughput 最大每秒兆位吞吐量Maximum megabit/second throughput
65,53565,535 11 65.5465.54 524.29524.29
65,53565,535 3030 2.182.18 17.4817.48
65,53565,535 6060 1.091.09 8.748.74
65,53565,535 9090 .73.73 5.835.83
65,53565,535 120120 .55.55 4.374.37

如果数据包丢失,则 TCP 连接的最大吞吐量将会减小,此时,发送方会重新传输它已发送的数据。If packets are lost, the maximum throughput of a TCP connection will be reduced while the sender retransmits data it has already sent.

TCP 窗口缩放TCP window scaling

TCP 窗口缩放可以动态增大 TCP 窗口大小,以便在需要收到确认之前可以发送更多数据。TCP window scaling is a technique that dynamically increases the TCP window size to allow more data to be sent before an acknowledgement is required. 在前面的示例中,在需要收到确认之前会发送 45 个数据包。In the previous example, 45 packets would be sent before an acknowledgement was required. 如果增加在需要收到确认之前可发送的数据包数,则会减少发送方等待确认的次数,因而可以增大 TCP 最大吞吐量。If you increase the number of packets that can be sent before an acknowledgement is needed, you're reducing the number of times a sender is waiting for acknowledgement, which increases the TCP maximum throughput.

下表演示了这些关系:This table illustrates those relationships:

TCP 窗口大小(字节)TCP window size (bytes) RTT 延迟(毫秒)RTT latency (ms) 最大每秒兆字节吞吐量Maximum megabyte/second throughput 最大每秒兆位吞吐量Maximum megabit/second throughput
65,53565,535 3030 2.182.18 17.4817.48
131,070131,070 3030 4.374.37 34.9534.95
262,140262,140 3030 8.748.74 69.9169.91
524,280524,280 3030 17.4817.48 139.81139.81

但是,TCP 窗口大小的 TCP 标头值长度仅为 2 个字节,这意味着,接收窗口的最大值为 65,535。But the TCP header value for TCP window size is only 2 bytes long, which means the maximum value for a receive window is 65,535. 为了提高最大窗口大小,我们引入了 TCP 窗口缩放因子。To increase the maximum window size, a TCP window scale factor was introduced.

该缩放因子也是可以在操作系统中配置的设置。The scale factor is also a setting that you can configure in an operating system. 下面是使用缩放因子计算 TCP 窗口大小的公式:Here's the formula for calculating the TCP window size by using scale factors:

TCP window size = TCP window size in bytes \* (2^scale factor)

下面是窗口缩放因子为 3,窗口大小为 65,535 时的计算公式:Here's the calculation for a window scale factor of 3 and a window size of 65,535:

65,535 \* (2^3) = 262,140 bytes

缩放因子为 14 时,TCP 窗口大小为 14(允许的最大偏移量)。A scale factor of 14 results in a TCP window size of 14 (the maximum offset allowed). TCP 窗口大小为 1,073,725,440 字节(8.5 千兆位)。The TCP window size will be 1,073,725,440 bytes (8.5 gigabits).

TCP 窗口缩放支持Support for TCP window scaling

Windows 可为不同的连接类型设置不同的缩放因子。Windows can set different scaling factors for different connection types. (连接类包括数据中心、Internet 等。)可以使用 Get-NetTCPConnection PowerShell 命令查看窗口缩放连接类型:(Classes of connections include datacenter, internet, and so on.) You use the Get-NetTCPConnection PowerShell command to view the window scaling connection type:

Get-NetTCPConnection

可以使用 Get-NetTCPSetting PowerShell 命令查看每个类的值:You can use the Get-NetTCPSetting PowerShell command to view the values of each class:

Get-NetTCPSetting

可以使用 Set-NetTCPSetting PowerShell 命令在 Windows 中设置初始 TCP 窗口大小和 TCP 缩放因子。You can set the initial TCP window size and TCP scaling factor in Windows by using the Set-NetTCPSetting PowerShell command. 有关详细信息,请参阅 Set-NetTCPSettingFor more information, see Set-NetTCPSetting.

Set-NetTCPSetting

下面是 AutoTuningLevel 的有效 TCP 设置:These are the effective TCP settings for AutoTuningLevel:

AutoTuningLevelAutoTuningLevel 缩放因子Scaling factor 缩放乘数Scaling multiplier 用于计算Formula to
窗口大小的公式calculate maximum window size
禁用Disabled None None 窗口大小Window size
受限Restricted 44 2^42^4 窗口大小 * (2^4)Window size * (2^4)
严格限制Highly restricted 22 2^22^2 窗口大小 * (2^2)Window size * (2^2)
普通Normal 88 2^82^8 窗口大小 * (2^8)Window size * (2^8)
实验Experimental 1414 2^142^14 窗口大小 * (2^14)Window size * (2^14)

这些设置很有可能会影响 TCP 性能,但请记住,Internet 中不受 Azure 控制的许多其他因素也可能会影响 TCP 性能。These settings are the most likely to affect TCP performance, but keep in mind that many other factors across the internet, outside the control of Azure, can also affect TCP performance.

增大 MTU 大小Increase MTU size

由于 MTU 越大,MSS 也就越大,因此你可能想要知道,增大 MTU 是否可以提高 TCP 性能。Because a larger MTU means a larger MSS, you might wonder whether increasing the MTU can increase TCP performance. 很有可能并非如此。Probably not. 使用超过 TCP 流量大小的数据包大小既有利也有弊。There are pros and cons to packet size beyond just TCP traffic. 如前所述,影响 TCP 吞吐量性能的最重要因素是 TCP 窗口大小、丢包和 RTT。As discussed earlier, the most important factors affecting TCP throughput performance are TCP window size, packet loss, and RTT.

重要

我们不建议 Azure 客户更改虚拟机上的默认 MTU 值。We don't recommend that Azure customers change the default MTU value on virtual machines.

加速网络和接收端缩放Accelerated networking and receive side scaling

加速网络Accelerated networking

一直以来,来宾 VM 和虚拟机监控程序/主机上的虚拟机网络功能都会大量消耗 CPU 资源。Virtual machine network functions have historically been CPU intensive on both the guest VM and the hypervisor/host. 传入主机的每个数据包由主机 CPU 在软件中进行处理,包括所有虚拟网络的封装和解封。Every packet that transits through the host is processed in software by the host CPU, including all virtual network encapsulation and decapsulation. 因此,通过主机的流量越多,CPU 负载就越高。So the more traffic that goes through the host, the higher the CPU load. 此外,如果主机 CPU 正忙于处理其他操作,则还会影响网络吞吐量和延迟。And if the host CPU is busy with other operations, that will also affect network throughput and latency. Azure 使用加速网络解决了此问题。Azure addresses this issue with accelerated networking.

加速网络通过 Azure 的内部可编程硬件以及 SR-IOV 之类的技术,提供一贯的超低网络延迟。Accelerated networking provides consistent ultralow network latency via the in-house programmable hardware of Azure and technologies like SR-IOV. 加速网络将大量的 Azure 软件定义网络堆栈从 CPU 转移到基于 FPGA 的 SmartNIC。Accelerated networking moves much of the Azure software-defined networking stack off the CPUs and into FPGA-based SmartNICs. 这项变化可让最终用户应用程序回收计算周期,减轻 VM 上的负载,减少波动和延迟不一致性。This change enables end-user applications to reclaim compute cycles, which puts less load on the VM, decreasing jitter and inconsistency in latency. 换而言之,性能可能更具确定性。In other words, performance can be more deterministic.

加速网络允许来宾 VM 绕过主机来与主机的 SmartNIC 直接建立数据路径,从而提高了性能。Accelerated networking improves performance by allowing the guest VM to bypass the host and establish a datapath directly with a host's SmartNIC. 加速网络的部分优势如下:Here are some benefits of accelerated networking:

  • 降低延迟/提高每秒数据包数 (pps) :从数据路径中去除虚拟交换机可以消除数据包在主机中进行策略处理所花费的时间,同时增大了 VM 中可处理的数据包数。Lower latency / higher packets per second (pps): Removing the virtual switch from the datapath eliminates the time packets spend in the host for policy processing and increases the number of packets that can be processed in the VM.

  • 减少波动:虚拟交换机处理取决于需要应用的策略数量,以及正在执行处理的 CPU 工作负荷。Reduced jitter: Virtual switch processing depends on the amount of policy that needs to be applied and the workload of the CPU that's doing the processing. 将策略实施卸载到硬件消除了这种可变性,因为可以将数据包直接传送到 VM,消除了主机与 VM 之间的通信,以及所有的软件中断和上下文切换。Offloading the policy enforcement to the hardware removes that variability by delivering packets directly to the VM, eliminating the host-to-VM communication and all software interrupts and context switches.

  • 降低 CPU 利用率:绕过主机中的虚拟交换机可以减少用于处理流量的 CPU 资源。Decreased CPU utilization: Bypassing the virtual switch in the host leads to less CPU utilization for processing network traffic.

若要使用加速网络,需要在每个适用的 VM 上显式启用它。To use accelerated networking, you need to explicitly enable it on each applicable VM. 有关说明,请参阅创建启用加速网络的 Linux 虚拟机See Create a Linux virtual machine with Accelerated Networking for instructions.

接收端缩放Receive side scaling

接收端缩放 (RSS) 是一种网络驱动程序技术,它通过在多处理器系统中的多个 CPU 之间分配接收处理,更有效地分配网络流量的接收负载。Receive side scaling (RSS) is a network driver technology that distributes the receiving of network traffic more efficiently by distributing receive processing across multiple CPUs in a multiprocessor system. 简单来说,RSS 可让系统处理更多的接收流量,因为它使用所有可用的 CPU,而不是只使用一个。In simple terms, RSS allows a system to process more received traffic because it uses all available CPUs instead of just one. 有关 RSS 的更多技术讨论,请参阅接收端缩放简介For a more technical discussion of RSS, see Introduction to receive side scaling.

在 VM 上启用加速网络后,若要获得最佳性能,需要启用 RSS。To get the best performance when accelerated networking is enabled on a VM, you need to enable RSS. RSS 也能够为不使用加速网络的 VM 带来优势。RSS can also provide benefits on VMs that don't use accelerated networking. 有关如何确定 RSS 是否已启用及其启用方法的概述,请参阅优化 Azure 虚拟机的网络吞吐量For an overview of how to determine if RSS is enabled and how to enable it, see Optimize network throughput for Azure virtual machines.

TCP TIME_WAIT 和 TIME_WAIT 抹消TCP TIME_WAIT and TIME_WAIT assassination

TCP TIME_WAIT 是影响网络和应用程序性能的另一项常用设置。TCP TIME_WAIT is another common setting that affects network and application performance. 在用作客户端或服务器(“源 IP:源端口”+“目标 IP:目标端口”)的、不断打开和关闭多个套接字的繁忙 VM 上,在 TCP 正常运行期间,给定的套接字最终可能会长时间处于 TIME_WAIT 状态。On busy VMs that are opening and closing many sockets, either as clients or as servers (Source IP:Source Port + Destination IP:Destination Port), during the normal operation of TCP, a given socket can end up in a TIME_WAIT state for a long time. TIME_WAIT 状态是指在关闭某个套接字之前允许在其上传送任何附加的数据。The TIME_WAIT state is meant to allow any additional data to be delivered on a socket before closing it. 因此,TCP/IP 堆栈通常会通过静默丢弃客户端的 TCP SYN 数据包来防止重复使用套接字。So TCP/IP stacks generally prevent the reuse of a socket by silently dropping the client's TCP SYN packet.

套接字处于 TIME_WAIT 状态的时间长短是可配置的。The amount of time a socket is in TIME_WAIT is configurable. 该值的范围为 30 秒到 240 秒。It could range from 30 seconds to 240 seconds. 套接字是有限的资源,在任意给定时间可使用的套接字数目是可配置的。Sockets are a finite resource, and the number of sockets that can be used at any given time is configurable. (可用套接字数目通常约为 30,000 个。)如果可用套接字已耗尽,或者客户端和服务器的 TIME_WAIT 设置不匹配,当某个 VM 尝试重复使用处于 TIME_WAIT 状态的套接字时,新的连接将会失败,因为 TCP SYN 数据包已静默丢弃。(The number of available sockets is typically about 30,000.) If the available sockets are consumed, or if clients and servers have mismatched TIME_WAIT settings, and a VM tries to reuse a socket in a TIME_WAIT state, new connections will fail as TCP SYN packets are silently dropped.

通常可以在操作系统的 TCP/IP 堆栈中配置出站套接字的端口范围值。The value for port range for outbound sockets is usually configurable within the TCP/IP stack of an operating system. 这同样适用于 TCP TIME_WAIT 设置和重复使用套接字的情况。The same thing is true for TCP TIME_WAIT settings and socket reuse. 更改这些数字可能会提高可伸缩性。Changing these numbers can potentially improve scalability. 但是,根据具体的情况,这些更改可能会导致互操作性问题。But, depending on the situation, these changes could cause interoperability issues. 应小心更改这些值。You should be careful if you change these values.

可以使用 TIME_WAIT 抹消来解决此缩放限制。You can use TIME_WAIT assassination to address this scaling limitation. 使用 TIME_WAIT 抹消可以在某些情况下重用某个套接字,例如,当新连接的 IP 数据包中的序列号超过前一连接的最后一个数据包的序列号时。TIME_WAIT assassination allows a socket to be reused in certain situations, like when the sequence number in the IP packet of the new connection exceeds the sequence number of the last packet from the previous connection. 在这种情况下,操作系统允许建立新连接(接受新的 SYN/ACK),并强制关闭处于 TIME_WAIT 状态的上一个连接。In this case, the operating system will allow the new connection to be established (it will accept the new SYN/ACK) and force close the previous connection that was in a TIME_WAIT state. Azure 中的 Windows VM 支持此功能。This capability is supported on Windows VMs in Azure. 若要了解其他 VM 是否支持此功能,请咨询 OS 供应商。To learn about support in other VMs, check with the OS vendor.

若要了解如何配置 TCP TIME_WAIT 设置和源端口范围,请参阅可修改哪些设置来提高网络性能To learn about configuring TCP TIME_WAIT settings and source port range, see Settings that can be modified to improve network performance.

可能会影响性能的虚拟网络因素Virtual network factors that can affect performance

VM 最大出站吞吐量VM maximum outbound throughput

Azure 提供多种 VM 大小和类型,每种大小和类型的性能各不相同。Azure provides a variety of VM sizes and types, each with a different mix of performance capabilities. 其中一项功能是以兆位/秒 (Mbps) 计量的网络吞吐量(或带宽)。One of these capabilities is network throughput (or bandwidth), which is measured in megabits per second (Mbps). 由于虚拟机托管在共享硬件上,因此网络容量需在使用同一硬件的虚拟机中公平地共享。Because virtual machines are hosted on shared hardware, the network capacity needs to be shared fairly among the virtual machines using the same hardware. 与更小的虚拟机相比,为更大的虚拟机分配的带宽更多。Larger virtual machines are allocated more bandwidth than smaller virtual machines.

分配给每个虚拟机的网络带宽按虚拟机的传出(出站)流量计算。The network bandwidth allocated to each virtual machine is metered on egress (outbound) traffic from the virtual machine. 从虚拟机流出的所有网络流量均计入分配限制,不管流向哪个目标。All network traffic leaving the virtual machine is counted toward the allocated limit, regardless of destination. 例如,如果虚拟机的限制为 1,000 Mbps,则不管出站流量的目标是同一虚拟网络中的另一虚拟机,还是 Azure 外部,均适用该限制。For example, if a virtual machine has a 1,000-Mbps limit, that limit applies whether the outbound traffic is destined for another virtual machine in the same virtual network or one outside of Azure.

传入流量不直接计算,或者说不直接受到限制。Ingress is not metered or limited directly. 但是,其他因素(例如 CPU 和存储限制)可能会影响虚拟机处理传入数据的能力。But there are other factors, like CPU and storage limits, that can affect a virtual machine's ability to process incoming data.

加速网络旨在改进网络性能(包括延迟、吞吐量和 CPU 使用率)。Accelerated networking is designed to improve network performance, including latency, throughput, and CPU utilization. 虚拟机的吞吐量可以通过加速网络来改进,但仍受分配给该虚拟机的带宽的限制。Accelerated networking can improve a virtual machine's throughput, but it can do that only up to the virtual machine's allocated bandwidth.

Azure 虚拟机上至少附加了一个网络接口。Azure virtual machines have at least one network interface attached to them. 它们可能包含多个网络接口。They might have several. 分配给某个虚拟机的带宽是流经所有网络接口(已连接到该虚拟机)的所有出站流量的总和。The bandwidth allocated to a virtual machine is the sum of all outbound traffic across all network interfaces attached to the machine. 换言之,带宽是按虚拟机分配的,而不管该虚拟机上附加了多少个网络接口。In other words, the bandwidth is allocated on a per-virtual machine basis, regardless of how many network interfaces are attached to the machine.

Azure 中 Windows 虚拟机的大小详细说明了每种 VM 大小支持的预期出站吞吐量和网络接口数。Expected outbound throughput and the number of network interfaces supported by each VM size are detailed in Sizes for Windows virtual machines in Azure. 若要查看最大吞吐量,请选择一种类型(例如“常规用途”),然后在结果页上找到有关大小系列的部分(例如“Dv2 系列”)。****To see maximum throughput, select a type, like General purpose, and then find the section about the size series on the resulting page (for example, "Dv2-series"). 对于每个系列,有一个表格的最后一列中提供了网络规范,其标题为“最大 NIC 数/预期网络带宽 (MBps)”。For each series, there's a table that provides networking specifications in the last column, which is titled "Max NICs / Expected network bandwidth (Mbps)."

吞吐量限制适用于虚拟机。The throughput limit applies to the virtual machine. 吞吐量不受这些因素的影响:Throughput is not affected by these factors:

  • 网络接口数:带宽限制适用于来自虚拟机的所有出站流量的总和。Number of network interfaces: The bandwidth limit applies to the sum of all outbound traffic from the virtual machine.

  • 加速网络:尽管此功能有助于流量达到已发布的限制,但不会更改限制。Accelerated networking: Though this feature can be helpful in achieving the published limit, it doesn't change the limit.

  • 流量目标:所有目标都计入出站限制。Traffic destination: All destinations count toward the outbound limit.

  • 协议:基于所有协议的所有出站流量都计入限制。Protocol: All outbound traffic over all protocols counts towards the limit.

有关详细信息,请参阅虚拟机网络带宽For more information, see Virtual machine network bandwidth.

Internet 性能注意事项Internet performance considerations

如本文所述,Internet 因素以及不受 Azure 控制的因素可能会影响网络性能。As discussed throughout this article, factors on the internet and outside the control of Azure can affect network performance. 以下是其中的部分因素:Here are some of those factors:

  • 延迟:两个目标之间的往返时间可能会受中间网络问题、不采用“最短”距离路径的流量,以及欠佳对等互连路径的影响。Latency: The round-trip time between two destinations can be affected by issues on intermediate networks, by traffic that doesn't take the "shortest" distance path, and by suboptimal peering paths.

  • 数据包丢失:数据包丢失可能是网络拥塞、物理路径问题和性能不佳的网络设备造成的。Packet loss: Packet loss can be caused by network congestion, physical path issues, and underperforming network devices.

  • MTU 大小/分段:在路径中分段可能导致数据延迟抵达,或数据包不按顺序抵达,而这可能会影响数据包的传送。MTU size/Fragmentation: Fragmentation along the path can lead to delays in data arrival or in packets arriving out of order, which can affect the delivery of packets.

Traceroute 是一个不错的工具,它可以测量源设备与目标设备之间的每条网络路径上的网络性能特征(例如数据包丢失和延迟)。Traceroute is a good tool for measuring network performance characteristics (like packet loss and latency) along every network path between a source device and a destination device.

网络设计注意事项Network design considerations

除了本文前面所述的考虑因素以外,虚拟网络的拓扑也可能会影响网络性能。Along with the considerations discussed earlier in this article, the topology of a virtual network can affect the network's performance. 例如,将流量全局回传到单一中心虚拟网络的中心辐射型设计会造成网络延迟,因而会影响总体网络性能。For example, a hub-and-spoke design that backhauls traffic globally to a single-hub virtual network will introduce network latency, which will affect overall network performance.

网络流量通过的网络设备数也会影响总体延迟。The number of network devices that network traffic passes through can also affect overall latency. 例如,在中心辐射型设计中,如果流量在传输到 Internet 之前会通过辐射网络虚拟设备和中心虚拟设备,则网络虚拟设备可能会造成延迟。For example, in a hub-and-spoke design, if traffic passes through a spoke network virtual appliance and a hub virtual appliance before transiting to the internet, the network virtual appliances can introduce latency.

Azure 区域、虚拟网络和延迟Azure regions, virtual networks, and latency

Azure 区域由一个笼统的地理区域中的多个数据中心构成。Azure regions are made up of multiple datacenters that exist within a general geographic area. 这些数据中心的实体位置可能并不相互靠近。These datacenters might not be physically next to each other. 在某些情况下,它们之间相隔 10 公里。In some cases they're separated by as much as 10 kilometers. 虚拟网络是位于 Azure 物理数据中心网络顶部的逻辑叠加层。The virtual network is a logical overlay on top of the Azure physical datacenter network. 虚拟网络并不意味着数据中心内部采用任何特定的网络拓扑。A virtual network doesn't imply any specific network topology within the datacenter.

例如,同一个虚拟网络和子网中的两个 VM 可能位于不同的机架、设备排中,甚至位于不同的数据中心内。For example, two VMs that are in the same virtual network and subnet might be in different racks, rows, or even datacenters. 它们可能会按光纤电缆的英尺或公里数相分隔。They could be separated by feet of fiber optic cable or by kilometers of fiber optic cable. 这种变数可能会在不同的 VM 之间造成可变的延迟(相差几毫秒)。This variation could introduce variable latency (a few milliseconds difference) between different VMs.

VM 的地理位置以及两个 VM 之间的潜在延迟可能受可用性集的配置影响。The geographic placement of VMs, and the potential resulting latency between two VMs, can be influenced by the configuration of availability sets. 但是,某个区域中数据中心之间的距离与该区域直接相关,主要受该区域中数据中心拓扑的影响。But the distance between datacenters in a region is region-specific and primarily influenced by datacenter topology in the region.

源 NAT 端口耗尽Source NAT port exhaustion

Azure 中的部署可与 Azure 外部的公共 Internet 和/或公共 IP 地址空间中的终结点进行通信。A deployment in Azure can communicate with endpoints outside of Azure on the public internet and/or in the public IP space. 当实例发起出站连接时,Azure 会动态将专用 IP 地址映射到公共 IP 地址。When an instance initiates an outbound connection, Azure dynamically maps the private IP address to a public IP address. 在 Azure 创建此映射后,发起的出站流的返回流量还可以抵达发起该流的专用 IP 地址。After Azure creates this mapping, return traffic for the outbound originated flow can also reach the private IP address where the flow originated.

对于每个出站连接,Azure 负载均衡器需要保持此映射一段时间。For every outbound connection, the Azure Load Balancer needs to maintain this mapping for some period of time. 根据 Azure 的多租户性质,对每个 VM 的每个出站流保持此映射可能会消耗大量的资源。With the multitenant nature of Azure, maintaining this mapping for every outbound flow for every VM can be resource intensive. 因此,需要根据 Azure 虚拟网络的配置设置一些限制。So there are limits that are set and based on the configuration of the Azure Virtual Network. 或者,更准确地说,Azure VM 在给定的时间只能建立特定数量的出站连接。Or, to say that more precisely, an Azure VM can only make a certain number of outbound connections at a given time. 达到这些限制时,VM 无法建立更多的出站连接。When these limits are reached, the VM won't be able to make more outbound connections.

但是,此行为是可配置的。But this behavior is configurable. 有关 SNAT 和 SNAT 端口耗尽的详细信息,请参阅此文For more information about SNAT and SNAT port exhaustion, see this article.

测试 Azure 上的网络性能Measure network performance on Azure

本文所述的最大性能值与两个 VM 之间的网络延迟/往返时间 (RTT) 相关。A number of the performance maximums in this article are related to the network latency / round-trip time (RTT) between two VMs. 本部分提供有关如何测试延迟/RTT 以及如何测试 TCP 性能和 VM 网络性能的一些建议。This section provides some suggestions for how to test latency/RTT and how to test TCP performance and VM network performance. 可以使用本部分所述的方法,来优化前面所述的 TCP/IP 和网络值并执行性能测试。You can tune and performance test the TCP/IP and network values discussed earlier by using the techniques described in this section. 可将延迟、MTU、MSS 和窗口大小值插入前面提供的计算公式,并将理论最大值与测试期间观测到的实际值进行比较。You can plug latency, MTU, MSS, and window size values into the calculations provided earlier and compare theoretical maximums to actual values that you observe during testing.

测量往返时间和丢包率Measure round-trip time and packet loss

TCP 性能严重依赖于 RTT 和丢包率。TCP performance relies heavily on RTT and packet Loss. 测量 RTT 和丢包率的最简单方法是使用 Windows 和 Linux 中提供的 PING 实用工具。The PING utility available in Windows and Linux provides the easiest way to measure RTT and packet loss. PING 的输出会显示源与目标之间的最小/最大/平均延迟。The output of PING will show the minimum/maximum/average latency between a source and destination. 它还会显示丢包率。It will also show packet loss. PING 默认使用 ICMP 协议。PING uses the ICMP protocol by default. 可以使用 PsPing 来测试 TCP RTT。You can use PsPing to test TCP RTT. 有关详细信息,请参阅 PsPingFor more information, see PsPing.

测量 TCP 连接的实际吞吐量Measure actual throughput of a TCP connection

NTttcp 是用于测试 Linux 或 Windows VM 的 TCP 性能的工具。NTttcp is a tool for testing the TCP performance of a Linux or Windows VM. 可以更改各项 TCP 设置,然后测试使用 NTttcp 所带来的优势。You can change various TCP settings and then test the benefits by using NTttcp. 有关详细信息,请参阅以下资源:For more information, see these resources:

测量虚拟机的实际带宽Measure actual bandwidth of a virtual machine

可以使用名为 iPerf 的工具来测试不同的 VM 类型、加速网络等的性能。You can test the performance of different VM types, accelerated networking, and so on, by using a tool called iPerf. 在 Linux 和 Windows 上也可以使用 iPerf。iPerf is also available on Linux and Windows. iPerf 可以使用 TCP 或 UDP 来测试总体网络吞吐量。iPerf can use TCP or UDP to test overall network throughput. iPerf TCP 吞吐量测试结果受本文所述的因素(例如延迟和 RTT)的影响。iPerf TCP throughput tests are influenced by the factors discussed in this article (like latency and RTT). 因此,如果只是测试最大吞吐量,UDP 可能会生成更好的结果。So UDP might yield better results if you just want to test maximum throughput.

有关详细信息,请参阅以下文章:For more information, see these articles:

检测低效的 TCP 行为Detect inefficient TCP behaviors

在数据包捕获中,Azure 客户可能会看到带有 TCP 标志(SACK、DUP ACK、RETRANSMIT 和 FAST RETRANSMIT)的 TCP 数据包,这可能表示存在网络性能问题。In packet captures, Azure customers might see TCP packets with TCP flags (SACK, DUP ACK, RETRANSMIT, and FAST RETRANSMIT) that could indicate network performance problems. 具体而言,这些数据包表示由于数据包丢失而导致网络效率低下。These packets specifically indicate network inefficiencies that result from packet loss. 但数据包丢失不一定是 Azure 性能问题造成的。But packet loss isn't necessarily caused by Azure performance problems. 性能问题可能是应用程序问题、操作系统问题或者不直接与 Azure 平台相关的问题造成的。Performance problems could be the result of application problems, operating system problems, or other problems that might not be directly related to the Azure platform.

另请注意,某些重新传输和重复 ACK 在网络中是正常的。Also, keep in mind that some retransmission and duplicate ACKs are normal on a network. TCP 协议原本就很可靠。TCP protocols were built to be reliable. 数据包捕获中出现这些 TCP 数据包并不一定证明存在系统性的网络问题,除非这些数据包大量出现。Evidence of these TCP packets in a packet capture doesn't necessarily indicate a systemic network problem, unless they're excessive.

尽管如此,出现此类数据包表明 TCP 吞吐量并未实现其最大性能,本文的其他部分已讨论了原因。Still, these packet types are indications that TCP throughput isn't achieving its maximum performance, for reasons discussed in other sections of this article.

后续步骤Next steps

了解 Azure VM 的 TCP/IP 性能优化后,我们建议了解规划虚拟网络时的其他考虑因素,或详细了解如何连接和配置虚拟网络Now that you've learned about TCP/IP performance tuning for Azure VMs, you might want to read about other considerations for planning virtual networks or learn more about connecting and configuring virtual networks.