网络性能故障排除Troubleshooting network performance

概述Overview

Azure 提供了将本地网络连接到 Azure 的稳定且快速方法。Azure provides stable and fast ways to connect from your on-premises network to Azure. 站点到站点 VPN 和 ExpressRoute 等方法被大小客户成功用于在 Azure 中运行业务。Methods like Site-to-Site VPN and ExpressRoute are successfully used by customers large and small to run their businesses in Azure. 但是,若性能不满足期望或先前经验,会发生什么呢?But what happens when performance doesn't meet your expectation or previous experience? 本文可帮助对测试特定环境并为特定环境设置基线的方法进行规范。This document can help standardize the way you test and baseline your specific environment.

本文展示了如何轻松而一致地测试两台主机之间的网络延迟和带宽。This document shows how you can easily and consistently test network latency and bandwidth between two hosts. 还就如何查看 Azure 网络及帮助隔离问题点提供了一些建议。This document also provides some advice on ways to look at the Azure network and help to isolate problem points. 所讨论的 PowerShell 脚本和工具需要网络上有两台主机(在所测试链接的任一端)。The PowerShell script and tools discussed require two hosts on the network (at either end of the link being tested). 一台主机必须是 Windows Server 或 Desktop,另一台可以是 Windows 或 Linux。One host must be a Windows Server or Desktop, the other can be either Windows or Linux.

备注

故障排除方法、工具和所用方法都可根据个人喜好挑选。The approach to troubleshooting, the tools, and methods used are personal preferences. 本文介绍了我通常使用的方法和工具。This document describes the approach and tools I often take. 你所用方法可能有所不同,不过解决问题的方式不同并无大碍。Your approach will probably differ, there's nothing wrong with different approaches to problem solving. 但是,如果没有一个确定的方法,则可以使用本文来帮助构建自己的方法、工具和首选项,解决网络问题。However, if you don't have an established approach, this document can get you started on the path to building your own methods, tools, and preferences to troubleshooting network issues.

网络要素Network components

深入了解故障排除之前,先来讨论一些常用术语和要素。Before digging into troubleshooting, let's discuss some common terms and components. 这可确保我们思考 Azure 中支持连接的端到端链中的每个要素。This discussion ensures we're thinking about each component in the end-to-end chain that enables connectivity in Azure. 11

概括来看,我先介绍三个主要的网络路由域:At the highest level, I describe three major network routing domains;

  • Azure 网络(右侧蓝色云朵)the Azure network (blue cloud on the right)
  • Internet 或 WAN(中间绿色云朵)the Internet or WAN (green cloud in the center)
  • 公司网络(左侧桃色云朵)the Corporate Network (peach cloud on the left)

从右到左查看关系图,我们来简要讨论一下每个要素:Looking at the diagram from right to left, let's discuss briefly each component:

  • 虚拟机 - 服务器可能有多个 NIC,确保任何静态路由、默认路由和操作系统设置按照预想的方式发送和接收流量。Virtual Machine - The server may have multiple NICs, ensure any static routes, default routes, and Operating System settings are sending and receiving traffic the way you think it is. 此外,每个 VM SKU 具有带宽限制。Also, each VM SKU has a bandwidth restriction. 如果使用较小的 VM SKU,流量会受到 NIC 可用带宽的限制。If you're using a smaller VM SKU, your traffic is limited by the bandwidth available to the NIC. 我通常使用 DS5v2 来进行测试(然后在测试完成后删除,节省资金),确保 VM 有足够的带宽。I usually use a DS5v2 for testing (and then delete once done with testing to save money) to ensure adequate bandwidth at the VM.
  • NIC - 确保你知道分配给问题 NIC 的专用 IP。NIC - Ensure you know the private IP that is assigned to the NIC in question.
  • NIC NSG - 可能会有特定的 NSG 应用于 NIC 级别,确保 NSG 规则集适用于正尝试传递的流量。NIC NSG - There may be specific NSGs applied at the NIC level, ensure the NSG rule-set is appropriate for the traffic you're trying to pass. 例如,确保已打开 iPerf 的 5201 端口、RDP 的 3389 或 SSH 22,可让测试流量通过。For example, ensure ports 5201 for iPerf, 3389 for RDP, or 22 for SSH are open to allow test traffic to pass.
  • VNet 子网 - NIC 已分配给特定子网,确保你知道该子网及与该子网关联的规则。VNet Subnet - The NIC is assigned to a specific subnet, ensure you know which one and the rules associated with that subnet.
  • 子网 NSG - NSG 与 NIC 一样可应用于子网。Subnet NSG - Just like the NIC, NSGs can be applied at the subnet as well. 确保 NSG 规则集适合于正尝试传递的流量。Ensure the NSG rule-set is appropriate for the traffic you're trying to pass. (对于流入 NIC 的流量,首先应用子网 NSG,然后是 NIC NSG,相反,对于从 VM 流出的流量,则首先应用 NSG ,然后应用子网 NSG)。(for traffic inbound to the NIC the subnet NSG applies first, then the NIC NSG, conversely for traffic outbound from the VM the NIC NSG applies first then the Subnet NSG comes into play).
  • 子网 UDR - 用户定义的路由可以将流量定向到中间跃点(如防火墙或负载均衡器)。Subnet UDR - User Defined Routes can direct traffic to an intermediate hop (like a firewall or load-balancer). 确保你知道流量是否有 UDR,如果有的话,知道其去往何处,以及下一个跃点会对流量进行哪些操作。Ensure you know if there is a UDR in place for your traffic and if so where it goes and what that next hop will do to your traffic. (例如,防火墙能够在两个相同主机之间传递一些流量及拒绝另一些流量)。(for example, a firewall could pass some traffic and deny other traffic between the same two hosts).
  • 网关子网/NSG/UDR - 网关子网与 VM 子网一样可具有 NSG 和 UDR。Gateway subnet / NSG / UDR - Just like the VM subnet, the gateway subnet can have NSGs and UDRs. 确保你知道是否有这两者,及其对流量的影响。Make sure you know if they are there and what effects they have on your traffic.
  • VPN 网关 (ExpressRoute) - 启用对等互连 (ExpressRoute) 或 VPN 后,不会有许多设置会影响到流量路由方式及是否进行流量路由。VPN Gateway (ExpressRoute) - Once peering (ExpressRoute) or VPN is enabled, there aren't many settings that can affect how or if traffic routes. 如果有多个 ExpressRoute 线路或 VPN 隧道连接到同一个 VPN 网关,则应注意连接权重设置,因为此设置影响连接首选项及流量采用的路径。If you have multiple ExpressRoute circuits or VPN tunnels connected to the same VPN Gateway, you should be aware of the connection weight settings as this setting affects connection preference and affects the path your traffic takes.
  • 路由筛选器(未显示)- 路由筛选器仅适用于 ExpressRoute 上的 Microsoft 对等互连,但如果没有在 Microsoft 对等互连上看到预期路由,那它十分重要。Route Filter (Not shown) - A route filter only applies to Microsoft Peering on ExpressRoute, but is critical to check if you're not seeing the routes you expect on Microsoft Peering.

此时,你处于链接的 WAN 部分。At this point, you're on the WAN portion of the link. 此路由域可以是服务提供商、公司 WAN 或 Internet。This routing domain can be your service provider, your corporate WAN, or the Internet. 如果这些链接涉及非常多的跃点、技术和公司,则进行故障排除变得有些困难。Many hops, technologies, and companies involved with these links can make it somewhat difficult to troubleshoot. 通常情况下,转到一系列公司和跃点之前,先排除 Azure 和企业网络。Often, you work to rule out both Azure and your Corporate Networks first before jumping into this collection of companies and hops.

上图中,最左侧是公司网络。In the preceding diagram, on the far left is your corporate network. 根据公司大小,此路由域可以是你和 WAN 之间的一些网络设备,或是校园/企业网络中的多层设备。Depending on the size of your company, this routing domain can be a few network devices between you and the WAN or multiple layers of devices in a campus/enterprise network.

考虑到这三种不同高层级网络环境的复杂性,从边缘处开始,试着显示性能好的地方及性能降低的地方,通常来说是最佳的方式。Given the complexities of these three different high-level network environments, it's often optimal to start at the edges and try to show where performance is good, and where it degrades. 这种方法可以帮助确定这三者的问题路由域,然后便可针对该特定环境进行故障排除。This approach can help identify the problem routing domain of the three and then focus your troubleshooting on that specific environment.

工具Tools

可使用 ping 和 traceroute 等基本工具对大多数网络问题进行分析及隔离。Most network issues can be analyzed and isolated using basic tools like ping and traceroute. 基本上不需要进行得像 Wireshark 等数据包分析那般地深入。It's rare that you need to go as deep as a packet analysis like Wireshark. 为了帮助进行故障排除,开发了 Azure 连接工具包 (AzureCT) 将这些工具中的一部分放入简单的包中。To help with troubleshooting, the Azure Connectivity Toolkit (AzureCT) was developed to put some of these tools in an easy package. 我想使用 iPerf 和 PSPing 进行性能测试。For performance testing, I like to use iPerf and PSPing. iPerf 是一种常用工具,可在大多数操作系统上运行。iPerf is a commonly used tool and runs on most operating systems. 有利于进行基本性能测试且使用方法非常简单。iPerf is good for basic performances tests and is fairly easy to use. PSPing 是由 SysInternals 开发的 ping 工具。PSPing is a ping tool developed by SysInternals. 它提供一种简单方法,可通过一个易于使用的命令来执行 ICMP 和 TCP ping。PSPing is an easy way to perform ICMP and TCP pings in one also easy to use command. 这两个工具都是轻量级的,只需将文件复制到主机的目录便可轻松“安装”。Both of these tools are lightweight and are "installed" simply by coping the files to a directory on the host.

我已经将所有这些工具和方法打包到一个 PowerShell 模块 (AzureCT) 中,你可以安装和使用。I've wrapped all of these tools and methods into a PowerShell module (AzureCT) that you can install and use.

AzureCT - Azure 连接工具包AzureCT - the Azure Connectivity Toolkit

AzureCT PowerShell 模块有两个组件 - 可用性测试性能测试The AzureCT PowerShell module has two components Availability Testing and Performance Testing. 本文只涉及性能测试,所以,我们一起来看看 PowerShell 模块中的两个链接性能命令。This document is only concerned with Performance testing, so lets focus on the two Link Performance commands in this PowerShell module.

使用此工具包进行性能测试有三个基本步骤。There are three basic steps to use this toolkit for Performance testing. 1) 安装 PowerShell 模块,2) 安装支持性应用程序 iPerf 和 PSPing,3) 运行性能测试。1) Install the PowerShell module, 2) Install the supporting applications iPerf and PSPing 3) Run the performance test.

  1. 安装 PowerShell 模块Installing the PowerShell Module

    (new-object Net.WebClient).DownloadString("https://aka.ms/AzureCT") | Invoke-Expression
    
    

    此命令用于下载 PowerShell 模块,并将其安装在本地。This command downloads the PowerShell module and installs it locally.

  2. 安装支持性应用程序Install the supporting applications

    Install-LinkPerformance
    

    此 AzureCT 命令将 iPerf 和 PSPing 安装在新目录“C:\ACTTools”中,还会打开 Windows 防火墙端口以允许 ICMP、端口 5201 (iPerf) 流量。This AzureCT command installs iPerf and PSPing in a new directory "C:\ACTTools", it also opens the Windows Firewall ports to allow ICMP and port 5201 (iPerf) traffic.

  3. 运行性能测试Run the performance test

    首先,iPerf 须在远程主机上安装,在服务器模式下运行。First, on the remote host you must install and run iPerf in server mode. 另外,请确保远程主机在 3389 (RDP for Windows) 或 22 (SSH for Linux) 上进行侦听,并在 iPerf 端口 5201 上允许流量。Also ensure the remote host is listening on either 3389 (RDP for Windows) or 22 (SSH for Linux) and allowing traffic on port 5201 for iPerf. 如果远程主机是 Windows,则请安装 AzureCT 并运行 Install-LinkPerformance 命令,设置 iPerf 和在服务器模式中成功启动 iPerf 所需的防火墙规则。If the remote host is windows, install the AzureCT and run the Install-LinkPerformance command to set up iPerf and the firewall rules needed to start iPerf in server mode successfully.

    远程计算机准备就绪后,在本地计算机上打开 PowerShell 并启动测试:Once the remote machine is ready, open PowerShell on the local machine and start the test:

    Get-LinkPerformance -RemoteHost 10.0.0.1 -TestSeconds 10
    

    此命令运行一系列的并发负荷和延迟测试,帮助估计带宽容量和网络链接延迟。This command runs a series of concurrent load and latency tests to help estimate the bandwidth capacity and latency of your network link.

  4. 查看测试的输出Review the output of the tests

    PowerShell 输出格式看起来类似于:The PowerShell output format looks similar to:

    44

    所有 iPerf 和 PSPing 测试的详细结果都位于“C:\ACTTools”上 AzureCT 工具目录中的单个文本文件中。The detailed results of all the iPerf and PSPing tests are in individual text files in the AzureCT tools directory at "C:\ACTTools."

故障排除Troubleshooting

如果性能测试没能提供预期结果,则请搞清楚应进行渐进式逐步过程的原因。If the performance test is not giving you expected results, figuring out why should be a progressive step-by-step process. 考虑到路径中影响性能的要素数量,系统化方法常提供较快的解决途径,而不是来回执行、(可能不必要地)多次进行相同的测试。Given the number of components in the path, a systematic approach generally provides a faster path to resolution than jumping around and potentially needlessly doing the same testing multiple times.

备注

此处的情况是出现了性能问题而非连接问题。The scenario here is a performance issue, not a connectivity issue. 如果流量根本不传递,则需要采取不同的措施。The steps would be different if traffic wasn't passing at all.

首先,怀疑假设。First, challenge your assumptions. 期望是否合理?Is your expectation reasonable? 例如,如果有 1 Gbps ExpressRoute 线路和 100 ms 延迟,由于 TCP 在高延迟链路上的性能特点,希望完全获得 1 Gbps 流量是不合理的。For instance, if you have a 1-Gbps ExpressRoute circuit and 100 ms of latency it's unreasonable to expect the full 1 Gbps of traffic given the performance characteristics of TCP over high latency links.

接下来,建议从路由域之间的边界开始,试着将问题隔离到单个主要路由域、企业网络,WAN 或 Azure 网络。Next, I recommend starting at the edges between routing domains and try to isolate the problem to a single major routing domain; the Corporate Network, the WAN, or the Azure Network. 人们往往把责任归咎于路径中的“黑匣子”,不过责备黑匣子不是难事儿,却可能会严重影响问题解决,特别是如果问题实际存在的区域你有能力进行改变。People often blame the "black box" in the path, while blaming the black box is easy to do, it may significantly delay resolution especially if the problem is actually in an area that you have the ability to make changes. 请确保移交给服务提供商或 ISP 之前已竭尽全力。Make sure you do your due diligence before handing off to your service provider or ISP.

确定了可能存在问题的主要路由域之后,应创建一个问题区域的关系图。Once you've identified the major routing domain that appears to contain the problem, you should create a diagram of the area in question. 无论是在白板、记事本或 Visio 上,关系图都能提供具体的“作战地图”,使你能采用妥善的方法进一步隔离问题。Either on a whiteboard, notepad, or Visio as a diagram provides a concrete "battle map" to allow a methodical approach to further isolate the problem. 可以规划测试点,在清除区域时更新“作战图”,或在测试进行过程中深入挖掘。You can plan testing points, and update the map as you clear areas or dig deeper as the testing progresses.

形成关系图后,开始将网络分成几个部分,缩小问题范围。Now that you have a diagram, start to divide the network into segments and narrow the problem down. 找出其起作用的地方和不起作用的地方。Find out where it works and where it doesn't. 不断移动测试点,隔离有问题的要素。Keep moving your testing points to isolate down to the offending component.

此外,也请记得查看 OSI 模型的其他层。Also, don't forget to look at other layers of the OSI model. 注意到网络和第 1 - 3 层(物理层、数据层和网络层)很容易,但是问题也可能出现在应用程序层的第 7 层。It's easy to focus on the network and layers 1 - 3 (Physical, Data, and Network layers) but the problems can also be up at Layer 7 in the application layer. 保持开放的心态,验证假设。Keep an open mind and verify assumptions.

高级 ExpressRoute 故障排除Advanced ExpressRoute troubleshooting

如果不确定云边缘的实际所在,那么隔离 Azure 要素便是一个难题。If you're not sure where the edge of the cloud actually is, isolating the Azure components can be a challenge. 使用 ExpressRoute 时,边缘是名为 Microsoft 企业边缘 (MSEE) 的网络要素。When ExpressRoute is used, the edge is a network component called the Microsoft Enterprise Edge (MSEE). 使用 ExpressRoute 时,MSEE 是进入 Microsoft 网络的第一个接触点和离开 Microsoft 网络的最后一个跃点。 When using ExpressRoute, the MSEE is the first point of contact into Microsoft's network, and the last hop leaving the Microsoft network. 在 VPN 网关和 ExpressRoute 线路之间创建连接对象时,实际上正在连接 MSEE。When you create a connection object between your VPN Gateway and the ExpressRoute circuit, you're actually making a connection to the MSEE. 辨别 MSEE 是第一个跃点还是最后一个跃点(取决于要走哪个方向)至关重要,可隔离 Azure 网络问题,证明问题出在 Azure 上还是 WAN 或企业网络的更下游。Recognizing the MSEE as the first or last hop (depending on which direction you're going) is crucial to isolating Azure Network problems to either prove the issue is in Azure or further downstream in the WAN or the Corporate Network.

22

备注

请注意,MSEE 不在 Azure 云中。Notice that the MSEE isn't in the Azure cloud. ExpressRoute 实际是在 Microsoft 网络边缘,而不是在 Azure 中。ExpressRoute is actually at the edge of the Microsoft network not actually in Azure. 通过 ExpressRoute 连接到 MSEE 后,你就连接到了 Microsoft 网络,然后便可转到任何云服务,如 Azure(使用专有对等互连)。Once you're connected with ExpressRoute to an MSEE, you're connected to Microsoft's network, from there you can then go to any of the cloud services, like Azure (with Private Peering).

如果两个 Vnet(关系图中的 Vnet A 和 B)连接到相同 ExpressRoute 线路,便可执行一系列测试以隔离 Azure 中的问题(或证明它不在 Azure 中)If two VNets (VNets A and B in the diagram) are connected to the same ExpressRoute circuit, you can perform a series of tests to isolate the problem in Azure (or prove it's not in Azure)

测试计划Test plan

  1. 在 VM1 和 VM2 之间运行 Get-LinkPerformance 测试。Run the Get-LinkPerformance test between VM1 and VM2. 此测试可让你了解到问题是否是出在本地。This test provides insight to if the problem is local or not. 如果此测试带来了可接受的延迟和带宽结果,则可将本地 VNet 网络标记为良好。If this test produces acceptable latency and bandwidth results, you can mark the local VNet network as good.
  2. 如果本地 VNet 流量状态良好,则在 VM1 和 VM3 之间运行 Get-LinkPerformance 测试。Assuming the local VNet traffic is good, run the Get-LinkPerformance test between VM1 and VM3. 此测试通过 Microsoft 网络实现连接到 MSEE ,然后再连回 Azure。This test exercises the connection through the Microsoft network down to the MSEE and back into Azure. 如果此测试带来了可接受的延迟和带宽结果,则可将 Azure 网络标记为良好。If this test produces acceptable latency and bandwidth results, you can mark the Azure network as good.
  3. 如果排除了 Azure,可在公司网络上执行类似的测试步骤。If Azure is ruled out, you can perform a similar sequence of tests on your Corporate Network. 如果此测试结果也是良好,则应对服务提供商或 ISP 执行测试来诊断 WAN 连接状态。If that also tests well, it's time to work with your service provider or ISP to diagnose your WAN connection. 示例:在两个分支机构之间运行此测试,或在桌面和数据中心服务器之间进行测试。Example: Run this test between two branch offices, or between your desk and a data center server. 根据测试内容,寻找可实现该路径的终结点(服务器、PC 等等)。Depending on what you're testing, find endpoints (servers, PCs, etc.) that can exercise that path.

重要

对于每次测试,都务必标记运行测试时的时间,并将测试结果记录在共同位置(我喜欢记录在 OneNote 或 Excel 上)。It's critical that for each test you mark the time of day you run the test and record the results in a common location (I like OneNote or Excel). 每次测试运行应有相同的输出,以便在测试运行之间比较结果数据,并在数据中没有“遗漏”。Each test run should have identical output so you can compare the resultant data across test runs and not have "holes" in the data. 我使用 AzureCT 进行故障排除主要是因为多个测试间的一致性。Consistency across multiple tests is the primary reason I use the AzureCT for troubleshooting. “魔力”在于从每个测试中获取的一致测试结果和数据输出,而非我所运行的精确负载方案。The magic isn't in the exact load scenarios I run, but instead the magic is the fact that I get a consistent test and data output from each and every test. 如果稍后发现问题是偶尔发生的,每次记录时间并获得一致的数据特别有用。Recording the time and having consistent data every single time is especially helpful if you later find that the issue is sporadic. 要勤于采集数据,这样可避免重复测试相同的方案(多年前我了解到这不容易)。Be diligent with your data collection up front and you'll avoid hours of retesting the same scenarios (I learned this hard way many years ago).

隔离了问题,然后呢?The problem is isolated, now what?

隔离的问题越多,修复就越容易;不过,通常会碰到一个点,在这个点上无法进一步进行故障排除。The more you can isolate the problem the easier it is to fix, however often you reach the point where you can't go deeper or further with your troubleshooting. 到达这个点时,就应该寻求帮助了。This point is when you should reach out for help. 向谁寻求帮助取决于将问题隔离于其中的路由域,或者如果能够将问题缩小到特定要素,那就更好了。Who you ask is dependent on the routing domain you isolated the issue to, or even better if you are able to narrow it down to a specific component.

对于公司网络问题,内部 IT 部门或支持网络的服务提供商(可能是硬件制造商)可能能够帮助进行设备配置或硬件维修。For corporate network issues, your internal IT department or service provider supporting your network (which may be the hardware manufacturer) may be able to help with device configuration or hardware repair.

对于 WAN,与服务提供商或 ISP 共享测试结果可能会帮助他们开始测试工作,并避免重复测试你已测试的区域。For the WAN, sharing your testing results with your Service Provider or ISP may help get them started and avoid covering some of the same ground you've tested already. 但是,如果他们想亲自验证你的测试结果也不要生气。However, don't be offended if they want to verify your results themselves. 基于他人的报告结果进行故障排除时,“信任但验证”是不错的信条。"Trust but verify" is a good motto when troubleshooting based on other people's reported results.

使用 Azure,尽可能详细地确定问题后,就可以查看 Azure 网络文档,随后开具支持票证(如果仍然需要)。With Azure, once you isolate the issue in as much detail as you're able, it's time to review the Azure Network Documentation and then if still needed open a support ticket.

后续步骤Next steps

  1. 从 GitHub 中下载 Azure 连接工具包,地址为 http://aka.ms/AzCTDownload the Azure Connectivity Toolkit from GitHub at http://aka.ms/AzCT
  2. 按照链接性能测试的说明进行操作Follow the instructions for link performance testing