使用指标进行的标准负载均衡器诊断Standard Load Balancer diagnostics with metrics

Azure 标准负载均衡器公开了以下诊断功能:Azure Standard Load Balancer exposes the following diagnostic capabilities:

  • 多维指标:通过 Azure Monitor 针对标准负载均衡器配置提供新的多维诊断功能。Multi-dimensional metrics: Provides new multi-dimensional diagnostic capabilities through Azure Monitor for standard load balancer configurations. 可以监视、管理和排查标准负载均衡器资源问题。You can monitor, manage, and troubleshoot your standard load balancer resources.

本文概要介绍这些功能,以及如何对标准负载均衡器使用这些功能。This article provides a quick tour of these capabilities, and it offers ways to use them for Standard Load Balancer.

多维指标Multi-dimensional metrics

Azure 负载均衡器通过 Azure 门户中的新 Azure 指标来提供新的多维指标,帮助你实时洞察负载均衡器资源的诊断信息。Azure Load Balancer provides new multi-dimensional metrics via the new Azure Metrics in the Azure portal, and it helps you get real-time diagnostic insights into your load balancer resources.

各种标准负载均衡器配置提供以下指标:The various Standard Load Balancer configurations provide the following metrics:

指标Metric 资源类型Resource type 说明Description 建议的聚合Recommended aggregation
数据路径可用性(VIP 可用性)Data path availability (VIP availability) 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器持续运用从区域内部到负载均衡器前端,直到支持 VM 的 SDN 堆栈的数据路径。Standard Load Balancer continuously exercises the data path from within a region to the load balancer front end, all the way to the SDN stack that supports your VM. 只要保留正常实例,这种度量就会遵循应用程序负载均衡的流量所用的相同路径。As long as healthy instances remain, the measurement follows the same path as your application's load-balanced traffic. 此外,还会验证客户使用的数据路径。The data path that your customers use is also validated. 度量对于应用程序不可见,且不会干扰其他操作。The measurement is invisible to your application and does not interfere with other operations. 平均值Average
运行状况探测状态(DIP 可用性)Health probe status(DIP availability) 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器使用分布式运行状况探测服务,根据配置设置监视应用程序终结点的运行状况。Standard Load Balancer uses a distributed health-probing service that monitors your application endpoint's health according to your configuration settings. 此指标提供负载均衡器池中每个实例终结点的聚合视图或按终结点筛选的视图。This metric provides an aggregate or per-endpoint filtered view of each instance endpoint in the load balancer pool. 可以查看负载均衡器如何根据运行状况探测配置的指示了解应用程序的运行状况。You can see how Load Balancer views the health of your application, as indicated by your health probe configuration. 平均值Average
SYN(同步)数据包SYN (synchronize) packets 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器不会终止传输控制协议 (TCP) 连接,也不会与 TCP 或 UDP 数据包流交互。Standard Load Balancer does not terminate Transmission Control Protocol (TCP) connections or interact with TCP or UDP packet flows. 流及其握手始终位于源和 VM 实例之间。Flows and their handshakes are always between the source and the VM instance. 若要更好地排查 TCP 协议方案的问题,可以使用 SYN 数据包计数器了解进行了多少次 TCP 连接尝试。To better troubleshoot your TCP protocol scenarios, you can make use of SYN packets counters to understand how many TCP connection attempts are made. 该指标将报告接收到的 TCP SYN 数据包数目。The metric reports the number of TCP SYN packets that were received. 平均值Average
SNAT 连接SNAT connections 公共负载均衡器Public load balancer 标准负载均衡器报告公共 IP 地址前端上伪装的出站流数。Standard Load Balancer reports the number of outbound flows that are masqueraded to the Public IP address front end. 源网络地址转换 (SNAT) 端口是消耗性资源。Source network address translation (SNAT) ports are an exhaustible resource. 此指标可以指出应用程序依赖于 SNAT 获取出站发起流的程度有多高。This metric can give an indication of how heavily your application is relying on SNAT for outbound originated flows. 将报告成功和失败的出站 SNAT 流的计数器,可使用这些计数器排查和了解出站流的运行状况。Counters for successful and failed outbound SNAT flows are reported and can be used to troubleshoot and understand the health of your outbound flows. 平均值Average
字节计数器Byte counters 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器按前端报告处理的数据。Standard Load Balancer reports the data processed per front end. 平均值Average
数据包计数器Packet counters 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器按前端报告处理的数据包。Standard Load Balancer reports the packets processed per front end. 平均值Average

在 Azure 门户中查看负载均衡器指标View your load balancer metrics in the Azure portal

Azure 门户通过“指标”页公开负载均衡器指标,可在特定资源的负载均衡器资源页以及 Azure Monitor 页中访问该页。The Azure portal exposes the load balancer metrics via the Metrics page, which is available on both the load balancer resource page for a particular resource and the Azure Monitor page.

若要查看标准负载均衡器资源的指标,请执行以下操作:To view the metrics for your Standard Load Balancer resources:

  1. 转到 Azure Monitor 页,选择负载均衡器资源。Go to the Azure Monitor page, select the load balancer resource.

  2. 设置适当的聚合类型。Set the appropriate aggregation type.

  3. (可选)配置需要的筛选和分组。Optionally, configure the required filtering and grouping.

    标准负载均衡器的指标

    图: 标准负载均衡器的“数据路径可用性”指标Figure: Data Path Availability metric for Standard Load Balancer

通过 API 以编程方式检索多维指标Retrieve multi-dimensional metrics programmatically via APIs

有关如何检索多维指标定义和值的 API 指导,请参阅 Azure 监视 REST API 演练For API guidance for retrieving multi-dimensional metric definitions and values, see Azure Monitoring REST API walkthrough.

数据路径是否已启动并适用于我的负载均衡器 VIP?Is the data path up and available for my load balancer VIP?

VIP 可用性指标描述区域中用于计算 VM 所在主机的数据路径的运行状况。The VIP availability metric describes the health of the data path within the region to the compute host where your VMs are located. 此指标反映了 Azure 基础结构的运行状况。The metric is a reflection of the health of the Azure infrastructure. 使用此指标可以:You can use the metric to:

  • 监视服务的外部可用性Monitor the external availability of your service
  • 深入分析和了解部署服务的平台是否正常,或者来宾 OS 或应用程序实例是否正常。Dig deeper and understand whether the platform on which your service is deployed is healthy or whether your guest OS or application instance is healthy.
  • 查明某个事件是与服务还是底层数据平面相关。Isolate whether an event is related to your service or the underlying data plane. 请不要将此指标与运行状况探测状态(“DIP 可用性”)相混淆。Do not confuse this metric with the health probe status ("DIP availability").

若要获取标准负载均衡器资源的“数据路径可用性”,请执行以下操作:To get the Data Path Availability for your Standard Load Balancer resources:

  1. 确保选择正确的负载均衡器资源。Make sure the correct load balancer resource is selected.
  2. 在“指标”下拉列表中选择“数据路径可用性”。 In the Metric drop-down list, select Data Path Availability.
  3. 在“聚合” 下拉列表中,选择“平均” 。In the Aggregation drop-down list, select Avg.
  4. 另外,请将基于前端 IP 地址或前端端口的筛选器添加为维度,并添加所需的前端 IP 地址或前端端口,然后根据选定的维度将其分组。Additionally, add a filter on the Frontend IP address or Frontend port as the dimension with the required front-end IP address or front-end port, and then group them by the selected dimension.

VIP 探测

图: 负载均衡器前端探测详细信息Figure: Load Balancer Frontend probing details

将会根据活动的带内度量值生成该指标。The metric is generated by an active, in-band measurement. 区域中的探测服务根据此测量值发起流量,A probing service within the region originates traffic for the measurement. 使用公共前端创建部署后,此服务会立即激活,并一直运行到删除了前端为止。The service is activated as soon as you create a deployment with a public front end, and it continues until you remove the front end.

会定期生成与部署前端和规则匹配的数据包。A packet matching your deployment's front end and rule is generated periodically. 该服务在区域中从源遍历到后端池中 VM 所在的主机。It traverses the region from the source to the host where a VM in the back-end pool is located. 负载均衡器基础结构执行的负载均衡和转换运算与针对其他所有流量执行的操作一样。The load balancer infrastructure performs the same load balancing and translation operations as it does for all other traffic. 此探测在负载均衡终结点上的带内执行。This probe is in-band on your load-balanced endpoint. 探测抵达后端池中正常 VM 所在的计算主机后,计算主机会针对探测服务生成响应。After the probe arrives on the compute host, where a healthy VM in the back-end pool is located, the compute host generates a response to the probing service. VM 看不到此流量。Your VM does not see this traffic.

VIP 可用性探测会出于原因而失败:VIP availability fails for the following reasons:

  • 后端池中没有剩余的可用于部署的正常 VM。Your deployment has no healthy VMs remaining in the back-end pool.
  • 发生基础结构服务中断。An infrastructure outage has occurred.

可以结合使用“数据路径可用性”指标和运行状况探测状态进行诊断。For diagnostic purposes, you can use the Data Path Availability metric together with the health probe status.

在大多数情况下,可以使用“平均值”作为聚合。 Use Average as the aggregation for most scenarios.

VIP 的后端实例是否正在响应探测?Are the back-end instances for my VIP responding to probes?

运行状况探测状态指标描述在配置负载均衡器的运行状况探测时,由你配置的应用程序部署的运行状况。The health probe status metric describes the health of your application deployment as configured by you when you configure the health probe of your load balancer. 负载均衡器使用运行状况探测的状态来确定要将新流量发送到何处。The load balancer uses the status of the health probe to determine where to send new flows. 运行状况探测源自某个 Azure 基础结构地址,并会显示在 VM 的来宾 OS 中。Health probes originate from an Azure infrastructure address and are visible within the guest OS of the VM.

若要获取标准负载均衡器资源的运行状况探测状态,请执行以下操作:To get the health probe status for your Standard Load Balancer resources:

  1. 选择“运行状况探测状态”作为指标,选择“平均值”作为聚合类型。 Select the Health Probe Status metric with Avg aggregation type.
  2. 应用基于所需前端 IP 地址和/或端口的筛选器。Apply a filter on the required Frontend IP address or port (or both).

运行状况探测会出于以下原因而失败:Health probes fail for the following reasons:

  • 针对不在侦听、无响应或者使用错误协议的端口配置运行状况探测。You configure a health probe to a port that is not listening or not responding or is using the wrong protocol. 如果服务使用直接服务器返回(DSR 或浮动 IP)规则,请确保服务侦听 NIC IP 配置的 IP 地址,而不仅仅是侦听使用前端 IP 地址配置的环回地址。If your service is using direct server return (DSR, or floating IP) rules, make sure that the service is listening on the IP address of the NIC's IP configuration and not just on the loopback that's configured with the front-end IP address.
  • 网络安全组、VM 的来宾 OS 防火墙或应用层筛选器不允许你的探测。Your probe is not permitted by the Network Security Group, the VM's guest OS firewall, or the application layer filters.

在大多数情况下,可以使用“平均值”作为聚合。 Use Average as the aggregation for most scenarios.

如何检查出站连接统计信息?How do I check my outbound connection statistics?

“SNAT 连接”指标描述适用于出站流的成功和失败连接的数量。The SNAT connections metric describes the volume of successful and failed connections for outbound flows.

如果失败连接数量大于零,则表示 SNAT 端口已耗尽。A failed connections volume of greater than zero indicates SNAT port exhaustion. 必须进一步调查,确定失败的可能原因。You must investigate further to determine what may be causing these failures. SNAT 端口耗尽的表现形式是无法建立出站流SNAT port exhaustion manifests as a failure to establish an outbound flow. 请查看有关出站连接的文章,以了解相关的场景和运行机制,并了解如何缓解并尽量避免 SNAT 端口耗尽的情况。Review the article about outbound connections to understand the scenarios and mechanisms at work, and to learn how to mitigate and design to avoid SNAT port exhaustion.

若要获取 SNAT 连接统计信息,请执行以下操作:To get SNAT connection statistics:

  1. 选择“SNAT 连接”作为指标类型,并选择“总和”作为聚合。 Select SNAT Connections metric type and Sum as aggregation.
  2. 根据不同行中显示的成功和失败 SNAT 连接计数的“连接状态”进行分组。 Group by Connection State for successful and failed SNAT connection counts that are represented by different lines.

SNAT 连接

图:负载均衡器 SNAT 连接计数Figure: Load Balancer SNAT connection count

如何检查服务的入站/出站连接尝试?How do I check inbound/outbound connection attempts for my service?

“SYN 数据包”指标描述收到或发送的、与特定前端关联的 TCP SYN 数据包数量(适用于出站流)。A SYN packets metric describes the volume of TCP SYN packets, which have arrived or were sent (for outbound flows) that are associated with a specific front end. 可以使用此指标了解对服务发起的 TCP 连接尝试。You can use this metric to understand TCP connection attempts to your service.

在大多数情况下,可以使用“总计”作为聚合。 Use Total as the aggregation for most scenarios.

SYN 连接

图:负载均衡器 SYN 计数Figure: Load Balancer SYN count

如何检查网络带宽消耗?How do I check my network bandwidth consumption?

字节和数据包计数器指标描述服务发送或收到的字节和数据包数量,根据前端显示信息。The bytes and packet counters metric describes the volume of bytes and packets that are sent or received by your service on a per-front-end basis.

在大多数情况下,可以使用“总计”作为聚合。 Use Total as the aggregation for most scenarios.

获取字节或数据包计数统计信息:To get byte or packet count statistics:

  1. 选择“字节计数”和/或“数据包计数”作为指标类型,并选择“平均值”作为聚合。 Select the Bytes Count and/or Packet Count metric type, with Avg as the aggregation.
  2. 执行以下操作之一:Do either of the following:
    • 在特定的前端 IP、前端端口、后端 IP 或后端端口应用筛选器。Apply a filter on a specific front-end IP, front-end port, back-end IP, or back-end port.
    • 不使用任何筛选器获取负载均衡器资源的总体统计信息。Get overall statistics for your load balancer resource without any filtering.

字节计数

图:负载均衡器字节计数Figure: Load Balancer byte count

如何诊断负载均衡器部署?How do I diagnose my load balancer deployment?

在单个图表中结合使用 VIP 可用性和运行状况探测指标可以识别查找和解决问题的位置。By using a combination of the VIP availability and health probe metrics on a single chart you can identify where to look for the problem and resolve the problem. 可以确定 Azure 是否正常工作,并据此最终确定配置或应用程序是否为问题的根本原因。You can gain assurance that Azure is working correctly and use this knowledge to conclusively determine that the configuration or application is the root cause.

可以使用运行状况探测指标来了解 Azure 如何根据提供的配置查看部署的运行状况。You can use health probe metrics to understand how Azure views the health of your deployment as per the configuration you have provided. 在监视或确定原因时,查看运行状况探测始终是合理的第一个动作。Looking at health probes is always a great first step in monitoring or determining a cause.

然后可以采取进一步的措施,并使用 VIP 可用性指标来深入了解 Azure 如何查看负责特定部署的底层数据平面的运行状况。You can take it a step further and use VIP availability metrics to gain insight into how Azure views the health of the underlying data plane that's responsible for your specific deployment. 结合使用两个指标,可以查明错误的所在位置,如以下示例所示:When you combine both metrics, you can isolate where the fault might be, as illustrated in this example:

组合使用“数据路径可用性”和“运行状况探测状态”指标

图: 组合使用“数据路径可用性”和“运行状况探测状态”指标Figure: Combining Data Path Availability and Health Probe Status metrics

此图表显示以下信息:The chart displays the following information:

  • 承载 VM 的基础结构在过去不可用,在图表开始处显示为 0%。The infrastructure hosting your VMs was unavailable and at 0 percent at the beginning of the chart. 稍后,基础结构正常,VM 可访问,多个 VM 置于后端。Later, the infrastructure was healthy and the VMs were reachable, and more than one VM was placed in the back end. 数据路径可用性(VIP 可用性)的蓝色轨迹(稍后显示为 100%)指示了此信息。This information is indicated by the blue trace for data path availability (VIP availability), which was later at 100 percent.
  • 图表开头紫色轨迹所示的运行状况探测状态(DIP 可用性)为 0%。The health probe status (DIP availability), indicated by the purple trace, is at 0 percent at the beginning of the chart. 绿色圆圈突出显示了运行状况探测状态(DIP 可用性)变为正常的位置,以及客户部署可以接受新流量的位置。The circled area in green highlights where the health probe status (DIP availability) became healthy, and at which point the customer's deployment was able to accept new flows.

客户可以使用该图表来自行排查部署问题,而无需猜测或询问支持部门是否发生了其他问题。The chart allows customers to troubleshoot the deployment on their own without having to guess or ask support whether other issues are occurring. 此服务之所以不可用,是因为配置不当或应用程序故障导致运行状况探测失败。The service was unavailable because health probes were failing due to either a misconfiguration or a failed application.

后续步骤Next steps