使用指标和指标进行标准负载均衡器诊断Standard Load Balancer diagnostics with metrics, and alerts

Azure 标准负载均衡器公开了以下诊断功能:Azure Standard Load Balancer exposes the following diagnostic capabilities:

  • 多维指标和警报:通过 Azure Monitor 针对标准负载均衡器配置提供多维诊断功能。Multi-dimensional metrics and alerts: Provides multi-dimensional diagnostic capabilities through Azure Monitor for standard load balancer configurations. 可以监视、管理和排查标准负载均衡器资源问题。You can monitor, manage, and troubleshoot your standard load balancer resources.

本文概要介绍这些功能,以及如何对标准负载均衡器使用这些功能。This article provides a quick tour of these capabilities, and it offers ways to use them for Standard Load Balancer.

多维指标Multi-dimensional metrics

Azure 负载均衡器通过 Azure 门户中的 Azure 指标来提供多维指标,帮助你获取负载均衡器资源的实时诊断见解。Azure Load Balancer provides multi-dimensional metrics via the Azure Metrics in the Azure portal, and it helps you get real-time diagnostic insights into your load balancer resources.

各种标准负载均衡器配置提供以下指标:The various Standard Load Balancer configurations provide the following metrics:

指标Metric 资源类型Resource type 说明Description 建议的聚合Recommended aggregation
数据路径可用性(VIP 可用性)Data path availability (VIP availability) 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器持续运用从区域内部到负载均衡器前端,直到支持 VM 的 SDN 堆栈的数据路径。Standard Load Balancer continuously exercises the data path from within a region to the load balancer front end, all the way to the SDN stack that supports your VM. 只要保留正常实例,这种度量就会遵循应用程序负载均衡的流量所用的相同路径。As long as healthy instances remain, the measurement follows the same path as your application's load-balanced traffic. 此外,还会验证客户使用的数据路径。The data path that your customers use is also validated. 度量对于应用程序不可见,且不会干扰其他操作。The measurement is invisible to your application and does not interfere with other operations. 平均值Average
运行状况探测状态(DIP 可用性)Health probe status(DIP availability) 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器使用分布式运行状况探测服务,根据配置设置监视应用程序终结点的运行状况。Standard Load Balancer uses a distributed health-probing service that monitors your application endpoint's health according to your configuration settings. 此指标提供负载均衡器池中每个实例终结点的聚合视图或按终结点筛选的视图。This metric provides an aggregate or per-endpoint filtered view of each instance endpoint in the load balancer pool. 可以查看负载均衡器如何根据运行状况探测配置的指示了解应用程序的运行状况。You can see how Load Balancer views the health of your application, as indicated by your health probe configuration. 平均值Average
SYN(同步)数据包SYN (synchronize) packets 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器不会终止传输控制协议 (TCP) 连接,也不会与 TCP 或 UDP 数据包流交互。Standard Load Balancer does not terminate Transmission Control Protocol (TCP) connections or interact with TCP or UDP packet flows. 流及其握手始终位于源和 VM 实例之间。Flows and their handshakes are always between the source and the VM instance. 若要更好地排查 TCP 协议方案的问题,可以使用 SYN 数据包计数器了解进行了多少次 TCP 连接尝试。To better troubleshoot your TCP protocol scenarios, you can make use of SYN packets counters to understand how many TCP connection attempts are made. 该指标将报告接收到的 TCP SYN 数据包数目。The metric reports the number of TCP SYN packets that were received. 平均值Average
SNAT 连接SNAT connections 公共负载均衡器Public load balancer 标准负载均衡器报告公共 IP 地址前端上伪装的出站流数。Standard Load Balancer reports the number of outbound flows that are masqueraded to the Public IP address front end. 源网络地址转换 (SNAT) 端口是消耗性资源。Source network address translation (SNAT) ports are an exhaustible resource. 此指标可以指出应用程序依赖于 SNAT 获取出站发起流的程度有多高。This metric can give an indication of how heavily your application is relying on SNAT for outbound originated flows. 将报告成功和失败的出站 SNAT 流的计数器,可使用这些计数器排查和了解出站流的运行状况。Counters for successful and failed outbound SNAT flows are reported and can be used to troubleshoot and understand the health of your outbound flows. 平均值Average
已分配的 SNAT 端口数Allocated SNAT ports 公共负载均衡器Public load balancer 标准负载均衡器报告每个后端实例分配的 SNAT 端口数Standard Load Balancer reports the number of SNAT ports allocated per backend instance 平均值。Average.
已用 SNAT 端口数Used SNAT ports 公共负载均衡器Public load balancer 标准负载均衡器报告每个后端实例使用的 SNAT 端口数。Standard Load Balancer reports the number of SNAT ports that are utilized per backend instance. 平均值Average
字节计数器Byte counters 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器按前端报告处理的数据。Standard Load Balancer reports the data processed per front end. 你可能会注意到,这些字节并没有均匀地分布在后端实例中。You may notice that the bytes are not distributed equally across the backend instances. 这是正常的,因为 Azure 的负载均衡器算法基于流This is expected as Azure's Load Balancer algorithm is based on flows 平均值Average
数据包计数器Packet counters 公共和内部负载均衡器Public and internal load balancer 标准负载均衡器按前端报告处理的数据包。Standard Load Balancer reports the packets processed per front end. 平均值Average

在 Azure 门户中查看负载均衡器指标View your load balancer metrics in the Azure portal

Azure 门户通过“指标”页公开负载均衡器指标,可在特定资源的负载均衡器资源页以及 Azure Monitor 页中访问该页。The Azure portal exposes the load balancer metrics via the Metrics page, which is available on both the load balancer resource page for a particular resource and the Azure Monitor page.

若要查看标准负载均衡器资源的指标,请执行以下操作:To view the metrics for your Standard Load Balancer resources:

  1. 转到“指标”页,执行以下操作之一:Go to the Metrics page and do either of the following:
    • 在负载均衡器资源页的下拉列表中选择指标类型。On the load balancer resource page, select the metric type in the drop-down list.
    • 在 Azure Monitor 页中选择负载均衡器资源。On the Azure Monitor page, select the load balancer resource.
  2. 设置适当的指标聚合类型。Set the appropriate metric aggregation type.
  3. (可选)配置需要的筛选和分组。Optionally, configure the required filtering and grouping.
  4. (可选)配置时间范围和聚合。Optionally, configure the time range and aggregation. 默认情况下,时间以 UTC 格式显示。By default time is displayed in UTC.

备注

解释某些指标时,时间聚合非常重要,因为数据每分钟采样一次。Time aggregation is important when interpreting certain metrics as data is sampled once per minute. 如果时间聚合设置为五分钟,并且指标聚合类型“求和”用于“SNAT 分配”等指标,则图形将显示分配的 SNAT 端口总数的五倍。If time aggregation is set to five minutes and metric aggregation type Sum is used for metrics such as SNAT Allocation, your graph will display five times the total allocated SNAT ports.

标准负载均衡器的指标

图: 标准负载均衡器的“数据路径可用性”指标Figure: Data Path Availability metric for Standard Load Balancer

通过 API 以编程方式检索多维指标Retrieve multi-dimensional metrics programmatically via APIs

有关如何检索多维指标定义和值的 API 指导,请参阅 Azure 监视 REST API 演练For API guidance for retrieving multi-dimensional metric definitions and values, see Azure Monitoring REST API walkthrough. 这些指标只能通过“所有指标”选项写入存储帐户。These metrics can be written to a storage account via the 'All Metrics' option only.

配置针对多维指标的警报Configure alerts for multi-dimensional metrics

Azure 标准负载均衡器支持易于配置的针对多维指标的警报。Azure Standard Load Balancer supports easily configurable alerts for multi-dimensional metrics. 为特定指标配置自定义阈值,用以触发具有不同严重性级别的警报,从而提供无接触的资源监视体验。Configure custom thresholds for specific metrics to trigger alerts with varying levels of severity to empower a touchless resource monitoring experience.

配置警报:To configure alerts:

  1. 转到负载均衡器的“警报”子边栏选项卡Go to the alert sub-blade for the load balancer
  2. 创建新的警报规则Create new alert rule
    1. 配置警报条件Configure alert condition
    2. (可选)添加用于自动修复的操作组(Optional) Add action group for automated repair
    3. 分配警报严重性、名称和说明,以实现直观的反应Assign alert severity, name and description that enables intuitive reaction

数据路径是否正常可用并适用于我的负载均衡器前端?Is the data path up and available for my Load Balancer Frontend?

展开Expand

数据路径可用性指标描述区域中用于 VM 所在计算主机的数据路径的运行状况。The data path availability availability metric describes the health of the data path within the region to the compute host where your VMs are located. 此指标反映了 Azure 基础结构的运行状况。The metric is a reflection of the health of the Azure infrastructure. 使用此指标可以:You can use the metric to:

  • 监视服务的外部可用性Monitor the external availability of your service
  • 深入分析和了解部署服务的平台是否正常,或者来宾 OS 或应用程序实例是否正常。Dig deeper and understand whether the platform on which your service is deployed is healthy or whether your guest OS or application instance is healthy.
  • 查明某个事件是与服务还是底层数据平面相关。Isolate whether an event is related to your service or the underlying data plane. 请不要将此指标与运行状况探测状态(“后端实例可用性”)相混淆。Do not confuse this metric with the health probe status ("Backend Instance availability").

若要获取标准负载均衡器资源的“数据路径可用性”,请执行以下操作:To get the Data Path Availability for your Standard Load Balancer resources:

  1. 确保选择正确的负载均衡器资源。Make sure the correct load balancer resource is selected.
  2. 在“指标”下拉列表中选择“数据路径可用性”。 In the Metric drop-down list, select Data Path Availability.
  3. 在“聚合” 下拉列表中,选择“平均” 。In the Aggregation drop-down list, select Avg.
  4. 另外,请将基于前端 IP 地址或前端端口的筛选器添加为维度,并添加所需的前端 IP 地址或前端端口,然后根据选定的维度将其分组。Additionally, add a filter on the Frontend IP address or Frontend port as the dimension with the required front-end IP address or front-end port, and then group them by the selected dimension.

VIP 探测

图: 负载均衡器前端探测详细信息Figure: Load Balancer Frontend probing details

将会根据活动的带内度量值生成该指标。The metric is generated by an active, in-band measurement. 区域中的探测服务根据此测量值发起流量,A probing service within the region originates traffic for the measurement. 使用公共前端创建部署后,此服务会立即激活,并一直运行到删除了前端为止。The service is activated as soon as you create a deployment with a public front end, and it continues until you remove the front end.

会定期生成与部署前端和规则匹配的数据包。A packet matching your deployment's front end and rule is generated periodically. 该服务在区域中从源遍历到后端池中 VM 所在的主机。It traverses the region from the source to the host where a VM in the back-end pool is located. 负载均衡器基础结构执行的负载均衡和转换运算与针对其他所有流量执行的操作一样。The load balancer infrastructure performs the same load balancing and translation operations as it does for all other traffic. 此探测在负载均衡终结点上的带内执行。This probe is in-band on your load-balanced endpoint. 探测抵达后端池中正常 VM 所在的计算主机后,计算主机会针对探测服务生成响应。After the probe arrives on the compute host, where a healthy VM in the back-end pool is located, the compute host generates a response to the probing service. VM 看不到此流量。Your VM does not see this traffic.

数据路径可用性探测会出于以下原因而失败:Datapath availability availability fails for the following reasons:

  • 后端池中没有剩余的可用于部署的正常 VM。Your deployment has no healthy VMs remaining in the back-end pool.
  • 发生基础结构服务中断。An infrastructure outage has occurred.

可以结合使用“数据路径可用性”指标和运行状况探测状态进行诊断。For diagnostic purposes, you can use the Data Path Availability metric together with the health probe status.

在大多数情况下,可以使用“平均值”作为聚合。 Use Average as the aggregation for most scenarios.

我的负载均衡器的后端实例是否正在响应探测?Are the Backend Instances for my Load Balancer responding to probes?

展开Expand 运行状况探测状态指标描述在配置负载均衡器的运行状况探测时,由你配置的应用程序部署的运行状况。The health probe status metric describes the health of your application deployment as configured by you when you configure the health probe of your load balancer. 负载均衡器使用运行状况探测的状态来确定要将新流量发送到何处。The load balancer uses the status of the health probe to determine where to send new flows. 运行状况探测源自某个 Azure 基础结构地址,并会显示在 VM 的来宾 OS 中。Health probes originate from an Azure infrastructure address and are visible within the guest OS of the VM.

若要获取标准负载均衡器资源的运行状况探测状态,请执行以下操作:To get the health probe status for your Standard Load Balancer resources:

  1. 选择“运行状况探测状态”作为指标,选择“平均值”作为聚合类型。 Select the Health Probe Status metric with Avg aggregation type.
  2. 应用基于所需前端 IP 地址和/或端口的筛选器。Apply a filter on the required Frontend IP address or port (or both).

运行状况探测会出于以下原因而失败:Health probes fail for the following reasons:

  • 针对不在侦听、无响应或者使用错误协议的端口配置运行状况探测。You configure a health probe to a port that is not listening or not responding or is using the wrong protocol. 如果服务使用直接服务器返回(DSR 或浮动 IP)规则,请确保服务侦听 NIC IP 配置的 IP 地址,而不仅仅是侦听使用前端 IP 地址配置的环回地址。If your service is using direct server return (DSR, or floating IP) rules, make sure that the service is listening on the IP address of the NIC's IP configuration and not just on the loopback that's configured with the front-end IP address.
  • 网络安全组、VM 的来宾 OS 防火墙或应用层筛选器不允许你的探测。Your probe is not permitted by the Network Security Group, the VM's guest OS firewall, or the application layer filters.

在大多数情况下,可以使用“平均值”作为聚合。 Use Average as the aggregation for most scenarios.

如何检查出站连接统计信息?How do I check my outbound connection statistics?

展开Expand

“SNAT 连接”指标描述适用于出站流的成功和失败连接的数量。The SNAT connections metric describes the volume of successful and failed connections for outbound flows.

如果失败连接数量大于零,则表示 SNAT 端口已耗尽。A failed connections volume of greater than zero indicates SNAT port exhaustion. 必须进一步调查,确定失败的可能原因。You must investigate further to determine what may be causing these failures. SNAT 端口耗尽的表现形式是无法建立出站流SNAT port exhaustion manifests as a failure to establish an outbound flow. 请查看有关出站连接的文章,以了解相关的场景和运行机制,并了解如何缓解并尽量避免 SNAT 端口耗尽的情况。Review the article about outbound connections to understand the scenarios and mechanisms at work, and to learn how to mitigate and design to avoid SNAT port exhaustion.

若要获取 SNAT 连接统计信息,请执行以下操作:To get SNAT connection statistics:

  1. 选择“SNAT 连接”作为指标类型,并选择“总和”作为聚合。 Select SNAT Connections metric type and Sum as aggregation.
  2. 根据不同行中显示的成功和失败 SNAT 连接计数的“连接状态”进行分组。 Group by Connection State for successful and failed SNAT connection counts that are represented by different lines.

SNAT 连接

图: 负载均衡器 SNAT 连接计数

*Figure: Load Balancer SNAT connection count*

如何检查 SNAT 端口用量和分配?How do I check my SNAT port usage and allocation?

展开Expand “SNAT 用量”指标指示在 Internet 源与负载均衡器后面的且没有公共 IP 地址的后端 VM 或虚拟机规模集之间建立了多少个唯一流。The SNAT Usage metric indicates how many unique flows are established between an internet source and a backend VM or virtual machine scale set that is behind a load balancer and does not have a public IP address. 将此指标与“SNAT 分配”指标进行比较,可以确定服务是否遇到了 SNAT 耗尽问题或者面临着这种风险,并导致出站流失败。By comparing this with the SNAT Allocation metric, you can determine if your service is experiencing or at risk of SNAT exhaustion and resulting outbound flow failure.

如果指标指出了出站流失败的风险,请参考相应的文章并采取缓解措施,以确保服务正常运行。If your metrics indicate risk of outbound flow failure, reference the article and take steps to mitigate this to ensure service health.

若要查看 SNAT 端口用量和分配:To view SNAT port usage and allocation:

  1. 将图形的时间聚合设置为 1 分钟,以确保显示所需的数据。Set the time aggregation of the graph to 1 minute to ensure desired data is displayed.
  2. 选择“SNAT 用量”和/或“SNAT 分配”作为指标类型,选择“平均”作为聚合类型 Select SNAT Usage and/or SNAT Allocation as the metric type and Average as the aggregation
    • 默认情况下,这是分配到每个后端 VM 或 VMSS 或者它们使用的平均 SNAT 端口数,对应于映射到负载均衡器的所有前端公共 IP,是基于 TCP 和 UDP 聚合得出的。By default this is the average number of SNAT ports allocated to or used by each backend VMs or VMSSes, corresponding to all frontend public IPs mapped to the Load Balancer, aggregated over TCP and UDP.
    • 若要查看负载均衡器使用的或者为其分配的 SNAT 端口总数,请使用指标聚合“求和” To view total SNAT ports used by or allocated for the load balancer use metric aggregation Sum
  3. 根据特定的“协议类型”、一组“后端 IP”和/或“前端 IP”进行筛选。 Filter to a specific Protocol Type, a set of Backend IPs, and/or Frontend IPs.
  4. 若要监视每个后端或前端实例的运行状况,请应用拆分。To monitor health per backend or frontend instance, apply splitting.
    • 请注意,拆分时每次只允许显示一个指标。Note splitting only allows for a single metric to be displayed at a time.
  5. 例如,若要监视每台计算机的 TCP 流的 SNAT 用量,请通过“平均”进行聚合,按“后端 IP”进行拆分,并按“协议类型”进行筛选。 For example, to monitor SNAT usage for TCP flows per machine, aggregate by Average, split by Backend IPs and filter by Protocol Type.

SNAT 分配和用量

图:一组后端 VM 的平均 TCP SNAT 端口分配和用量Figure: Average TCP SNAT port allocation and usage for a set of backend VMs

按后端实例列出的 SNAT 用量

图: 每个后端实例的 TCP SNAT 端口用量

*Figure: TCP SNAT port usage per backend instance*

如何检查服务的入站/出站连接尝试?How do I check inbound/outbound connection attempts for my service?

展开Expand “SYN 数据包”指标描述收到或发送的、与特定前端关联的 TCP SYN 数据包数量(适用于[出站流](/load-balancer/load-balancer-outbound-connections))。A SYN packets metric describes the volume of TCP SYN packets, which have arrived or were sent (for [outbound flows](/load-balancer/load-balancer-outbound-connections)) that are associated with a specific front end. 可以使用此指标了解对服务发起的 TCP 连接尝试。You can use this metric to understand TCP connection attempts to your service.

在大多数情况下,可以使用“总计”作为聚合。 Use Total as the aggregation for most scenarios.

SYN 连接

图: 负载均衡器 SYN 计数

*Figure: Load Balancer SYN count*

如何检查网络带宽消耗?How do I check my network bandwidth consumption?

展开Expand 字节和数据包计数器指标描述服务发送或收到的字节和数据包数量,根据前端显示信息。The bytes and packet counters metric describes the volume of bytes and packets that are sent or received by your service on a per-front-end basis.

在大多数情况下,可以使用“总计”作为聚合。 Use Total as the aggregation for most scenarios.

获取字节或数据包计数统计信息:To get byte or packet count statistics:

  1. 选择“字节计数”和/或“数据包计数”作为指标类型,并选择“平均值”作为聚合。 Select the Bytes Count and/or Packet Count metric type, with Avg as the aggregation.
  2. 执行以下操作之一:Do either of the following:
    • 在特定的前端 IP、前端端口、后端 IP 或后端端口应用筛选器。Apply a filter on a specific front-end IP, front-end port, back-end IP, or back-end port.
    • 不使用任何筛选器获取负载均衡器资源的总体统计信息。Get overall statistics for your load balancer resource without any filtering.

字节计数

图: 负载均衡器字节计数

*Figure: Load Balancer byte count*

如何诊断负载均衡器部署?How do I diagnose my load balancer deployment?

展开Expand 在单个图表中结合使用数据路径可用性和运行状况探测状态指标可以识别查找和解决问题的位置。By using a combination of the Data Path Availability and Health Probe Status metrics on a single chart you can identify where to look for the problem and resolve the problem. 可以确定 Azure 是否正常工作,并据此最终确定配置或应用程序是否为问题的根本原因。You can gain assurance that Azure is working correctly and use this knowledge to conclusively determine that the configuration or application is the root cause.

可以使用运行状况探测指标来了解 Azure 如何根据提供的配置查看部署的运行状况。You can use health probe metrics to understand how Azure views the health of your deployment as per the configuration you have provided. 在监视或确定原因时,查看运行状况探测始终是合理的第一个动作。Looking at health probes is always a great first step in monitoring or determining a cause.

然后可以采取进一步的措施,并使用数据路径可用性指标来深入了解 Azure 如何查看负责特定部署的底层数据平面的运行状况。You can take it a step further and use Data Path availability metric to gain insight into how Azure views the health of the underlying data plane that's responsible for your specific deployment. 结合使用两个指标,可以查明错误的所在位置,如以下示例所示:When you combine both metrics, you can isolate where the fault might be, as illustrated in this example:

组合使用“数据路径可用性”和“运行状况探测状态”指标

图: 组合使用“数据路径可用性”和“运行状况探测状态”指标Figure: Combining Data Path Availability and Health Probe Status metrics

此图表显示以下信息:The chart displays the following information:

  • 承载 VM 的基础结构在过去不可用,在图表开始处显示为 0%。The infrastructure hosting your VMs was unavailable and at 0 percent at the beginning of the chart. 稍后,基础结构正常,VM 可访问,多个 VM 置于后端。Later, the infrastructure was healthy and the VMs were reachable, and more than one VM was placed in the back end. 数据路径可用性的蓝色轨迹(稍后显示为 100%)指示了此信息。This information is indicated by the blue trace for data path availability, which was later at 100 percent.
  • 图表开头紫色轨迹所示的运行状况探测状态为 0%。The health probe status, indicated by the purple trace, is at 0 percent at the beginning of the chart. 绿色圆圈突出显示了运行状况探测状态变为正常的位置,以及客户部署可以接受新流量的位置。The circled area in green highlights where the health probe status became healthy, and at which point the customer's deployment was able to accept new flows.

客户可以使用该图表来自行排查部署问题,而无需猜测或询问支持部门是否发生了其他问题。The chart allows customers to troubleshoot the deployment on their own without having to guess or ask support whether other issues are occurring. 此服务之所以不可用,是因为配置不当或应用程序故障导致运行状况探测失败。The service was unavailable because health probes were failing due to either a misconfiguration or a failed application.

后续步骤Next steps