网络性能监视器解决方案常见问题解答Network Performance Monitor solution FAQ


本文收集了有关 Azure 中网络性能监视器 (NPM) 的常见问题 (FAQ)This article captures the frequently asked questions (FAQs) about Network Performance Monitor (NPM) in Azure

网络性能监视器可检测诸如流量黑洞、路由错误之类的网络问题,以及传统网络监视方法无法检测到的问题。Network Performance Monitor detects network issues like traffic blackholing, routing errors, and issues that conventional network monitoring methods aren't able to detect. 只要突破网络链接的阈值,解决方案就会生成警报并进行通知。The solution generates alerts and notifies you when a threshold is breached for a network link. 它还可以确保及时检测到网络性能问题,然后确定问题根源所在的特定网络段或设备。It also ensures timely detection of network performance issues and localizes the source of the problem to a particular network segment or device.

安装和配置代理Set up and configure agents

NPM 进行监视所用的节点要满足哪些平台要求?What are the platform requirements for the nodes to be used for monitoring by NPM?

下面列出了 NPM 各项功能所要满足的平台要求:Listed below are the platform requirements for NPM's various capabilities:

  • NPM 的性能监视器和服务连接监视器功能支持 Windows Server 和 Windows 桌面/客户端操作系统。NPM's Performance Monitor and Service Connectivity Monitor capabilities support both Windows server and Windows desktops/client operating systems. 支持的 Windows Server OS 版本为 2008 SP1 或更高版本。Windows server OS versions supported are 2008 SP1 or later. 支持的 Windows 桌面/客户端版本为 Windows 10、Windows 8.1、Windows 8 和 Windows 7。Windows desktops/client versions supported are Windows 10, Windows 8.1, Windows 8, and Windows 7.
  • NPM 的 ExpressRoute 监视器功能仅支持 Windows Server(2008 SP1 或更高版本)操作系统。NPM's ExpressRoute Monitor capability supports only Windows server (2008 SP1 or later) operating system.

是否可以使用 Linux 计算机作为 NPM 中的监视节点?Can I use Linux machines as monitoring nodes in NPM?

使用基于 Linux 的节点监视网络的功能目前以预览版的形式提供。The capability to monitor networks using Linux-based nodes is currently in preview. 请联系客户经理了解详细信息。Reach out to your Account Manager to know more. Linux 代理仅为 NPM 的性能监视器功能提供监视功能,不适用于服务连接监视器和 ExpressRoute 监视器功能Linux agents provide monitoring capability only for NPM's Performance Monitor capability, and are not available for the Service Connectivity Monitor and ExpressRoute Monitor capabilities

NPM 进行监视所用的节点要满足哪些大小要求?What are the size requirements of the nodes to be used for monitoring by NPM?

要在节点 VM 上运行 NPM 解决方案以监视网络,节点应至少有 500 MB 内存和 1 个核心。For running the NPM solution on node VMs to monitor networks, the nodes should have at least 500-MB memory and one core. 运行 NPM 不需要使用单独的节点。You don't need to use separate nodes for running NPM. 该解决方案可以在运行了其他工作负荷的节点上运行。The solution can run on nodes that have other workloads running on it. 在 CPU 使用率超过 5% 的情况下,该解决方案能够停止监视进程。The solution has the capability to stop the monitoring process if it uses more than 5% CPU.

若要使用 NPM,是要以直接代理的形式还是通过 System Center Operations Manager 连接节点?To use NPM, should I connect my nodes as Direct agent or through System Center Operations Manager?

性能监视器和服务连接监视器功能都支持以直接代理形式连接的节点。Both the Performance Monitor and the Service Connectivity Monitor capabilities support nodes connected as Direct Agents.

对于 ExpressRoute 监视器功能,Azure 节点只能以直接代理的形式连接。For ExpressRoute Monitor capability, the Azure nodes should be connected as Direct Agents only. 不支持通过 Operations Manager 连接的 Azure 节点。Azure nodes, which are connected through Operations Manager are not supported. 对于本地节点,支持使用以直接代理形式连接的节点以及通过 Operations Manager 连接的节点来监视 ExpressRoute 线路。For on-premises nodes, the nodes connected as Direct Agents and through Operations Manager are supported for monitoring an ExpressRoute circuit.

应选择 TCP 还是 ICMP 协议进行监视?Which protocol among TCP and ICMP should be chosen for monitoring?

如果使用基于 Windows Server 的节点监视网络,我们建议使用 TCP 作为监视协议,因为它提供更好的准确性。If you're monitoring your network using Windows server-based nodes, we recommend you use TCP as the monitoring protocol since it provides better accuracy.

建议将 ICMP 用于基于 Windows 桌面/客户端操作系统的节点。ICMP is recommended for Windows desktops/client operating system-based nodes. 此平台不允许通过原始套接字发送 TCP 数据,NPM 使用这些套接字来发现网络拓扑。This platform doesn't allow TCP data to be sent over raw sockets, which NPM uses to discover network topology.

可在此处详细了解每个协议的相对优势。You can get more details on the relative advantages of each protocol here.

如何将节点配置为支持使用 TCP 协议进行监视?How can I configure a node to support monitoring using TCP protocol?

要使节点支持使用 TCP 协议进行监视:For the node to support monitoring using TCP protocol:

  • 请确保节点平台是 Windows Server(2008 SP1 或更高版本)。Ensure that the node platform is Windows Server (2008 SP1 or later).
  • 在该节点上运行 EnableRules.ps1 PowerShell 脚本。Run EnableRules.ps1 PowerShell script on the node. 参阅说明了解更多详细信息。See instructions for more details.

如何更改 NPM 用来监视的 TCP 端口?How can I change the TCP port being used by NPM for monitoring?

可以运行 EnableRules.ps1 脚本来更改 NPM 用来监视的 TCP 端口。You can change the TCP port used by NPM for monitoring, by running the EnableRules.ps1 script. 需要输入用作参数的端口号。You need enter the port number you intend to use as a parameter. 例如,若要在端口 8060 上启用 TCP,请运行 EnableRules.ps1 8060For example, to enable TCP on port 8060, run EnableRules.ps1 8060. 确保在用于监视的所有节点上使用相同的 TCP 端口。Ensure that you use the same TCP port on all the nodes being used for monitoring.

此脚本仅在本地配置 Windows 防火墙。The script configures only Windows Firewall locally. 如果有网络防火墙或网络安全组 (NSG) 规则,请确保它们允许将流量发往 NPM 所用 TCP 端口。If you have network firewall or Network Security Group (NSG) rules, make sure that they allow the traffic destined for the TCP port used by NPM.

应使用多少个代理?How many agents should I use?

对于要监视的每个子网,至少要使用一个代理。You should use at least one agent for each subnet that you want to monitor.

我可以使用的最大代理数或我会看到“...你已达到配置限制”错误的最大代理数是多少?What is the maximum number of agents I can use or I see error ".... you've reached your Configuration limit"?

NPM 将 IP 数限制为每个工作区 5000 个 IP。NPM limits the number of IPs to 5000 IPs per workspace. 如果某个节点同时拥有 IPv4 和 IPv6 地址,则计为该节点的 2 个 IP。If a node has both IPv4 and IPv6 addresses, this will count as 2 IPs for that node. 因此,此 5000 个 IP 的限制会决定代理数的上限。Hence, this limit of 5000 IPs would decide the upper limit on the number of agents. 可从“NPM”>>“配置”中的“节点”选项卡删除非活动代理。You can delete the inactive agents from Nodes tab in NPM >> Configure. NPM 还维护曾分配给托管代理的 VM 的所有 IP 的历史记录,每个IP 计为单独的 IP 并包含在 5000 个 IP 的上限中。NPM also maintains history of all the IPs that were ever assigned to the VM hosting the agent and each is counted as separate IP contributing to that upper limit of 5000 IPs. 若要为工作区释放 IP,可使用“节点”页面删除未使用的 IP。To free up IPs for your workspace, you can use the Nodes page to delete the IPs that are not in use.


如何计算丢包率和延迟How are loss and latency calculated

源代理按固定间隔将 TCP SYN 请求(如果选择 TCP 作为用于监视的协议)或 ICMP ECHO 请求(如果选择 ICMP 作为用于监视的协议)发送到目标 IP,以确保涵盖源-目标 IP 组合之间的所有路径。Source agents send either TCP SYN requests (if TCP is chosen as the protocol for monitoring) or ICMP ECHO requests (if ICMP is chosen as the protocol for monitoring) to destination IP at regular intervals to ensure that all the paths between the source-destination IP combination are covered. 将测量收到的数据包百分比和数据包往返时间,以计算每个路径的丢包率和延迟。The percentage of packets received and packet round-trip time is measured to calculate the loss and latency of each path. 将会基于轮询间隔和所有路径聚合此数据,以获取特定轮询间隔内 IP 组合的失包率和延迟的聚合值。This data is aggregated over the polling interval and over all the paths to get the aggregated values of loss and latency for the IP combination for the particular polling interval.

源代理按哪种频率将数据包发送到目标进行监视?With what frequency does the source agent send packets to the destination for monitoring?

对于性能监视器和 ExpressRoute 监视器功能,源每隔 5 秒发送数据包,并记录网络测量值。For Performance Monitor and ExpressRoute Monitor capabilities, the source sends packets every 5 seconds and records the network measurements. 将会基于 3 分钟轮询间隔聚合此数据,以计算丢失率和延迟的平均值与峰值。This data is aggregated over a 3-minute polling interval to calculate the average and peak values of loss and latency. 对于服务连接监视器功能,发送数据包进行网络测量的频率由用户在配置测试时为特定测试输入的频率确定。For Service Connectivity Monitor capability, the frequency of sending the packets for network measurement is determined by the frequency entered by the user for the specific test while configuring the test.

要发送多少个数据包进行监视?How many packets are sent for monitoring?

源代理在轮询中发送到目标的数据包数目是自适应性的,由我们的专属算法确定,该算法可能因网络拓扑的不同而异。The number of packets sent by the source agent to destination in a polling is adaptive and is decided by our proprietary algorithm, which can be different for different network topologies. 源-目标 IP 组合之间的网络路径越多,则发送的数据包数目也就越多。More the number of network paths between the source-destination IP combination, more is the number of packets that are sent. 系统会确保涵盖源-目标 IP 组合之间的所有路径。The system ensures that all paths between the source-destination IP combination are covered.

NPM 如何发现源与目标之间的网络拓扑?How does NPM discover network topology between source and destination?

NPM 使用基于跟踪路由的专属算法来发现源与目标之间的所有路径和跃点。NPM uses a proprietary algorithm based on Traceroute to discover all the paths and hops between source and destination.

NPM 提供路由和交换级别信息Does NPM provide routing and switching level info

尽管 NPM 可以检测源代理与目标之间所有可能的路由,但它不提供特定工作负荷发送的数据包所采用的路由的深入信息。Though NPM can detect all the possible routes between the source agent and the destination, it does not provide visibility into which route was taken by the packets sent by your specific workloads. 该解决方案可帮助识别那些超预期增大延迟的路径和底层网络跃点。The solution can help you identify the paths and underlying network hops, which are adding more latency than you expected.

为何某些路径不正常?Why are some of the paths unhealthy?

源与目标 IP 之间可能存在不同的网络路径,而每个路径可能存在不同的丢包率和延迟值。Different network paths can exist between the source and destination IPs and each path can have a different value of loss and latency. NPM 将这些路径标记为不正常(以红色表示),它们的失包率和/或延迟值大于监视配置中设置的相关阈值。NPM marks those paths as unhealthy (denoted with red color) for which the values of loss and/or latency is greater than the respective threshold set in the monitoring configuration.

网络拓扑图中红色的跃点表示什么?What does a hop in red color signify in the network topology map?

如果某个跃点为红色,表示它属于至少一个不正常的路径。If a hop is red, it signifies that it is part of at-least one unhealthy path. NPM 只会将路径标记为不正常,而不会隔离每个路径的运行状况。NPM only marks the paths as unhealthy, it does not segregate the health status of each path. 若要识别有问题的跃点,可以查看每个跃点的延迟,并隔离那些超预期增大延迟的跃点。To identify the troublesome hops, you can view the hop-by-hop latency and segregate the ones adding more than expected latency.

性能监视器中的故障定位的工作原理是什么?How does fault localization in Performance Monitor work?

NPM 根据每个网络路径、网段和构成网络跃点所属的不正常路径数,使用概率机制向它们分配故障概率。NPM uses a probabilistic mechanism to assign fault-probabilities to each network path, network segment, and the constituent network hops based on the number of unhealthy paths they are a part of. 随着网段和跃点属于越来越多的不正常路径,与之关联的故障概率将会增大。As the network segments and hops become part of more number of unhealthy paths, the fault-probability associated with them increases. 如果有许多包含 NPM 代理的节点相互连接,因此增加了用于计算故障概率的数据点,则此算法的效果最佳。This algorithm works best when you have many nodes with NPM agent connected to each other as this increases the data points for calculating the fault-probabilities.

如何在 NPM 中创建警报?How can I create alerts in NPM?

由于存在问题,当前无法从 NPM UI 创建警报。Creating alerts from NPM UI is currently failing due to a issue. 请手动创建警报。Please create alerts manually.

哪个是针对警报的默认 Log Analytics 查询?What are the default Log Analytics queries for alerts

性能监视器查询Performance monitor query

 | where (SubType == "SubNetwork" or SubType == "NetworkPath") 
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy") and RuleName == "<<your rule name>>"

服务连接监视器查询Service connectivity monitor query

 | where (SubType == "EndpointHealth" or SubType == "EndpointPath")
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or ServiceResponseHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy") and TestName == "<<your test name>>"

ExpressRoute 监视器查询:线路查询ExpressRoute monitor queries: Circuits query

 | where (SubType == "ERCircuitTotalUtilization") and (UtilizationHealthState == "Unhealthy") and CircuitResourceId == "<<your circuit resource ID>>"

专用对等互连Private peering

 | where (SubType == "ExpressRoutePeering" or SubType == "ERVNetConnectionUtilization" or SubType == "ExpressRoutePath")   
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or UtilizationHealthState == "Unhealthy") and CircuitName == "<<your circuit name>>" and VirtualNetwork == "<<vnet name>>"

Microsoft 对等互连Microsoft peering

 | where (SubType == "ExpressRoutePeering" or SubType == "ERMSPeeringUtilization" or SubType == "ExpressRoutePath")
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or UtilizationHealthState == "Unhealthy") and CircuitName == ""<<your circuit name>>" and PeeringType == "MicrosoftPeering"

常见查询Common query

 | where (SubType == "ExpressRoutePeering" or SubType == "ERVNetConnectionUtilization" or SubType == "ERMSPeeringUtilization" or SubType == "ExpressRoutePath")
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or UtilizationHealthState == "Unhealthy")

NPM 是否可以将路由器和服务器作为单个设备进行监视?Can NPM monitor routers and servers as individual devices?

NPM 只能识别源与目标 IP 之间的底层网络跃点(交换机、路由器、服务器等)的 IP 和主机名。NPM only identifies the IP and host name of underlying network hops (switches, routers, servers, etc.) between the source and destination IPs. 此外,它还能识别这些已识别的跃点之间的延迟。It also identifies the latency between these identified hops. 它不会单独监视这些底层跃点。It does not individually monitor these underlying hops.

是否可以使用 NPM 来监视 Azure 与 AWS 之间的网络连接?Can NPM be used to monitor network connectivity between Azure and AWS?

是。Yes. 有关详细信息,请参阅使用 NPM 监视 Azure、AWS 和本地网络一文。Refer to the article Monitor Azure, AWS, and on-premises networks using NPM for details.

ExpressRoute 带宽用量是指传入还是传出带宽?Is the ExpressRoute bandwidth usage incoming or outgoing?

带宽用量是传入和传出带宽的总计。Bandwidth usage is the total of incoming and outgoing bandwidth. 它以“位/秒”为单位表示。It is expressed in Bits/sec.

是否可以获取 ExpressRoute 的传入和传出带宽信息?Can we get incoming and outgoing bandwidth information for the ExpressRoute?

可以捕获主要和辅助带宽的传入和传出值。Incoming and outgoing values for both Primary and Secondary bandwidth can be captured.

如需 MS 对等互连级信息,请在日志搜索中使用下面所述的查询For MS peering level information, use the below mentioned query in Log Search

 | where SubType == "ERMSPeeringUtilization"
 | project CircuitName,PeeringName,BitsInPerSecond,BitsOutPerSecond 

如需专用对等互连级信息,请在日志搜索中使用下面所述的查询For private peering level information, use the below mentioned query in Log Search

 | where SubType == "ERVNetConnectionUtilization"
 | project CircuitName,PeeringName,BitsInPerSecond,BitsOutPerSecond

如需线路级信息,请在日志搜索中使用下面所述的查询For circuit level information, use the below mentioned query in Log Search

 | where SubType == "ERCircuitTotalUtilization"
 | project CircuitName, BitsInPerSecond, BitsOutPerSecond

NPM 的性能监视器支持哪些区域?Which regions are supported for NPM's Performance Monitor?

NPM 可以通过某个受支持区域中托管的工作区,监视全球任意位置的网络之间的连接NPM can monitor connectivity between networks in any part of the world, from a workspace that is hosted in one of the supported regions

NPM 的服务连接监视器支持哪些区域?Which regions are supported for NPM's Service Connectivity Monitor?

NPM 可以通过某个受支持区域中托管的工作区,监视全球任意位置的服务的连接NPM can monitor connectivity to services in any part of the world, from a workspace that is hosted in one of the supported regions


网络拓扑视图中为何有些跃点标记为不可识别?Why are some of the hops marked as unidentified in the network topology view?

NPM 使用跟踪路由的修改版来发现从源代理到目标的拓扑。NPM uses a modified version of traceroute to discover the topology from the source agent to the destination. 不可识别的跃点表示该网络跃点未响应源代理的跟踪路由请求。An unidentified hop represents that the network hop did not respond to the source agent's traceroute request. 如果三个连续的网络跃点未响应代理的跟踪路由,解决方案会将无响应的跃点标记为不可识别,且不会尝试发现其他跃点。If three consecutive network hops do not respond to the agent's traceroute, the solution marks the unresponsive hops as unidentified and does not try to discover more hops.

如果存在以下一种或多种情况,跃点可能不会响应跟踪路由:A hop may not respond to a traceroute in one or more of the below scenarios:

  • 路由器已配置为隐藏其标识。The routers have been configured to not reveal their identity.
  • 网络设备不允许 ICMP_TTL_EXCEEDED 流量。The network devices are not allowing ICMP_TTL_EXCEEDED traffic.
  • 防火墙阻止了来自网络设备的 ICMP_TTL_EXCEEDED 响应。A firewall is blocking the ICMP_TTL_EXCEEDED response from the network device.

当任一终结点位于 Azure 中时,traceroute 会显示无法识别的跃点,因为 Azure 基础结构不会向 traceroute 透露标识。When either of the endpoints lies in Azure, traceroute shows up unidentified hops as Azure Infrastructure does not reveal identity to traceroute.

我收到测试运行不正常的警报,但在 NPM 的丢失和延迟图中并没有看到过高的值。I get alerts for unhealthy tests but I do not see the high values in NPM's loss and latency graph. 如何查看运行不正常的项目?How do I check what is unhealthy?

如果源和目标之间的端到端延迟超过其间的任何路径的阈值,NPM 会引发警报。NPM raises an alert if end to end latency between source and destination crosses the threshold for any path between them. 某些网络有多个路径连接相同的源和目标。Some networks have multiple paths connecting the same source and destination. 如果任何路径不正常,NPM 会引发警报。NPM raises an alert is any path is unhealthy. 图中看到的丢失和延迟是所有路径的平均值,因此可能无法表现单个路径的具体值。The loss and latency seen in the graphs is the average value for all the paths, hence it may not show the exact value of a single path. 若要了解超出阈值的位置,请查看警报中的“SubType”列。To understand where the threshold has been breached, look for the "SubType" column in the alert. 如果问题是某个路径导致的,则 SubType 值为 NetworkPath(适用于性能监视器测试)、EndpointPath(适用于服务连接监视器测试)和 ExpressRoutePath(适用于 ExpressRotue 监视器测试)。If the issue is caused by a path the SubType value will be NetworkPath (for Performance Monitor tests), EndpointPath (for Service Connectivity Monitor tests) and ExpressRoutePath (for ExpressRotue Monitor tests).

用于了解路径是否正常的示例查询:Sample Query to find is path is unhealthy:

 | where ( SubType == "ExpressRoutePath")
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or UtilizationHealthState == "Unhealthy") and CircuitResourceID =="<your ER circuit ID>" and ConnectionResourceId == "<your ER connection resource id>"
 | project SubType, LossHealthState, LatencyHealthState, MedianLatency

为何测试显示运行不正常,而拓扑不这么显示Why does my test show unhealthy but the topology does not

NPM 按不同的时间间隔监视端到端丢包、延迟和拓扑。NPM monitors end-to-end loss, latency, and topology at different intervals. 丢包和延迟每 5 秒钟度量一次,每三分钟聚合一次(适用于性能监视器和 ExpressRoute 监视器),而拓扑则使用 traceroute 每 10 分钟计算一次。Loss and latency are measured once every 5 seconds and aggregated every three minutes (for Performance Monitor and Express Route Monitor) while topology is calculated using traceroute once every 10 minutes. 例如,在 3:44 到 4:04 之间,拓扑可能更新了三次(3:44、3:54、4:04),但丢包和延迟更新了大约七次(3:44、3:47、3:50、3:53、3:56、3:59、4:02)。For example, between 3:44 and 4:04, topology may be updated three times (3:44, 3:54, 4:04), but loss and latency are updated about seven times (3:44, 3:47, 3:50, 3:53, 3:56, 3:59, 4:02). 在 3:54 生成的拓扑会针对在 3:56、3:59 和 4:02 计算的丢包和延迟呈现。The topology generated at 3:54 will be rendered for the loss and latency that gets calculated at 3:56, 3:59 and 4:02. 假设你获得一个警报,指出你的 ER 线路在 3:59 不正常。Suppose you get an alert that your ER circuit was unhealthy at 3:59. 你登录到 NPM,尝试将拓扑时间设置为 3:59。You log on to NPM and try to set the topology time to 3:59. NPM 会呈现在 3:54 生成的拓扑。NPM will render the topology generated at 3:54. 若要了解网络的上一个已知拓扑,请 将字段 TimeProcessed(计算丢包和延迟的时间)与 TracerouteCompletedTime(计算拓扑的时间)进行比较。To understand the last known topology of your network, compare the fields TimeProcessed (time at which loss and latency was calculated) and TracerouteCompletedTime(time at which topology was calculated).

在 NetworkMonitoring 表中,字段 E2EMedianLatency 和 AvgHopLatencyList 的差异是什么What is the difference between the fields E2EMedianLatency and AvgHopLatencyList in the NetworkMonitoring table

E2EMedianLatency 是在聚合 tcp ping 测试的结果后每三分钟更新一次的延迟,而 AvgHopLatencyList 则根据 traceroute 每 10 分钟更新一次。E2EMedianLatency is the latency updated every three minutes after aggregating the results of tcp ping tests, whereas AvgHopLatencyList is updated every 10 mins based on traceroute. 若要了解计算 E2EMedianLatency 的具体时间,请使用 TimeProcessed 字段。To understand the exact time at which E2EMedianLatency was calculated, use the field TimeProcessed. 若要了解完成 traceroute 并更新 AvgHopLatencyList 的具体时间,请使用 TracerouteCompletedTime 字段To understand the exact time at which traceroute completed and updated AvgHopLatencyList, use the field TracerouteCompletedTime

为何逐跳延迟数不同于 HopLatencyValueWhy does hop-by-hop latency numbers differ from HopLatencyValues

HopLatencyValue 是从源到终结点的。HopLatencyValues are source to endpoint. 例如:跃点 - A、B、C。For Example: Hops - A,B,C. AvgHopLatency - 10、15、20。AvgHopLatency - 10,15,20. 这意味着源到 A 的延迟为 10,源到 B 的延迟为 15,源到 C 的延迟为 20。This means source to A latency = 10, source to B latency = 15 and source to C latency is 20. UI 会将拓扑中 A-B 跃点的延迟计算为 5UI will calculate A-B hop latency as 5 in the topology

解决方案会显示 100% 的丢包率,但源与目标之间已建立连接The solution shows 100% loss but there is connectivity between the source and destination

如果主机防火墙或中间防火墙(网络防火墙或 Azure NSG)阻止了源代理与目标之间通过 NPM 用于监视的端口(端口默认为 8084,除非客户更改了端口)进行的通信,则可能会发生这种情况。This can happen if either the host firewall or the intermediate firewall (network firewall or Azure NSG) is blocking the communication between the source agent and the destination over the port being used for monitoring by NPM (by default the port is 8084, unless the customer has changed this).

  • 若要确认主机防火墙是否未阻止所需端口上的通信,请通过以下视图查看源和目标节点的运行状况:“网络性能监视器”->“配置”->“节点”。To verify that the host firewall is not blocking the communication on the required port, view the health status of the source and destination nodes from the following view: Network Performance Monitor -> Configuration -> Nodes. 如果这些节点不正常,请查看说明并采取纠正措施。If they are unhealthy, view the instructions and take corrective action. 如果节点正常,请转到下面的If the nodes are healthy, move to the step b. 使用。below.
  • 若要确认中间网络防火墙或 Azure NSG 是否未阻止所需端口上的通信,请遵照下面的说明使用第三方 PsPing 实用工具:To verify that an intermediate network firewall or Azure NSG is not blocking the communication on the required port, use the third-party PsPing utility using the below instructions:
    • 可从此处获取 psping 实用工具。psping utility is available for download here
    • 在源节点中运行以下命令。Run following command from the source node.
      • psping -n 15 <destination node IPAddress>:portNumber 默认情况下,NPM 使用 8084 端口。psping -n 15 <destination node IPAddress>:portNumber By default NPM uses 8084 port. 如果使用 EnableRules.ps1 脚本显式更改了此端口,请输入所用的自定义端口号。In case you have explicitly changed this by using the EnableRules.ps1 script, enter the custom port number you are using). 这是从 Azure 机器向本地执行的 pingThis is a ping from Azure machine to on-premises
  • 检查 ping 是否成功。Check if the pings are successful. 如果未成功,则表示中间网络防火墙或 Azure NSG 阻止了此端口上的流量。If not, then it indicates that an intermediate network firewall or Azure NSG is blocking the traffic on this port.
  • 现在,从目标节点向源节点 IP 运行该命令。Now, run the command from destination node to source node IP.

从节点 A 到 B 的通信发生了丢包,但从节点 B 到 A 的通信未发生丢包,这是为什么?There is loss from node A to B, but not from node B to A. Why?

由于从 A 到 B 之间的网络路径可能不同于从 B 到 A 之间的网络路径,因此,可能会观察到不同的丢包率和延迟值。As the network paths between A to B can be different from the network paths between B to A, different values for loss and latency can be observed.

为何发现不了我的所有 ExpressRoute 线路和对等互连?Why are all my ExpressRoute circuits and peering connections not being discovered?

NPM 现在可以在用户有权访问的所有订阅中发现 ExpressRoute 线路和对等连接。NPM now discovers ExpressRoute circuits and peering connections in all subscriptions to which the user has access. 选择链接 Express Route 资源的所有订阅,并为发现的每个资源启用监视。Choose all the subscriptions where your Express Route resources are linked and enable monitoring for each discovered resource. NPM 在发现专用对等互连时查找连接对象,因此请检查 VNET 是否与对等互连关联。NPM looks for connection objects when discovering a private peering, so please check if a VNET is associated with your peering. NPM 不检测与 Log Analytics 工作区所在租户不同的租户中的线路和对等互连。NPM does not detect circuits and peering that are in a diffrent tenant from the Log Analytics workspace.

ER 监视器功能发出了诊断消息“流量无法通过任何线路”。The ER Monitor capability has a diagnostic message "Traffic is not passing through ANY circuit". 这是什么意思?What does that mean?

可能存在这种情况:本地与 Azure 节点之间建立了正常的连接,但流量不能通过配置为由 NPM 监视的 ExpressRoute 线路。There can be a scenario where there is a healthy connection between the on-premises and Azure nodes but the traffic is not going over the ExpressRoute circuit configured to be monitored by NPM.

在以下情况下可能发生这种问题:This can happen if:

  • ER 线路已关闭。The ER circuit is down.
  • 路由筛选器的配置使得其他路由(例如 VPN 连接或其他 ExpressRoute 线路)的优先级高于所需的 ExpressRoute 线路。The route filters are configured in such a manner that they give priority to other routes (such as a VPN connection or another ExpressRoute circuit) over the intended ExpressRoute circuit.
  • 监视配置中选择用来监视 ExpressRoute 线路的本地和 Azure 节点未通过所需的 ExpressRoute 线路相互建立连接。The on-premises and Azure nodes chosen for monitoring the ExpressRoute circuit in the monitoring configuration, do not have connectivity to each other over the intended ExpressRoute circuit. 确保选择正确的节点,并通过所要监视的 ExpressRoute 线路让它们相互建立连接。Ensure that you have chosen correct nodes that have connectivity to each other over the ExpressRoute circuit you intend to monitor.

为什么 ExpressRoute Monitor 在我的线路/对等互连可用并传递数据时将其报告为运行不正常。Why does ExpressRoute Monitor report my circuit/peering as unhealthy when it is available and passing data.

ExpressRoute Monitor 会将代理/服务报告的网络性能值(丢失、延迟和带宽使用率)与配置过程中设置的阈值进行比较。ExpressRoute Monitor compares the network performance values (loss, latency and bandwidth utilisation) reported by the agents/service with the thresholds set during Configuration. 对于一条线路,如果所报告的带宽使用率超过配置中设置的阈值,该线路就会被标记为运行不正常。For a circuit, if the bandwidth utilisation reported is greater than the threshold set in Configuration, the circuit is marked as unhealthy. 对于对等互连,如果所报告的丢失、延迟或带宽使用率超过配置中设置的阈值,该对等互连就会被标记为运行不正常。For peerings, if the loss, latency or bandwidth utilisation reported is greater than the threshold set in the Configuration, the peering is marked as unhealthy. NPM 不利用指标或任何其他形式的数据来决定运行状况状态。NPM does not utilise metrics or any other form of data to deicde health state.

为什么 ExpressRoute 监视器的带宽使用率报告的值不同于传入/传出位指标Why does ExpressRoute Monitor'bandwidth utilisation report a value differrent from metrics bits in/out

对于 ExpressRoute 监视器,带宽使用率是过去 20 分钟内传入和传出带宽的平均值,表示为“位/秒”。对于 Express Route 指标,传入/传出位是每分钟数据点。用于这两者的数据集在内部是相同的,但聚合在 NPM 和 ER 指标之间会有所不同。For ExpressRoute Monitor, bandwidth utiliation is the average of incoming and outgoing bandwidth over the last 20 mins It is expressed in Bits/sec. For Express Route metrics, bit in/out are per minute data points.Internally the dataset used for both is the same, but the aggregation valies between NPM and ER metrics. 为了能够按分钟进行精确监视和快速发出警报,建议直接依据 ER 指标来设置警报For granular, minute by minute monitoring and fast alerts, we recommend setting alerts directly on ER metrics

为 ExpressRoute 线路配置监视时,并未检测 Azure 节点。While configuring monitoring of my ExpressRoute circuit, the Azure nodes are not being detected.

如果 Azure 节点是通过 Operations Manager 连接的,则可能发生这种情况。This can happen if the Azure nodes are connected through Operations Manager. ExpressRoute 监视器功能仅支持以直接代理形式连接的 Azure 节点。The ExpressRoute Monitor capability supports only those Azure nodes that are connected as Direct Agents.

无法在 OMS 门户中发现 ExpressRoute 线路I cannot Discover by ExpressRoute circuits in the OMS portal

尽管在 Azure 门户和 OMS 门户中都可以使用 NPM,但 ExpressRoute 监视器中的线路发现功能只能通过 Azure 门户执行。Though NPM can be used both from the Azure portal as well as the OMS portal, the circuit discovery in the ExpressRoute Monitor capability works only through the Azure portal. 通过 Azure 门户发现线路后,可在上述任一门户中使用该功能。Once the circuits are discovered through the Azure portal, you can then use the capability in either of the two portals.

在服务连接监视器功能中,服务响应时间、网络丢包和延迟显示为“不适用”In the Service Connectivity Monitor capability, the service response time, network loss, as well as latency are shown as NA

如果存在以下一种或多种情况,则可能会发生此问题:This can happen if one or more is true:

  • 服务已关闭。The service is down.
  • 用来检查服务的网络连接的节点已关闭。The node used for checking network connectivity to the service is down.
  • 在测试配置中输入的目标不正确。The target entered in the test configuration is incorrect.
  • 节点未建立任何网络连接。The node doesn't have any network connectivity.

在服务连接监视器功能中,有效服务响应时间已显示,但网络丢包和延迟显示为“不适用”In the Service Connectivity Monitor capability, a valid service response time is shown but network loss as well as latency are shown as NA

如果存在以下一种或多种情况,则可能会发生此问题:This can happen if one or more is true:

  • 如果用来检查服务的网络连接的节点是 Windows 客户端计算机,则原因是目标服务正在阻止 ICMP 请求,或者网络防火墙正在阻止 ICMP 来自该节点的请求。If the node used for checking network connectivity to the service is a Windows client machine, either the target service is blocking ICMP requests or a network firewall is blocking ICMP requests that originate from the node.
  • 在测试配置中,“执行网络测量”复选框为空。The Perform network measurements check box is blank in the test configuration.

在服务连接监视器功能中,服务响应时间为“不适用”,但网络丢包和延迟有效In the Service Connectivity Monitor capability, the service response time is NA but network loss as well as latency are valid

如果目标服务不是 Web 应用程序,但测试配置为 Web 测试,则可能会发生这种情况。This can happen if the target service is not a web application but the test is configured as a Web test. 编辑测试配置,选择“网络”而不是“Web”作为测试类型。Edit the test configuration, and choose the test type as Network instead of Web.


用于监视的节点的性能是否受影响?Is there a performance impact on the node being used for monitoring?

NPM 进程配置为当它的主机 CPU 资源利用率超过 5% 时停止。NPM process is configured to stop if it utilizes more than 5% of the host CPU resources. 这是为了确保可以持续使用这些节点来处理其常规工作负荷,而不会影响性能。This is to ensure that you can keep using the nodes for their usual workloads without impacting performance.

NPM 是否会编辑用于监视的防火墙规则?Does NPM edit firewall rules for monitoring?

NPM 只在运行 EnableRules.ps1 PowerShell 脚本的节点上创建本地 Windows 防火墙规则,以允许代理在指定的端口上建立彼此之间的 TCP 连接。NPM only creates a local Windows Firewall rule on the nodes on which the EnableRules.ps1 PowerShell script is run to allow the agents to create TCP connections with each other on the specified port. 该解决方案不会修改任何网络防火墙或网络安全组 (NSG) 规则。The solution does not modify any network firewall or Network Security Group (NSG) rules.

如何检查用于监视的节点的运行状况?How can I check the health of the nodes being used for monitoring?

可通过以下视图查看用于监视的节点的运行状况:“网络性能监视器”->“配置”->“节点”。You can view the health status of the nodes being used for monitoring from the following view: Network Performance Monitor -> Configuration -> Nodes. 如果某个节点不正常,可以查看错误详细信息并采取建议的措施。If a node is unhealthy, you can view the error details and take the suggested action.

NPM 是否可以微秒为单位报告延迟数字?Can NPM report latency numbers in microseconds?

NPM 在 UI 中以毫秒为单位将延迟数字四舍五入。NPM rounds the latency numbers in the UI and in milliseconds. 相同的数据将以更高的粒度存储(精确度有时可高达四位小数)。The same data is stored at a higher granularity (sometimes up to four decimal places).

后续步骤Next steps