如何排查 Log Analytics Windows 代理的问题How to troubleshoot issues with the Log Analytics agent for Windows

本文介绍如何排查可能遇到的 Azure Monitor 中的 Log Analytics Windows 代理的相关错误,并提供可能的解决方法建议。This article provides help troubleshooting errors you might experience with the Log Analytics agent for Windows in Azure Monitor and suggests possible solutions to resolve them.

如果这些步骤对你均无效,我们还提供了以下支持渠道:If none of these steps work for you, the following support channels are also available:

Log Analytics 故障排除工具Log Analytics Troubleshooting Tool

Log Analytics 代理 Windows 故障排除工具是一个 PowerShell 脚本集合,旨在帮助查找和诊断 Log Analytics 代理问题。The Log Analytics Agent Windows Troubleshooting Tool is a collection of PowerShell scripts designed to help find and diagnose issues with the Log Analytics Agent. 安装后,该工具将自动包含在代理中。It is automatically included with the agent upon installation. 应将运行此工具作为诊断问题的第一步。Running the tool should be the first step in diagnosing an issue.

如何使用How to use

  1. 在安装了 Log Analytics 代理的计算机上以管理员身份打开 PowerShell 提示符。Open PowerShell prompt as Administrator on the machine where Log Analytics Agent is installed.

  2. 导航到该工具所在的目录。Navigate to the directory where the tool is located.

    • cd "C:\Program Files\Microsoft Monitoring Agent\Agent\Troubleshooter"
  3. 使用此命令执行主脚本:Execute the main script using this command:

    • .\GetAgentInfo.ps1
  4. 选择故障排除方案。Select a troubleshooting scenario.

  5. 按照控制台上的说明进行操作。Follow instructions on the console. (注意:跟踪日志步骤需要手动干预来停止日志收集。(Note: trace logs steps requires manual intervention to stop log collection. 根据问题的可再现性,等待相应的持续时间,并按“s”停止日志收集,然后继续下一步)。Based upon the reproducibility of the issue, wait for the time duration and press 's' to stop log collection and proceed to the next step).

    完成时将会记录结果文件的位置,并会打开一个新的资源管理器窗口,其中突出显示结果文件的位置。Locations of the results file is logged upon completion and a new explorer window highlighting it is opened.


安装 Log Analytics 代理内部版本 10.20.18053.0 及更高版本时,将会自动包含故障排除工具。The Troubleshooting Tool is automatically included upon installation of the Log Analytics Agent build 10.20.18053.0 and onwards.

涵盖的方案Scenarios covered

下面是使用故障排除工具检查的方案的列表:Below is a list of scenarios checked by the Troubleshooting Tool:

  • 代理未报告数据,或者缺少检测信号数据Agent not reporting data or heartbeat data missing
  • 代理扩展部署失败Agent extension deployment failing
  • 代理崩溃Agent crashing
  • 代理使用的 CPU/内存过高Agent consuming high CPU/memory
  • 安装/卸载失败Installation/uninstallation failures
  • 自定义日志问题Custom logs issue
  • OMS 网关问题OMS Gateway issue
  • 性能计数器问题Performance counters issue
  • 收集所有日志Collect all logs


遇到问题时,请运行故障排除工具。Please run the Troubleshooting tool when you experience an issue. 创建支持票证时,从一开始便提供日志将会有力地帮助我们的支持团队更快解决你的问题。When opening a ticket, having the logs initially will greatly help our support team troubleshoot your issue quicker.

重要的故障排除源Important troubleshooting sources

为了帮助排查 Log Analytics Windows 代理相关的问题,该代理会将事件记录到 Windows 事件日志,具体而言,是 Application and Services\Operations Manager 下的日志。To assist with troubleshooting issues related to Log Analytics agent for Windows, the agent logs events to the Windows Event Log, specifically under Application and Services\Operations Manager.

连接问题Connectivity issues

如果代理通过代理服务器或防火墙通信,某些限制可能会阻止源计算机和 Azure Monitor 服务发起的通信。If the agent is communicating through a proxy server or firewall, there may be restrictions in place preventing communication from the source computer and the Azure Monitor service. 如果由于配置错误而阻止了通信,则在尝试安装代理或在安装后将代理配置为向其他工作区报告时,注册到工作区可能会失败。In case communication is blocked, because of misconfiguration, registration with a workspace might fail while attempting to install the agent or configure the agent post-setup to report to an additional workspace. 成功注册后,代理通信可能会失败。Agent communication may fail after successful registration. 本部分将介绍排查此类 Windows 代理问题的方法。This section describes the methods to troubleshoot this type of issue with the Windows agent.

请仔细检查防火墙或代理是否配置为允许下表中所述的端口和 URL。Double check that the firewall or proxy is configured to allow the following ports and URLs described in the following table. 此外,请确认没有为 Web 流量启用 HTTP 检查,因为它可能会阻止代理与 Azure Monitor 之间的安全 TLS 通道。Also confirm HTTP inspection is not enabled for web traffic, as it can prevent a secure TLS channel between the agent and Azure Monitor.

代理资源Agent Resource 端口Ports 方向Direction 绕过 HTTPS 检查Bypass HTTPS inspection
*.ods.opinsights.azure.cn*.ods.opinsights.azure.cn 端口 443Port 443 出站Outbound Yes
*.oms.opinsights.azure.cn*.oms.opinsights.azure.cn 端口 443Port 443 出站Outbound Yes
*.blob.core.chinacloudapi.cn*.blob.core.chinacloudapi.cn 端口 443Port 443 出站Outbound Yes
*.azure-automation.cn*.azure-automation.cn 端口 443Port 443 出站Outbound Yes

如果计划使用 Azure 自动化混合 Runbook 辅助角色连接到自动化服务并在其中注册以在环境中使用 Runbook 或管理解决方案,则它必须可以访问针对混合 Runbook 辅助角色配置网络中所述的端口号和 URL。If you plan to use the Azure Automation Hybrid Runbook Worker to connect to and register with the Automation service to use runbooks or management solutions in your environment, it must have access to the port number and the URLs described in Configure your network for the Hybrid Runbook Worker.

可通过多种方法验证代理是否能够成功与 Azure Monitor 通信。There are several ways you can verify if the agent is successfully communicating with Azure Monitor.

  • 运行以下查询来确认代理是否会将检测信号发送到为其配置的目标报告工作区。Run the following query to confirm the agent is sending a heartbeat to the workspace it is configured to report to. 请将 <ComputerName> 替换为计算机的实际名称。Replace <ComputerName> with the actual name of the machine.

    | where Computer like "<ComputerName>"
    | summarize arg_max(TimeGenerated, * ) by Computer 

    如果计算机可成功与服务通信,则该查询应会返回结果。If the computer is successfully communicating with the service, the query should return a result. 如果查询未返回结果,请先验证代理是否配置为向正确的工作区报告。If the query did not return a result, first verify the agent is configured to report to the correct workspace. 如果配置正确,请转到步骤 3 并搜索 Windows 事件日志,以确定代理是否在日志中指出了哪种问题导致它无法与 Azure Monitor 通信。If it is configured correctly, proceed to step 3 and search the Windows Event Log to identify if the agent is logging what issue might be preventing it from communicating with Azure Monitor.

  • 识别连接问题的另一种方法是运行 TestCloudConnectivity 工具。Another method to identify a connectivity issue is by running the TestCloudConnectivity tool. 默认情况下,该工具会连同代理一起安装在 %SystemRoot%\Program Files\Microsoft Monitoring Agent\Agent 文件夹中。The tool is installed by default with the agent in the folder %SystemRoot%\Program Files\Microsoft Monitoring Agent\Agent. 在权限提升的命令提示符下,导航到该文件夹并运行该工具。From an elevated command prompt, navigate to the folder and run the tool. 该工具会返回结果,并突出显示测试在哪个位置失败(例如,指出问题是否与某个已阻止的特定端口/URL 相关)。The tool returns the results and highlights where the test failed (for example, if it was related to a particular port/URL that was blocked).

    TestCloudConnection 工具的执行结果

  • 事件源 - “运行状况服务模块”、“运行状况服务”和“服务连接器”筛选“Operations Manager”事件日志,并按 事件级别 -“警告”和“错误”进行筛选,以确认代理是否写入了下表中所述的事件。 Filter the Operations Manager event log by Event sources - Health Service Modules, HealthService, and Service Connector and filter by Event Level Warning and Error to confirm if it has written events from the following table. 如果已写入,请查看针对每个可能的事件提供的解决方法步骤。If they are, review the resolution steps included for each possible event.

    事件 IDEvent ID Source 说明Description 解决方法Resolution
    2133 和 21292133 & 2129 运行状况服务Health Service 从代理连接到服务失败Connection to the service from the agent failed 如果代理无法直接或者通过防火墙/代理服务器来与 Azure Monitor 服务通信,则可能会发生此错误。This error can occur when the agent cannot communicate directly or through a firewall/proxy server to the Azure Monitor service. 验证该代理程序的代理设置,或者网络防火墙/代理是否允许将该计算机的 TCP 流量发送到服务。Verify agent proxy settings or that the network firewall/proxy allows TCP traffic from the computer to the service.
    21382138 运行状况服务模块Health Service Modules 代理要求身份验证Proxy requires authentication 配置该代理程序的代理设置,并指定在代理服务器上进行身份验证所需的用户名/密码。Configure the agent proxy settings and specify the username/password required to authenticate with the proxy server.
    21292129 运行状况服务模块Health Service Modules 连接失败/TLS 协商失败Failed connection/Failed TLS negotiation 检查网络适配器的 TCP/IP 设置和代理程序的代理设置。Check your network adapter TCP/IP settings and agent proxy settings.
    21272127 运行状况服务模块Health Service Modules 发送数据失败并收到错误代码Failure sending data received error code 如果此问题只是在某一天中定期发生,原因可能是出现随机异常;可忽略此问题。If it only happens periodically during the day, it could just be a random anomaly that can be ignored. 通过监视来了解问题发生的频率。Monitor to understand how often it happens. 如果在一整天经常发生,请先检查网络配置和代理设置。If it happens often throughout the day, first check your network configuration and proxy settings. 如果说明中包含 HTTP 错误代码 404,并且这是代理首次尝试将数据发送到服务,则错误消息中会包含 500 错误和 404 内部错误代码。If the description includes HTTP error code 404 and it's the first time that the agent tries to send data to the service, it will include a 500 error with an inner 404 error code. 404 表示“未找到”,即,仍在预配新工作区的存储区域。404 means not found, which indicates that the storage area for the new workspace is still being provisioned. 下次重试时,数据将成功按预期写入到工作区。On next retry, data will successfully write to the workspace as expected. HTTP 错误 403 可能表示出现权限或凭据问题。An HTTP error 403 might indicate a permission or credentials issue. 403 错误会包含更多信息来帮助排查问题。There is more information included with the 403 error to help troubleshoot the issue.
    40004000 服务连接器Service Connector DNS 名称解析失败DNS name resolution failed 计算机无法解析在向服务发送数据时使用的 Internet 地址。The machine could not resolve the Internet address used when sending data to the service. 原因可能是计算机上的 DNS 解析程序设置有问题、代理设置不正确,或者提供商出现了暂时性的 DNS 问题。This might be DNS resolver settings on your machine, incorrect proxy settings, or maybe a temporary DNS issue with your provider. 如果此错误定期发生,原因可能是存在暂时性的网络相关问题。If it happens periodically, it could be caused by a transient network-related issue.
    40014001 服务连接器Service Connector 无法连接到服务。Connection to the service failed. 如果代理无法直接或者通过防火墙/代理服务器来与 Azure Monitor 服务通信,则可能会发生此错误。This error can occur when the agent cannot communicate directly or through a firewall/proxy server to the Azure Monitor service. 验证该代理程序的代理设置,或者网络防火墙/代理是否允许将该计算机的 TCP 流量发送到服务。Verify agent proxy settings or that the network firewall/proxy allows TCP traffic from the computer to the service.
    40024002 服务连接器Service Connector 服务返回 HTTP 状态代码 403 以响应查询。The service returned HTTP status code 403 in response to a query. 请咨询服务管理员以了解服务的运行状况。Check with the service administrator for the health of the service. 稍后将重试该查询。The query will be retried later. 此错误是在代理的初始注册阶段写入的,你将看到类似于以下内容的 URL:https://<workspaceID>.oms.opinsights.azure.cn/AgentService.svc/AgentTopologyRequestThis error is written during the agent’s initial registration phase and you’ll see a URL similar to the following: https://<workspaceID>.oms.opinsights.azure.cn/AgentService.svc/AgentTopologyRequest. 错误代码 403 表示禁止,可能是由于错误键入工作区 ID 或密钥,或者计算机上的数据和时间不正确所致。An error code 403 means forbidden and can be caused by a mistyped Workspace ID or key, or the data and time is incorrect on the computer. 如果时间比当前时间快/慢 15 分钟,则载入失败。If the time is +/- 15 minutes from current time, then onboarding fails. 若要更正此错误,请更新 Windows 计算机的日期和/或时区。To correct this, update the date and/or timezone of your Windows computer.

数据收集问题Data collection issues

代理已安装并可以向配置的一个或多个工作区报告后,根据启用的设置以及面向计算机的内容,它可能会停止接收配置、收集性能、日志或其他数据或者将这些数据转发到服务。After the agent is installed and reports to its configured workspace or workspaces, it may stop receiving configuration, collecting or forwarding performance, logs, or other data to the service depending on what is enabled and targeting the computer. 需要确定:It is necessary to determine if:

  • 工作区中是特定的数据类型不可用,还是所有数据都不可用?Is it a particular data type or all data that is not available in the workspace?
  • 该数据类型是由解决方案指定的,还是指定为工作区数据收集配置的一部分?Is the data type specified by a solution or specified as part of the workspace data collection configuration?
  • 有多少台计算机受到影响?How many computers are affected? 是有一台还是多台计算机向工作区报告?Is it a single or multiple computers reporting to the workspace?
  • 代理是否只是在一天的特定时间停止,而在其他时间可保持工作,或者,它是否从未收集过数据?Was it working and did it stop at a particular time of day, or has it never been collected?
  • 使用的日志搜索查询在语法上是否正确?Is the log search query you are using syntactically correct?
  • 代理是否曾经从 Azure Monitor 接收过其配置?Has the agent ever received its configuration from Azure Monitor?

故障排除的第一步是确定计算机是否发送检测信号事件。The first step in troubleshooting is to determine if the computer is sending a heartbeat event.

    | where Computer like "<ComputerName>"
    | summarize arg_max(TimeGenerated, * ) by Computer

如果查询返回了结果,则你需要确定是否未收集特定的数据类型并将其转发到服务。If the query returns results, then you need to determine if a particular data type is not collected and forwarded to the service. 原因可能是代理未从服务接收更新的配置,或者其他某种症状阻止了代理正常运行。This could be caused by the agent not receiving updated configuration from the service, or some other symptom preventing the agent from operating normally. 执行以下步骤进一步进行故障排除。Perform the following steps to further troubleshoot.

  1. 在计算机上打开权限提升的命令提示符,并键入 net stop healthservice && net start healthservice 重启代理服务。Open an elevated command prompt on the computer and restart the agent service by typing net stop healthservice && net start healthservice.

  2. 打开“Operations Manager”事件日志,并在 **事件源“运行状况服务”中搜索 **事件 ID 7023、7024、7025、7028 和 1210。**** Open the Operations Manager event log and search for event IDs 7023, 7024, 7025, 7028 and 1210 from Event source HealthService. 这些事件表示代理可成功从 Azure Monitor 接收配置,并且它们正在监视计算机。These events indicate the agent is successfully receiving configuration from Azure Monitor and they are actively monitoring the computer. 事件 ID 1210 的事件说明还会在最后一行中指定代理监视范围内的所有解决方法和见解。The event description for event ID 1210 will also specify on the last line all of the solutions and Insights that are included in the scope of monitoring on the agent.

    事件 ID 1210 说明

  3. 如果在几分钟后,查询结果或可视化效果中未按预期显示数据,请根据你查看的是解决方法还是见解中的数据,在“Operations Manager”事件日志中,搜索 事件源“运行状况服务”和“运行状况服务模块”,并按 事件级别“警告”和“错误”进行筛选,以确认代理是否写入了下表中所述的事件。 If after several minutes you do not see the expected data in the query results or visualization, depending on if you are viewing the data from a solution or Insight, from the Operations Manager event log, search for Event sources HealthService and Health Service Modules and filter by Event Level Warning and Error to confirm if it has written events from the following table.

    事件 IDEvent ID Source 说明Description 解决方法Resolution
    80008000 HealthServiceHealthService 此事件将指定与性能、事件或收集的其他数据类型相关的工作流是否无法将这些数据转发到服务,以引入到工作区。This event will specify if a workflow related to performance, event, or other data type collected is unable to forward to the service for ingestion to the workspace. 来自源运行状况服务的事件 ID 2136 将连同此事件一起写入,可能表示代理无法与服务通信,原因可能是代理和身份验证设置的配置不当、网络中断,或者网络防火墙/代理不允许将计算机的 TCP 流量发送到服务。Event ID 2136 from source HealthService is written together with this event and can indicate the agent is unable to communicate with the service, possibly due to misconfiguration of the proxy and authentication settings, network outage, or the network firewall/proxy does not allow TCP traffic from the computer to the service.
    10102 和 1010310102 and 10103 运行状况服务模块Health Service Modules 工作流无法解析数据源。Workflow could not resolve data source. 如果指定的性能计数器或实例在计算机上不存在,或者在工作区数据设置中未正确定义,则可能会发生此错误。This can occur if the specified performance counter or instance does not exist on the computer or is incorrectly defined in the workspace data settings. 如果这是用户指定的性能计数器,请验证指定的信息是否遵循正确的格式,并在目标计算机上存在。If this is a user-specified performance counter, verify the information specified is following the correct format and exists on the target computers.
    2600226002 运行状况服务模块Health Service Modules 工作流无法解析数据源。Workflow could not resolve data source. 如果指定的 Windows 事件日志在计算机上不存在,则可能会发生此错误。This can occur if the specified Windows event log does not exist on the computer. 如果预期不需要在计算机上此注册此事件日志,则可以安全忽略此错误;如果这是用户指定的事件日志,请验证指定的信息是否正确。This error can be safely ignored if the computer is not expected to have this event log registered, otherwise if this is a user-specified event log, verify the information specified is correct.