Azure Monitor 中的日志数据引入时间Log data ingestion time in Azure Monitor

Azure Monitor 是一种大规模数据服务,每月为成千上万的客户发送数 TB 的数据,并且此数据仍在不断增长。Azure Monitor is a high scale data service that serves thousands of customers sending terabytes of data each month at a growing pace. 关于日志数据在收集后需要多长时间才可供使用,大家通常存有疑问。There are often questions about the time it takes for log data to become available after it's collected. 本文将对影响此延迟的不同因素进行说明。This article explains the different factors that affect this latency.

典型延迟Typical latency

延迟是指在受监视系统上创建数据的时间以及可在 Azure Monitor 中使用该数据进行分析的时间。Latency refers to the time that data is created on the monitored system and the time that it comes available for analysis in Azure Monitor. 引入日志数据时的典型延迟时间为 2 到 5 分钟。The typical latency to ingest log data is between 2 and 5 minutes. 任何特定数据的特定延迟将根据下面介绍的各种因素而变化。The specific latency for any particular data will vary depending on a variety of factors explained below.

影响延迟的因素Factors affecting latency

特定数据集的总引入时间可以细分为以下几个高级别区域。The total ingestion time for a particular set of data can be broken down into the following high-level areas.

  • 代理时间:发现事件、收集事件,然后将其作为日志记录发送到 Azure Monitor 引入点的时间。Agent time - The time to discover an event, collect it, and then send it to Azure Monitor ingestion point as a log record. 大多数情况下,此过程由代理处理。In most cases, this process is handled by an agent.
  • 管道时间:引入管道处理日志记录的时间。Pipeline time - The time for the ingestion pipeline to process the log record. 包括解析事件属性,并且可能会添加计算信息。This includes parsing the properties of the event and potentially adding calculated information.
  • 索引时间:将日志记录引入到 Azure Monitor 大数据存储所花费的时间。Indexing time – The time spent to ingest a log record into Azure Monitor big data store.

下面将详细介绍该过程中引入的不同延迟。Details on the different latency introduced in this process are described below.

代理收集延迟Agent collection latency

代理和管理解决方案使用不同的策略从虚拟机收集数据,这可能会影响延迟。Agents and management solutions use different strategies to collect data from a virtual machine, which may affect the latency. 一些具体示例包括:Some specific examples include the following:

  • 立即收集 Windows 事件、syslog 事件和性能指标。Windows events, syslog events, and performance metrics are collected immediately. Linux 性能计数器每隔 30 秒轮询一次。Linux performance counters are polled at 30-second intervals.
  • IIS 日志和自定义日志在其时间戳更改后收集。IIS logs and custom logs are collected once their timestamp changes. 对于 IIS 日志,这会受 IIS 上配置的滚动更新计划影响。For IIS logs, this is influenced by the rollover schedule configured on IIS.
  • Active Directory 复制解决方案每五天执行一次评估,而 Active Directory 评估解决方案每周对 Active Directory 基础结构进行一次评估。Active Directory Replication solution performs its assessment every five days, while the Active Directory Assessment solution performs a weekly assessment of your Active Directory infrastructure. 只有在评估完成后,代理才会收集这些日志。The agent will collect these logs only when assessment is complete.

代理上传频率Agent upload frequency

为确保 Log Analytics 代理保持轻型,代理会缓冲日志并定期将其上传到 Azure Monitor。To ensure the Log Analytics agent is lightweight, the agent buffers logs and periodically uploads them to Azure Monitor. 上传频率在 30 秒到 2 分钟之间变化,具体取决于数据类型。Upload frequency varies between 30 seconds and 2 minutes depending on the type of data. 大多数数据可在 1 分钟内上传。Most data is uploaded in under 1 minute. 网络状况可能会对数据抵达 Azure Monitor 引入点的延迟产生负面影响。Network conditions may negatively affect the latency of this data to reach Azure Monitor ingestion point.

Azure 活动日志、资源日志和指标Azure activity logs, resource logs and metrics

Azure 数据增加了额外的时间,以便在 Log Analytics 引入点处可用于处理:Azure data adds additional time to become available at Log Analytics ingestion point for processing:

  • 诊断日志中的数据需要 2 到 15 分钟,具体取决于 Azure 服务。Data from diagnostic logs take 2-15 minutes, depending on the Azure service. 请参阅下面的查询,以便在你的环境中检查此延迟See the query below to examine this latency in your environment
  • 将 Azure 平台指标发送到 Log Analytics 引入点需要 3 分钟。Azure platform metrics take 3 minutes to be sent to Log Analytics ingestion point.
  • 将活动日志数据发送到 Log Analytics 引入点大约需要 10 到 15 分钟。Activity log data will take about 10-15 minutes to be sent to Log Analytics ingestion point.

数据在引入点处可用后,还需要 2 到 5 分钟才能进行查询。Once available at ingestion point, data takes additional 2-5 minutes to be available for querying.

管理解决方案收集Management solutions collection

某些解决方案不从代理收集其数据,并且可能使用会引入额外延迟的收集方法。Some solutions do not collect their data from an agent and may use a collection method that introduces additional latency. 一些解决方案以固定时间间隔收集数据,而不尝试近实时收集。Some solutions collect data at regular intervals without attempting near-real time collection. 具体示例包括:Specific examples include the following:

  • Office 365 解决方案使用 Office 365 管理活动 API 轮询活动日志,该方法目前不提供任何近实时延迟保证。Office 365 solution polls activity logs using the Office 365 Management Activity API, which currently does not provide any near-real time latency guarantees.
  • 该解决方案以每天一次的频率收集 Windows Analytics 解决方案(例如更新符合性)数据。Windows Analytics solutions (Update Compliance for example) data is collected by the solution at a daily frequency.

请参阅各解决方案的文档,确定其收集频率。Refer to the documentation for each solution to determine its collection frequency.

管道处理时间Pipeline-process time

将日志记录引入到 Azure Monitor 管道(如 _TimeReceived 属性中所标识)后,会将其写入临时存储,以确保租户隔离并确保数据不会丢失。Once log records are ingested into the Azure Monitor pipeline (as identified in the _TimeReceived property), they're written to temporary storage to ensure tenant isolation and to make sure that data isn't lost. 此过程通常会花费 5-15 秒的时间。This process typically adds 5-15 seconds. 一些管理解决方案实施了更复杂的算法来聚合数据,并在数据流入时获得见解。Some management solutions implement heavier algorithms to aggregate data and derive insights as data is streaming in. 例如,网络性能监视器以 3 分钟的时间间隔聚合传入数据,有效地增加了 3 分钟的延迟。For example, the Network Performance Monitoring aggregates incoming data over 3-minute intervals, effectively adding 3-minute latency. 处理自定义日志是另一个增加延迟的过程。Another process that adds latency is the process that handles custom logs. 在某些情况下,此过程可能会为代理从文件收集的日志增加几分钟延迟。In some cases, this process might add few minutes of latency to logs that are collected from files by the agent.

新的自定义数据类型预配New custom data types provisioning

自定义日志数据收集器 API 创建新的自定义数据类型时,系统会创建专用存储容器。When a new type of custom data is created from a custom log or the Data Collector API, the system creates a dedicated storage container. 这是一次性开销,仅在此数据类型第一次出现时支付。This is a one-time overhead that occurs only on the first appearance of this data type.

激增保护Surge protection

Azure Monitor 的首要任务是确保不会丢失任何客户数据,因此系统具有内置的数据激增保护。The top priority of Azure Monitor is to ensure that no customer data is lost, so the system has built-in protection for data surges. 这包括缓冲区,可用于确保即使在巨大的负载下,系统也能继续正常运行。This includes buffers to ensure that even under immense load, the system will keep functioning. 在正常负载下,这些控件增加的时间不到一分钟,但在极端条件和故障情况下,它们可以增加大量时间,同时确保数据安全。Under normal load, these controls add less than a minute, but in extreme conditions and failures they could add significant time while ensuring data is safe.

索引任务Indexing time

对于提供分析和高级搜索功能以及提供对数据的即时访问,每个大数据平台都有内置平衡。There is a built-in balance for every big data platform between providing analytics and advanced search capabilities as opposed to providing immediate access to the data. Azure Monitor 允许用户对数十亿条记录运行强大的查询,并能在几秒钟内获得结果。Azure Monitor allows you to run powerful queries on billions of records and get results within a few seconds. 之所以能做到这一点,是因为基础结构能在数据引入过程中急速转换数据并将其存储在独特的紧凑结构中。This is made possible because the infrastructure transforms the data dramatically during its ingestion and stores it in unique compact structures. 系统不断缓冲数据,直到有足够的数据可用于创建这些结构。The system buffers the data until enough of it is available to create these structures. 必须在日志记录出现在搜索结果中之前完成此操作。This must be completed before the log record appears in search results.

目前,在数据量较少但数据速率较高的情况下,此过程大约需要 5 分钟。This process currently takes about 5 minutes when there is low volume of data but less time at higher data rates. 这似乎有悖常理,但此过程可优化大量生产工作负荷的延迟。This seems counterintuitive, but this process allows optimization of latency for high-volume production workloads.

检查引入时间Checking ingestion time

由于在不同情况下,不同资源的引入时间可能会有所不同。Ingestion time may vary for different resources under different circumstances. 可以使用日志查询来识别环境的特定行为。You can use log queries to identify specific behavior of your environment. 下表指定了如何在创建记录并将其发送到 Azure Monitor 时确定记录的不同时间。The following table specifies how you can determine the different times for a record as it's created and sent to Azure Monitor.

步骤Step 属性或函数Property or Function 注释Comments
在数据源处创建的记录Record created at data source TimeGeneratedTimeGenerated
如果数据源未设置此值,则它将设置为与 _TimeReceived 相同的时间。If the data source doesn't set this value, then it will be set to the same time as _TimeReceived.
Azure Monitor 引入终结点收到的记录Record received by Azure Monitor ingestion endpoint _TimeReceived_TimeReceived
存储在工作区中并可用于查询的记录Record stored in workspace and available for queries ingestion_time()ingestion_time()

引入延迟延迟Ingestion latency delays

可以通过比较 ingestion_time() 函数的结果和 TimeGenerated 字段来测量特定记录的延迟 。You can measure the latency of a specific record by comparing the result of the ingestion_time() function to the TimeGenerated field. 此数据可用于各种聚合,以查找引入延迟的行为方式。This data can be used with various aggregations to find how ingestion latency behaves. 检查引入时间的某些百分位数,以获取大量数据的见解。Examine some percentile of the ingestion time to get insights for large amount of data.

例如,以下查询将显示在前 8 小时内哪些计算机的引入时间最长:For example, the following query will show you which computers had the highest ingestion time over the prior 8 hours:

Heartbeat
| where TimeGenerated > ago(8h) 
| extend E2EIngestionLatency = ingestion_time() - TimeGenerated 
| extend AgentLatency = _TimeReceived - TimeGenerated 
| summarize percentiles(E2EIngestionLatency,50,95), percentiles(AgentLatency,50,95) by Computer 
| top 20 by percentile_E2EIngestionLatency_95 desc

上述百分位数检查非常适合发现延迟的一般趋势。The preceding percentile checks are good for finding general trends in latency. 若要确定延迟的短期峰值,使用最大值 (max()) 可能更有效。To identify a short-term spike in latency, using the maximum (max()) might be more effective.

如果想要在一段时间内详细了解特定计算机的引入时间,可以使用以下查询,它还可以将过去一天的数据以图表的形式显示出来:If you want to drill down on the ingestion time for a specific computer over a period of time, use the following query, which also visualizes the data from the past day in a graph:

Heartbeat 
| where TimeGenerated > ago(24h) //and Computer == "ContosoWeb2-Linux"  
| extend E2EIngestionLatencyMin = todouble(datetime_diff("Second",ingestion_time(),TimeGenerated))/60 
| extend AgentLatencyMin = todouble(datetime_diff("Second",_TimeReceived,TimeGenerated))/60 
| summarize percentiles(E2EIngestionLatencyMin,50,95), percentiles(AgentLatencyMin,50,95) by bin(TimeGenerated,30m) 
| render timechart

使用以下查询按计算机所在国家/地区(基于其 IP 地址)显示计算机引入时间:Use the following query to show computer ingestion time by the country/region they are located in which is based on their IP address:

Heartbeat 
| where TimeGenerated > ago(8h) 
| extend E2EIngestionLatency = ingestion_time() - TimeGenerated 
| extend AgentLatency = _TimeReceived - TimeGenerated 
| summarize percentiles(E2EIngestionLatency,50,95),percentiles(AgentLatency,50,95) by RemoteIPCountry 

源自代理的不同数据类型可能具有不同的引入延迟时间,因此先前的查询可以与其他类型一起使用。Different data types originating from the agent might have different ingestion latency time, so the previous queries could be used with other types. 使用以下查询来检查各种 Azure 服务的引入时间:Use the following query to examine the ingestion time of various Azure services:

AzureDiagnostics 
| where TimeGenerated > ago(8h) 
| extend E2EIngestionLatency = ingestion_time() - TimeGenerated 
| extend AgentLatency = _TimeReceived - TimeGenerated 
| summarize percentiles(E2EIngestionLatency,50,95), percentiles(AgentLatency,50,95) by ResourceProvider

停止响应的资源Resources that stop responding

在某些情况下,资源无法停止发送数据。In some cases, a resource could stop sending data. 若要了解资源是否正在发送数据,请查看由标准 TimeGenerated 字段标识的最新记录。To understand if a resource is sending data or not, look at its most recent record which can be identified by the standard TimeGenerated field.

使用检测信号表来检查 VM 的可用性,因为检测信号由代理每分钟发送一次。Use the Heartbeat table to check the availability of a VM, since a heartbeat is sent once a minute by the agent. 使用以下查询列出最近尚未报告过检测信号的活动计算机:Use the following query to list the active computers that haven’t reported heartbeat recently:

Heartbeat  
| where TimeGenerated > ago(1d) //show only VMs that were active in the last day 
| summarize NoHeartbeatPeriod = now() - max(TimeGenerated) by Computer  
| top 20 by NoHeartbeatPeriod desc 

后续步骤Next steps