优化 Azure Monitor 中的日志查询Optimize log queries in Azure Monitor

Azure Monitor 日志使用 Azure 数据资源管理器 (ADX) 来存储日志数据,并运行查询来分析这些数据。Azure Monitor Logs uses Azure Data Explorer (ADX) to store log data and run queries for analyzing that data. 它为你创建、管理和维护 ADX 群集,并针对你的日志分析工作负荷优化它们。It creates, manages, and maintains the ADX clusters for you, and optimizes them for your log analysis workload. 运行查询时,将对其进行优化,并将其路由到存储着工作区数据的相应 ADX 群集。When you run a query, it's optimized, and routed to the appropriate ADX cluster that stores the workspace data. Azure Monitor 日志和 Azure 数据资源管理器都使用许多自动查询优化机制。Both Azure Monitor Logs and Azure Data Explorer uses many automatic query optimization mechanisms. 虽然自动优化已提供了显著的性能提升,但在某些情况下,你还可以显著提高查询性能。While automatic optimizations provide significant boost, they are in some cases where you can dramatically improve your query performance. 本文介绍了性能注意事项和解决相关问题的几种方法。This article explains the performance considerations and several techniques to fix them.

大多数方法对于直接在 Azure 数据资源管理器和 Azure Monitor 日志上运行的查询是通用的,但我们在这里讨论的是几个独特的 Azure Monitor 日志注意事项。Most of the techniques are common to queries that are run directly on Azure Data Explorer and Azure Monitor Logs, though there are several unique Azure Monitor Logs considerations that are discussed here. 如需更多的 Azure 数据资源管理器优化技巧,请参阅查询最佳做法For more Azure Data Explorer optimization tips, see Query best practices.

优化的查询具有以下特点:Optimized queries will:

  • 运行速度更快,缩短了查询执行操作的总持续时间。Run faster, reduce overall duration of the query execution.
  • 被限制或拒绝的可能性更小。Have smaller chance of being throttled or rejected.

应特别注意反复使用的和阵发性的查询,例如涉及仪表板、警报、逻辑应用和 Power BI 的查询。You should give particular attention to queries that are used for recurrent and bursty usage such as dashboards, alerts, Logic Apps and Power BI. 在这些情况下,无效查询的影响是巨大的。The impact of an ineffective query in these cases is substantial.

查询性能窗格Query performance pane

在 Log Analytics 中运行查询后,单击查询结果上方的向下箭头可查看查询性能窗格,其中显示查询的多个性能指标的结果。After you run a query in Log Analytics, click the down arrow above the query results to view the query performance pane that shows the results of several performance indicators for the query. 下一部分介绍了这些性能指标中的每一个。These performance indicators are each described in the following section.

查询性能窗格

查询性能指标Query performance indicators

针对所执行的每个查询提供了以下查询性能指标:The following query performance indicators are available for every query that is executed:

  • 总 CPU 时间:用来在所有计算节点中处理此查询的总体计算。Total CPU: Overall compute used to process the query across all compute nodes. 它表示用于计算、分析和数据提取的时间。It represents time used for computing, parsing, and data fetching.

  • 用于已处理查询的数据:处理查询时访问的总体数据。Data used for processed query: Overall data that was accessed to process the query. 受目标表大小、所用时间跨度、已应用筛选器和已引用列数影响。Influenced by the size of the target table, time span used, filters applied, and the number of columns referenced.

  • 已处理查询的时间跨度:处理查询时访问的最新数据与最旧数据之间的间隔。Time span of the processed query: The gap between the newest and the oldest data that was accessed to process the query. 受为查询指定的显式时间范围影响。Influenced by the explicit time range specified for the query.

  • 已处理数据的年限:当前时间与处理查询时访问最旧数据的时间之间的间隔。Age of processed data: The gap between now and the oldest data that was accessed to process the query. 它会极大影响数据提取效率。It highly influences the efficiency of data fetching.

  • 工作区数量:在查询处理过程中按隐式或显式选择要求访问的工作区数。Number of workspaces: How many workspaces were accessed during the query processing due to implicit or explicit selection.

  • 区域数量:在查询处理过程中按照对工作区的隐式或显式选择要求访问的区域数。Number of regions: How many regions were accessed during the query processing based due to implicit or explicit selection of workspaces. 多区域查询的效率要低得多,并且性能指标只覆盖了部分区域。Multi-region queries are much less efficient and performance indicators present partial coverage.

  • 并行度:指明系统能够在多个节点上执行此查询的程度。Parallelism: Indicates how much the system was able to execute this query on multiple nodes. 仅适用于 CPU 消耗较高的查询。Relevant only to queries that have high CPU consumption. 受使用的具体函数和运算符影响。Influenced by usage of specific functions and operators.

总 CPU 时间Total CPU

在所有查询处理节点中处理此查询而投入的实际计算 CPU。The actual compute CPU that was invested to process this query across all the query processing nodes. 由于大多数查询是在大量节点上执行的,因此此时间通常会比查询实际执行的持续时间长得多。Since most queries are executed on large numbers of nodes, this will usually be much larger than the duration the query actually took to execute.

使用超过 100 秒 CPU 的查询被视为消耗过多资源的查询。Query that utilizes more than 100 seconds of CPU is considered a query that consumes excessive resources. 使用超过 1,000 秒 CPU 的查询被视为滥用查询,可能会受到限制。Query that utilizes more than 1,000 seconds of CPU is considered an abusive query and might be throttled.

查询处理时间花费在:Query processing time is spent on:

  • 数据检索 - 检索旧数据比检索近期数据所用时间更长。Data retrieval - retrieval of old data will consume more time than retrieval of recent data.
  • 数据处理 - 数据的逻辑和计算。Data processing - logic and evaluation of the data.

除了在查询处理节点中所花费的时间以外,Azure Monitor 日志还需要花费额外的时间来执行以下操作:对用户进行身份验证,并验证是否允许他们访问此数据、查找数据存储、分析查询和分配查询处理节点。Other than time spent in the query processing nodes, there is additional time that is spend by Azure Monitor Logs to: authenticate the user and verify that they are permitted to access this data, locate the data store, parse the query, and allocate the query processing nodes. 此时间不包括在查询总 CPU 时间内。This time is not included in the query total CPU time.

使用 CPU 使用率过高的函数之前提前筛选记录Early filtering of records prior of using high CPU functions

某些查询命令和函数的 CPU 消耗很高。Some of the query commands and functions are heavy in their CPU consumption. 对于分析 JSON 和 XML 或提取复杂的正则表达式的命令尤其如此。This is especially true for commands that parse JSON and XML or extract complex regular expressions. 这种分析可以通过 parse_json()parse_xml() 函数显式地进行,也可以在引用动态列时隐式地进行。Such parsing can happen explicitly via parse_json() or parse_xml() functions or implicitly when referring to dynamic columns.

这些函数消耗的 CPU 与它们正在处理的行数成正比。These functions consume CPU in proportion to the number of rows they are processing. 最高效的优化是在查询中提前添加 where 条件,这些条件可以在执行 CPU 密集型函数之前筛选出尽可能多的记录。The most efficient optimization is to add where conditions early in the query that can filter out as many records as possible before the CPU intensive function is executed.

例如,以下查询产生完全相同的结果,但第二个查询的效率最高,因为其中的 where 条件在分析之前排除了许多记录:For example, the following queries produce exactly the same result but the second one is by far the most efficient as the where condition before parsing excludes many records:

//less efficient
SecurityEvent
| extend Details = parse_xml(EventData)
| extend FilePath = tostring(Details.UserData.RuleAndFileData.FilePath)
| extend FileHash = tostring(Details.UserData.RuleAndFileData.FileHash)
| where FileHash != "" and FilePath !startswith "%SYSTEM32"  // Problem: irrelevant results are filtered after all processing and parsing is done
| summarize count() by FileHash, FilePath
//more efficient
SecurityEvent
| where EventID == 8002 //Only this event have FileHash
| where EventData !has "%SYSTEM32" //Early removal of unwanted records
| extend Details = parse_xml(EventData)
| extend FilePath = tostring(Details.UserData.RuleAndFileData.FilePath)
| extend FileHash = tostring(Details.UserData.RuleAndFileData.FileHash)
| where FileHash != "" and FilePath !startswith "%SYSTEM32"  // exact removal of results. Early filter is not accurate enough
| summarize count() by FileHash, FilePath
| where FileHash != "" // No need to filter out %SYSTEM32 here as it was removed before

避免使用涉及计算的 where 子句Avoid using evaluated where clauses

如果查询包含的 where 子句基于某个计得列而不是数据集中实际存在的列,则查询的效率会降低。Queries that contain where clauses on an evaluated column rather than on columns that are physically present in the dataset lose efficiency. 在处理大型数据集时,针对计得列进行筛选会妨碍某些系统优化。Filtering on evaluated columns prevents some system optimizations when large sets of data are handled. 例如,以下查询产生完全相同的结果,但第二个查询的效率更高,因为其中的 where 条件引用了内置的列For example, the following queries produce exactly the same result but the second one is more efficient as the where condition refers to built-in column

//less efficient
Syslog
| extend Msg = strcat("Syslog: ",SyslogMessage)
| where  Msg  has "Error"
| count 
//more efficient
Syslog
| where  SyslogMessage  has "Error"
| count 

在某些情况下,计算列由查询处理引擎隐式创建,因为筛选不只是在字段上完成的:In some cases the evaluated column is created implicitly by the query processing engine since the filtering is done not just on the field:

//less efficient
SecurityEvent
| where tolower(Process) == "conhost.exe"
| count 
//more efficient
SecurityEvent
| where Process =~ "conhost.exe"
| count 

在汇总和联接中使用高效的聚合命令和维度Use effective aggregation commands and dimensions in summarize and join

某些聚合命令,例如 max()sum()count()avg(),由于它们的逻辑,对 CPU 的影响很小,而另一些则较复杂,包括试探方法和估算,这使它们能够高效地执行。While some aggregation commands like max(), sum(), count(), and avg() have low CPU impact due to their logic, other are more complex and include heuristics and estimations that allow them to be executed efficiently. 例如,dcount() 使用 HyperLogLog 算法提供对大型数据集的非重复计数的近似估算,而不对每个值进行实际计数;百分位函数使用最近的秩百分位算法执行类似的粗略估算。For example, dcount() uses the HyperLogLog algorithm to provide close estimation to distinct count of large sets of data without actually counting each value; the percentile functions are doing similar approximations using the nearest rank percentile algorithm. 多个命令包括了可选参数来降低其影响。Several of the commands include optional parameters to reduce their impact. 例如,makeset() 函数使用一个可选参数来定义最大集大小,该大小会显著影响 CPU 和内存。For example, the makeset() function has an optional parameter to define the maximum set size, which significantly affects the CPU and memory.

Joinsummarize 命令在处理大型数据集时可能会导致 CPU 利用率较高。Join and summarize commands may cause high CPU utilization when they are processing a large set of data. 它们的复杂性与在汇总中用作 by 或用作联接属性的列的可能值的数量(称为“基数”)直接相关。Their complexity is directly related to the number of possible values, referred to as cardinality, of the columns that are using as the by in summarize or as the join attributes. 有关对联接和汇总的说明和优化,请参阅它们的文档文章和优化技巧。For explanation and optimization of join and summarize, see their documentation articles and optimization tips.

例如,下面的查询产生完全相同的结果,因为 CounterPath 始终一对一映射到 CounterNameObjectNameFor example, the following queries produce exactly the same result because CounterPath is always one-to-one mapped to CounterName and ObjectName. 第二个查询更为高效,因为聚合维度较小:The second one is more efficient as the aggregation dimension is smaller:

//less efficient
Perf
| summarize avg(CounterValue) 
by CounterName, CounterPath, ObjectName
//make the group expression more compact improve the performance
Perf
| summarize avg(CounterValue), any(CounterName), any(ObjectName) 
by CounterPath

CPU 消耗还可能会受 where 条件或需要密集计算的扩展列的影响。CPU consumption might also be impacted by where conditions or extended columns that require intensive computing. 所有琐碎的字符串比较(例如 equal ==startswith)对 CPU 的影响大致相同,而高级文本匹配则影响更大。All trivial string comparisons such as equal == and startswith have roughly the same CPU impact while advanced text matches have more impact. 具体而言,has 运算符比 contains 运算符更高效。Specifically, the has operator is more efficient that the contains operator. 查找长度超过四个字符的字符串比查找短字符串更高效,因为存在字符串处理技术。Due to string handling techniques, it is more efficient to look for strings that are longer than four characters than short strings.

例如,下面的查询将产生类似的结果,具体取决于计算机命名策略,但第二个查询更高效:For example, the following queries produce similar results, depending on Computer naming policy, but the second one is more efficient:

//less efficient - due to filter based on contains
Heartbeat
| where Computer contains "Production" 
| summarize count() by ComputerIP 
//less efficient - due to filter based on extend
Heartbeat
| extend MyComputer = Computer
| where MyComputer startswith "Production" 
| summarize count() by ComputerIP 
//more efficient
Heartbeat
| where Computer startswith "Production" 
| summarize count() by ComputerIP 

备注

此指标仅显示来自紧邻群集的 CPU。This indicator presents only CPU from the immediate cluster. 在多区域查询中,它仅显示其中一个区域。In multi-region query, it would represent only one of the regions. 在多工作区查询中,它可能不包括所有工作区。In multi-workspace query, it might not include all workspaces.

当字符串分析有效时,请避免使用完全 XML 和 JSON 分析Avoid full XML and JSON parsing when string parsing works

完全分析某个 XML 或 JSON 对象可能会消耗大量 CPU 和内存资源。Full parsing of an XML or JSON object may consume high CPU and memory resources. 在许多情况下,当只需要一两个参数并且 XML 或 JSON 对象很简单时,可以使用 parse 运算符或其他文本分析技术将它们分析为字符串,这样更简单。In many cases, when only one or two parameters are needed and the XML or JSON objects are simple, it is easier to parse them as strings using the parse operator or other text parsing techniques. 随着 XML 或 JSON 对象中的记录数增加,性能提升会更明显。The performance boost will be more significant as the number of records in the XML or JSON object increases. 当记录的数量达到数千万时,这一点至关重要。It is essential when the number of records reaches tens of millions.

例如,下面的查询将返回与上面的查询完全相同的结果,但不执行完全 XML 分析。For example, the following query will return exactly the same results as the queries above without performing full XML parsing. 请注意,它对 XML 文件结构做了一些假设,例如,FilePath 元素位于 FileHash 之后,并且它们都没有属性。Note that it makes some assumptions on the XML file structure such as that FilePath element comes after FileHash and none of them has attributes.

//even more efficient
SecurityEvent
| where EventID == 8002 //Only this event have FileHash
| where EventData !has "%SYSTEM32" //Early removal of unwanted records
| parse EventData with * "<FilePath>" FilePath "</FilePath>" * "<FileHash>" FileHash "</FileHash>" *
| summarize count() by FileHash, FilePath
| where FileHash != "" // No need to filter out %SYSTEM32 here as it was removed before

用于已处理查询的数据Data used for processed query

在处理查询的过程中,一个重要因素是经扫描后用于查询处理的数据量。A critical factor in the processing of the query is the volume of data that is scanned and used for the query processing. 与其他数据平台相比,Azure 数据资源管理器使用严格的优化,大大减少了数据量。Azure Data Explorer uses aggressive optimizations that dramatically reduce the data volume compared to other data platforms. 尽管如此,查询中也存在一些重要因素,这些因素可能会影响所使用的数据量。Still, there are critical factors in the query that can impact the data volume that is used.

处理超过 2,000KB 数据的查询被视为消耗过多资源的查询。Query that processes more than 2,000KB of data is considered a query that consumes excessive resources. 处理超过 20,000KB 数据的查询被视为滥用查询,可能会受到限制。Query that is processing more than 20,000KB of data is considered an abusive query and might be throttled.

在 Azure Monitor 日志中,TimeGenerated 列用作为数据编制索引的方式。In Azure Monitor Logs, the TimeGenerated column is used as a way to index the data. TimeGenerated 值限制在尽可能窄的范围内会显著限制必须处理的数据量,从而显著提高查询性能。Restricting the TimeGenerated values to as narrow a range as possible will make a significant improvement to query performance by significantly limiting the amount of data that has to be processed.

避免不必要地使用 search 和 union 运算符Avoid unnecessary use of search and union operators

增加处理数据的另一个因素是使用大量的表。Another factor that increases the data that is process is the use of large number of tables. 使用 search *union * 命令时通常会发生这种情况。This usually happens when search * and union * commands are used. 这些命令强制系统对工作区中所有表的数据进行评估和扫描。These commands force the system to evaluate and scan data from all tables in the workspace. 在某些情况下,工作区中可能有数百个表。In some cases, there might be hundreds of tables in the workspace. 尽量避免使用“search *”或未将范围限定为特定表的任何搜索。Try to avoid as much as possible using "search *" or any search without scoping it to a specific table.

例如,下面的查询生成完全相同的结果,但最后一个查询最高效:For example, the following queries produce exactly the same result but the last one is by far the most efficient:

// This version scans all tables though only Perf has this kind of data
search "Processor Time" 
| summarize count(), avg(CounterValue)  by Computer
// This version scans all strings in Perf tables - much more efficient
Perf
| search "Processor Time" 
| summarize count(), avg(CounterValue)  by Computer
// This is the most efficient version 
Perf 
| where CounterName == "% Processor Time"  
| summarize count(), avg(CounterValue)  by Computer

向查询添加早期筛选器Add early filters to the query

减少数据量的另一种方法是在查询中提前使用 where 条件。Another method to reduce the data volume is to have where conditions early in the query. Azure 数据资源管理器平台包含一个缓存,该缓存可让它知道哪些分区包含与特定 where 条件相关的数据。The Azure Data Explorer platform includes a cache that lets it know which partitions include data that is relevant for a specific where condition. 例如,如果查询包含 where EventID == 4624,则它只将查询分布到对包含匹配事件的分区进行处理的节点。For example, if a query contains where EventID == 4624 then it would distribute the query only to nodes that handle partitions with matching events.

下面的示例查询生成完全相同的结果,但第二个查询更高效:The following example queries produce exactly the same result but the second one is more efficient:

//less efficient
SecurityEvent
| summarize LoginSessions = dcount(LogonGuid) by Account
//more efficient
SecurityEvent
| where EventID == 4624 //Logon GUID is relevant only for logon event
| summarize LoginSessions = dcount(LogonGuid) by Account

使用条件聚合函数和 materialize 函数避免多次扫描相同源数据Avoid multiple scans of same source data using conditional aggregation functions and materialize function

如果查询包含多个使用 join 或 union 运算符合并的子查询,则每个子查询会分别扫描整个源,然后合并结果。When a query has several sub-queries that are merged using join or union operators, each sub-query scans the entire source separately and then merge the results. 这样就会导致扫描数据的次数倍增,这对于大型数据集是个至关重要的因素。This multiples the number of times data is scanned - critical factor in very large data sets.

避免这种情况的一种方法是使用条件聚合函数。A technique to avoid this is by using the conditional aggregation functions. 在 summarize 运算符中使用的大多数聚合函数都有一个带条件的版本,该版本允许配合多个条件使用一个 summarize 运算符。Most of the aggregation functions that are used in summary operator has a conditioned version that allow you to use a single summarize operator with multiple conditions.

例如,下面的查询显示了每个帐户的登录事件数以及进程执行事件数。For example, the following queries show the number of login events and the number of process execution events for each account. 它们返回相同的结果,但第一个扫描两次数据,第二个只扫描一次数据:They return the same results but the first is scanning the data twice, the second scan it only once:

//Scans the SecurityEvent table twice and perform expensive join
SecurityEvent
| where EventID == 4624 //Login event
| summarize LoginCount = count() by Account
| join 
(
    SecurityEvent
    | where EventID == 4688 //Process execution event
    | summarize ExecutionCount = count(), ExecutedProcesses = make_set(Process) by Account
) on Account
//Scan only once with no join
SecurityEvent
| where EventID == 4624 or EventID == 4688 //early filter
| summarize LoginCount = countif(EventID == 4624), ExecutionCount = countif(EventID == 4688), ExecutedProcesses = make_set_if(Process,EventID == 4688)  by Account

另一种不必使用子查询的情况是,对 parse 运算符进行预筛选以确保它只处理符合特定模式的记录。Another case where sub-queries are unnecessary is pre-filtering for parse operator to make sure that it processes only records that match specific pattern. 这样做是不必要的,因为 parse 运算符和其他类似的运算符在模式不匹配时会返回空结果。This is unnecessary as the parse operator and other similar operators return empty results when the pattern doesn't match. 下面两个查询返回完全相同的结果,但第二个查询只扫描一次数据。Here are two queries that return exactly the same results while the second query scan data only once. 在第二个查询中,每个 parse 命令只与其事件相关。In the second query, each parse command is relevant only for its events. 之后,extend 运算符会显示如何引用空数据情况。The extend operator afterwards shows how to refer to empty data situation.

//Scan SecurityEvent table twice
union(
SecurityEvent
| where EventID == 8002 
| parse EventData with * "<FilePath>" FilePath "</FilePath>" * "<FileHash>" FileHash "</FileHash>" *
| distinct FilePath
),(
SecurityEvent
| where EventID == 4799
| parse EventData with * "CallerProcessName\">" CallerProcessName1 "</Data>" * 
| distinct CallerProcessName1
)
//Single scan of the SecurityEvent table
SecurityEvent
| where EventID == 8002 or EventID == 4799
| parse EventData with * "<FilePath>" FilePath "</FilePath>" * "<FileHash>" FileHash "</FileHash>" * //Relevant only for event 8002
| parse EventData with * "CallerProcessName\">" CallerProcessName1 "</Data>" *  //Relevant only for event 4799
| extend FilePath = iif(isempty(CallerProcessName1),FilePath,"")
| distinct FilePath, CallerProcessName1

当上述情况不允许避免使用子查询时,另一种方法是使用 materialize() 函数来提示查询引擎:有一个在这些子查询中的每一个都用到的源数据。When the above doesn't allow to avoid using sub-queries, another technique is to hint to the query engine that there is a single source data used in each one of them using the materialize() function. 当源数据来自在查询中多次用到的某个函数时,适合使用这种方法。This is useful when the source data is coming from a function that is used several times within the query. 当子查询的输出比输入小得多时,具体化是有效的。Materialize is effective when the output of the sub-query is much smaller than the input. 查询引擎将在所有匹配项中缓存和重用输出。The query engine will cache and reuse the output in all occurrences.

减少检索的列数Reduce the number of columns that is retrieved

由于 Azure 数据资源管理器是一个列式数据存储,因此对每一列的检索都独立于其他列。Since Azure Data Explorer is a columnar data store, retrieval of every column is independent of the others. 检索的列数直接影响总体数据量。The number of columns that are retrieved directly influences the overall data volume. 只应在输出中包含对结果进行汇总或对特定列进行投影所需的列。You should only include the columns in the output that are needed by summarizing the results or projecting the specific columns. Azure 数据资源管理器采取了多项优化来减少检索的列的数量。Azure Data Explorer has several optimizations to reduce the number of retrieved columns. 如果它确定不需要某个列(例如,如果在 summarize 命令中没有引用该列),它将不会检索该列。If it determines that a column isn't needed, for example if it's not referenced in the summarize command, it won't retrieve it.

例如,第二个查询可能会处理三倍的数据,因为它需要获取的不是一列而是三列:For example, the second query may process three times more data since it needs to fetch not one column but three:

//Less columns --> Less data
SecurityEvent
| summarize count() by Computer  
//More columns --> More data
SecurityEvent
| summarize count(), dcount(EventID), avg(Level) by Computer  

已处理查询的时间跨度Time span of the processed query

Azure Monitor 日志中的所有日志都根据 TimeGenerated 列进行分区。All logs in Azure Monitor Logs are partitioned according to the TimeGenerated column. 访问的分区数与时间跨度直接相关。The number of partitions that are accessed are directly related to the time span. 减小时间范围是确保快速执行查询的最有效方法。Reducing the time range is the most efficient way of assuring a prompt query execution.

时间跨度超过 15 天的查询被视为消耗过多资源的查询。Query with time span of more than 15 days is considered a query that consumes excessive resources. 时间跨度超过 90 天的查询被视为滥用查询,可能会受到限制。Query with time span of more than 90 days is considered an abusive query and might be throttled.

可以使用 Log Analytics 屏幕中的时间范围选择器设置时间范围,如 Azure Monitor Log Analytics 中的日志查询范围和时间范围中所述。The time range can be set using the time range selector in the Log Analytics screen as described in Log query scope and time range in Azure Monitor Log Analytics. 这是推荐使用的方法,因为选定的时间范围将使用查询元数据传递到后端。This is the recommended method as the selected time range is passed to the backend using the query metadata.

另一种方法是在查询中显式地包含针对 TimeGeneratedwhere 条件。An alternative method is to explicitly include a where condition on TimeGenerated in the query. 你应当使用此方法,因为它可以确保时间跨度是固定的,即使查询是从不同的接口使用的。You should use this method as it assures that the time span is fixed, even when the query is used from a different interface. 你应确保查询的所有部分都具有 TimeGenerated 筛选器。You should ensure that all parts of the query have TimeGenerated filters. 当查询有子查询从不同的表或同一个表中获取数据时,每个查询都必须包含自己的 where 条件。When a query has sub-queries fetching data from various tables or the same table, each has to include its own where condition.

确保所有子查询都具有 TimeGenerated 筛选器Make sure all sub-queries have TimeGenerated filter

例如,在下面的查询中,对于 Perf 表,将只扫描最后一天;对于 Heartbeat 表,将扫描其所有历史记录,可能长达两年:For example, in the following query, while the Perf table will be scanned only for the last day, the Heartbeat table will be scanned for all of its history, which might be up to two years:

Perf
| where TimeGenerated > ago(1d)
| summarize avg(CounterValue) by Computer, CounterName
| join kind=leftouter (
    Heartbeat
    //No time span filter in this part of the query
    | summarize IPs = makeset(ComputerIP, 10) by  Computer
) on Computer

出现这种错误的一种常见情况是使用 arg_max() 查找最近的匹配项。A common case where such a mistake occurs is when arg_max() is used to find the most recent occurrence. 例如:For example:

Perf
| where TimeGenerated > ago(1d)
| summarize avg(CounterValue) by Computer, CounterName
| join kind=leftouter (
    Heartbeat
    //No time span filter in this part of the query
    | summarize arg_max(TimeGenerated, *), min(TimeGenerated)   
by Computer
) on Computer

通过在内部查询中添加时间筛选器,可以很容易地纠正此问题:This can be easily corrected by adding a time filter in the inner query:

Perf
| where TimeGenerated > ago(1d)
| summarize avg(CounterValue) by Computer, CounterName
| join kind=leftouter (
    Heartbeat
    | where TimeGenerated > ago(1d) //filter for this part
    | summarize arg_max(TimeGenerated, *), min(TimeGenerated)   
by Computer
) on Computer

此错误的另一个示例是在对多个表进行 union 之后执行时间范围筛选。Another example for this fault is when performing the time scope filtering just after a union over several tables. 执行 union 运算时,应确定每个子查询的范围。When performing the union, each sub-query should be scoped. 可以使用 let 语句来确保范围一致性。You can use let statement to assure scoping consistency.

例如,以下查询将扫描 HeartbeatPerf 表中的所有数据,而不是仅仅扫描最后 1 天:For example, the following query will scan all the data in the Heartbeat and Perf tables, not just the last 1 day:

Heartbeat 
| summarize arg_min(TimeGenerated,*) by Computer
| union (
    Perf 
    | summarize arg_min(TimeGenerated,*) by Computer) 
| where TimeGenerated > ago(1d)
| summarize min(TimeGenerated) by Computer

应按以下代码所示纠正此查询:This query should be fixed as follows:

let MinTime = ago(1d);
Heartbeat 
| where TimeGenerated > MinTime
| summarize arg_min(TimeGenerated,*) by Computer
| union (
    Perf 
    | where TimeGenerated > MinTime
    | summarize arg_min(TimeGenerated,*) by Computer) 
| summarize min(TimeGenerated) by Computer

时间跨度度量限制Time span measurement limitations

度量值始终大于指定的实际时间。The measurement is always larger than the actual time specified. 例如,如果查询上的筛选器为 7 天,则系统可能会扫描 7.5 或 8.1 天。For example, if the filter on the query is 7 days, the system might scan 7.5 or 8.1 days. 这是因为系统将数据分成大小可变的块。This is because the system is partitioning the data into chunks in variable size. 为了确保扫描所有相关记录,它会扫描整个分区,该分区可能涵盖几个小时甚至不止一天的数据。To assure that all relevant records are scanned, it scans the entire partition that might cover several hours and even more than a day.

在某些情况下,系统无法准确度量时间范围。There are several cases where the system cannot provide an accurate measurement of the time range. 在大多数情况下,如果查询范围少于一天或查询为多工作区查询,就会出现这种现象。This happens in most of the cases where the query's span less than a day or in multi-workspace queries.

重要

此指标仅显示在紧邻群集中处理的数据。This indicator presents only data processed in the immediate cluster. 在多区域查询中,它仅显示其中一个区域。In multi-region query, it would represent only one of the regions. 在多工作区查询中,它可能不包括所有工作区。In multi-workspace query, it might not include all workspaces.

已处理数据的年限Age of processed data

Azure 数据资源管理器使用多个存储层:内存中、本地 SSD 磁盘,以及速度慢得多的 Azure Blob。Azure Data Explorer uses several storage tiers: in-memory, local SSD disks and much slower Azure Blobs. 数据越新,越有可能存储在性能更高且延迟更小的层中,从而减少查询持续时间和 CPU 占用。The newer the data, the higher is the chance that it is stored in a more performant tier with smaller latency, reducing the query duration and CPU. 除了数据本身以外,系统还有元数据的缓存。Other than the data itself, the system also has a cache for metadata. 数据越旧,其元数据在缓存中的可能性就越小。The older the data, the less chance its metadata will be in cache.

处理超过 14 天的数据的查询将被视为消耗过多资源的查询。Query that processes data than is more than 14 days old is considered a query that consumes excessive resources.

虽然有些查询需要使用旧数据,但也有误用旧数据的情况。While some queries require usage of old data, there are cases where old data is used by mistake. 执行查询时,如果没有在其元数据中提供时间范围并且不是所有表引用都包含针对 TimeGenerated 列的筛选器,则会发生这种情况。This happens when queries are executed without providing time range in their meta-data and not all table references include filter on the TimeGenerated column. 在这些情况下,系统将扫描该表中存储的所有数据。In these cases, the system will scan all the data that is stored in that table. 当数据保留时间较长时,它可能会涵盖较长的时间范围,因此会涵盖在时间上与数据保留期一样长的数据。When the data retention is long, it can cover long time ranges and thus data that is as old as the data retention period.

下面是这种情况的示例:Such cases can be for example:

  • 未使用无限制的子查询在 Log Analytics 中设置时间范围。Not setting the time range in Log Analytics with a sub-query that isn't limited. 请参看上述示例。See example above.
  • 在不使用时间范围可选参数的情况下使用 API。Using the API without the time range optional parameters.
  • 使用不强制限制时间范围的客户端,例如 Power BI 连接器。Using a client that doesn't force a time range such as the Power BI connector.

请参阅上一部分中的示例和注释,因为它们在本例中也是相关的。See examples and notes in the pervious section as they are also relevant in this case.

区域数量Number of regions

在以下几种情况下,可能会跨不同的区域执行单个查询:There are several situations where a single query might be executed across different regions:

  • 当显式列出多个工作区并且它们位于不同的区域中时。When several workspaces are explicitly listed, and they are located in different regions.
  • 当范围为资源的查询提取数据并且数据存储在位于不同区域中的多个工作区中时。When a resource-scoped query is fetching data and the data is stored in multiple workspaces that are located in different regions.

跨区域执行查询要求系统在后端序列化和传输通常比查询最终结果大得多的大块中间数据。Cross-region query execution requires the system to serialize and transfer in the backend large chunks of intermediate data that are usually much larger than the query final results. 它还限制了系统执行优化和试探法以及使用缓存的能力。It also limits the system's ability to perform optimizations, heuristics, and utilize caches. 如果没有真正的理由要扫描所有这些区域,则应调整范围,使其涵盖较少的区域。If there is no real reason to scan all these regions, you should adjust the scope so it covers fewer regions. 如果资源范围已最小化,但仍使用了许多区域,则可能是因配置错误而导致的。If the resource scope is minimized but still many regions are used, it might happen due to misconfiguration. 例如,审核日志和诊断设置发送到不同区域中的不同工作区,或者存在多个诊断设置配置。For example, audit logs and diagnostic settings are sent to different workspaces in different regions or there are multiple diagnostic settings configurations.

跨超过 3 个区域的查询被视为消耗过多资源的查询。Query that spans more than 3 regions is considered a query that consumes excessive resources. 跨超过 6 个区域的查询被视为滥用查询,可能会受到限制。Query that spans more than 6 regions is considered an abusive query and might be throttled.

重要

当查询跨多个区域运行时,CPU 和数据度量将不准确,并且将仅显示其中一个区域上的度量。When a query is run across several regions, the CPU and data measurements will not be accurate and will represent the measurement only on one of the regions.

工作区数量Number of workspaces

工作区是用于隔离和管理日志数据的逻辑容器。Workspaces are logical containers that are used to segregate and administer logs data. 后端会优化选定区域内物理群集上的工作区放置。The backend optimizes workspace placements on physical clusters within the selected region.

使用多个工作区的原因可能是:Usage of multiple workspaces can result from:

  • 显式列出了多个工作区。Where several workspaces are explicitly listed.
  • 范围为资源的查询提取数据并且数据存储在多个工作区中。When a resource-scoped query is fetching data and the data is stored in multiple workspaces.

跨区域和跨群集执行查询要求系统在后端序列化和传输通常比查询最终结果大得多的大块中间数据。Cross-region and cross-cluster execution of queries requires the system to serialize and transfer in the backend large chunks of intermediate data that are usually much larger than the query final results. 它还限制了系统执行优化和试探法以及使用缓存的能力。It also limits the system ability to perform optimizations, heuristics and utilizing caches.

跨超过 5 个工作区的查询被视为消耗过多资源的查询。Query that spans more than 5 workspace is considered a query that consumes excessive resources. 查询不能跨超过 100 个工作区。Queries cannot span to to more than 100 workspaces.

重要

在某些多工作区方案中,CPU 和数据度量会不准确,并且只会显示几个工作区的度量。In some multi-workspace scenarios, the CPU and data measurements will not be accurate and will represent the measurement only to few of the workspaces.

并行度Parallelism

Azure Monitor 日志使用 Azure 数据资源管理器的大型群集来运行查询,这些群集的规模各不相同,可能会产生多达数十个计算节点。Azure Monitor Logs is using large clusters of Azure Data Explorer to run queries, and these clusters vary in scale, potentially getting up to dozens of compute nodes. 系统会根据工作区放置逻辑和容量自动缩放群集。The system automatically scales the clusters according to workspace placement logic and capacity.

为了高效地执行查询,系统会根据处理查询时所需的数据将查询分区并分布到不同的计算节点上。To efficiently execute a query, it is partitioned and distributed to compute nodes based on the data that is required for its processing. 在某些情况下,系统无法高效地执行此操作。There are some situations where the system cannot do this efficiently. 这可能会导致查询持续很长时间。This can lead to a long duration of the query.

可能会降低并行度的查询行为包括:Query behaviors that can reduce parallelism include:

  • 使用序列化和窗口函数,例如 serialize 运算符next()prev()row 函数。Use of serialization and window functions such as the serialize operator, next(), prev(), and the row functions. 在这些情况下,有时候可能会使用时序和用户分析功能。Time series and user analytics functions can be used in some of these cases. 如果在非查询末尾的位置使用了以下运算符,则可能会导致序列化低效:rangesortordertoptop-hittersgetschemaInefficient serialization may also happen if the following operators are used not at the end of the query: range, sort, order, top, top-hitters, getschema.
  • 使用 dcount() 聚合函数会强制系统将非重复值存储在中心副本中。Usage of dcount() aggregation function force the system to have central copy of the distinct values. 当数据规模较大时,请考虑使用 dcount 函数可选参数来降低精度。When the scale of data is high, consider using the dcount function optional parameters to reduced accuracy.
  • 在许多情况下,join 运算符会降低整体并行度。In many cases, the join operator lowers overall parallelism. 当性能有问题时,看是否可以使用 shuffle join 作为替代方法。Examine shuffle join as an alternative when performance is problematic.
  • 在资源范围查询中,预执行 Kubernetes RBAC 或 Azure RBAC 在存在海量 Azure 角色分配的情况下可能会延迟。In resource-scope queries, the pre-execution Kubernetes RBAC or Azure RBAC checks may linger in situations where there is very large number of Azure role assignments. 这可能会导致检查时间延长,并且会导致并行度降低。This may lead to longer checks that would result in lower parallelism. 例如,查询在有数千个资源的订阅上执行,每个资源在资源级别(而不是在订阅或资源组上)有许多角色分配。For example, a query is executed on a subscription where there are thousands of resources and each resource has many role assignments in the resource level, not on the subscription or resource group.
  • 如果查询处理的是小块数据,那么它的并行度将很低,因为系统不会将它分布到许多计算节点上。If a query is processing small chunks of data, its parallelism will be low as the system will not spread it across many compute nodes.

后续步骤Next steps