partition 运算符partition operator

partition 运算符根据指定列的值将其输入表分区为多个子表,对每个子表执行子查询,并生成一个输出表,该表是所有子查询的结果的并集。The partition operator partitions its input table into multiple sub-tables according to the values of the specified column, executes a sub-query over each sub-table, and produces a single output table that is the union of the results of all sub-queries.

T | partition by Col1 ( top 10 by MaxValue )

T | partition by Col1 { U | where Col2=toscalar(Col1) }

语法Syntax

T | partition [PartitionParameters] by Column ( ContextualSubquery )T | partition [PartitionParameters] by Column ( ContextualSubquery )

T | partition [PartitionParameters] by Column { Subquery }T | partition [PartitionParameters] by Column { Subquery }

参数Arguments

  • T:要由运算符处理其数据的表格源。T: The tabular source whose data is to be processed by the operator.

  • Column:T 中某个列的名称,其值决定了如何对输入表进行分区。Column: The name of a column in T whose values determine how the input table is to be partitioned. 请参阅下文中的说明See Notes below.

  • ContextualSubquery:一个表格表达式,其源是 partition 运算符的源,其作用域为单个键值。ContextualSubquery: A tabular expression, which source is the source of the partition operator, scoped for a single key value.

  • Subquery:没有源的表格表达式。Subquery: A tabular expression without source. 键值可通过 toscalar() 调用获得。The key value can be obtained via toscalar() call.

  • PartitionParameters:零个或零个以上的(使用空格进行分隔的)参数,它们控制运算符的行为,格式为:Name = ValuePartitionParameters: Zero or more (space-separated) parameters in the form of: Name = Value that control the behavior of the operator. 支持以下参数:The following parameters are supported:

    名称Name Values 说明Description
    hint.materialized true,falsetrue,false 如果设置为 true,将对 partition 运算符的源进行具体化(默认值:falseIf set to true will materialize the source of the partition operator (default: false)
    hint.concurrency 数字Number 提示系统应并行执行 partition 运算符的多少个并发子查询。Hints the system how many concurrent subqueries of the partition operator should be executed in parallel. 默认:群集的单个节点上的 CPU 核心数(2 到 16 个)。Default: Amount of CPU cores on the single node of the cluster (2 to 16).
    hint.spread 数字Number 提示系统并发的 partition 子查询执行应使用多少个节点。Hints the system how many nodes should be used by the concurrentpartition subqueries execution. 默认:1.Default: 1.

返回Returns

此运算符返回对输入数据的每个分区应用子查询得到的结果的并集。The operator returns a union of the results of applying the subquery to each partition of the input data.

备注Notes

  • partition 运算符当前受分区数的限制。The partition operator is currently limited by the number of partitions. 最多可以创建 64 个不同的分区。Up to 64 distinct partitions may be created. 如果分区列(Column)的非重复值超过 64 个,则该运算符会产生错误。The operator will yield an error if the partition column (Column) has more than 64 distinct values.

  • 子查询隐式引用输入分区(子查询中没有分区的“名称”)。The subquery references the input partition implicitly (there's no "name" for the partition in the subquery). 若要在子查询中多次引用输入分区,请使用 as 运算符,如下文中的示例:partition 参考所示。To reference the input partition multiple times within the subquery, use the as operator, as in Example: partition-reference below.

示例:top-nested 案例Example: top-nested case

在某些情况下,使用 partition 运算符编写查询比使用 top-nested 运算符性能更高且更容易。下一个示例将针对以 W 开头的以下每个 State 运行子查询来计算 summarizetop:(WYOMING、WASHINGTON、WEST VIRGINIA、WISCONSIN)At some cases - it is more performant and easier to write query using partition operator rather using top-nested operator The next example runs a sub-query calculating summarize and top for-each of States starting with W: (WYOMING, WASHINGTON, WEST VIRGINIA, WISCONSIN)

StormEvents
| where State startswith 'W'
| partition by State 
(
    summarize Events=count(), Injuries=sum(InjuriesDirect) by EventType, State
    | top 3 by Events 
) 

EventTypeEventType 状态State 事件Events InjuriesInjuries
冰雹Hail 怀俄明州WYOMING 108108 00
疾风High Wind 怀俄明州WYOMING 8181 55
冬季风暴Winter Storm 怀俄明州WYOMING 7272 00
大雪Heavy Snow 华盛顿州WASHINGTON 8282 00
疾风High Wind 华盛顿州WASHINGTON 5858 1313
野火Wildfire 华盛顿州WASHINGTON 2929 00
雷雨大风Thunderstorm Wind 西弗吉尼亚州WEST VIRGINIA 180180 11
冰雹Hail 西弗吉尼亚州WEST VIRGINIA 103103 00
冬季天气Winter Weather 西弗吉尼亚州WEST VIRGINIA 8888 00
雷雨大风Thunderstorm Wind 威斯康星州WISCONSIN 416416 11
冬季风暴Winter Storm 威斯康星州WISCONSIN 310310 00
冰雹Hail 威斯康星州WISCONSIN 303303 11

示例:查询非重叠数据分区Example: query non-overlapping data partitions

有时候,以映射/化简样式对非重叠数据分区运行复杂子查询很有用(在性能方面)。Sometimes it is useful (perf-wise) to run a complex subquery over non-overlapping data partitions in a map/reduce style. 下面的示例展示了如何在 10 个分区上创建聚合的手动分布。The example below shows how to create a manual distribution of aggregation over 10 partitions.

StormEvents
| extend p = hash(EventId, 10)
| partition by p
(
    summarize Count=count() by Source 
)
| summarize Count=sum(Count) by Source
| top 5 by Count
SourceSource 计数Count
专业观测员Trained Spotter 1277012770
执法机构Law Enforcement 85708570
公用Public 61576157
灾害管理Emergency Manager 49004900
COOP ObserverCOOP Observer 30393039

示例:查询时分区Example: query-time partitioning

下面的示例展示了如何将查询分区为 N = 10 个分区,其中的每个分区都计算其自己的 Count,然后全部汇总到 TotalCount。The following example shows how query can be partitioned into N=10 partitions, where each partition calculates its own Count, and all later summarized into TotalCount.

let N = 10;                 // Number of query-partitions
range p from 0 to N-1 step 1  // 
| partition by p            // Run the sub-query partitioned 
{
    StormEvents 
    | where hash(EventId, N) == toscalar(p) // Use toscalar() to fetch partition key value
    | summarize Count = count()
}
| summarize TotalCount=sum(Count) 
TotalCountTotalCount
5906659066

示例:分区引用Example: partition-reference

下面的示例展示了如何使用 as 运算符为每个数据分区指定“名称”,然后在子查询中重复使用该名称:The following example shows how one can use the as operator to give a "name" to each data partition and then reuse that name within the subquery:

T
| partition by Dim
(
    as Partition
    | extend MetricPct = Metric * 100.0 / toscalar(Partition | summarize sum(Metric))
)

示例:由函数调用隐藏的复杂子查询Example: complex subquery hidden by a function call

同样的技术可以应用于复杂得多的子查询。The same technique can be applied with much more complex subqueries. 若要简化语法,可以将子查询包装在函数调用中:To simplify the syntax, one can wrap the subquery in a function call:

let partition_function = (T:(Source:string)) 
{
    T
    | summarize Count=count() by Source
};
StormEvents
| extend p = hash(EventId, 10)
| partition by p
(
    invoke partition_function()
)
| summarize Count=sum(Count) by Source
| top 5 by Count
SourceSource 计数Count
专业观测员Trained Spotter 1277012770
执法机构Law Enforcement 85708570
公用Public 61576157
灾害管理Emergency Manager 49004900
COOP ObserverCOOP Observer 30393039