通过重新分区优化使用 Azure 流分析进行的处理Use repartitioning to optimize processing with Azure Stream Analytics

本文介绍如何使用重新分区针对无法完全并行化的方案缩放 Azure 流分析查询。This article shows you how to use repartitioning to scale your Azure Stream Analytics query for scenarios that can't be fully parallelized.

如果出现以下情况,则可能无法使用并行化You might not be able to use parallelization if:

  • 你无法控制输入流的分区键。You don't control the partition key for your input stream.
  • 你的源“散布”的输入跨多个以后需要合并的分区。Your source "sprays" input across multiple partitions that later need to be merged.

当你处理的数据所在的流不是按照自然的输入方案(例如事件中心的 PartitionId)分片时,需要重新分区或重新组织。Repartitioning, or reshuffling, is required when you process data on a stream that's not sharded according to a natural input scheme, such as PartitionId for Event Hubs. 如果重新分区,则每个分片都可以独立处理,这样就可以通过线性方式横向扩展流式处理管道。When you repartition, each shard can be processed independently, which allows you to linearly scale out your streaming pipeline.

如何重新分区How to repartition

若要重新分区,请在查询中的 PARTITION BY 语句后使用关键字 INTOTo repartition, use the keyword INTO after a PARTITION BY statement in your query. 以下示例按 DeviceID 将数据分区,分区数目为 10。The following example partitions the data by DeviceID into a partition count of 10. 使用 DeviceID 的哈希来确定哪个分区应该接受哪个子流。Hashing of DeviceID is used to determine which partition shall accept which substream. 将针对每个分区的流独立刷新数据,假定输出支持分区的写入并有 10 个分区。The data is flushed independently for each partitioned stream, assuming the output supports partitioned writes, and has 10 partitions.

SELECT * 
INTO output
FROM input
PARTITION BY DeviceID 
INTO 10

以下示例查询联接两个其数据已重新分区的流。The following example query joins two streams of repartitioned data. 联接两个其数据已重新分区的流时,这两个流必须有相同的分区键和计数。When joining two streams of repartitioned data, the streams must have the same partition key and count. 结果就是一个具有相同分区方案的流。The outcome is a stream that has the same partition scheme.

WITH step1 AS (SELECT * FROM input1 PARTITION BY DeviceID INTO 10),
step2 AS (SELECT * FROM input2 PARTITION BY DeviceID INTO 10)

SELECT * INTO output FROM step1 PARTITION BY DeviceID UNION step2 PARTITION BY DeviceID

输出方案应该与流方案键和计数匹配,这样就可以单独刷新每个子流。The output scheme should match the stream scheme key and count so that each substream can be flushed independently. 流也可以在刷新之前通过另一方案再次进行合并和重新分区,但是应避免该方法,因为它会导致常规处理延迟增加,以及资源使用率增加。The stream could also be merged and repartitioned again by a different scheme before flushing, but you should avoid that method because it adds to the general latency of the processing and increases resource utilization.

重新分区的流单元Streaming Units for repartitions

试验并观察作业的资源使用情况,确定所需的具体分区数。Experiment and observe the resource usage of your job to determine the exact number of partitions you need. 流单元 (SU) 的数目必须按照每个分区所需的物理资源进行调整。The number of streaming units (SU) must be adjusted according to the physical resources needed for each partition. 通常情况下,每个分区需要六个 SU。In general, six SUs are needed for each partition. 如果分配给作业的资源不足,系统只会在有益于作业的情况下应用重新分区。If there are insufficient resources assigned to the job, the system will only apply the repartition if it benefits the job.

针对 SQL 输出的重新分区Repartitions for SQL output

当作业使用 SQL 数据库进行输出时,请使用显式重新分区来匹配优化的分区计数,以便将吞吐量最大化。When your job uses SQL database for output, use explicit repartitioning to match the optimal partition count to maximize throughput. 由于 SQL 在使用八个写入器时效果最好,因此在刷新之前将流重新分区为八个(或在上游的某个位置进一步进行分区)可能会有益于作业性能。Since SQL works best with eight writers, repartitioning the flow to eight before flushing, or somewhere further upstream, may benefit job performance.

如果输入分区超过 8 个,则继承输入分区方案可能不是适当的选择。When there are more than 8 input partitions, inheriting the input partitioning scheme might not be an appropriate choice.

以下示例从输入中读取数据(而不考虑其自然分区情况如何),然后根据 DeviceID 维度将流重新划分为十份,并将数据刷新到输出。The following example reads from the input, regardless of it being naturally partitioned, and repartitions the stream tenfold according to the DeviceID dimension and flushes the data to output.

SELECT * INTO [output] FROM [input] PARTITION BY DeviceID INTO 10

有关详细信息,请参阅从 Azure 流分析输出到 Azure SQL 数据库For more information, see Azure Stream Analytics output to Azure SQL Database.

后续步骤Next steps