通过重新分区优化使用 Azure 流分析进行的处理Use repartitioning to optimize processing with Azure Stream Analytics

本文介绍如何使用重新分区针对无法完全并行化的方案缩放 Azure 流分析查询。This article shows you how to use repartitioning to scale your Azure Stream Analytics query for scenarios that can't be fully parallelized.

如果出现以下情况,则可能无法使用并行化You might not be able to use parallelization if:

  • 你无法控制输入流的分区键。You don't control the partition key for your input stream.
  • 你的源“散布”的输入跨多个以后需要合并的分区。Your source "sprays" input across multiple partitions that later need to be merged.

当你处理的数据所在的流不是按照自然的输入方案(例如事件中心的 PartitionId)分片时,需要重新分区或重新组织。Repartitioning, or reshuffling, is required when you process data on a stream that's not sharded according to a natural input scheme, such as PartitionId for Event Hubs. 如果重新分区,则每个分片都可以独立处理,这样就可以通过线性方式横向扩展流式处理管道。When you repartition, each shard can be processed independently, which allows you to linearly scale out your streaming pipeline.

如何重新分区How to repartition

可以通过 2 种方式对输入重新分区:You can repartition your input in 2 ways:

  1. 使用进行重新分区的单独流分析作业Use a separate Stream Analytics job that does the repartitioning
  2. 使用单个作业,但首先进行重新分区,然后再执行自定义分析逻辑Use a single job but do the repartitioning first before your custom analytics logic

创建单独流分析作业以对输入重新分区Creating a separate Stream Analytics job to repartition input

可以创建作业,以读取输入并使用分区键写入事件中心输出。You can create a job that reads input and writes to an Event Hub output using a partition key. 此事件中心随后可以充当实现分析逻辑的另一个流分析作业的输入。This Event Hub can then serve as input for another Stream Analytics job where you implement your analytics logic. 在作业中配置此事件中心输出时,必须指定流分析用于对数据重新分区的分区键。When configuring this Event Hub output in your job, you must specify the partition key by which Stream Analytics will repartition your data.

-- For compat level 1.2 or higher
SELECT * 
INTO output
FROM input

--For compat level 1.1 or lower
SELECT *
INTO output
FROM input PARTITION BY PartitionId

在单个流分析作业中对输入重新分区Repartition input within a single Stream Analytics job

还可以在查询中引入一个步骤,它首先对输入重新分区,这随后可以由查询中的其他步骤使用。You can also introduce a step in your query that first repartitions the input and this can then be used by other steps in your query. 例如,如果要基于 DeviceId 对输入重新分区,则查询会是:For example, if you want to repartition input based on DeviceId, your query would be:

WITH RepartitionedInput AS 
( 
SELECT * 
FROM input PARTITION BY DeviceID
)

SELECT DeviceID, AVG(Reading) as AvgNormalReading  
INTO output
FROM RepartitionedInput  
GROUP BY DeviceId, TumblingWindow(minute, 1)  

以下示例查询联接两个其数据已重新分区的流。The following example query joins two streams of repartitioned data. 联接两个其数据已重新分区的流时,这两个流必须有相同的分区键和计数。When joining two streams of repartitioned data, the streams must have the same partition key and count. 结果就是一个具有相同分区方案的流。The outcome is a stream that has the same partition scheme.

WITH step1 AS (SELECT * FROM input1 PARTITION BY DeviceID),
step2 AS (SELECT * FROM input2 PARTITION BY DeviceID)

SELECT * INTO output FROM step1 PARTITION BY DeviceID UNION step2 PARTITION BY DeviceID

输出方案应该与流方案键和计数匹配,这样就可以单独刷新每个子流。The output scheme should match the stream scheme key and count so that each substream can be flushed independently. 流也可以在刷新之前通过另一方案再次进行合并和重新分区,但是应避免该方法,因为它会导致常规处理延迟增加,以及资源使用率增加。The stream could also be merged and repartitioned again by a different scheme before flushing, but you should avoid that method because it adds to the general latency of the processing and increases resource utilization.

重新分区的流单元Streaming Units for repartitions

试验并观察作业的资源使用情况,确定所需的具体分区数。Experiment and observe the resource usage of your job to determine the exact number of partitions you need. 流单元 (SU) 的数目必须按照每个分区所需的物理资源进行调整。The number of streaming units (SU) must be adjusted according to the physical resources needed for each partition. 通常情况下,每个分区需要六个 SU。In general, six SUs are needed for each partition. 如果分配给作业的资源不足,系统只会在有益于作业的情况下应用重新分区。If there are insufficient resources assigned to the job, the system will only apply the repartition if it benefits the job.

针对 SQL 输出的重新分区Repartitions for SQL output

当作业使用 SQL 数据库进行输出时,请使用显式重新分区来匹配优化的分区计数,以便将吞吐量最大化。When your job uses SQL database for output, use explicit repartitioning to match the optimal partition count to maximize throughput. 由于 SQL 在使用八个写入器时效果最好,因此在刷新之前将流重新分区为八个(或在上游的某个位置进一步进行分区)可能会有益于作业性能。Since SQL works best with eight writers, repartitioning the flow to eight before flushing, or somewhere further upstream, may benefit job performance.

如果输入分区超过 8 个,则继承输入分区方案可能不是适当的选择。When there are more than 8 input partitions, inheriting the input partitioning scheme might not be an appropriate choice. 请考虑在查询中使用 INTO 来显式指定输出写入器的数量。Consider using INTO in your query to explicitly specify the number of output writers.

以下示例从输入中读取数据(而不考虑其自然分区情况如何),然后根据 DeviceID 维度将流重新划分为十份,并将数据刷新到输出。The following example reads from the input, regardless of it being naturally partitioned, and repartitions the stream tenfold according to the DeviceID dimension and flushes the data to output.

SELECT * INTO [output] FROM [input] PARTITION BY DeviceID INTO 10

有关详细信息,请参阅从 Azure 流分析输出到 Azure SQL 数据库For more information, see Azure Stream Analytics output to Azure SQL Database.

后续步骤Next steps