Azure 流分析作业中的检查点和重播概念Checkpoint and replay concepts in Azure Stream Analytics jobs

本文介绍 Azure 流分析中内部检查点和重播的概念及其对作业恢复的影响。This article describes the internal checkpoint and replay concepts in Azure Stream Analytics, and the impact those have on job recovery. 每当运行流分析作业时,都会在内部维护状态信息。Each time a Stream Analytics job runs, state information is maintained internally. 该状态信息定期保存在检查点中。That state information is saved in a checkpoint periodically. 在某些情况下,如果发生作业失败或升级,则会使用检查点信息进行作业恢复。In some scenarios, the checkpoint information is used for job recovery if a job failure or upgrade occurs. 在另一些情况下,检查点无法用于恢复,而必须使用重播。In other circumstances, the checkpoint cannot be used for recovery, and a replay is necessary.

时态元素中的有状态查询逻辑Stateful query logic in temporal elements

Azure 流分析作业的独有功能之一是执行有状态的处理,如开窗聚合、临时联接和临时分析函数。One of the unique capability of Azure Stream Analytics job is to perform stateful processing, such as windowed aggregates, temporal joins, and temporal analytic functions. 当作业运行时,其中的每个运算符都会保存状态信息。Each of these operators keeps state information when the job runs. 这些查询元素的最大窗口大小为 7 天。 The maximum window size for these query elements is seven days.

多个流分析查询元素中都出现了时态窗口的概念:The temporal window concept appears in several Stream Analytics query elements:

  1. 开窗聚合(翻转窗口、跳跃窗口和滑动窗口 GROUP BY)Windowed aggregates (GROUP BY of Tumbling, Hopping, and Sliding windows)

  2. 时态联接 (JOIN with DATEDIFF)Temporal joins (JOIN with DATEDIFF)

  3. 时态分析函数(ISFIRST、LAST 和 LAG with LIMIT DURATION)Temporal analytic functions (ISFIRST, LAST, and LAG with LIMIT DURATION)

发生节点故障(包括 OS 升级)后进行作业恢复Job recovery from node failure, including OS upgrade

每当运行流分析作业时,该作业将在内部横向扩展,以跨多个工作节点执行工作。Each time a Stream Analytics job runs, internally it is scaled out to do work across multiple worker nodes. 每个工作节点的状态每隔几分钟就会创建检查点一次,从而在发生故障时帮助系统恢复。Each worker node's state is checkpointed every few minutes, which helps the system recover if a failure occurs.

有时,给定的工作节点可能会发生故障,或者该工作节点发生操作系统升级。At times, a given worker node may fail, or an Operating System upgrade can occur for that worker node. 为了自动恢复,流分析将获取一个新的正常节点,同时最新可用的检查点还原先前工作节点的状态。To recover automatically, Stream Analytics acquires a new healthy node, and the prior worker node's state is restored from the latest available checkpoint. 若要恢复工作,必须运行少量的重播,以便从创建检查点的时间还原状态。To resume the work, a small amount of replay is necessary to restore the state from the time when the checkpoint is taken. 通常,还原间隙只有几分钟。Usually, the restore gap is only a few minutes. 如果为作业选择了足够多的流单位,则重播应该很快就能完成。When enough Streaming Units are selected for the job, the replay should be completed quickly.

在完全并行的查询中,发生工作节点故障后保持同步所需的时间与以下值成正比:In a fully parallel query, the time it takes to catch up after a worker node failure is proportional to:

[输入事件速率] x [间隙长度] / [处理分区数][the input event rate] x [the gap length] / [number of processing partitions]

如果由于节点故障和 OS 升级而出现严重的处理延迟,请考虑将查询设为完全并行,并扩展作业以分配更多的流单位。If you ever observe significant processing delay because of node failure and OS upgrade, consider making the query fully parallel, and scale the job to allocate more Streaming Units. 有关详细信息,请参阅扩展 Azure 流分析作业以增加吞吐量For more information, see Scale an Azure Stream Analytics job to increase throughput.

如果此类恢复过程正在进行,当前的流分析不会显示报告。Current Stream Analytics does not show a report when this kind of recovery process is taking place.

服务升级后进行作业恢复Job recovery from a service upgrade

Azure 偶尔会升级在 Azure 服务中运行流分析作业的二进制文件。Azure occasionally upgrades the binaries that run the Stream Analytics jobs in the Azure service. 这时,用户正在运行的作业会升级到更新的版本,并且作业会自动重启。At these times, users' running jobs are upgraded to newer version and the job restarts automatically.

目前,每次升级后,恢复检查点格式不会保留。Currently, the recovery checkpoint format is not preserved between upgrades. 因此,必须完全使用重播方法来还原流查询的状态。As a result, the state of the streaming query must be restored entirely using replay technique. 若要让流分析作业重播与以前完全相同的输入,必须至少将源数据的保留策略设置为查询中的窗口大小。In order to allow Stream Analytics jobs to replay the exact same input from before, it’s important to set the retention policy for the source data to at least the window sizes in your query. 否则可能会导致服务升级期间出现错误或不完整的结果,因为源数据的保留时限并不足够靠后,以致不能包含整个窗口大小。Failing to do so may result in incorrect or partial results during service upgrade, since the source data may not be retained far enough back to include the full window size.

一般而言,所需的重播量与窗口大小乘以平均事件速率的结果成正比。In general, the amount of replay needed is proportional to the size of the window multiplied by the average event rate. 例如,如果某个作业的输入速率为每秒 1000 个事件,则大于 1 小时的窗口大小被认为是一个较大的重播大小。As an example, for a job with an input rate of 1000 events per second, a window size greater than one hour is considered to have a large replay size. 为了生成完整正确的结果,最多可能需要重新处理一小时的数据,才能初始化状态,而这可能会导致输出延迟(无输出)更长时间。Up to one hour of data may need to be re-processed to initialize the state so it can produce full and correct results, which may cause delayed output (no output) for some extended period. 无窗口或其他时态运算符(例如 JOINLAG)的查询不会重播。Queries with no windows or other temporal operators, like JOIN or LAG, would have zero replay.

估算重播同步时间Estimate replay catch-up time

若要估算服务升级导致的延迟的长度,可以遵循以下方法:To estimate the length of the delay due to a service upgrade, you can follow this technique:

  1. 以预期的事件速率在输入事件中心加载足够的数据,以涵盖查询中的最大窗口大小。Load the input Event Hub with sufficient data to cover the largest window size in your query, at expected event rate. 在此整个时间段内,事件时间戳应该接近挂钟时间,就如同事件是实时输入源一样。The events’ timestamp should be close to the wall clock time throughout that period of time, as if it’s a live input feed. 例如,如果查询中包含 3 天的时间窗口,请向事件中心发送事件 3 天,然后继续发送事件。For example, if you have a 3-day window in your query, send events to Event Hub for three days, and continue to send events.

  2. 使用 Now 作为开始时间启动作业。Start the job using Now as the start time.

  3. 测量开始时间与生成第一个输出时的间隔时间。Measure the time between the start time and when the first output is generated. 此间隔时间大致是服务升级期间作业的延迟时间。The time is rough how much delay the job would incur during a service upgrade.

  4. 如果延迟过长,请尝试将作业分区并增加 SU 数目,使负载分散到更多节点。If the delay is too long, try to partition your job and increase number of SUs, so the load is spread out to more nodes. 或者,考虑减小查询中的窗口大小,并对下游接收器中流分析作业生成的输出执行进一步的聚合或其他有状态处理(例如,使用 Azure SQL 数据库)。Alternatively, consider reducing the window sizes in your query, and perform further aggregation or other stateful processing on the output produced by the Stream Analytics job in the downstream sink (for example, using Azure SQL Database).

为了克服升级任务关键型作业期间服务稳定性的一般忧虑,请考虑在配对的 Azure 区域中运行重复的作业。For general service stability concern during upgrade of mission critical jobs, consider running duplicate jobs in paired Azure regions. 有关详细信息,请参阅在服务更新期间保证流分析作业可靠性For more information, see Guarantee Stream Analytics job reliability during service updates.

在用户发起的停止和启动后进行作业恢复Job recovery from a user initiated stop and start

若要编辑流作业的查询语法或调整输入和输出,需要停止该作业以进行更改,并升级作业设计。To edit the Query syntax on a streaming job, or to adjust inputs and outputs, the job needs to be stopped to make the changes and upgrade the job design. 在这种情况下,当用户停止流作业并再次将其启动时,恢复方案类似于服务升级。In such scenarios, when a user stops the streaming job, and starts it again, the recovery scenario is similar to service upgrade.

检查点数据不可用于用户发起的作业重启。Checkpoint data cannot be used for a user initiated job restart. 若要估算执行此类重启期间发生的输出延迟,请使用上一部分所述的相同过程;如果延迟太长,请应用类似的缓解措施。To estimate the delay of output during such a restart, use the same procedure as described in the previous section, and apply similar mitigation if the delay is too long.

后续步骤Next steps

有关可靠性和可伸缩性的详细信息,请参阅以下文章:For more information on reliability and scalability, see these articles: