Azure 流分析输出的故障排除Troubleshoot Azure Stream Analytics outputs

本文介绍了 Azure 流分析输出连接的常见问题,以及如何排查输出问题。This article describes common issues with Azure Stream Analytics output connections and how to troubleshoot output issues. 许多故障排除步骤都需要为流分析作业启用资源日志和其他诊断日志。Many troubleshooting steps require resource and other diagnostic logs be enabled for your Stream Analytics job. 如果没有启用资源日志,请参阅使用资源日志对 Azure 流分析进行故障排除If you don't have resource logs enabled, see Troubleshoot Azure Stream Analytics by using resource logs.

作业不生成输出The job doesn't produce output

  1. 使用每项输出对应的“测试连接”按钮来验证与输出的连接。Verify connectivity to outputs by using the Test Connection button for each output.

  2. 查看“监视器”选项卡上的监视指标。由于值将进行聚合,因此指标会延迟几分钟。Look at Monitoring metrics on the Monitor tab. Because the values are aggregated, the metrics are delayed by a few minutes.

    • 如果“输入事件”的值大于零,则作业可以读取输入数据。If the Input Events value is greater than zero, the job can read the input data. 如果“输入事件”的值小于或等于零,表示作业的输入有问题。If the Input Events value isn't greater than zero, there's an issue with the job's input. 有关详细信息,请参阅对输入连接进行故障排除See Troubleshoot input connections for more information. 如果作业具有参考数据输入,请在查看输入事件指标时按逻辑名称应用拆分。If your job has reference data input, apply splitting by logical name when looking at Input Events metric. 如果仅参考数据中没有输入事件,则可能意味着此输入源未正确配置,无法获取正确的参考数据集。If there are no input events from your reference data alone, then it likely means that this input source has not be configured properly to fetch the right reference dataset.
    • 如果“数据转换错误”的值大于零并不断增大,请参阅 Azure 流分析数据错误,详细了解数据转换错误。If the Data Conversion Errors value is greater than zero and climbing, see Azure Stream Analytics data errors for detailed information about data conversion errors.
    • 如果“运行时错误”的值大于零,表示作业可以接收数据,但在处理查询时生成错误。If the Runtime Errors value is greater than zero, your job receives data but generates errors while processing the query. 若要查找错误,请转到审核日志,然后筛选“失败”状态。To find the errors, go to the audit logs, and then filter on the Failed status.
    • 如果“输入事件”的值大于零,而“输出事件”的值等于零,则下列陈述之一是正确的:If the Input Events value is greater than zero and the Output Events value equals zero, one of the following statements is true:
      • 查询处理导致了输出事件为零。The query processing resulted in zero output events.
      • 事件或字段的格式可能不正确,导致在执行查询处理后的输出为零。Events or fields might be malformed, resulting in a zero output after the query processing.
      • 由于连接或身份验证原因,作业无法将数据推送到输出接收器。The job was unable to push data to the output sink for connectivity or authentication reasons.

    操作日志消息解释了其他详细信息(包括具体发生了什么),但查询逻辑筛选掉所有事件的情况除外。Operations log messages explain additional details, including what's happening, except in cases where the query logic filters out all events. 如果对多个事件的处理生成错误,则错误每 10 分钟聚合一次。If the processing of multiple events generates errors, the errors aggregate every 10 minutes.

首个输出延迟The first output is delayed

当流分析作业启动时,就会读取输入事件。When a Stream Analytics job starts, the input events are read. 不过,在某些情况下,输出可能会有延迟。But, there can be a delay in the output, in certain circumstances.

在时态查询元素中使用较大的时间值可能会导致输出延迟。Large time values in temporal query elements can contribute to the output delay. 为了在长时间段内生成正确的输出,流式处理作业会尽量读取最新时间的数据来填充时间段。To produce the correct output over large time windows, the streaming job reads data from the latest time possible to fill the time window. 最长可以追溯到 7 天前的数据。The data can be up to seven days past. 在读取未完成的输入事件之前,不会生成任何输出。No output produces until the outstanding input events are read. 当系统升级流式处理作业时,可能会出现此问题。This problem can surface when the system upgrades the streaming jobs. 当升级发生时,作业重启。When an upgrade takes place, the job restarts. 此类升级通常每隔几个月发生一次。Such upgrades generally occur once every couple of months.

在设计流分析查询时,请保持审慎。Use discretion when designing your Stream Analytics query. 如果对作业查询语法中的时态元素使用长时间段,可能会导致首个输出在作业启动或重启时延迟。If you use a large time window for temporal elements in the job's query syntax, it can lead to a delay in the first output when the job starts or restarts. 超过几个小时(最长 7 天)就被视为长时间段。More than several hours, up to seven days, is considered a large time window.

缓解这种首个输出延迟的问题的一种方法是,使用查询并行化技术(如对数据进行分区)。One mitigation for this kind of first output delay is to use query parallelization techniques, such as partitioning the data. 或者,可以添加更多的流单元来提高吞吐量,直到完成作业。Or, you can add more Streaming Units to improve the throughput until the job catches up. 有关详细信息,请参阅创建流分析作业时的注意事项For more information, see Considerations when creating Stream Analytics jobs.

下面这些因素会影响首个输出的及时性:These factors affect the timeliness of the first output:

  • 使用窗口化聚合(如翻转窗口、跳跃窗口和滑动窗口的 GROUP BY 子句):The use of windowed aggregates, such as a GROUP BY clause of tumbling, hopping, and sliding windows:

    • 对于翻转窗口或跳跃窗口聚合,结果在窗口时间范围结束时生成。For tumbling or hopping window aggregates, the results generate at the end of the window timeframe.
    • 对于滑动窗口,结果在事件进入或退出滑动窗口时生成。For a sliding window, the results generate when an event enters or exits the sliding window.
    • 如果计划使用大的窗口大小(如超过一小时),最好选择跳跃窗口或滑动窗口。If you're planning to use a large window size, such as more than one hour, it's best to choose a hopping or sliding window. 使用这些类型的窗口,可以更频繁地看到输出。These window types let you see the output more frequently.
  • 使用时态联接(如包含 DATEDIFF 的 JOIN):The use of temporal joins, such as JOIN with DATEDIFF:

    • 只要到达匹配的事件的两侧,就会生成匹配。Matches generate as soon as both sides of the matched events arrive.
    • 缺少匹配的数据(如 LEFT OUTER JOIN)在左侧每个事件的 DATEDIFF 窗口结束时生成。Data that lacks a match, like LEFT OUTER JOIN, is generated at the end of the DATEDIFF window, for each event on the left side.
  • 使用时态分析函数(如包含 LIMIT DURATION 的 ISFIRST、LAST 和 LAG):The use of temporal analytic functions, such as ISFIRST, LAST, and LAG with LIMIT DURATION:

    • 对于分析函数,输出是针对每个事件生成的。For analytic functions, the output is generated for every event. 不存在延迟。There is no delay.

输出落后于进度The output falls behind

在作业的正常操作期间,输出的延迟时间可能会越来越长。During the normal operation of a job, the output might have longer and longer periods of latency. 如果输出像这样落后于进度,可以通过检查以下因素来确定根本原因:If the output falls behind like that, you can pinpoint the root causes by examining the following factors:

  • 下游接收器是否受限制Whether the downstream sink is throttled
  • 上游源是否受限制Whether the upstream source is throttled
  • 查询中的处理逻辑是否是计算密集型的Whether the processing logic in the query is compute-intensive

Azure SQL 数据库输出键冲突警告Key violation warning with Azure SQL Database output

如果你将 Azure SQL 数据库配置为流分析作业的输出,它会将记录批量插入到目标表中。When you configure an Azure SQL database as output to a Stream Analytics job, it bulk inserts records into the destination table. 一般来说,Azure 流分析保证至少一次传递到输出接收器。In general, Azure Stream Analytics guarantees at-least-once delivery to the output sink. 当 SQL 表定义了唯一约束时,你仍可以实现确切地一次传递到 SQL 输出。You can still achieve exactly-once delivery to a SQL output when a SQL table has a unique constraint defined.

如果你在 SQL 表上设置唯一键约束,Azure 流分析会删除重复记录。When you set up unique key constraints on the SQL table, Azure Stream Analytics removes duplicate records. 它将数据拆分为几个批,并以递归方式插入这些批,直到找到一个重复的记录。It splits the data into batches and recursively inserts the batches until a single duplicate record is found. 拆分和插入过程一次忽略一个重复项。The split and insert process ignores the duplicates one at a time. 对于包含多个重复行的流式处理作业,此过程效率低下且非常耗时。For a streaming job that has many duplicate rows, the process is inefficient and time-consuming. 如果你在前一个小时的活动日志中看到多个键冲突警告消息,则很可能是 SQL 输出拖慢了整个作业的速度。If you see multiple key violation warning messages in your Activity log for the previous hour, it's likely that your SQL output is slowing down the entire job.

若要解决此问题,请通过启用 IGNORE_DUP_KEY 选项来配置导致键冲突的索引To resolve this issue, configure the index that's causing the key violation by enabling the IGNORE_DUP_KEY option. 使用此选项,SQL 可以在大容量插入期间忽略重复值。This option allows SQL to ignore duplicate values during bulk inserts. Azure SQL 数据库只是生成警告消息,而不是错误。Azure SQL Database simply produces a warning message instead of an error. 因此,Azure 流分析不再生成主键冲突错误。As a result, Azure Stream Analytics no longer produces primary key violation errors.

为多种类型的索引配置 IGNORE_DUP_KEY 时,应注意以下几点:Note the following observations when configuring IGNORE_DUP_KEY for several types of indexes:

  • 无法对主键或使用 ALTER INDEX 的唯一约束设置 IGNORE_DUP_KEY。You can't set IGNORE_DUP_KEY on a primary key or a unique constraint that uses ALTER INDEX. 必须删除并重新创建索引。You need to drop the index and recreate it.
  • 可以通过对唯一索引使用 ALTER INDEX 来设置 IGNORE_DUP_KEY。You can set IGNORE_DUP_KEY by using ALTER INDEX for a unique index. 此实例与 PRIMARY KEY/UNIQUE 约束不同,它是使用 CREATE INDEX 或 INDEX 定义创建的。This instance is different from a PRIMARY KEY/UNIQUE constraint and is created by using a CREATE INDEX or INDEX definition.
  • IGNORE_DUP_KEY 选项不应用于列存储索引,因为无法对此类索引强制实现唯一性。The IGNORE_DUP_KEY option doesn't apply to column store indexes because you can't enforce uniqueness on them.

SQL 输出重试逻辑SQL output retry logic

当带有 SQL 输出的流分析作业收到第一批事件时,将执行以下步骤:When a Stream Analytics job with SQL output receives the first batch of events, the following steps occur:

  1. 作业尝试连接到 SQL。The job attempts to connect to SQL.
  2. 作业提取目标表的架构。The job fetches the schema of the destination table.
  3. 作业针对目标表架构验证列名称和类型。The job validates column names and types against the destination table schema.
  4. 作业根据批处理中的输出记录准备内存中数据表。The job prepares an in-memory data table from the output records in the batch.
  5. 作业使用 BulkCopy API 将数据表写入 SQL。The job writes the data table to SQL using BulkCopy API.

在这些步骤中,SQL 输出可能会遇到以下类型的错误:During these steps, the SQL output can experience following types of errors:

  • 使用指数退避重试策略重试的暂时性错误Transient errors that are retried using an exponential backoff retry strategy. 最小重试间隔取决于各个错误代码,但是间隔通常小于 60 秒。The minimum retry interval depends on the individual error code, but the intervals are typically less than 60 seconds. 上限最多可为五分钟。The upper limit can be at most five minutes.

    登录失败防火墙问题会在上次尝试后至少 5 分钟重试一次,直到成功为止。Login failures and firewall issues are retried at least 5 minutes after the previous try and are retried until they succeed.

  • 数据错误(如强制转换错误和架构约束冲突)将通过输出错误策略进行处理。Data errors, such as casting errors and schema constraint violations, are handled with output error policy. 通过重试二进制拆分批处理来处理这些错误,直到导致错误的单个记录通过跳过或重试得到处理。These errors are handled by retrying binary split batches until the individual record causing the error is handled by skip or retry. 主唯一键约束冲突始终需得到处理Primary Unique key constraint violation is always handled.

  • 存在 SQL 服务问题或内部代码缺陷时,可能会出现非暂时性错误。Non-transient errors can happen when there are SQL service issues or internal code defects. 例如,对于弹性池达到其存储限制之类的错误(代码 1132),重试无法解决。For example, when errors like (Code 1132) Elastic Pool hitting its storage limit, retries do not resolve the error. 在这些情况下,流分析作业将遭遇降级In these scenarios, the Stream Analytics job experiences degradation.

  • 在步骤 5 的 BulkCopy 期间,可能出现 BulkCopy 超时。BulkCopy timeouts can happen during BulkCopy in Step 5. BulkCopy 偶尔会出现操作超时。BulkCopy can experience operation timeouts occasionally. 默认配置的最小超时为五分钟,当连续达到此值时,它将增加一倍。The default minimum configured timeout is five minutes and it's doubled when hit consecutively. 一旦超时超过 15 分钟,BulkCopy 的最大批处理大小提示将减少到一半,直到每批剩下 100 个事件为止。Once the timeout is above 15 minutes, the max batch size hint to BulkCopy is reduced to half until 100 events per batch are left.

Azure 流分析 (1.0) 中的列名称是小写的Column names are lowercase in Azure Stream Analytics (1.0)

如果使用的是原始兼容性级别 (1.0),Azure 流分析会将列名称更改为小写。When using the original compatibility level (1.0), Azure Stream Analytics changes column names to lowercase. 此行为已在以后的兼容性级别中修复。This behavior was fixed in later compatibility levels. 若要保留大小写,请迁移到兼容性级别 1.1 或更高版本。To preserve the case, move to compatibility level 1.1 or later. 有关详细信息,请参阅流分析作业的兼容性级别For more information, see Compatibility level for Stream Analytics jobs.

后续步骤Next steps