监视数据流Monitor Data Flows

适用于: Azure 数据工厂 Azure Synapse Analytics

完成数据流的生成和调试之后,需要在管道上下文中将数据流安排为按计划执行。After you have completed building and debugging your data flow, you want to schedule your data flow to execute on a schedule within the context of a pipeline. 可以使用触发器从 Azure 数据工厂计划管道。You can schedule the pipeline from Azure Data Factory using Triggers. 或者,可以使用 Azure 数据工厂管道生成器中的“立即触发”选项执行单次运行,以便在管道上下文中测试数据流。Or you can use the Trigger Now option from the Azure Data Factory Pipeline Builder to execute a single-run execution to test your data flow within the pipeline context.

执行管道时,可以监视该管道及其包含的所有活动,包括数据流活动。When you execute your pipeline, you can monitor the pipeline and all of the activities contained in the pipeline including the Data Flow activity. 在 Azure 数据工厂 UI 的左侧面板中单击监视图标。Click on the monitor icon in the left-hand Azure Data Factory UI panel. 可以看到下面所示的屏幕。You can see a screen similar to the one below. 使用突出显示的图标可以深入到管道中的活动,包括数据流活动。The highlighted icons allow you to drill into the activities in the pipeline, including the Data Flow activity.

屏幕截图显示了为管道选择以获取详细信息的图标。Screenshot shows icons to select for pipelines for more information.

还可以看到此级别的统计信息,包括运行次数和状态。You see statistics at this level as well including the run times and status. 活动级别的运行 ID 不同于管道级别的运行 ID。The Run ID at the activity level is different than the Run ID at the pipeline level. 上一级别的运行 ID 适用于管道。The Run ID at the previous level is for the pipeline. 选择眼镜图标可以查看有关数据流执行的深入详细信息。Selecting the eyeglasses gives you deep details on your data flow execution.

屏幕截图显示了查看数据流执行详细信息的眼镜图标。Screenshot shows the eyeglasses icon to see details of data flow execution.

在图形节点监视视图中,可以看到数据流图形的简化只读版本。When you're in the graphical node monitoring view, you can see a simplified view-only version of your data flow graph.

屏幕截图显示图形的仅查看版本。Screenshot shows the view-only version of the graph.

查看数据流执行计划View Data Flow Execution Plans

在 Spark 中执行数据流时,Azure 数据工厂会根据数据流的完整性确定最佳代码路径。When your Data Flow is executed in Spark, Azure Data Factory determines optimal code paths based on the entirety of your data flow. 此外,执行路径可能出现在不同的横向扩展节点和数据分区上。Additionally, the execution paths may occur on different scale-out nodes and data partitions. 因此,监视图形表示流的设计,它考虑到了转换的执行路径。Therefore, the monitoring graph represents the design of your flow, taking into account the execution path of your transformations. 选择单个节点时,可以看到“分组”,它们表示在群集上统一执行的代码。When you select individual nodes, you can see "groupings" that represent code that was executed together on the cluster. 看到的计时和计数表示相对于设计中各个步骤的组。The timings and counts that you see represent those groups as opposed to the individual steps in your design.

屏幕截图显示了数据流页面。Screenshot shows the page for a data flow.

  • 在监视窗口中选择空白区域时,底部窗格中的统计信息会显示每个接收器的计时和行数,以及生成转换沿袭的接收器数据的转换。When you select the open space in the monitoring window, the stats in the bottom pane display timing and row counts for each Sink and the transformations that led to the sink data for transformation lineage.

  • 选择单个转换时,右侧面板会显示其他反馈,其中显示了分区统计信息、列数、偏斜度(数据在分区之间的分布均匀度)和峰度(数据的峰度)。When you select individual transformations, you receive additional feedback on the right-hand panel that shows partition stats, column counts, skewness (how evenly is the data distributed across partitions), and kurtosis (how spiky is the data).

  • 在节点视图中选择“接收器”时,可以看到列沿袭。When you select the Sink in the node view, you can see column lineage. 在整个数据流中,可通过三种不同的方法来累积要载入接收器的列。There are three different methods that columns are accumulated throughout your data flow to land in the Sink. 它们分别是:They are:

    • 计算:使用列进行条件处理,或者在数据流中的表达式内使用列,但不将其载入接收器Computed: You use the column for conditional processing or within an expression in your data flow, but don't land it in the Sink
    • 派生:列是在流中生成的新列,即,它不存在于源中Derived: The column is a new column that you generated in your flow, that is, it was not present in the Source
    • 映射:列来源于源,你要将它映射到接收器字段Mapped: The column originated from the source and your are mapping it to a sink field
    • 数据流状态:执行的当前状态Data flow status: The current status of your execution
    • 群集启动时间:为数据流执行获取 JIT Spark 计算环境所需的时间量Cluster startup time: Amount of time to acquire the JIT Spark compute environment for your data flow execution
    • 转换次数:在流中执行的转换步骤数Number of transforms: How many transformation steps are being executed in your flow

屏幕截图显示了“刷新”选项。Screenshot shows the Refresh option.

总接收器处理时间与转换处理时间Total Sink Processing Time vs. Transformation Processing Time

每个转换阶段都包括在每个分区执行时间汇总后完成该阶段的总时间。Each transformation stage includes a total time for that stage to complete with each partition execution time totaled together. 单击接收器时,将显示“接收器处理时间”。When you click on the Sink you will see "Sink Processing Time". 此时间包括转换时间总计加上将数据写入目标存储所花费的 I/O 时间。This time includes the total of the transformation time plus the I/O time it took to write your data to your destination store. 接收器处理时间与转换总时间之间的差异是写入数据的 I/O 时间。The difference between the Sink Processing Time and the total of the transformation is the I/O time to write the data.

如果在 ADF 管道监视视图中打开数据流活动的 JSON 输出,则还可以查看每个分区转换步骤的详细计时。You can also see detailed timing for each partition transformation step if you open the JSON output from your data flow activity in the ADF pipeline monitoring view. JSON 包含每个分区的毫秒计时,而 UX 监视视图是添加在一起的分区的聚合计时:The JSON contains millisecond timing for each partition, whereas the UX monitoring view is an aggregate timing of partitions added together:

     "stage": 4,
     "partitionTimes": [

接收器处理时间Sink processing time

当你在映射中选择接收器转换图标时,右侧的滑动面板将在底部显示名为“后期处理时间”的其他数据点。When you select a sink transformation icon in your map, the slide-in panel on the right will show an additional data point called "post processing time" at the bottom. 这是加载、转换和写入数据后在 Spark 群集上执行作业所花费的时间。This is the amount time spent executing your job on the Spark cluster after your data has been loaded, transformed, and written. 此时间可能包括关闭连接池、驱动程序关闭、删除文件、合并文件等。当你在流中执行诸如“移动文件”和“输出到单个文件”之类的操作时,可能会看到后期处理时间值的增加。This time can include closing connection pools, driver shutdown, deleting files, coalescing files, etc. When you perform actions in your flow like "move files" and "output to single file", you will likely see an increase in the post processing time value.

  • 写入阶段持续时间:将数据写入到 Synapse SQL 暂存位置的时间Write stage duration: The time to write the data to a staging location for Synapse SQL
  • 表操作 SQL 持续时间:将数据从临时表移动到目标表所花费的时间Table operation SQL duration: The time spent moving data from temp tables to target table
  • 前 SQL 持续时间和后 SQL 持续时间:运行前/后 SQL 命令所花费的时间Pre SQL duration & Post SQL duration: The time spent running pre/post SQL commands
  • 前命令持续时间和后命令持续时间:针对基于文件的源/接收器运行任何前/后操作所花费的时间。Pre commands duration & post commands duration: The time spent running any pre/post operations for file based source/sinks. 例如,在处理后移动或删除文件。For example move or delete files after processing.
  • 合并持续时间:合并文件所花费的时间,在写入到单个文件或使用“作为列数据的文件名”时,合并文件用于基于文件的接收器。Merge duration: The time spent merging the file, merge files are used for file based sinks when writing to single file or when "File name as column data" is used. 如果在此指标中花费了大量时间,则应避免使用这些选项。If significant time is spent in this metric, you should avoid using these options.

错误行Error rows

在数据流接收器中启用错误行处理将反映在监视输出中。Enabling error row handling in your data flow sink will be reflected in the monitoring output. 将接收器设置为“出错时报告成功”时,如果单击接收器监视节点,监视输出将显示成功和失败行数。When you set the sink to "report success on error", the monitoring output will show the number of success and failed rows when you click on the sink monitoring node.

屏幕截图显示了错误行。Screenshot shows error rows.

选择“出错时报告失败”时,相同的输出将仅显示在活动监视输出文本中。When you select "report failure on error", the same output will be shown only in the activity monitoring output text. 这是因为数据流活动将返回执行失败,详细监视视图将不可用。This is because the data flow activity will return failure for execution and the detailed monitoring view will be unavailable.

屏幕截图显示了活动中的错误行。Screenshot shows error rows in activity.

监视图标Monitor Icons

此图标表示转换数据已在群集中缓存,因此计时和执行路径已考虑到这种情况:This icon means that the transformation data was already cached on the cluster, so the timings and execution path have taken that into account:

屏幕截图显示了磁盘图标。Screenshot shows the disk icon.

在转换中还会看到绿色的圆圈图标。You also see green circle icons in the transformation. 它们表示数据流入的接收器数目。They represent a count of the number of sinks that data is flowing into.