映射数据流调试模式Mapping data flow Debug Mode

适用于: Azure 数据工厂 Azure Synapse Analytics

概述Overview

通过 Azure 数据工厂映射数据流的调试模式,你可以在生成和调试数据流时以交互方式观察数据形状转换。Azure Data Factory mapping data flow's debug mode allows you to interactively watch the data shape transform while you build and debug your data flows. 调试会话可在数据流设计会话中使用,也可在数据流的管道调试执行期间使用。The debug session can be used both in Data Flow design sessions as well as during pipeline debug execution of data flows. 若要启用调试模式,请使用设计图面顶部的“数据流调试”按钮。To turn on debug mode, use the "Data Flow Debug" button at the top of the design surface.

调试滑块Debug slider

开启滑块后,系统将提示你选择要使用的集成运行时配置。Once you turn on the slider, you will be prompted to select which integration runtime configuration you wish to use. 如果选择“AutoResolveIntegrationRuntime”,则将启动具有八个常规计算内核的群集(默认 60 分钟时间)。If AutoResolveIntegrationRuntime is chosen, a cluster with eight cores of general compute with a default 60-minute time to live will be spun up. 如果要在会话超时之前允许更多空闲团队,可以选择更高的 TTL 设置。If you'd like to allow for more idle team before your session times out, you can choose a higher TTL setting. 有关数据流集成运行时的详细信息,请参阅数据流性能For more information on data flow integration runtimes, see Data flow performance.

调试 IR 选择Debug IR selection

启用调试模式时,将使用活动的 Spark 群集以交互方式构建数据流。When Debug mode is on, you'll interactively build your data flow with an active Spark cluster. 在 Azure 数据工厂中关闭调试后,会话将关闭。The session will close once you turn debug off in Azure Data Factory. 你应该了解 Azure Databricks 在打开调试会话期间产生的每小时费用。You should be aware of the hourly charges incurred by Azure Databricks during the time that you have the debug session turned on.

在大多数情况下,最好在调试模式下生成数据流,以便在 Azure 数据工厂中发布工作之前验证业务逻辑并查看数据转换。In most cases, it's a good practice to build your Data Flows in debug mode so that you can validate your business logic and view your data transformations before publishing your work in Azure Data Factory. 使用管道面板上的“调试”按钮测试管道中的数据流。Use the "Debug" button on the pipeline panel to test your data flow in a pipeline.

查看数据流调试会话

备注

用户从 ADF 浏览器 UI 启动的每个调试会话都是具有自己的 Spark 群集的新会话。Every debug session that a user starts from their ADF browser UI is a new session with its own Spark cluster. 你可以使用上述调试会话的监视视图查看和管理每个工厂的调试会话。You can use the monitoring view for debug sessions above to view and manage debug sessions per factory. 每个调试会话执行的每一个小时(包括 TTL 时间)都会产生费用。You are charged for every hour that each debug session is executing including the TTL time.

群集状态Cluster status

设计图面顶部的群集状态指示器在群集准备好进行调试时会变为绿色。The cluster status indicator at the top of the design surface turns green when the cluster is ready for debug. 如果群集已热,那么绿色指示器几乎会立即出现。If your cluster is already warm, then the green indicator will appear almost instantly. 如果在进入调试模式时群集尚未运行,则必须等待 5-7 分钟才能启动群集。If your cluster wasn't already running when you entered debug mode, then you'll have to wait 5-7 minutes for the cluster to spin up. 指示器将运行直准备就绪。The indicator will spin until its ready.

完成调试后,关闭调试开关,以便 Azure Databricks 群集可以停止,你将不再为调试活动付费。When you are finished with your debugging, turn the Debug switch off so that your Azure Databricks cluster can terminate and you'll no longer be billed for debug activity.

调试设置Debug settings

开启调试模式后,可以编辑数据流预览数据的方式。Once you turn on debug mode, you can edit how a data flow previews data. 可以通过单击“数据流”画布工具栏上的“调试设置”来编辑调试设置。Debug settings can be edited by clicking "Debug Settings" on the Data Flow canvas toolbar. 你可以在此处选择要用于每个“源”转换的行限制或文件源。You can select the row limit or file source to use for each of your Source transformations here. 此设置中的行限制仅适用于当前调试会话。The row limits in this setting are only for the current debug session. 还可以选择要用于 Azure Synapse Analytics 源的暂存链接服务。You can also select the staging linked service to be used for an Azure Synapse Analytics source.

调试设置Debug settings

如果在数据流或任何其引用的数据集中具有参数,可以通过选择“参数”选项卡指定在调试期间要使用的值。If you have parameters in your Data Flow or any of its referenced datasets, you can specify what values to use during debugging by selecting the Parameters tab.

调试设置参数Debug settings parameters

ADF 数据流中用于调试模式的默认 IR 是具有 4 核单驱动程序节点的小型 4 核单辅助角色节点。The default IR used for debug mode in ADF data flows is a small 4-core single worker node with a 4-core single driver node. 这非常适合用于测试数据流逻辑时的较小数据样本。This works fine with smaller samples of data when testing your data flow logic. 如果在数据预览期间在调试设置中扩展行限制,或在管道调试期间在源中设置更多采样行数,则不妨考虑在新的 Azure Integration Runtime 中设置更大的计算环境。If you expand the row limits in your debug settings during data preview or set a higher number of sampled rows in your source during pipeline debug, then you may wish to consider setting a larger compute environment in a new Azure Integration Runtime. 然后,你可以使用更大的计算环境重新启动调试会话。Then you can restart your debug session using the larger compute environment.

数据预览Data preview

打开调试后,“数据预览”选项卡将在底部面板上亮起。With debug on, the Data Preview tab will light-up on the bottom panel. 如果未打开调试模式,数据流将仅在“检查”选项卡中显示传入和传出每个转换的当前元数据。数据预览仅查询你在调试设置中设置为限制的行数。Without debug mode on, Data Flow will show you only the current metadata in and out of each of your transformations in the Inspect tab. The data preview will only query the number of rows that you have set as your limit in your debug settings. 单击“刷新”以提取数据预览。Click Refresh to fetch the data preview.

数据预览Data preview

备注

文件源仅限制你看到的行,而不是正在读取的行。File sources only limit the rows that you see, not the rows being read. 对于非常大的数据集,建议选择该文件的一小部分,并使用它进行测试。For very large datasets, it is recommended that you take a small portion of that file and use it for your testing. 在“调试设置”中为文件数据集类型的每个源选择临时文件。You can select a temporary file in Debug Settings for each source that is a file dataset type.

在数据流中以调试模式运行时,数据不会写入接收器转换。When running in Debug Mode in Data Flow, your data will not be written to the Sink transform. 调试会话旨在用作转换测试工具。A Debug session is intended to serve as a test harness for your transformations. 调试期间不需要接收器,并且会在数据流中忽略接收器。Sinks are not required during debug and are ignored in your data flow. 如果希望在接收器中测试数据写入,请从 Azure 数据工厂管道执行数据流,并从管道使用调试执行。If you wish to test writing the data in your Sink, execute the Data Flow from an Azure Data Factory Pipeline and use the Debug execution from a pipeline.

数据预览是 Spark 内存的数据帧中使用行限制和数据采样的转换数据的快照。Data Preview is a snapshot of your transformed data using row limits and data sampling from data frames in Spark memory. 因此,在这种情况下,不会使用或测试接收器驱动程序。Therefore, the sink drivers are not utilized or tested in this scenario.

测试联接条件Testing join conditions

当单元在测试“联接”、“存在”或“查找”转换时,请确保使用少量已知数据进行测试。When unit testing Joins, Exists, or Lookup transformations, make sure that you use a small set of known data for your test. 可以使用上面的“调试设置”选项设置临时文件以用于测试。You can use the Debug Settings option above to set a temporary file to use for your testing. 这很有必要,因为在对大型数据集中的行进行限制或采样时,你无法预测哪些行和哪些键将被读取到流中进行测试。This is needed because when limiting or sampling rows from a large dataset, you cannot predict which rows and which keys will be read into the flow for testing. 结果是非确定性的,这意味着你的联接条件可能失败。The result is non-deterministic, meaning that your join conditions may fail.

快速操作Quick actions

看到数据预览后,就可以生成快速转换以对列进行类型强制转换、删除或修改。Once you see the data preview, you can generate a quick transformation to typecast, remove, or do a modification on a column. 单击列标题,然后从数据预览工具栏中选择其中一个选项。Click on the column header and then select one of the options from the data preview toolbar.

屏幕截图显示数据预览工具栏,其中包含以下选项:类型强制转换、修改、统计信息和删除。Screenshot shows the data preview toolbar with options: Typecast, Modify, Statistics, and Remove.

选择“修改”后,数据预览将立即刷新。Once you select a modification, the data preview will immediately refresh. 单击右上角的“确认”,以生成新的转换。Click Confirm in the top-right corner to generate a new transformation.

屏幕截图显示“确认”按钮。Screenshot shows the Confirm button.

“类型强制转换”和“修改”将生成“派生列”转换,“删除”将生成“选择”转换 。Typecast and Modify will generate a Derived Column transformation and Remove will generate a Select transformation.

屏幕截图显示派生列的设置。Screenshot shows Derived Column’s Settings.

备注

如果编辑数据流,则需要在添加快速转换之前重新提取数据预览。If you edit your Data Flow, you need to re-fetch the data preview before adding a quick transformation.

数据事件探查Data profiling

通过在数据预览选项卡中选择单个列并在数据预览工具栏中单击“统计信息”,将在数据网格的最右侧弹出一个图表,其中包含有关每个字段的详细统计信息。Selecting a column in your data preview tab and clicking Statistics in the data preview toolbar will pop up a chart on the far-right of your data grid with detailed statistics about each field. Azure 数据工厂将根据要显示的图表类型的数据采样做出决定。Azure Data Factory will make a determination based upon the data sampling of which type of chart to display. 高基数字段将默认为 NULL/NOT NULL 图表,而具有低基数的分类和数值数据将显示条形图(显示数据值频率)。High-cardinality fields will default to NULL/NOT NULL charts while categorical and numeric data that has low cardinality will display bar charts showing data value frequency. 你还将看到字符串字段的最大/最小长度、数值字段中的最小/最大值、标准偏差、百分点值、计数和平均值。You'll also see max/len length of string fields, min/max values in numeric fields, standard dev, percentiles, counts, and average.

列统计信息Column statistics

后续步骤Next steps