使用 Azure 数据工厂进行迭代开发和调试Iterative development and debugging with Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics

通过 Azure 数据工厂,你可以在开发数据集成解决方案时以迭代方式开发和调试数据工厂管道。Azure Data Factory lets you iteratively develop and debug Data Factory pipelines as you are developing your data integration solutions. 这些功能使你能够在创建拉取请求或将其发布到数据工厂服务之前测试更改。These features allow you to test your changes before creating a pull request or publishing them to the data factory service.

调试管道Debugging a pipeline

当你使用管道画布进行创作时,可使用“调试”功能测试活动。As you author using the pipeline canvas, you can test your activities using the Debug capability. 执行测试运行时,在选择“调试”之前,不需要将更改发布到数据工厂。 When you do test runs, you don't have to publish your changes to the data factory before you select Debug. 当希望确保更改在你更新数据工厂工作流之前按预期工作时,此功能很有帮助。This feature is helpful in scenarios where you want to make sure that the changes work as expected before you update the data factory workflow.

管道画布上的调试功能

管道运行时,你可在管道画布的“输出”选项卡中查看每个活动的结果。As the pipeline is running, you can see the results of each activity in the Output tab of the pipeline canvas.

在管道画布的“输出” 窗口中查看测试运行的结果。View the results of your test runs in the Output window of the pipeline canvas.

管道画布“输出”窗口

在测试运行成功后,向管道中添加更多活动并继续以迭代方式进行调试。After a test run succeeds, add more activities to your pipeline and continue debugging in an iterative manner. 还可以在测试运行正在执行时将其 取消You can also Cancel a test run while it is in progress.

重要

选择”调试” 会实际运行管道。Selecting Debug actually runs the pipeline. 例如,如果管道包含复制活动,则测试运行会将数据从源复制到目标。For example, if the pipeline contains copy activity, the test run copies data from source to destination. 因此,在调试时,建议在复制活动和其他活动中使用测试文件夹。As a result, we recommend that you use test folders in your copy activities and other activities when debugging. 在调试管道后,切换到要在正常操作中使用的实际文件夹。After you've debugged the pipeline, switch to the actual folders that you want to use in normal operations.

设置断点Setting breakpoints

Azure 数据工厂还允许你一直调试管道,直到到达管道画布中的某个特定活动。Azure Data Factory allows for you to debug a pipeline until you reach a particular activity on the pipeline canvas. 在活动上放置要测试到的断点,然后选择“调试”即可。Put a breakpoint on the activity until which you want to test, and select Debug. 数据工厂会确保测试仅运行到管道画布上的断点活动。Data Factory ensures that the test runs only until the breakpoint activity on the pipeline canvas. 如果不想测试整个管道,只想测试该管道内的一部分活动,则此“调试至” 功能非常有用。This Debug Until feature is useful when you don't want to test the entire pipeline, but only a subset of activities inside the pipeline.

管道画布上的断点

若要设置断点,请选择管道画布上的元素。To set a breakpoint, select an element on the pipeline canvas. “调试至” 选项在元素的右上角显示为空心的红色圆圈。A Debug Until option appears as an empty red circle at the upper right corner of the element.

在所选元素上设置断点之前

选择“调试至”选项后,它将变为实心的红色圆圈,以指示已启用断点 。After you select the Debug Until option, it changes to a filled red circle to indicate the breakpoint is enabled.

在所选元素上设置断点之后

监视调试运行Monitoring debug runs

运行管道调试运行时,结果将显示在管道画布的“输出”窗口中。When you run a pipeline debug run, the results will appear in the Output window of the pipeline canvas. “输出”选项卡只包含在当前浏览器会话过程中出现的最新运行。The output tab will only contain the most recent run that occurred during the current browser session.

管道画布“输出”窗口

若要查看调试运行的历史视图或查看所有活动调试运行的列表,你可以进入“监视器”体验。To view a historical view of debug runs or see a list of all active debug runs, you can go into the Monitor experience.

选择查看活动调试运行图标

备注

Azure 数据工厂服务仅将调试运行历史记录保留 15 天。The Azure Data Factory service only persists debug run history for 15 days.

调试映射数据流Debugging mapping data flows

通过映射数据流,可构建大规模运行的无代码数据转换逻辑。Mapping data flows allow you to build code-free data transformation logic that runs at scale. 生成逻辑时,可打开调试会话,使用实时 Spark 群集以交互方式处理数据。When building your logic, you can turn on a debug session to interactively work with your data using a live Spark cluster. 若要了解详细信息,请阅读映射数据流调试模式To learn more, read about mapping data flow debug mode.

可在“监视器”体验中监视工厂中的活动数据流调试会话。You can monitor active data flow debug sessions across a factory in the Monitor experience.

查看数据流调试会话

数据流设计器中的数据预览和数据流的管道调试最适合与数据的小示例一起使用。Data preview in the data flow designer and pipeline debugging of data flows are intended to work best with small samples of data. 但是,如果需要对大量数据测试管道或数据流中的逻辑,请使用更多的内核和最少的常规用途计算来增加调试会话中使用的 Azure Integration Runtime 的大小。However, if you need to test your logic in a pipeline or data flow against large amounts of data, increase the size of the Azure Integration Runtime being used in the debug session with more cores and a minimum of general purpose compute.

调试包含数据流活动的管道Debugging a pipeline with a data flow activity

在使用数据流执行调试管道运行时,有两个选项可用于该计算。When executing a debug pipeline run with a data flow, you have two options on which compute to use. 你可使用现有调试群集,也可为数据流启动新的实时群集。You can either use an existing debug cluster or spin up a new just-in-time cluster for your data flows.

使用现有调试会话将大幅缩短数据流启动时间,这是因为群集已在运行,但建议不要对复杂或并行工作负载这样做,理由是一次运行多个作业时可能会失败。Using an existing debug session will greatly reduce the data flow start up time as the cluster is already running, but is not recommended for complex or parallel workloads as it may fail when multiple jobs are run at once.

如果使用活动运行时,则将使用在每个数据流活动的集成运行时中指定的设置创建一个新群集。Using the activity runtime will create a new cluster using the settings specified in each data flow activity's integration runtime. 这样可隔离每个作业,它应该用于复杂的工作负载或性能测试。This allows each job to be isolated and should be used for complex workloads or performance testing. 你还可控制 Azure IR 中的 TTL,从而使用于调试的群集资源在该时间段内仍然可用,从而服务其他作业请求。You can also control the TTL in the Azure IR so that the cluster resources used for debugging will still be available for that time period to serve additional job requests.

备注

如果你有一个管道,其数据流正在并行执行或需要使用大型数据集进行测试,请选择“使用活动运行时”,使数据工厂能够使用在数据流活动中选择的 Integration Runtime。If you have a pipeline with data flows executing in parallel or data flows that need to be tested with large datasets, choose "Use Activity Runtime" so that Data Factory can use the Integration Runtime that you've selected in your data flow activity. 这使得数据流能在多个群集上执行,还可适应并行数据流执行。This will allow the data flows to execute on multiple clusters and can accommodate your parallel data flow executions.

运行包含数据流的管道