使用 Azure 数据工厂进行数据引入Data ingestion with Azure Data Factory

本文介绍如何使用 Azure 数据工厂 (ADF) 生成数据引入管道。In this article, you learn how to build a data ingestion pipeline with Azure Data Factory (ADF). 此管道用于引入要在 Azure 机器学习中使用的数据。This pipeline is used to ingest data for use with Azure Machine Learning. Azure 数据工厂可让你轻松提取、转换和加载 (ETL) 数据。Azure Data Factory allows you to easily extract, transform, and load (ETL) data. 转换数据并将其载入存储后,可以使用这些数据来训练机器学习模型。Once the data has been transformed and loaded into storage, it can be used to train your machine learning models.

可以使用本机 ADF 活动和数据流等检测工具来处理简单的数据转换。Simple data transformation can be handled with native ADF activities and instruments such as data flow. 涉及到较复杂的方案时,可以使用一些自定义代码来处理数据。When it comes to more complicated scenarios, the data can be processed with some custom code. 例如 Python 或 R 代码。For example, Python or R code.

在引入期间,可以通过多种常用方法使用 Azure 数据工厂来转换数据。There are several common techniques of using Azure Data Factory to transform data during ingestion. 每种方法有各自的优缺点,这也决定了它是否适合特定的用例:Each technique has pros and cons that determine if it is a good fit for a specific use case:

方法Technique 优点Pros 缺点Cons
ADF + Azure FunctionsADF + Azure Functions 低延迟,无服务器计算Low latency, serverless compute
有状态函数Stateful functions
可重用函数Reusable functions
仅适用于短时间运行的处理Only good for short running processing
ADF + 自定义组件ADF + custom component 大规模并行计算Large-scale parallel computing
适合用于繁重算法Suited for heavy algorithms
将代码包装到可执行文件中Wrapping code into an executable
处理依赖项和 IO 时存在的复杂性Complexity of handling dependencies and IO
ADF + Azure Databricks 笔记本ADF + Azure Databricks notebook Apache SparkApache Spark
本机 Python 环境Native Python environment
开销可能很高Can be expensive
初次创建群集时需要较长时间并会增大延迟Creating clusters initially takes time and adds latency

ADF 与 Azure Functions 相结合ADF with Azure functions

adf-function

Azure Functions 允许运行小段代码(函数),且不需要担心应用程序基础结构。Azure Functions allows you to run small pieces of code (functions) without worrying about application infrastructure. 使用此选项时,数据将通过包装在 Azure 函数中的自定义 Python 代码进行处理。In this option, the data is processed with custom Python code wrapped into an Azure Function.

该函数是使用 ADF Azure 函数活动调用的。The function is invoked with the ADF Azure Function activity. 此方法非常适合轻型数据转换。This approach is a good option for lightweight data transformations.

  • 优点:Pros:
    • 以相对较低的延迟在无服务器计算资源中处理数据The data is processed on a serverless compute with a relatively low latency
    • ADF 管道可以调用一个持久性 Azure 函数,该函数可实现复杂的数据转换流ADF pipeline can invoke a Durable Azure Function that may implement a sophisticated data transformation flow
    • 可重复使用且可从其他位置调用的 Azure 函数会抽象掉数据转换的详细信息The details of the data transformation are abstracted away by the Azure Function that can be reused and invoked from other places
  • 缺点:Cons:
    • 在与 ADF 结合使用之前,必须先创建 Azure FunctionsThe Azure Functions must be created before use with ADF
    • Azure Functions 仅适用于短时间运行的数据处理Azure Functions is good only for short running data processing

ADF 与自定义组件活动相结合ADF with Custom Component Activity

adf-customcomponent

使用此选项时,数据将通过包装在可执行文件中的自定义 Python 代码进行处理。In this option, the data is processed with custom Python code wrapped into an executable. 该可执行文件是使用 ADF 自定义组件活动调用的。It is invoked with an ADF Custom Component activity. 与前面的方法相比,此方法更适合较大的数据。This approach is a better fit for large data than the previous technique.

  • 优点:Pros:
    • 数据将在 Azure Batch 池中进行处理,该池提供大规模并行和高性能计算The data is processed on Azure Batch pool, which provides large-scale parallel and high-performance computing
    • 可用于运行繁重的算法并处理大量数据Can be used to run heavy algorithms and process significant amounts of data
  • 缺点:Cons:
    • 在与 ADF 结合使用之前,必须先创建 Azure Batch 池Azure Batch pool must be created before use with ADF
    • 将 Python 代码包装到可执行文件会产生过多的工程工作量。Over engineering related to wrapping Python code into an executable. 处理依赖项和输入/输出参数时存在的复杂性Complexity of handling dependencies and input/output parameters

在 Azure 机器学习管道中使用数据Consuming data in Azure Machine Learning pipelines

aml-dataset

从 ADF 管道转换的数据将保存到数据存储(例如 Azure Blob)。The transformed data from the ADF pipeline is saved to data storage (such as Azure Blob). Azure 机器学习可以使用数据存储数据集访问此数据。Azure Machine Learning can access this data using datastores and datasets.

每次运行 ADF 管道时,数据将保存到存储中的不同位置。Each time the ADF pipeline runs, the data is saved to a different location in storage. 若要将位置传递给 Azure 机器学习,ADF 管道需调用 Azure 机器学习管道。To pass the location to Azure Machine Learning, the ADF pipeline calls an Azure Machine Learning pipeline. 调用 ML 管道时,数据位置和运行 ID 将作为参数发送。When calling the ML pipeline, the data location and run ID are sent as parameters. 然后,ML 管道可以使用数据位置来创建数据存储/数据集。The ML pipeline can then create a datastore/dataset using the data location.

提示

数据集支持版本控制,因此 ML 管道可以注册指向 ADF 管道中最新数据的新数据集版本。Datasets support versioning, so the ML pipeline can register a new version of the dataset that points to the most recent data from the ADF pipeline.

可通过数据存储或数据集访问数据后,可以使用这些数据来训练 ML 模型。Once the data is accessible through a datastore or dataset, you can use it to train an ML model. 训练过程可能是从 ADF 中调用的同一 ML 管道的一部分。The training process might be part of the same ML pipeline that is called from ADF. 或者,它可能是一个单独的过程,例如 Jupyter 笔记本中的试验。Or it might be a separate process such as experimentation in a Jupyter notebook.

由于数据集支持版本控制,并且每次从管道运行都会创建一个新版本,因此,很容易就知道使用了哪个数据版本来训练模型。Since datasets support versioning, and each run from the pipeline creates a new version, it's easy to understand which version of the data was used to train a model.

后续步骤Next steps