使用 Azure 数据工厂进行数据引入Data ingestion with Azure Data Factory

在本文中,你将了解使用 Azure 数据工厂生成数据引入管道时的可用选项。In this article, you learn about the available options for building a data ingestion pipeline with Azure Data Factory. 此 Azure 数据工厂管道用于引入要在 Azure 机器学习中使用的数据。This Azure Data Factory pipeline is used to ingest data for use with Azure Machine Learning. 使用数据工厂可以轻松提取、转换和加载 (ETL) 数据。Data Factory allows you to easily extract, transform, and load (ETL) data. 转换数据并将其载入存储后,可以使用这些数据在 Azure 机器学习中训练机器学习模型。Once the data has been transformed and loaded into storage, it can be used to train your machine learning models in Azure Machine Learning.

可以使用原生数据工厂活动和数据流等检测工具来处理简单的数据转换。Simple data transformation can be handled with native Data Factory activities and instruments such as data flow. 涉及到较复杂的方案时,可以使用一些自定义代码来处理数据。When it comes to more complicated scenarios, the data can be processed with some custom code. 例如 Python 或 R 代码。For example, Python or R code.

比较 Azure 数据工厂数据引入管道Compare Azure Data Factory data ingestion pipelines

在引入期间,可以通过多种常用方法使用数据工厂来转换数据。There are several common techniques of using Data Factory to transform data during ingestion. 每种方法各有优缺点,这也决定了它是否适合特定的用例:Each technique has advantages and disadvantages that help determine if it is a good fit for a specific use case:

方法Technique 优点Advantages 缺点Disadvantages
数据工厂 + Azure FunctionsData Factory + Azure Functions
  • 低延迟,无服务器计算Low latency, serverless compute
  • 有状态函数Stateful functions
  • 可重用函数Reusable functions
  • 仅适用于短时间运行的处理Only good for short running processing
    数据工厂 + 自定义组件Data Factory + custom component
  • 大规模并行计算Large-scale parallel computing
  • 适合用于繁重算法Suited for heavy algorithms
  • 需要将代码包装到可执行文件中Requires wrapping code into an executable
  • 处理依赖项和 IO 时存在的复杂性Complexity of handling dependencies and IO
  • 数据工厂 + Azure Databricks 笔记本Data Factory + Azure Databricks notebook
  • Apache SparkApache Spark
  • 本机 Python 环境Native Python environment
  • 开销可能很高Can be expensive
  • 初次创建群集时需要较长时间并会增大延迟Creating clusters initially takes time and adds latency
  • 包含 Azure 函数的 Azure 数据工厂Azure Data Factory with Azure functions

    Azure Functions 允许运行小段代码(函数),且不需要担心应用程序基础结构。Azure Functions allows you to run small pieces of code (functions) without worrying about application infrastructure. 使用此选项时,数据将通过包装在 Azure 函数中的自定义 Python 代码进行处理。In this option, the data is processed with custom Python code wrapped into an Azure Function.

    该函数是使用 Azure 数据工厂 Azure 函数活动调用的。The function is invoked with the Azure Data Factory Azure Function activity. 此方法非常适合轻型数据转换。This approach is a good option for lightweight data transformations.

    该图显示了包含“Azure 函数”和“运行 ML 管道”的 Azure 数据工厂管道和包含“训练模型”的 Azure 机器学习管道,以及它们与原始数据和准备好的数据进行交互的方式。

    • 优点:Advantages:
      • 以相对较低的延迟在无服务器计算资源中处理数据The data is processed on a serverless compute with a relatively low latency
      • 数据工厂管道可以调用一个持久性 Azure 函数,该函数可实现复杂的数据转换流Data Factory pipeline can invoke a Durable Azure Function that may implement a sophisticated data transformation flow
      • 可重复使用且可从其他位置调用的 Azure 函数会抽象掉数据转换的详细信息The details of the data transformation are abstracted away by the Azure Function that can be reused and invoked from other places
    • 缺点:Disadvantages :
      • 在与 ADF 结合使用之前,必须先创建 Azure FunctionsThe Azure Functions must be created before use with ADF
      • Azure Functions 仅适用于短时间运行的数据处理Azure Functions is good only for short running data processing

    包含自定义组件活动的 Azure 数据工厂Azure Data Factory with Custom Component activity

    使用此选项时,数据将通过包装在可执行文件中的自定义 Python 代码进行处理。In this option, the data is processed with custom Python code wrapped into an executable. 该可执行文件是使用 Azure 数据工厂自定义组件活动调用的。It is invoked with an Azure Data Factory Custom Component activity. 与前面的方法相比,此方法更适合较大的数据。This approach is a better fit for large data than the previous technique.

    该图显示了包含“自定义组件”和“运行 ML 管道”的 Azure 数据工厂管道和包含“训练模型”的 Azure 机器学习管道,以及它们与原始数据和准备好的数据进行交互的方式。

    • 优点:Advantages:
      • 数据将在 Azure Batch 池中进行处理,该池提供大规模并行和高性能计算The data is processed on Azure Batch pool, which provides large-scale parallel and high-performance computing
      • 可用于运行繁重的算法并处理大量数据Can be used to run heavy algorithms and process significant amounts of data
    • 缺点:Disadvantages:
      • 在与数据工厂结合使用之前,必须先创建 Azure Batch 池Azure Batch pool must be created before use with Data Factory
      • 将 Python 代码包装到可执行文件会产生过多的工程工作量。Over engineering related to wrapping Python code into an executable. 处理依赖项和输入/输出参数时存在的复杂性Complexity of handling dependencies and input/output parameters

    包含 Azure Databricks Python 笔记本的 Azure 数据工厂Azure Data Factory with Azure Databricks Python notebook

    Azure Databricks 是 Microsoft 云中基于 Apache Spark 的分析平台。Azure Databricks is an Apache Spark-based analytics platform in the Microsoft cloud.

    使用此方法时,数据转换将由 Azure Databricks 群集上运行的某个 Python 笔记本执行。In this technique, the data transformation is performed by a Python notebook, running on an Azure Databricks cluster. 这也许是全面利用 Azure Databricks 服务的强大之处的最常见方法。This is probably, the most common approach that leverages the full power of an Azure Databricks service. 它旨在用于大规模的分布式数据处理。It is designed for distributed data processing at scale.

    该图显示了包含“Azure Databricks Python”和“运行 ML 管道”的 Azure 数据工厂管道和包含“训练模型”的 Azure 机器学习管道,以及它们与原始数据和准备好的数据进行交互的方式。

    • 优点:Advantages:
      • 在以 Apache Spark 环境为后盾的最强大数据处理 Azure 服务中转换数据The data is transformed on the most powerful data processing Azure service, which is backed up by Apache Spark environment
      • 原生支持 Python 以及数据科学框架和库(包括 TensorFlow、PyTorch 和 scikit-learn)Native support of Python along with data science frameworks and libraries including TensorFlow, PyTorch, and scikit-learn
      • 无需将 Python 代码包装到函数或可执行模块中。There is no need to wrap the Python code into functions or executable modules. 代码按原样运行。The code works as is.
    • 缺点:Disadvantages:
      • 在与数据工厂结合使用之前,必须先创建 Azure Databricks 基础结构Azure Databricks infrastructure must be created before use with Data Factory
      • 开销可能很高,具体取决于 Azure Databricks 配置Can be expensive depending on Azure Databricks configuration
      • 从“冷”模式运转计算群集需要一段时间,使解决方案遇到较高的延迟Spinning up compute clusters from "cold" mode takes some time that brings high latency to the solution

    使用 Azure 机器学习中的数据Consume data in Azure Machine Learning

    数据工厂管道会将准备好的数据保存到你的云存储(例如 Azure Blob 或 Azure Datalake)。The Data Factory pipeline saves the prepared data to your cloud storage (such as Azure Blob or Azure Datalake).
    可以通过以下方式使用 Azure 机器学习中已准备好的数据:Consume your prepared data in Azure Machine Learning by,

    从数据工厂调用 Azure 机器学习管道Invoke Azure Machine Learning pipeline from Data Factory

    对于机器学习操作 (MLOps) 工作流,建议使用此方法。This method is recommended for Machine Learning Operations (MLOps) workflows. 如果不想设置 Azure 机器学习管道,请参阅直接从存储读取数据If you don't want to set up an Azure Machine Learning pipeline, see Read data directly from storage.

    数据工厂管道每次运行时:Each time the Data Factory pipeline runs,

    1. 数据将保存到存储中的不同位置。The data is saved to a different location in storage.
    2. 若要将位置传递给 Azure 机器学习,数据工厂管道需要调用 Azure 机器学习管道To pass the location to Azure Machine Learning, the Data Factory pipeline calls an Azure Machine Learning pipeline. 调用 ML 管道时,数据位置和运行 ID 将作为参数发送。When calling the ML pipeline, the data location and run ID are sent as parameters.
    3. 然后,ML 管道可以使用数据位置创建 Azure 机器学习数据存储和数据集。The ML pipeline can then create an Azure Machine Learning datastore and dataset with the data location. 可以从在数据工厂中执行 Azure 机器学习管道中了解详细信息。Learn more in Execute Azure Machine Learning pipelines in Data Factory.

    该图显示了 Azure 数据工厂管道和 Azure 机器学习管道,以及它们与原始数据和准备好的数据进行交互的方式。

    提示

    数据集支持版本控制,因此 ML 管道可以注册指向 ADF 管道中最新数据的新数据集版本。Datasets support versioning, so the ML pipeline can register a new version of the dataset that points to the most recent data from the ADF pipeline.

    可通过数据存储或数据集访问数据后,可以使用这些数据来训练 ML 模型。Once the data is accessible through a datastore or dataset, you can use it to train an ML model. 训练过程可能是从 ADF 中调用的同一 ML 管道的一部分。The training process might be part of the same ML pipeline that is called from ADF. 或者,它可能是一个单独的过程,例如 Jupyter 笔记本中的试验。Or it might be a separate process such as experimentation in a Jupyter notebook.

    由于数据集支持版本控制,并且每次从管道运行都会创建一个新版本,因此,很容易就知道使用了哪个数据版本来训练模型。Since datasets support versioning, and each run from the pipeline creates a new version, it's easy to understand which version of the data was used to train a model.

    直接从存储读取数据Read data directly from storage

    如果不想创建 ML 管道,则可以直接从存储帐户访问数据,该存储帐户使用 Azure 机器学习数据存储和数据集保存准备好的数据。If you don't want to create a ML pipeline, you can access the data directly from the storage account where your prepared data is saved with an Azure Machine Learning datastore and dataset.

    以下 Python 代码演示了如何创建连接到 Azure DataLake Generation 2 存储的数据存储。The following Python code demonstrates how to create a datastore that connects to Azure DataLake Generation 2 storage. 详细了解数据存储以及在何处查找服务主体权限Learn more about datastores and where to find service principal permissions.

    ws = Workspace.from_config()
    adlsgen2_datastore_name = '<ADLS gen2 storage account alias>'  #set ADLS Gen2 storage account alias in AML
    
    subscription_id=os.getenv("ADL_SUBSCRIPTION", "<ADLS account subscription ID>") # subscription id of ADLS account
    resource_group=os.getenv("ADL_RESOURCE_GROUP", "<ADLS account resource group>") # resource group of ADLS account
    
    account_name=os.getenv("ADLSGEN2_ACCOUNTNAME", "<ADLS account name>") # ADLS Gen2 account name
    tenant_id=os.getenv("ADLSGEN2_TENANT", "<tenant id of service principal>") # tenant id of service principal
    client_id=os.getenv("ADLSGEN2_CLIENTID", "<client id of service principal>") # client id of service principal
    client_secret=os.getenv("ADLSGEN2_CLIENT_SECRET", "<secret of service principal>") # the secret of service principal
    
    adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(
        workspace=ws,
        datastore_name=adlsgen2_datastore_name,
        account_name=account_name, # ADLS Gen2 account name
        filesystem='<filesystem name>', # ADLS Gen2 filesystem
        tenant_id=tenant_id, # tenant id of service principal
        client_id=client_id, # client id of service principal
    

    接下来,创建一个数据集,以引用要在机器学习任务中使用的文件。Next, create a dataset to reference the file(s) you want to use in your machine learning task.

    下面的代码基于 csv 文件 prepared-data.csv 创建一个 TabularDataset。The following code creates a TabularDataset from a csv file, prepared-data.csv. 详细了解数据集类型和接受的文件格式Learn more about dataset types and accepted file formats.

    from azureml.core import Workspace, Datastore, Dataset
    from azureml.core.experiment import Experiment
    from azureml.train.automl import AutoMLConfig
    
    # retrieve data via AML datastore
    datastore = Datastore.get(ws, adlsgen2_datastore)
    datastore_path = [(datastore, '/data/prepared-data.csv')]
            
    prepared_dataset = Dataset.Tabular.from_delimited_files(path=datastore_path)
    

    在这里,请使用 prepared_dataset 来引用已准备好的数据,就像在训练脚本中一样。From here, use prepared_dataset to reference your prepared data, like in your training scripts. 了解如何在 Azure 机器学习中使用数据集训练模型Learn how to Train models with datasets in Azure Machine Learning.

    后续步骤Next steps