Azure 机器学习工作流的数据引入选项Data ingestion options for Azure Machine Learning workflows

本文介绍可在 Azure 机器学习中使用的数据引入选项的优点和缺点。In this article, you learn the pros and cons of data ingestion options available with Azure Machine Learning.

从下列项中进行选择:Choose from:

数据引入是指从一个或多个源中提取非结构化数据,然后准备这些数据以用于训练机器学习模型的过程。Data ingestion is the process in which unstructured data is extracted from one or multiple sources and then prepared for training machine learning models. 此过程也很耗时,尤其是手动执行,并且要从多个源提取大量数据时。It's also time intensive, especially if done manually, and if you have large amounts of data from multiple sources. 将此工作自动化可以释放资源,并确保模型使用最新且适用的数据。Automating this effort frees up resources and ensures your models use the most recent and applicable data.

Azure 数据工厂Azure Data Factory

Azure 数据工厂原生支持数据引入管道的数据源监视和触发器。Azure Data Factory offers native support for data source monitoring and triggers for data ingestion pipelines.

下表汇总了对数据引入工作流使用 Azure 数据工厂的优点和缺点。The following table summarizes the pros and cons for using Azure Data Factory for your data ingestion workflows.

优点Pros 缺点Cons
专门设计用于提取、加载和转换数据。Specifically built to extract, load, and transform data. 目前提供有限的一组 Azure 数据工厂管道任务Currently offers a limited set of Azure Data Factory pipeline tasks
可以创建数据驱动式工作流用于协调大规模的数据移动与转换。Allows you to create data-driven workflows for orchestrating data movement and transformations at scale. 生成和维护成本高昂。Expensive to construct and maintain. 有关详细信息,请参阅 Azure 数据工厂的定价页See Azure Data Factory's pricing page for more information.
Azure DatabricksAzure Functions 等各种 Azure 工具集成Integrated with various Azure tools like Azure Databricks and Azure Functions 原生不能运行脚本,而是依赖独立的计算资源来运行脚本Doesn't natively run scripts, instead relies on separate compute for script runs
原生支持数据源触发的数据引入Natively supports data source triggered data ingestion
数据准备和模型训练过程是独立的。Data preparation and model training processes are separate.
Azure 数据工厂数据流的嵌入式数据世系功能Embedded data lineage capability for Azure Data Factory dataflows
为非脚本方法提供低代码体验用户界面Provides a low code experience user interface for non-scripting approaches

以下步骤和下图演示了 Azure 数据工厂的数据引入工作流。These steps and the following diagram illustrate Azure Data Factory's data ingestion workflow.

  1. 从源提取数据Pull the data from its sources

  2. 转换数据,并将其保存到充当 Azure 机器学习数据存储的输出 Blob 容器Transform and save the data to an output blob container, which serves as data storage for Azure Machine Learning

  3. 存储准备好的数据后,Azure 数据工厂管道将调用一个训练机器学习管道,用于接收已准备好用于训练模型的数据With prepared data stored, the Azure Data Factory pipeline invokes a training Machine Learning pipeline that receives the prepared data for model training

    ADF 数据引入

了解如何使用 Azure 数据工厂为机器学习生成数据引入管道。Learn how to build a data ingestion pipeline for Machine Learning with Azure Data Factory.

Azure 机器学习 Python SDKAzure Machine Learning Python SDK

使用 Python SDK 可将数据引入任务合并到 Azure 机器学习管道步骤中。With the Python SDK, you can incorporate data ingestion tasks into an Azure Machine Learning pipeline step.

下表汇总了对数据引入任务使用 SDK 和 ML 管道步骤的优点和缺点。The following table summarizes the pros and con for using the SDK and an ML pipelines step for data ingestion tasks.

优点Pros 缺点Cons
配置自己的 Python 脚本Configure your own Python scripts 原生不支持数据源更改触发。Does not natively support data source change triggering. 需要实现逻辑应用或 Azure 函数Requires Logic App or Azure Function implementations
在每次执行模型训练的过程中准备数据Data preparation as part of every model training execution 需要拥有创建数据引入脚本方面的开发技能Requires development skills to create a data ingestion script
支持对各种计算目标(包括 Azure 机器学习计算)使用数据准备脚本Supports data preparation scripts on various compute targets, including Azure Machine Learning compute 不提供用于创建引入机制的用户界面Does not provide a user interface for creating the ingestion mechanism

在下图中,Azure 机器学习管道由两个步骤组成:数据引入和模型训练。In the following diagram, the Azure Machine Learning pipeline consists of two steps: data ingestion and model training. 数据引入步骤包含可以使用 Python 库和 Python SDK 完成的任务(例如从本地/Web 源提取数据),以及数据转换(例如缺失值插补)。The data ingestion step encompasses tasks that can be accomplished using Python libraries and the Python SDK, such as extracting data from local/web sources, and data transformations, like missing value imputation. 然后,训练步骤使用准备好的数据作为训练脚本的输入来训练机器学习模型。The training step then uses the prepared data as input to your training script to train your machine learning model.

Azure 管道 + SDK 数据引入

后续步骤Next steps

请参阅以下操作指南文章:Follow these how-to articles: