Azure 机器学习中的安全数据访问Secure data access in Azure Machine Learning

Azure 机器学习使你可以轻松连接到云中的数据。Azure Machine Learning makes it easy to connect to your data in the cloud. 它在基础存储服务上提供抽象层,让你可以安全地访问和使用数据,而无需编写特定于存储类型的代码。It provides an abstraction layer over the underlying storage service, so you can securely access and work with your data without having to write code specific to your storage type. Azure 机器学习还提供了以下数据功能:Azure Machine Learning also provides the following data capabilities:

  • 数据世系的版本控制和跟踪Versioning and tracking of data lineage
  • 数据标记Data labeling
  • 数据偏差监视Data drift monitoring
  • 与 Pandas 和 Spark DataFrames 的互操作性Interoperability with Pandas and Spark DataFrames

数据工作流Data workflow

准备好在基于云的存储解决方案中使用数据时,建议使用以下数据发送工作流。When you're ready to use the data in your cloud-based storage solution, we recommend the following data delivery workflow. 此工作流假设你拥有 Azure 存储帐户,并在 Azure 中的基于云的存储服务中拥有数据。This workflow assumes you have an Azure storage account and data in a cloud-based storage service in Azure.

  1. 创建 Azure 机器学习数据存储以将连接信息存储到 Azure 存储。Create an Azure Machine Learning datastore to store connection information to your Azure storage.

  2. 在该数据存储中,创建一个 Azure 机器学习数据集以指向基础存储中的特定文件。From that datastore, create an Azure Machine Learning dataset to point to a specific file(s) in your underlying storage.

  3. 若要在机器学习试验中使用该数据集,可采用以下方法To use that dataset in your machine learning experiment you can either

    1. 将该数据集装载到试验的计算目标以进行模型训练。Mount it to your experiment's compute target for model training.

      或者OR

    2. 在自动化机器学习(自动化 ML)试验运行、机器学习管道或 Azure 机器学习设计器等 Azure 机器学习解决方案中直接使用该数据集。Consume it directly in Azure Machine Learning solutions like, automated machine learning (automated ML) experiment runs, machine learning pipelines, or the Azure Machine Learning designer.

  4. 针对模型输出数据集创建数据集监视器以检测数据偏移。Create dataset monitors for your model output dataset to detect for data drift.

  5. 如果检测到数据偏移,请更新输入数据集,并相应地重新训练模型。If data drift is detected, update your input dataset and retrain your model accordingly.

下图提供了此建议的工作流的直观演示。The following diagram provides a visual demonstration of this recommended workflow.

Data-concept-diagram

数据存储Datastores

Azure 机器学习数据存储可安全地将连接信息保存到 Azure 存储中,因此无需在脚本中对其进行编码。Azure Machine Learning datastores securely keep the connection information to your Azure storage, so you don't have to code it in your scripts. 注册并创建一个数据存储即可轻松连接到存储帐户,并访问底层 Azure 存储服务中的数据。Register and create a datastore to easily connect to your storage account, and access the data in your underlying Azure storage service.

Azure 中支持的基于云的存储服务,可注册为数据存储:Supported cloud-based storage services in Azure that can be registered as datastores:

  • Azure Blob 容器Azure Blob Container
  • Azure 文件共享Azure File Share
  • Azure Data LakeAzure Data Lake
  • Azure Data Lake Gen2Azure Data Lake Gen2
  • Azure SQL 数据库Azure SQL Database
  • Azure Database for PostgreSQLAzure Database for PostgreSQL
  • Databricks 文件系统Databricks File System
  • Azure Database for MySQLAzure Database for MySQL

数据集Datasets

Azure 机器学习数据集是指向存储服务中数据的引用。Azure Machine Learning datasets are references that point to the data in your storage service. 它们不是数据的副本,因此不会产生额外的存储成本,且原始数据源的完整性也不会面临风险。They aren't copies of your data, so no extra storage cost is incurred and the integrity of your original data sources aren't at risk.

若要与存储中的数据进行交互,请创建一个数据集以将数据打包成机器学习任务可用的对象。To interact with your data in storage, create a dataset to package your data into a consumable object for machine learning tasks. 将数据集注册到工作区可在不同的试验中共享和重用该数据集,而不会造成数据引入复杂性。Register the dataset to your workspace to share and reuse it across different experiments without data ingestion complexities.

可以从本地文件、公共 URL、Azure 开放数据集或经由数据存储的 Azure 存储服务创建数据集。Datasets can be created from local files, public urls, Azure Open Datasets, or Azure storage services via datastores. 若要从内存中 pandas 数据帧创建数据集,请将数据写入本地文件(例如 parquet),然后从该文件创建数据集。To create a dataset from an in memory pandas dataframe, write the data to a local file, like a parquet, and create your dataset from that file.

我们支持 2 种类型的数据集:We support 2 types of datasets:

  • FileDataset 引用数据存储或公共 URL 中的单个或多个文件。A FileDataset references single or multiple files in your datastores or public URLs. 如果数据已被清理并且可用于训练试验,则可以下载或装载 FileDataset 引用的文件到计算目标。If your data is already cleansed and ready to use in training experiments, you can download or mount files referenced by FileDatasets to your compute target.

  • TabularDataset 通过分析提供的文件或文件列表来以表格格式表示数据。A TabularDataset represents data in a tabular format by parsing the provided file or list of files. 可以将 TabularDataset 加载到 Pandas 或 Spark DataFrame 中,以便进一步处理和清理。You can load a TabularDataset into a pandas or Spark DataFrame for further manipulation and cleansing. 有关可从中创建 TabularDataset 的数据格式的完整列表,请参阅 TabularDatasetFactory 类For a complete list of data formats you can create TabularDatasets from, see the TabularDatasetFactory class.

可在以下文档中找到更多数据集功能:Additional datasets capabilities can be found in the following documentation:

使用数据Work with your data

使用数据集可以通过与 Azure 机器学习功能的无缝集成来完成许多机器学习任务。With datasets, you can accomplish a number of machine learning tasks through seamless integration with Azure Machine Learning features.

数据标记Data labeling

在机器学习项目中,标记大量数据通常是一件很麻烦的事情。Labeling large amounts of data has often been a headache in machine learning projects. 这些包含计算机视觉组件(如图像分类或对象检测)的项目通常需要数千个图像和对应的标签。Those with a computer vision component, such as image classification or object detection, generally require thousands of images and corresponding labels.

Azure 机器学习提供了一个中心位置,用于创建、管理和监视标签项目。Azure Machine Learning gives you a central location to create, manage, and monitor labeling projects. 标记项目有助于协调数据、标签和团队成员,使你能够更有效地管理标记任务。Labeling projects help coordinate the data, labels, and team members, allowing you to more efficiently manage the labeling tasks. 当前支持的任务包括图像分类(多标签或多类)以及使用边界框的对象标识。Currently supported tasks are image classification, either multi-label or multi-class, and object identification using bounded boxes.

创建数据标记项目,并输出可在机器学习试验中使用的数据集。Create a data labeling project, and output a dataset for use in machine learning experiments.

数据偏移Data drift

在机器学习的上下文中,数据偏移是指模型输入数据中导致模型性能下降的变化。In the context of machine learning, data drift is the change in model input data that leads to model performance degradation. 这是模型准确度在一段时间后下降的最主要原因之一,因此,监视数据偏移有助于检测模型性能问题。It is one of the top reasons model accuracy degrades over time, thus monitoring data drift helps detect model performance issues.

请参阅创建数据集监视器一文来详细了解如何检测数据集中新数据的数据偏移并发出警报。See the Create a dataset monitor article, to learn more about how to detect and alert to data drift on new data in a dataset.

后续步骤Next steps