连接到 Azure 存储服务Connect to Azure storage services

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

本文介绍如何通过 Azure 机器学习数据存储连接到 Azure 存储服务。In this article, learn how to connect to Azure storage services via Azure Machine Learning datastores. 数据存储可安全地连接到 Azure 存储服务,而不会损害你的身份验证凭据以及原始数据源的完整性。Datastores securely connect to your Azure storage service without putting your authentication credentials and the integrity of your original data source at risk. 它们会存储连接信息,例如与工作区关联的 Key Vault 中的订阅 ID 和令牌授权,让你能够安全地访问存储,而无需在脚本中对其进行硬编码。They store connection information, like your subscription ID and token authorization in your Key Vault associated with the workspace, so you can securely access your storage without having to hard code them in your scripts. 可以使用 Azure 机器学习 Python SDKAzure 机器学习工作室来创建和注册数据存储。You can use the Azure Machine Learning Python SDK or the Azure Machine Learning studio to create and register datastores.

如果希望使用 Azure 机器学习 VS Code 扩展来创建和管理数据存储,请访问 VS Code 资源管理操作指南以了解详细信息。If you prefer to create and manage datastores using the Azure Machine Learning VS Code extension; visit the VS Code resource management how-to guide to learn more.

可从这些 Azure 存储解决方案创建数据存储。You can create datastores from these Azure storage solutions. 对于不支持的存储解决方案,为了在 ML 试验期间节省数据出口成本,请将数据移到支持的 Azure 存储解决方案。For unsupported storage solutions, and to save data egress cost during ML experiments, move your data to a supported Azure storage solution.

若要了解在 Azure 机器学习总体数据访问工作流中的哪些位置使用数据存储,请参阅安全地访问数据一文。To understand where datastores fit in Azure Machine Learning's overall data access workflow, see the Securely access data article.

先决条件Prerequisites

需要:You'll need:

  • Azure 订阅。An Azure subscription. 如果没有 Azure 订阅,请在开始前创建一个试用帐户。If you don't have an Azure subscription, create a trial account before you begin. 试用Azure 试用版Try the Azure trial.

  • 一个使用支持的存储类型的 Azure 存储帐户。An Azure storage account with a supported storage type.

  • 适用于 Python 的 Azure 机器学习 SDK,或 Azure 机器学习工作室的访问权限。The Azure Machine Learning SDK for Python, or access to Azure Machine Learning studio.

  • Azure 机器学习工作区。An Azure Machine Learning workspace.

    创建 Azure 机器学习工作区或通过 Python SDK 使用现有工作区。Either create an Azure Machine Learning workspace or use an existing one via the Python SDK.

    导入 WorkspaceDatastore 类,并使用函数 from_config() 从文件 config.json 中加载订阅信息。Import the Workspace and Datastore class, and load your subscription information from the file config.json using the function from_config(). 默认情况下,这会查找当前目录中的 JSON 文件,但你也可以使用 from_config(path="your/file/path") 指定一个路径参数,使之指向该文件。This looks for the JSON file in the current directory by default, but you can also specify a path parameter to point to the file using from_config(path="your/file/path").

    import azureml.core
    from azureml.core import Workspace, Datastore
    
    ws = Workspace.from_config()
    

    创建工作区时,会将 Azure Blob 容器和 Azure 文件共享作为数据存储自动注册到工作区。When you create a workspace, an Azure blob container and an Azure file share are automatically registered as datastores to the workspace. 它们分别命名为 workspaceblobstoreworkspacefilestoreThey're named workspaceblobstore and workspacefilestore, respectively. workspaceblobstore 用于存储工作区项目和机器学习试验日志。The workspaceblobstore is used to store workspace artifacts and your machine learning experiment logs. 它也已设为默认数据存储,无法从工作区中删除。It's also set as the default datastore and can't be deleted from the workspace. workspacefilestore 用于存储通过计算实例授权的笔记本和 R 脚本。The workspacefilestore is used to store notebooks and R scripts authorized via compute instance.

    备注

    Azure 机器学习设计器(预览版)会在你打开设计器主页中的示例时自动创建一个名为 azureml_globaldatasets 的数据存储。Azure Machine Learning designer (preview) will create a datastore named azureml_globaldatasets automatically when you open a sample in the designer homepage. 此数据存储仅包含示例数据集。This datastore only contains sample datasets. 请不要将此数据存储用于任何机密数据访问。Please do not use this datastore for any confidential data access.

支持的数据存储服务类型Supported data storage service types

数据存储目前支持将连接信息存储到下表中列出的存储服务。Datastores currently support storing connection information to the storage services listed in the following matrix.

存储类型Storage type 身份验证类型Authentication type Azure 机器学习工作室Azure Machine Learning studio Azure 机器学习 Python SDKAzure Machine Learning  Python SDK Azure 机器学习 CLIAzure Machine Learning CLI Azure 机器学习 Rest APIAzure Machine Learning  Rest API VS CodeVS Code
Azure Blob 存储Azure Blob Storage 帐户密钥Account key
SAS 令牌SAS token
Azure 文件共享Azure File Share 帐户密钥Account key
SAS 令牌SAS token
Azure Data Lake Storage Gen 2Azure Data Lake Storage Gen 2 服务主体Service principal
Azure SQL 数据库Azure SQL Database SQL 身份验证SQL authentication
服务主体Service principal
Azure PostgreSQLAzure PostgreSQL SQL 身份验证SQL authentication
Azure Database for MySQLAzure Database for MySQL SQL 身份验证SQL authentication ✓*✓* ✓*✓* ✓*✓*

*仅管道 DataTransferStep 支持 MySQL。*MySQL is only supported for pipeline DataTransferStep.

存储指导原则Storage guidance

建议为 Azure Blob 容器创建数据存储。We recommend creating a datastore for an Azure Blob container. 标准和高级存储都可用于 Blob。Both standard and premium storage are available for blobs. 尽管高级存储费用更高,但其吞吐速度也更快,可加速训练运行,特别是在针对大型数据集进行训练时。Although premium storage is more expensive, its faster throughput speeds might improve the speed of your training runs, particularly if you train against a large dataset. 要了解存储帐户的成本,请参阅 Azure 定价计算器For information about the cost of storage accounts, see the Azure pricing calculator.

Azure Data Lake Storage Gen2 基于 Azure Blob 存储而构建,专为企业大数据分析设计。Azure Data Lake Storage Gen2 is built on top of Azure Blob storage and designed for enterprise big data analytics. Data Lake Storage Gen2 的一个基本部分是向 Blob 存储添加分层命名空间A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. 分层命名空间将对象/文件组织到目录层次结构中,以便进行有效的数据访问。The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access.

存储访问和权限Storage access and permissions

为了确保安全连接到 Azure 存储服务,Azure 机器学习会要求你具有相应数据存储容器的访问权限。To ensure you securely connect to your Azure storage service, Azure Machine Learning requires that you have permission to access the corresponding data storage container. 此访问权限依赖用于注册数据存储的身份验证凭据。This access depends on the authentication credentials used to register the datastore.

虚拟网络Virtual network

如果你的数据存储帐户在虚拟网络中,则需要执行其他配置步骤来确保 Azure 机器学习能够访问你的数据。If your data storage account is in a virtual network, additional configuration steps are required to ensure Azure Machine Learning has access to your data. 请参阅网络隔离和隐私,以确保在创建和注册数据存储时应用适当的配置步骤。See Network isolation & privacy to ensure the appropriate configuration steps are applied when you create and register your datastore.

访问验证Access validation

在初始的数据存储创建和注册过程中,Azure 机器学习会自动验证基础存储服务是否存在,以及用户提供的主体(用户名、服务主体或 SAS 令牌)是否有权访问指定的存储。As part of the initial datastore create and register process, Azure Machine Learning automatically validates that the underlying storage service exists and the user provided principal (username, service principal, or SAS token) has access to the specified storage.

创建数据存储后,此验证只针对要求访问基础存储容器的方法执行,而不是每次检索数据存储对象时都执行 。After datastore creation, this validation is only performed for methods that require access to the underlying storage container, not each time datastore objects are retrieved. 例如,如果要从数据存储中下载文件,则会进行验证,但如果只想更改默认数据存储,则不会进行验证。For example, validation happens if you want to download files from your datastore; but if you just want to change your default datastore, then validation does not happen.

若要在访问基础存储服务时进行身份验证,可在要创建的数据存储类型的相应 register_azure_*() 方法中提供帐户密钥、共享访问签名 (SAS) 令牌或服务主体。To authenticate your access to the underlying storage service, you can provide either your account key, shared access signatures (SAS) tokens, or service principal in the corresponding register_azure_*() method of the datastore type you want to create. 存储类型矩阵列出了与各种数据存储类型对应的受支持的身份验证类型。The storage type matrix lists the supported authentication types that correspond to each datastore type.

可在 Azure 门户上查找帐户密钥、SAS 令牌和服务主体信息。You can find account key, SAS token, and service principal information on your Azure portal.

  • 如果计划使用帐户密钥或 SAS 令牌进行身份验证,请在左窗格中选择“存储帐户”,然后选择要注册的存储帐户。If you plan to use an account key or SAS token for authentication, select Storage Accounts on the left pane, and choose the storage account that you want to register.

    • “概述”页面提供了帐户名称、容器和文件共享名称等信息。The Overview page provides information such as the account name, container, and file share name.
      1. 对于帐户密钥,请转到“设置”窗格中的“访问密钥” 。For account keys, go to Access keys on the Settings pane.
      2. 对于 SAS 令牌,请转到“设置”窗格中的“共享访问签名” 。For SAS tokens, go to Shared access signatures on the Settings pane.
  • 如果计划使用服务主体进行身份验证,请转到“应用注册”,然后选择要使用的应用。If you plan to use a service principal for authentication, go to your App registrations and select which app you want to use.

    • 其对应的“概览”页面将包含租户 ID 和客户端 ID 之类的必需信息。Its corresponding Overview page will contain required information like tenant ID and client ID.

重要

出于安全原因,你可能需要更改 Azure 存储帐户的访问密钥(帐户密钥或 SAS 令牌)。For security reasons, you may need to change your access keys for an Azure Storage account (account key or SAS token). 执行此操作时,请确保将新凭据与工作区及其连接的数据存储进行同步。When doing so, be sure to sync the new credentials with your workspace and the datastores connected to it. 了解如何通过这些步骤同步更新的凭据。Learn how to sync your updated credentials with these steps.

权限Permissions

对于 Azure Blob 容器和 Azure Data Lake 第 2 代存储,请确保身份验证凭据具有对存储 Blob 数据读取器的访问权限。For Azure blob container and Azure Data Lake Gen 2 storage, make sure your authentication credentials has Storage Blob Data Reader access. 详细了解存储 Blob 数据读取器Learn more about Storage Blob Data Reader.

通过 SDK 创建和注册数据存储Create and register datastores via the SDK

将 Azure 存储解决方案注册为数据存储时,会自动创建数据存储并将其注册到特定的工作区。When you register an Azure storage solution as a datastore, you automatically create and register that datastore to a specific workspace. 请查看存储访问和权限部分,以了解在哪里查找所需身份验证凭据。Review the storage access & permissions section to understand where to find required authentication credentials.

本部分中的示例演示了如何通过 Python SDK 为以下存储类型创建和注册数据存储。Within this section are examples for how to create and register a datastore via the Python SDK for the following storage types. 这些示例中提供的参数是创建和注册数据存储所必需的。The parameters provided in these examples are the required parameters to create and register a datastore.

若要为其他受支持的存储服务创建数据存储,请参阅适用的 register_azure_* 方法的参考文档To create datastores for other supported storage services, see the reference documentation for the applicable register_azure_* methods.

如果你更喜欢使用低代码体验,请参阅在 Azure 机器学习工作室中创建数据存储If you prefer a low code experience, see Create datastores in Azure Machine Learning studio.

备注

数据存储名称应仅包含小写字母、数字和下划线。Datastore name should only consist of lowercase letters, digits and underscores.

Azure blob 容器Azure blob container

若要将 Azure blob 容器注册为数据存储,请使用 register_azure_blob_container()To register an Azure blob container as a datastore, use register_azure_blob_container().

以下代码会创建 blob_datastore_name 数据存储并将其注册到 ws 工作区。The following code creates and registers the blob_datastore_name datastore to the ws workspace. 此数据存储使用提供的帐户访问密钥访问 my-account-name 存储帐户上的 my-container-name Blob 容器。This datastore accesses the my-container-name blob container on the my-account-name storage account, by using the provided account access key.

blob_datastore_name='azblobsdk' # Name of the datastore to workspace
container_name=os.getenv("BLOB_CONTAINER", "<my-container-name>") # Name of Azure blob container
account_name=os.getenv("BLOB_ACCOUNTNAME", "<my-account-name>") # Storage account name
account_key=os.getenv("BLOB_ACCOUNT_KEY", "<my-account-key>") # Storage account access key

blob_datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                         datastore_name=blob_datastore_name, 
                                                         container_name=container_name, 
                                                         account_name=account_name,
                                                         account_key=account_key)

Azure 文件共享Azure file share

若要将 Azure 文件共享注册为数据存储,请使用 register_azure_file_share()To register an Azure file share as a datastore, use register_azure_file_share().

以下代码会创建 file_datastore_name 数据存储并将其注册到 ws 工作区。The following code creates and registers the file_datastore_name datastore to the ws workspace. 此数据存储使用提供的帐户访问密钥访问 my-account-name 存储帐户上的 my-fileshare-name 文件共享。This datastore accesses the my-fileshare-name file share on the my-account-name storage account, by using the provided account access key.

file_datastore_name='azfilesharesdk' # Name of the datastore to workspace
file_share_name=os.getenv("FILE_SHARE_CONTAINER", "<my-fileshare-name>") # Name of Azure file share container
account_name=os.getenv("FILE_SHARE_ACCOUNTNAME", "<my-account-name>") # Storage account name
account_key=os.getenv("FILE_SHARE_ACCOUNT_KEY", "<my-account-key>") # Storage account access key

file_datastore = Datastore.register_azure_file_share(workspace=ws,
                                                     datastore_name=file_datastore_name, 
                                                     file_share_name=file_share_name, 
                                                     account_name=account_name,
                                                     account_key=account_key)

Azure Data Lake Storage Gen2Azure Data Lake Storage Generation 2

对于 Azure Data Lake Storage Gen2 (ADLS Gen 2) 数据存储,请使用 register_azure_data_lake_gen2() 通过服务主体权限注册连接到 Azure DataLake Gen 2 存储的凭据数据存储。For an Azure Data Lake Storage Generation 2 (ADLS Gen 2) datastore, use register_azure_data_lake_gen2() to register a credential datastore connected to an Azure DataLake Gen 2 storage with service principal permissions.

若要使用服务主体,需要注册应用程序,并向服务主体授予“存储 Blob 数据读取者”访问权限。In order to utilize your service principal, you need to register your application and grant the service principal with Storage Blob Data Reader access. 详细了解 ADLS Gen2 的访问控制设置Learn more about access control set up for ADLS Gen 2.

以下代码会创建 adlsgen2_datastore_name 数据存储并将其注册到 ws 工作区。The following code creates and registers the adlsgen2_datastore_name datastore to the ws workspace. 此数据存储使用提供的服务主体凭据访问 account_name 存储帐户中的文件系统 testThis datastore accesses the file system test in the account_name storage account, by using the provided service principal credentials.

adlsgen2_datastore_name = 'adlsgen2datastore'

subscription_id=os.getenv("ADL_SUBSCRIPTION", "<my_subscription_id>") # subscription id of ADLS account
resource_group=os.getenv("ADL_RESOURCE_GROUP", "<my_resource_group>") # resource group of ADLS account

account_name=os.getenv("ADLSGEN2_ACCOUNTNAME", "<my_account_name>") # ADLS Gen2 account name
tenant_id=os.getenv("ADLSGEN2_TENANT", "<my_tenant_id>") # tenant id of service principal
client_id=os.getenv("ADLSGEN2_CLIENTID", "<my_client_id>") # client id of service principal
client_secret=os.getenv("ADLSGEN2_CLIENT_SECRET", "<my_client_secret>") # the secret of service principal

adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(workspace=ws,
                                                             datastore_name=adlsgen2_datastore_name,
                                                             account_name=account_name, # ADLS Gen2 account name
                                                             filesystem='test', # ADLS Gen2 filesystem
                                                             tenant_id=tenant_id, # tenant id of service principal
                                                             client_id=client_id, # client id of service principal
                                                             client_secret=client_secret) # the secret of service principal

在工作室中创建数据存储Create datastores in the studio

在 Azure 机器学习工作室中通过几个步骤创建新的数据存储。Create a new datastore in a few steps with the Azure Machine Learning studio.

重要

如果数据存储帐户位于虚拟网络中,则需要执行其他配置步骤以确保工作室可以访问你的数据。If your data storage account is in a virtual network, additional configuration steps are required to ensure the studio has access to your data. 若要确保应用适当的配置步骤,请参阅[网络隔离和隐私] (how-to-enable-virtual-network.md#machine-learning-studio)。See [Network isolation & privacy] (how-to-enable-virtual-network.md#machine-learning-studio) to ensure the appropriate configuration steps are applied.

  1. 登录到 Azure 机器学习工作室Sign in to Azure Machine Learning studio.
  2. 在左窗格中的“管理”下,选择“数据存储” 。Select Datastores on the left pane under Manage.
  3. 选择“+ 新建数据存储”。Select + New datastore.
  4. 填写新数据存储的表单。Complete the form for a new datastore. 该表单会根据你选择的 Azure 存储类型和身份验证类型智能地进行更新。The form intelligently updates itself based on your selections for Azure storage type and authentication type. 请参阅存储访问和权限部分,了解在哪里可以找到填充此窗体所需的身份验证凭据。See the storage access and permissions section to understand where to find the authentication credentials you need to populate this form.

下面的示例展示了创建 Azure Blob 数据存储时窗体的外观:The following example demonstrates what the form looks like when you create an Azure blob datastore:

新数据存储的表单

使用数据存储中的数据Use data in your datastores

创建数据存储后,请创建 Azure 机器学习数据集,以便与数据进行交互。After you create a datastore, create an Azure Machine Learning dataset to interact with your data. 数据集可将数据打包成一个延迟计算的可供机器学习任务(例如训练)使用的对象。Datasets package your data into a lazily evaluated consumable object for machine learning tasks, like training. 它们还提供从 Azure 存储服务(例如 Azure Blob 存储和 ADLS Gen2)下载或装载任何格式的文件的功能。They also provide the ability to download or mount files of any format from Azure storage services like, Azure Blob storage and ADLS Gen 2. 你还可以使用它们将表格数据加载到 pandas 或 Spark 数据帧中。You can also use them to load tabular data into a pandas or Spark DataFrame.

从工作区获取数据存储Get datastores from your workspace

若要获取在当前工作区中注册的特定数据存储,请在 Datastore 类上使用 get() 静态方法:To get a specific datastore registered in the current workspace, use the get() static method on the Datastore class:

# Get a named datastore from the current workspace
datastore = Datastore.get(ws, datastore_name='your datastore name')

若要获取在给定工作区中注册的数据存储的列表,可对工作区对象使用 datastores 属性:To get the list of datastores registered with a given workspace, you can use the datastores property on a workspace object:

# List all datastores registered in the current workspace
datastores = ws.datastores
for name, datastore in datastores.items():
    print(name, datastore.datastore_type)

若要获取工作区的默认数据存储,请使用此行:To get the workspace's default datastore, use this line:

datastore = ws.get_default_datastore()

还可通过以下代码更改默认数据存储。You can also change the default datastore with the following code. 仅支持通过 SDK 使用此功能。This ability is only supported via the SDK.

 ws.set_default_datastore(new_default_datastore)

在评分过程中访问数据Access data during scoring

Azure 机器学习提供多种方法来使用模型进行评分。Azure Machine Learning provides several ways to use your models for scoring. 其中一些方法不提供对数据存储的访问权限。Some of these methods don't provide access to datastores. 使用下表了解允许在评分期间访问数据存储的方法:Use the following table to understand which methods allow you to access datastores during scoring:

方法Method 数据存储访问Datastore access 说明Description
批量预测Batch prediction 以异步方式对大量数据进行预测。Make predictions on large quantities of data asynchronously.
Web 服务Web service   将模型部署为 Web 服务。Deploy models as a web service.
Azure IoT Edge 模块Azure IoT Edge module   将模型部署到 IoT Edge 设备。Deploy models to IoT Edge devices.

对于 SDK 不提供对数据存储的访问权限的情况,也许可以通过使用相关 Azure SDK 访问数据以创建自定义代码。For situations where the SDK doesn't provide access to datastores, you might be able to create custom code by using the relevant Azure SDK to access the data. 例如,适用于 Python 的 Azure 存储 SDK 是可用于访问 Blob 或文件中存储的数据的客户端库。For example, the Azure Storage SDK for Python is a client library that you can use to access data stored in blobs or files.

将数据移到支持的 Azure 存储解决方案Move data to supported Azure storage solutions

Azure 机器学习支持从 Azure Blob 存储、Azure 文件存储、Azure Data Lake Storage Gen1、Azure Data Lake Storage Gen2、Azure SQL 数据库和 Azure Database for PostgreSQL 访问数据。Azure Machine Learning supports accessing data from Azure Blob storage, Azure Files, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure Database for PostgreSQL. 如果你正在使用不受支持的存储,则建议使用 Azure 数据工厂和这些步骤将数据移动到受支持的 Azure 存储解决方案。If you're using unsupported storage, we recommend that you move your data to supported Azure storage solutions by using Azure Data Factory and these steps. 将数据移动到受支持的存储可帮助你在机器学习试验期间节省数据传出成本。Moving data to supported storage can help you save data egress costs during machine learning experiments.

Azure 数据工厂具有超过 80 个预生成的连接器,可提供高效且可复原的数据传输,无需额外付费。Azure Data Factory provides efficient and resilient data transfer with more than 80 prebuilt connectors at no additional cost. 这些连接器包括 Azure 数据服务、本地数据源、Amazon S3 和 Redshift,以及 Google BigQuery。These connectors include Azure data services, on-premises data sources, Amazon S3 and Redshift, and Google BigQuery.

后续步骤Next steps