Azure Blob 存储 Azure Blob storage

Azure Blob 存储是一项用于存储大量非结构化对象数据(例如文本数据或二进制数据)的服务。Azure Blob storage is a service for storing large amounts of unstructured object data, such as text or binary data. 可以使用 Blob 存储向外公开数据,或者私下存储应用程序数据。You can use Blob storage to expose data publicly to the world, or to store application data privately. Blob 存储的常见用途包括:Common uses of Blob storage include:

  • 直接向浏览器提供图像或文档Serving images or documents directly to a browser
  • 存储文件以供分布式访问Storing files for distributed access
  • 对视频和音频进行流式处理Streaming video and audio
  • 存储用于备份和还原、灾难恢复及存档的数据Storing data for backup and restore, disaster recovery, and archiving
  • 存储数据以供本地或 Azure 托管服务执行分析Storing data for analysis by an on-premises or Azure-hosted service

备注

Azure Databricks 还支持以下 Azure 数据源:Azure Data Lake Storage Gen2Azure Cosmos DBAzure Synapse AnalyticsAzure Databricks also supports the following Azure data sources: Azure Data Lake Storage Gen2, Azure Cosmos DB, and Azure Synapse Analytics.

本文介绍如何通过使用 DBFS 安装存储或直接使用 API 来访问 Azure Blob 存储。This article explains how to access Azure Blob storage by mounting storage using DBFS or directly using APIs.

要求Requirements

可从公共存储帐户读取数据,无需任何其他设置。You can read data from public storage accounts without any additional settings. 若要从专用存储帐户读取数据,必须配置共享密钥共享访问签名 (SAS)To read data from a private storage account, you must configure a Shared Key or a Shared Access Signature (SAS). 为了在 Azure Databricks 中安全地使用凭据,建议按机密管理用户指南进行操作,如装载 Azure Blob 存储容器中所示。For leveraging credentials safely in Azure Databricks, we recommend that you follow the Secret management user guide as shown in Mount an Azure Blob storage container.

将 Azure Blob 存储容器装载至 DBFS Mount Azure Blob storage containers to DBFS

可将 Blob 存储容器或容器中的某个文件夹装载到 Databricks 文件系统 (DBFS)You can mount a Blob storage container or a folder inside a container to Databricks File System (DBFS). 此装载是指向一个 Blob 存储容器的指针,因此数据永远不会在本地同步。The mount is a pointer to a Blob storage container, so the data is never synced locally.

重要

  • Azure Blob 存储支持三种 Blob 类型:块 Blob、追加 Blob 和页 Blob。Azure Blob storage supports three blob types: block, append, and page. 只能将块 Blob 装载到 DBFS。You can only mount block blobs to DBFS.
  • 所有用户都对装载到 DBFS 的 Blob 存储容器中的对象具有读写访问权限。All users have read and write access to the objects in Blob storage containers mounted to DBFS.
  • 通过群集创建装入点后,该群集的用户可立即访问装入点。Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. 若要在另一个正在运行的群集中使用装入点,则必须在运行的群集上运行 dbutils.fs.refreshMounts(),使新创建的装入点可供使用。To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

DBFS 使用在创建装入点时提供的凭据来访问已装载的 Blob 存储容器。DBFS uses the credential that you provide when you create the mount point to access the mounted Blob storage container. 如果 Blob 存储容器是使用存储帐户访问密钥装载的,则 DBFS 在访问此装入点时会使用从存储帐户密钥派生的临时 SAS 令牌。If a Blob storage container is mounted using a storage account access key, DBFS uses temporary SAS tokens derived from the storage account key when it accesses this mount point.

装载 Azure Blob 存储容器 Mount an Azure Blob storage container

  1. 若要在容器内装载 Blob 存储容器或文件夹,请使用以下命令:To mount a Blob storage container or a folder inside a container, use the following command:

    PythonPython

    dbutils.fs.mount(
      source = "wasbs://<container-name>@<storage-account-name>.blob.core.chinacloudapi.cn",
      mount_point = "/mnt/<mount-name>",
      extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
    

    ScalaScala

    dbutils.fs.mount(
      source = "wasbs://<container-name>@<storage-account-name>.blob.core.chinacloudapi.cn/<directory-name>",
      mountPoint = "/mnt/<mount-name>",
      extraConfigs = Map("<conf-key>" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")))
    

    wherewhere

    • <mount-name> 是一个 DBFS 路径,表示 Blob 存储容器或该容器中的某个文件夹(在 source 中指定)要装载到 DBFS 中的什么位置。<mount-name> is a DBFS path representing where the Blob storage container or a folder inside the container (specified in source) will be mounted in DBFS.
    • <conf-key> 可以是 fs.azure.account.key.<storage-account-name>.blob.core.chinacloudapi.cnfs.azure.sas.<container-name>.<storage-account-name>.blob.core.chinacloudapi.cn<conf-key> can be either fs.azure.account.key.<storage-account-name>.blob.core.chinacloudapi.cn or fs.azure.sas.<container-name>.<storage-account-name>.blob.core.chinacloudapi.cn
    • dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") 获取在机密范围中存储为机密的密钥。dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") gets the key that has been stored as a secret in a secret scope.
  2. 像访问本地文件一样访问容器中的文件,例如:Access files in your container as if they were local files, for example:

    PythonPython

    # python
    df = spark.read.text("/mnt/<mount-name>/...")
    df = spark.read.text("dbfs:/<mount-name>/...")
    

    ScalaScala

    // scala
    val df = spark.read.text("/mnt/<mount-name>/...")
    val df = spark.read.text("dbfs:/<mount-name>/...")
    

卸载装入点Unmount a mount point

若要卸载装入点,请使用以下命令:To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

直接访问 Azure Blob 存储Access Azure Blob storage directly

本部分介绍如何使用 Spark 数据帧和 RDD API 访问 Azure Blob 存储。This section explains how to access Azure Blob storage using Spark DataFrame and RDD APIs.

使用数据帧 API 访问 Azure Blob 存储Access Azure Blob storage using the DataFrame API

可使用 Spark API 和 Databricks API 读取 Azure Blob 存储中的数据:You can read data from Azure Blob storage using the Spark API and Databricks APIs:

  • 设置帐户访问密钥:Set up an account access key:

    spark.conf.set(
      "fs.azure.account.key.<storage-account-name>.blob.core.chinacloudapi.cn",
      "<storage-account-access-key>")
    
  • 为容器设置 SAS:Set up a SAS for a container:

    spark.conf.set(
      "fs.azure.sas.<container-name>.<storage-account-name>.blob.core.chinacloudapi.cn",
      "<complete-query-string-of-sas-for-the-container>")
    

在笔记本中设置帐户访问密钥或 SAS 后,可使用标准 Spark 和 Databricks API 读取存储帐户中的内容:Once an account access key or a SAS is set up in your notebook, you can use standard Spark and Databricks APIs to read from the storage account:

val df = spark.read.parquet("wasbs://<container-name>@<storage-account-name>.blob.core.chinacloudapi.cn/<directory-name>")

dbutils.fs.ls("wasbs://<container-name>@<storage-account-name>.blob.core.chinacloudapi.cn/<directory-name>")

使用 RDD API 访问 Azure Blob 存储Access Azure Blob storage using the RDD API

不能通过 SparkContext 访问使用 spark.conf.set(...) 设置的 Hadoop 配置选项。Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. 也就是说,当这些内容对数据帧和数据集 API 可见时,就对 RDD API 不可见。This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. 如果使用 RDD API 读取 Azure Blob 存储中的内容,必须使用以下方法之一来设置凭据:If you are using the RDD API to read from Azure Blob storage, you must set the credentials using one of the following methods:

  • 创建群集时,将 Hadoop 凭据配置选项指定为 Spark 选项。Specify the Hadoop credential configuration options as Spark options when you create the cluster. 必须将 spark.hadoop. 前缀添加到相应的 Hadoop 配置键,以便让 Spark 将它们传播到用于 RDD 作业的 Hadoop 配置:You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to tell Spark to propagate them to the Hadoop configurations that are used for your RDD jobs:

    # Using an account access key
    spark.hadoop.fs.azure.account.key.<storage-account-name>.blob.core.chinacloudapi.cn <storage-account-access-key>
    
    # Using a SAS token
    spark.hadoop.fs.azure.sas.<container-name>.<storage-account-name>.blob.core.chinacloudapi.cn <complete-query-string-of-sas-for-the-container>
    
  • Scala 用户可在 spark.sparkContext.hadoopConfiguration 中设置凭据:Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    // Using an account access key
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.account.key.<storage-account-name>.blob.core.chinacloudapi.cn",
      "<storage-account-access-key>"
    )
    
    // Using a SAS token
    spark.sparkContext.hadoopConfiguration.set(
      "fs.azure.sas.<container-name>.<storage-account-name>.blob.core.chinacloudapi.cn",
      "<complete-query-string-of-sas-for-the-container>"
    )
    

警告

这些凭据可供访问群集的所有用户使用。These credentials are available to all users who access the cluster.