Azure Blob Filesystem 驱动程序 (ABFS):专用于 Hadoop 的 Azure 存储驱动程序The Azure Blob Filesystem driver (ABFS): A dedicated Azure Storage driver for Hadoop

要访问 Azure Data Lake Storage Gen2 中的数据,一种主要方式是通过 Hadoop FileSystemOne of the primary access methods for data in Azure Data Lake Storage Gen2 is via the Hadoop FileSystem. Data Lake Storage Gen2 允许 Azure Blob 存储的用户访问新驱动程序、Azure Blob 文件系统驱动程序或 ABFSData Lake Storage Gen2 allows users of Azure Blob Storage access to a new driver, the Azure Blob File System driver or ABFS. ABFS 是 Apache Hadoop 的一部分,Hadoop 的许多商业分发均带有此程序。ABFS is part of Apache Hadoop and is included in many of the commercial distributions of Hadoop. 借助此驱动程序,许多应用程序和框架无需显式引用 Data Lake Storage Gen2 的任何代码,即可访问 Azure Blob 存储中的数据。Using this driver, many applications and frameworks can access data in Azure Blob Storage without any code explicitly referencing Data Lake Storage Gen2.

以前的功能:Windows Azure 存储 Blob 驱动程序Prior capability: The Windows Azure Storage Blob driver

Windows Azure 存储 Blob 驱动程序或 WASB 驱动程序提供了对 Azure Blob 存储的原始支持。The Windows Azure Storage Blob driver or WASB driver provided the original support for Azure Blob Storage. 此驱动程序执行复杂任务,即将文件系统语义(根据 Hadoop FileSystem 接口的要求)映射到 Azure Blob 存储公开的对象存储样式接口的语义中。This driver performed the complex task of mapping file system semantics (as required by the Hadoop FileSystem interface) to that of the object store style interface exposed by Azure Blob Storage. 此驱动程序继续支持此模型,提供对 Blob 中存储的数据的高性能访问,但包含大量执行此映射的代码,因此很难维护。This driver continues to support this model, providing high performance access to data stored in blobs, but contains a significant amount of code performing this mapping, making it difficult to maintain. 此外,由于对象存储缺少对目录的支持,某些操作(如 FileSystem.rename()FileSystem.delete())在应用到目录时需要驱动程序执行大量操作,这通常导致性能下降。Additionally, some operations such as FileSystem.rename() and FileSystem.delete() when applied to directories require the driver to perform a vast number of operations (due to object stores lack of support for directories) which often leads to degraded performance. ABFS 驱动程序旨在克服 WASB 的固有缺陷。The ABFS driver was designed to overcome the inherent deficiencies of WASB.

Azure Blob 文件系统驱动程序The Azure Blob File System driver

Azure Data Lake Storage REST 接口旨在支持 Azure Blob 存储的文件系统语义。The Azure Data Lake Storage REST interface is designed to support file system semantics over Azure Blob Storage. 考虑到 Hadoop FileSystem 的目的也是支持这些语义,因此无需在驱动程序中进行复杂的映射。Given that the Hadoop FileSystem is also designed to support the same semantics there is no requirement for a complex mapping in the driver. 这样,Azure Blob 文件系统驱动程序 (ABFS) 仅作为 REST API 的客户端填充码。Thus, the Azure Blob File System driver (or ABFS) is a mere client shim for the REST API.

但是,驱动程序仍然必须执行一些功能:However, there are some functions that the driver must still perform:

引用数据的 URI 方案URI scheme to reference data

与 Hadoop 中的其他 FileSystem 实现一样,ABFS 驱动程序自行定义 URI 方案,让资源(目录和文件)能够得到明确处理。Consistent with other FileSystem implementations within Hadoop, the ABFS driver defines its own URI scheme so that resources (directories and files) may be distinctly addressed. 如需了解 URI 方案,请参阅使用 Azure Data Lake Storage Gen2 URIThe URI scheme is documented in Use the Azure Data Lake Storage Gen2 URI. URI 的结构是 abfs[s]://file_system@account_name.dfs.core.chinacloudapi.cn/<path>/<path>/<file_name>The structure of the URI is: abfs[s]://file_system@account_name.dfs.core.chinacloudapi.cn/<path>/<path>/<file_name>

可按上述 URI 格式使用标准 Hadoop 工具和框架引用这些资源:Using the above URI format, standard Hadoop tools and frameworks can be used to reference these resources:

hdfs dfs -mkdir -p abfs://fileanalysis@myanalytics.dfs.core.chinacloudapi.cn/tutorials/flightdelays/data 
hdfs dfs -put flight_delays.csv abfs://fileanalysis@myanalytics.dfs.core.chinacloudapi.cn/tutorials/flightdelays/data/ 

ABFS 驱动程序在内部将 URI 中指定的资源转换为文件和目录,并使用这些引用调用 Azure Data Lake Storage REST API。Internally, the ABFS driver translates the resource(s) specified in the URI to files and directories and makes calls to the Azure Data Lake Storage REST API with those references.

AuthenticationAuthentication

ABFS 驱动程序支持两种形式的身份验证,以便 Hadoop 应用程序可以安全地访问支持 Data Lake Storage Gen2 的帐户中包含的资源。The ABFS driver supports two forms of authentication so that the Hadoop application may securely access resources contained within a Data Lake Storage Gen2 capable account. Azure 存储安全指南中提供了可用身份验证方案的完整详细信息。Full details of the available authentication schemes are provided in the Azure Storage security guide. 它们分别是:They are:

  • 共享密钥: 这允许用户访问帐户中的所有资源。Shared Key: This permits users access to ALL resources in the account. 密钥被加密并存储在 Hadoop 配置中。The key is encrypted and stored in Hadoop configuration.

  • Azure Active Directory OAuth 持有者令牌: 驱动程序使用最终用户或所配置的某个服务主体的标识获取和刷新 Azure AD 持有者令牌。Azure Active Directory OAuth Bearer Token: Azure AD bearer tokens are acquired and refreshed by the driver using either the identity of the end user or a configured Service Principal. 使用此身份验证模型时,所有访问都是使用与所提供的令牌关联的标识以调用为单位进行授权的,并且依据所分配的 POSIX 访问控制列表 (ACL) 进行评估。Using this authentication model, all access is authorized on a per-call basis using the identity associated with the supplied token and evaluated against the assigned POSIX Access Control List (ACL).

    Note

    Azure Data Lake Storage Gen2 仅支持 Azure AD v1.0 终结点。Azure Data Lake Storage Gen2 supports only Azure AD v1.0 endpoints.

配置Configuration

ABFS 驱动程序的所有配置均存储在 core-site.xml 配置文件中。All configuration for the ABFS driver is stored in the core-site.xml configuration file. 在带有 Ambari 的 Hadoop 分发上,还可使用 Web 门户或 Ambari REST API 管理配置。On Hadoop distributions featuring Ambari, the configuration may also be managed using the web portal or Ambari REST API.

要详细了解所有受支持的配置条目,请参阅官方 Hadoop 文档Details of all supported configuration entries are specified in the Official Hadoop documentation.

Hadoop 文档Hadoop documentation

要完整了解 ABFS 驱动程序,请参阅官方 Hadoop 文档The ABFS driver is fully documented in the Official Hadoop documentation

后续步骤Next steps