HDInsight 中的 Azure Data Lake Storage Gen2 概述Azure Data Lake Storage Gen2 overview in HDInsight

有关 Azure Data Lake Storage Gen2 的详细信息,请参阅 Azure Data Lake Storage Gen2 简介For more information on Azure Data Lake Storage Gen2, see Introduction to Azure Data Lake Storage Gen2.

Azure Data Lake Storage Gen2 的核心功能Core functionality of Azure Data Lake Storage Gen2

  • 与 Hadoop 兼容的访问: 在 Azure Data Lake Storage Gen2 中,可以像使用 Hadoop 分布式文件系统 (HDFS) 一样管理和访问数据。Access that is compatible with Hadoop: In Azure Data Lake Storage Gen2, you can manage and access data just as you would with a Hadoop Distributed File System (HDFS). Azure Blob 文件系统 (ABFS) 驱动程序可以在所有 Apache Hadoop 环境中使用(包括 Azure HDInsight 和 Azure Databricks)。The Azure Blob File System (ABFS) driver is available within all Apache Hadoop environments, including Azure HDInsight and Azure Databricks. 使用 ABFS 可以访问存储在 Data Lake Storage Gen2 中的数据。Use ABFS to access data stored in Data Lake Storage Gen2.

  • POSIX 权限的超集: Data Lake Gen2 的安全模型支持 ACL 和 POSIX 权限,以及特定于 Data Lake Storage Gen2 的一些额外粒度。A superset of POSIX permissions: The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. 可以通过管理工具或 Apache Hive 和 Apache Spark 等框架来配置设置。Settings can be configured through admin tools or frameworks like Apache Hive and Apache Spark.

  • 成本效益: Data Lake Storage Gen2 提供低成本的存储容量和事务。Cost effectiveness: Data Lake Storage Gen2 offers low-cost storage capacity and transactions. Azure Blob 存储生命周期可以随数据在其生命周期中的移动调整计费率,从而降低成本。Azure Blob storage life cycles help lower costs by adjusting billing rates as data moves through its life cycle.

  • 与 Blob 存储工具、框架和应用兼容: Data Lake Storage Gen2 可以继续使用适用于 Blob 存储的各种工具、框架和应用程序。Compatibility with Blob storage tools, frameworks, and apps: Data Lake Storage Gen2 continues to work with a wide array of tools, frameworks, and applications for Blob storage.

  • 优化的驱动程序: ABFS 驱动程序专门针对大数据分析进行了优化。Optimized driver: The ABFS driver is optimized specifically for big data analytics. 相应的 REST API 通过分布式文件系统 (DFS) 终结点 dfs.core.windows.net 进行显示。The corresponding REST APIs are surfaced through the distributed file system (DFS) endpoint, dfs.core.windows.net.

Azure Data Lake Storage Gen 2 的新增功能What's new for Azure Data Lake Storage Gen 2

用于安全文件访问的托管标识Managed identities for secure file access

Azure HDInsight 使用托管标识来保护对 Azure Data Lake Storage Gen2 中文件的群集访问。Azure HDInsight uses managed identities to secure cluster access to files in Azure Data Lake Storage Gen2. 托管标识是 Azure Active Directory 的一项功能,可以为 Azure 服务提供一组自动托管的凭据。Managed identities are a feature of Azure Active Directory that provides Azure services with a set of automatically managed credentials. 这些凭据可用于对任何支持 Active Directory 身份验证的服务进行身份验证。These credentials can be used to authenticate to any service that supports Active Directory authentication. 不需要将凭据存储在代码或配置文件中即可使用托管标识。Using managed identities doesn't require you to store credentials in code or configuration files.

有关详细信息,请参阅 Azure 资源的托管标识For more information, see Managed identities for Azure resources.

Azure Blob 文件系统驱动程序Azure Blob File System driver

Apache Hadoop 应用程序会以本机方式从本地磁盘存储读取和写入数据。Apache Hadoop applications natively expect to read and write data from local disk storage. ABFS 等 Hadoop 文件系统驱动程序使 Hadoop 应用程序能够使用云存储。A Hadoop file system driver like ABFS enables Hadoop applications to work with cloud storage. 通过模拟常规 Hadoop 文件系统操作来工作。Works by emulating regular Hadoop file system operations. 驱动程序将从应用程序收到的这些命令转换为实际云存储平台可以理解的操作。The driver converts those commands received from the application into operations that the actual cloud storage platform understands.

以前,Hadoop 文件系统驱动程序会先将所有文件系统操作转换为针对客户端的 Azure 存储 REST API 调用,Previously, the Hadoop file system driver converted all file system operations to Azure Storage REST API calls on the client side. 然后再调用 REST API。And then invoked the REST API. 但是,这种客户端转换会导致针对单个文件系统操作(例如文件重命名)执行多个 REST API 调用。This client-side conversion, however, resulted in multiple REST API calls for a single file system operation like the renaming of a file. ABFS 已将 Hadoop 文件系统逻辑从客户端移到了服务器端。ABFS has moved the Hadoop file system logic from the client side to the server side. Azure Data Lake Storage Gen2 API 现在将与 Blob API 并行运行。The Azure Data Lake Storage Gen2 API now runs in parallel with the Blob API. 此迁移提高了性能,因为现在常用的 Hadoop 文件系统操作可以通过一个 REST API 调用来执行。This migration improves performance because now common Hadoop file system operations can be executed with one REST API call.

有关详细信息,请参阅 Azure Blob 文件系统驱动程序 (ABFS):用于 Hadoop 的专用 Azure 存储驱动程序For more information, see The Azure Blob Filesystem driver (ABFS): A dedicated Azure Storage driver for Hadoop.

Azure Data Lake Storage Gen 2 的 URI 方案URI scheme for Azure Data Lake Storage Gen 2

Azure Data Lake Storage Gen2 使用新的 URI 方案从 HDInsight 访问 Azure 存储中的文件:Azure Data Lake Storage Gen2 uses a new URI scheme to access files in Azure Storage from HDInsight:

abfs://<FILE_SYSTEM_NAME>@<ACCOUNT_NAME>.dfs.core.windows.net/<PATH>

URI 方案提供 SSL 加密访问。The URI scheme provides SSL-encrypted access.

<FILE_SYSTEM_NAME> 标识文件系统 Data Lake Storage Gen2 的路径。<FILE_SYSTEM_NAME> identifies the path of the file system Data Lake Storage Gen2.

<ACCOUNT_NAME> 标识 Azure 存储帐户名称。<ACCOUNT_NAME> identifies the Azure Storage account name. 完全限定域名 (FQDN) 是必需的。A fully qualified domain name (FQDN) is required.

<PATH> 是文件或目录 HDFS 路径名。<PATH> is the file or directory HDFS path name.

如果未指定 <FILE_SYSTEM_NAME><ACCOUNT_NAME> 的值,则会使用默认文件系统。If values for <FILE_SYSTEM_NAME> and <ACCOUNT_NAME> aren't specified, the default file system is used. 对于默认文件系统中的文件,可以使用相对路径或绝对路径。For the files on the default file system, use a relative path or an absolute path. 例如,可以使用以下任一路径引用 HDInsight 群集附带的 hadoop-mapreduce-examples.jar 文件:For example, the hadoop-mapreduce-examples.jar file that comes with HDInsight clusters can be referred to by using one of the following paths:

abfs://myfilesystempath@myaccount.dfs.core.windows.net/example/jars/hadoop-mapreduce-examples.jar
abfs:///example/jars/hadoop-mapreduce-examples.jar /example/jars/hadoop-mapreduce-examples.jar

备注

在 HDInsight 版本 2.1 和 1.6 群集中,文件名为 hadoop-examples.jarThe file name is hadoop-examples.jar in HDInsight versions 2.1 and 1.6 clusters. 在 HDInsight 外部使用文件时,你会发现大多数实用程序无法识别 ABFS 格式,应该改用基本的路径格式,如 example/jars/hadoop-mapreduce-examples.jarWhen you're working with files outside of HDInsight, you'll find that most utilities don't recognize the ABFS format but instead expect a basic path format, such as example/jars/hadoop-mapreduce-examples.jar.

有关详细信息,请参阅使用 Azure Data Lake Storage Gen2 URIFor more information, see Use the Azure Data Lake Storage Gen2 URI.

后续步骤Next steps