将本地 Apache Hadoop 群集迁移到 Azure HDInsightMigrate on-premises Apache Hadoop clusters to Azure HDInsight

本文提供有关 Azure HDInsight 系统数据存储的建议。This article gives recommendations for data storage in Azure HDInsight systems. 本文是帮助用户将本地 Apache Hadoop 系统迁移到 Azure HDInsight 的最佳做法系列教程中的其中一篇。It's part of a series that provides best practices to assist with migrating on-premises Apache Hadoop systems to Azure HDInsight.

为 HDInsight 群集选择合适的存储系统Choose right storage system for HDInsight clusters

可在 Azure 存储或 Azure Data Lake Storage 重新创建本地 Apache Hadoop 文件系统 (HDFS) 目录结构。The on-premises Apache Hadoop File System (HDFS) directory structure can be re-created in Azure Storage or Azure Data Lake Storage. 可安全地删除用于计算的 HDInsight 群集,而不会丢失用户数据。You can then safely delete HDInsight clusters that are used for computation without losing user data. 这两种服务既可以用作默认文件系统,也可以用作 HDInsight 群集的附加文件系统。Both services can be used as both the default file system and an additional file system for an HDInsight cluster. HDInsight 群集和存储帐户必须位于同一区域。The HDInsight cluster and the storage account must be hosted in the same region.

Azure 存储Azure storage

HDInsight 群集可将 Azure 存储中的 blob 容器用作默认文件系统或其他文件系统。HDInsight clusters can use the blob container in Azure Storage as either the default file system or an additional file system. 支持将标准层存储帐户与 HDInsight 群集配合使用。 The Standard tier storage account is supported for use with HDInsight clusters. 不支持高级层。The Premier tier is not supported. 默认的 Blob 容器存储群集特定的信息,如作业历史记录和日志。The default Blob container stores cluster-specific information such as job history and logs. 不支持将单个 blob 容器共享为多个群集的默认文件系统。 Sharing one blob container as the default file system for multiple clusters is not supported.

创建过程中定义的存储帐户及其对应的密钥存储在群集节点上的 %HADOOP_HOME%/conf/core-site.xml 中。The storage accounts that are defined in the creation process and their respective keys are stored in %HADOOP_HOME%/conf/core-site.xml on the cluster nodes. 也可在 Ambari UI 中的 HDFS 配置中的“自定义核心站点”部分下访问它们。They can also be accessed under the "Custom core site" section in HDFS configuration in the Ambari UI. 默认情况下将加密存储帐户密钥,并且使用自定义解密脚本在将密钥传递给 Hadoop 守护程序之前解密密钥。The storage account key is encrypted by default and a custom decryption script is used to decrypt the keys before being passed on to Hadoop daemons. 很多作业(包括 Hive、MapReduce、Hadoop Streaming 和 Pig)都带有存储帐户和元数据的说明。The jobs including Hive, MapReduce, Hadoop streaming, and Pig, carry a description of storage accounts and metadata with them.

可对 Azure 存储进行异地复制。Azure storage can be geo-replicated. 尽管异地复制可提供地理恢复和数据冗余功能,但针对异地复制位置的故障转移将大大影响性能,并且可能会产生额外成本。Although geo-replication gives geographic recovery and data redundancy, a failover to the geo-replicated location severely impacts the performance, and it may incur additional costs. 建议仅在数据的价值值得你支付额外成本时才选择适当的地理复制。The recommendation is to choose the geo-replication wisely and only if the value of the data is worth the additional cost.

可以使用以下格式之一访问存储在 Azure 存储中的数据:One of the following formats can be used to access data that is stored in Azure Storage:

数据访问格式Data Access Format 说明Description
wasb:/// 使用未加密的通信访问默认存储。Access default storage using unencrypted communication.
wasbs:/// 使用加密的通信访问默认存储。Access default storage using encrypted communication.
wasb://<container-name>@<account-name>.blob.core.chinacloudapi.cn/ 与非默认存储帐户通信时使用。Used when communicating with a non-default storage account. 

标准存储帐户的可伸缩性目标列出了 Azure 存储帐户的当前限制。Scalability targets for standard storage accounts lists the current limits on Azure Storage accounts. 如果应用程序的需求超过单个存储帐户的伸缩性目标,则在构建时让应用程序使用多个存储帐户,并将数据对象分布到这些存储帐户中。If the needs of the application exceed the scalability targets of a single storage account, the application can be built to use multiple storage accounts and then partition data objects across those storage accounts.

Azure 存储分析 提供了所有存储服务的指标,可配置 Azure 门户来收集这些指标,以便通过图表直观显示。Azure Storage Analytics provides metrics for all storage services and Azure portal can be configured collect metrics to be visualized through charts. 可以创建警报,以便在达到存储资源指标的阈值时收到通知。Alerts can be created to notify when thresholds have been reached for storage resource metrics.

Azure 存储现提供 Blob 对象软删除,以便在应用程序或其他存储帐户用户意外修改或删除数据后恢复数据。Azure Storage offers soft delete for blob objects to help recover data when it is accidentally modified or deleted by an application or other storage account user.

可创建 blob 快照You can create blob snapshots. 快照是在某一时间点拍摄的只读版本的 blob,这是一种备份 blob 的方法。A snapshot is a read-only version of a blob that's taken at a point in time and it provides a way to back up a blob. 在创建快照后,可以读取、复制或删除该快照,但无法对其进行修改。Once a snapshot has been created, it can be read, copied, or deleted, but not modified.


对于没有“wasbs”证书的旧版本地 Hadoop 发行版,需要将其导入 Java 信任存储区。For older versions of on-premises Hadoop Distributions that don't have the "wasbs" certificate, they need to be imported to the Java trust store.

可以使用以下方法将证书导入 Java 信任存储区:The following methods can be used to import certificates into the Java trust store:

将 Azure Blob TLS/SSL 证书下载到文件Download the Azure Blob TLS/SSL cert to a file

echo -n | openssl s_client -connect <storage-account>.blob.core.chinacloudapi.cn:443 | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > Azure_Storage.cer

将上述文件导入所有节点上的 Java 信任存储区Import the above file to the Java trust store on all the nodes

keytool -import -trustcacerts -keystore /path/to/jre/lib/security/cacerts -storepass changeit -noprompt -alias blobtrust -file Azure_Storage.cer

验证添加的证书是否位于信任存储区Verify that the added cert is in the trust store

keytool -list -v -keystore /path/to/jre/lib/security/cacerts

有关详细信息,请参阅以下文章:For more information, see the following articles:

Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 是最新的存储套餐。Azure Data Lake Storage Gen2 is the latest storage offering. 它统一了第一代 Azure Data Lake Storage 的核心功能和直接集成到 Azure Blob 存储中的 Hadoop 兼容文件系统。It unifies the core capabilities from the first generation of Azure Data Lake Storage with a Hadoop compatible file system endpoint directly integrated into Azure Blob Storage. 此增强功能将对象存储的规模和成本优势与通常仅与本地文件系统相关联的可靠性和性能相结合。This enhancement combines the scale and cost benefits of object storage with the reliability and performance typically associated only with on-premises file systems.

ADLS Gen 2 基于  Azure Blob 存储构建,可使用文件系统和对象存储范例与数据进行交互。ADLS Gen 2 is built on top of Azure Blob storage and allows you to interface with data using both file system and object storage paradigms. 在 Data Lake Storage Gen2 中,在添加针对分析工作负载优化的文件系统接口的优点的同时,还保留了对象存储的所有功能。In Data Lake Storage Gen2, all the qualities of object storage remain while adding the advantages of a file system interface optimized for analytics workloads.

Data Lake Storage Gen2 的一个基本功能是,在 Blob 存储服务中添加一个 分层命名空间 ,将对象/文件组织成用于执行数据访问的目录层次结构。A fundamental feature of Data Lake Storage Gen2 is the addition of a hierarchical namespace to the Blob storage service, which organizes objects/files into a hierarchy of directories for performant data access. 这种层次结构启用了诸如重命名或删除目录之类的操作在目录上成为单个原子元数据操作,而不是枚举或处理共享目录名称前缀的所有对象。 The hierarchical structure enables operations such as renaming or deleting a directory to be single atomic metadata operations on the directory rather than enumerating and processing all objects that share the name prefix of the directory.

过去,基于云的分析必须在性能、管理和安全性方面做出妥协。In the past, cloud-based analytics had to compromise in areas of performance, management, and security. Azure Data Lake Storage (ADLS) Gen2 的主要功能如下:The Key features of Azure Data Lake Storage (ADLS) Gen2 are as follows:

  • Hadoop 兼容访问:使用 Azure Data Lake Storage Gen2,可以像使用  Hadoop 分布式文件系统 (HDFS) 一样管理和访问数据。Hadoop compatible access: Azure Data Lake Storage Gen2 allows you to manage and access data just as you would with a Hadoop Distributed File System (HDFS).  Azure HDInsight 中包含的所有 Apache Hadoop 环境中都提供了新的  ABFS 驱动程序 。The new ABFS driver is available within all Apache Hadoop environments that are included in Azure HDInsight. 通过此驱动程序可访问存储在 Data Lake Storage Gen2 中的数据。This driver allows you to access data stored in Data Lake Storage Gen2.

  • POSIX 权限的超集:Data Lake Gen2 的安全模型完全支持 ACL 和 POSIX 权限,以及特定于 Data Lake Storage Gen2 的一些额外粒度。A superset of POSIX permissions: The security model for Data Lake Gen2 fully supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. 可以通过管理工具或 Hive 和 Spark 等框架配置设置。Settings may be configured through admin tools or through frameworks like Hive and Spark.

  • 经济高效:Data Lake Storage Gen2 具有低成本的存储容量和事务。Cost effective: Data Lake Storage Gen2 features low-cost storage capacity and transactions. 随着数据在其整个生命周期中的转换,账单费率变化通过诸如 Azure Blob 存储生命周期的内置功能使成本保持在最低水平。As data transitions through its complete life-cycle, billing rates change to minimize costs via built-in features such as Azure Blob storage life cycle.

  • 使用 Blob 存储工具、框架和应用:Data Lake Storage Gen2 可以继续使用目前适用于 Blob 存储的各种工具、框架和应用程序。Works with Blob storage tools, frameworks, and apps: Data Lake Storage Gen2 continues to work with a wide array of tools, frameworks, and applications that exist today for Blob storage.

  • 优化的驱动程序:Azure Blob 文件系统驱动程序 (ABFS) 针对大数据分析进行了 专门优化 。Optimized driver: The Azure Blob Filesystem driver (ABFS) is optimized specifically for big data analytics. 相应的 REST API 通过 dfs 终结点 dfs.core.chinacloudapi.cn 进行显示。The corresponding REST APIs are surfaced through the dfs endpoint, dfs.core.chinacloudapi.cn.

可以使用以下格式之一访问存储在 ADLS Gen2 中的数据:One of the following formats can be used to access data that is stored in ADLS Gen2:

在本地 Hadoop 群集配置中保护 Azure 存储密钥Secure Azure Storage keys within on-premises Hadoop cluster configuration

添加到 Hadoop 配置文件的 Azure 存储密钥在本地 HDFS 和 Azure Blob 存储之间建立连接。The Azure storage keys that are added to the Hadoop configuration files, establish connectivity between on premises HDFS and Azure Blob storage. 可以使用 Hadoop 凭据提供程序框架对这些密钥进行加密来保护这些密钥。These keys can be protected by encrypting them with the Hadoop credential provider framework. 加密后,可以安全地存储和访问它们。Once encrypted, they can be stored and accessed securely.

预配凭据:To provision the credentials:

hadoop credential create fs.azure.account.key.account.blob.core.chinacloudapi.cn -value <storage key> -provider jceks://hdfs@headnode.xx.internal.chinacloudapp.cn/path/to/jceks/file

将上述提供程序路径添加到 core-site.xml 或自定义核心站点下的 Ambari 配置:To add the above provider path to the core-site.xml or to the Ambari configuration under custom core-site:

    <description>Path to interrogate for protected credentials.</description>


也可将提供程序路径属性添加到 distcp 命令行,而不是将密钥存储在 core-site.xml 的群集级别,如下所示:The provider path property can also be added to the distcp command line instead of storing key at cluster level at core-site.xml as follows:

hadoop distcp -D hadoop.security.credential.provider.path=jceks://hdfs@headnode.xx.internal.chinacloudapp.cn/path/to/jceks /user/user1/ wasb:<//yourcontainer@youraccount.blob.core.chinacloudapi.cn/>user1

使用 SAS 限制 Azure 存储数据访问Restrict Azure storage data access using SAS

默认情况下,HDInsight 对群集关联的 Azure 存储帐户中的数据拥有完全访问权限。HDInsight by default has full access to data in the Azure Storage accounts associated with the cluster. Blob 容器上的共享访问签名 (SAS) 可用于限制对数据的访问,例如为用户提供对数据的只读访问权限。Shared Access Signatures (SAS) on the blob container can be used to restrict access to the data, such as provide users with read-only access to the data.

使用通过 python 创建的 SAS 令牌Using the SAS token created with python

  1. 打开 SASToken.py 文件并更改以下值:Open the SASToken.py file and change the following values:

    令牌属性Token Property 说明Description
    policy_namepolicy_name 要创建的存储策略所用的名称。The name to use for the stored policy to create.
    storage_account_namestorage_account_name 存储帐户的名称。The name of your storage account.
    storage_account_keystorage_account_key 存储帐户的密钥。The key for the storage account.
    storage_container_namestorage_container_name 想要限制访问的存储帐户中的容器。The container in the storage account that you want to restrict access to.
    example_file_pathexample_file_path 上传到容器的文件的路径。The path to a file that is uploaded to the container.
  2. SASToken.py 文件附带 ContainerPermissions.READ + ContainerPermissions.LIST 权限,可以根据用例进行调整。The SASToken.py file comes with the ContainerPermissions.READ + ContainerPermissions.LIST permissions and can be adjusted based on the use case.

  3. 按如下所示执行脚本:python SASToken.pyExecute the script as follows: python SASToken.py

  4. 脚本完成后,会显示如以下文本所示的 SAS 令牌:sr=c&si=policyname&sig=dOAi8CXuz5Fm15EjRUu5dHlOzYNtcK3Afp1xqxniEps%3D&sv=2014-02-14It displays the SAS token similar to the following text when the script completes: sr=c&si=policyname&sig=dOAi8CXuz5Fm15EjRUu5dHlOzYNtcK3Afp1xqxniEps%3D&sv=2014-02-14

  5. 要限制对具有共享访问签名的容器的访问,请在“Ambari HDFS 配置高级自定义”核心站点的“添加”属性下为群集的核心站点配置添加自定义条目。To limit access to a container with Shared Access Signature, add a custom entry to the core-site configuration for the cluster under Ambari HDFS Configs Advanced Custom core-site Add property.

  6. 将以下值用于“密钥”和“值”字段      :Use the following values for the Key and Value fields:

    密钥fs.azure.sas.YOURCONTAINER.YOURACCOUNT.blob.core.chinacloudapi.cn :Python 应用程序从上面的步骤 4 返回的 SAS 密钥。Key: fs.azure.sas.YOURCONTAINER.YOURACCOUNT.blob.core.chinacloudapi.cn Value: The SAS KEY returned by the Python application FROM step 4 above.

  7. 单击“添加”按钮以保存此密钥和值,并单击“保存”按钮以保存配置更改      。Click the Add button to save this key and value, then click the Save button to save the configuration changes. 出现提示时,请添加更改的说明(例如,“添加 SAS 存储访问”),并单击“保存”  。When prompted, add a description of the change ("adding SAS storage access" for example) and then click Save.

  8. 在 Ambari Web UI 中,从左侧的列表中选择“HDFS”,并从右侧的“服务操作”下拉列表中选择“重启所有受影响项”   。In the Ambari web UI, select HDFS from the list on the left, and then select Restart All Affected from the Service Actions drop down list on the right. 出现提示时,选择“确认全部重启”  。When prompted, select Confirm Restart All.

  9. 对 MapReduce2 和 YARN 重复此过程。Repeat this process for MapReduce2 and YARN.

关于在 Azure 中使用 SAS 令牌,有三个重要事项需要记住:There are three important things to remember about the use of SAS Tokens in Azure:

  1. 使用“READ + LIST”权限创建 SAS 令牌时,使用该 SAS 令牌访问 Blob 容器的用户将无法“写入和删除”数据。When SAS tokens are created with "READ + LIST" permissions, users who access the Blob container with that SAS token won't be able to "write and delete" data. 使用该 SAS 令牌访问 Blob 容器并尝试写入或删除操作的用户将收到类似 "This request is not authorized to perform this operation" 的消息。Users who access the Blob container with that SAS token and try a write or delete operation, will receive a message like "This request is not authorized to perform this operation".

  2. 当使用 READ + LIST + WRITE 权限生成 SAS 令牌(仅限 DELETE)时,hadoop fs -put 等命令首先写入 \_COPYING\_ 文件,然后尝试重命名该文件。When the SAS tokens are generated with READ + LIST + WRITE permissions (to restrict DELETE only), commands like hadoop fs -put first write to a \_COPYING\_ file and then try to rename the file. 此 HDFS 操作映射到 WASB 的 copy+deleteThis HDFS operation maps to a copy+delete for WASB. 由于未提供 DELETE 权限,因此“put”将失败。Since the DELETE permission was not provided, the "put" would fail. \_COPYING\_ 操作是一个 Hadoop 功能,旨在提供一些并发控制。The \_COPYING\_ operation is a Hadoop feature intended to provide some concurrency control. 目前,没有办法仅限制“DELETE”操作而不影响“WRITE”操作。Currently there is no way to restrict just the "DELETE" operation without affecting "WRITE" operations as well.

  3. 遗憾的是,hadoop 凭证提供程序和解密密钥提供程序 (ShellDecryptionKeyProvider) 当前不能与 SAS 令牌配合使用,因此目前无法对其可见性提供保护。Unfortunately, the hadoop credential provider and decryption key provider (ShellDecryptionKeyProvider) currently do not work with the SAS tokens and so it currently cannot be protected from visibility.

有关详细信息,请参阅使用 Azure 存储共享访问签名来限制访问 HDInsight 中的数据For more information, see Use Azure Storage Shared Access Signatures to restrict access to data in HDInsight.

使用数据加密和复制Use data encryption and replication

所有写入 Azure 存储的数据都使用 存储服务加密 (SSE) 进行自动加密。All data written to Azure Storage is automatically encrypted using Storage Service Encryption (SSE). 始终复制 Azure 存储帐户中的数据以实现高可用性。The data in the Azure storage account is always replicated for high availability. 创建存储帐户时,可以选择以下复制选项之一: When you create a storage account, you can choose one of the following replication options:

有关详细信息,请参阅以下文章:For more information, see the following articles:

将其他 Azure 存储帐户附加到该群集Attach additional Azure storage accounts to cluster

在 HDInsight 创建过程中,将选择 Azure 存储帐户或者 Azure Data Lake Storage 作为默认文件系统。During the HDInsight creation process, an Azure Storage account or Azure Data Lake storage account is chosen as the default file system. 除了此默认存储帐户,在群集创建过程中或群集创建完成后,还可以从同一 Azure 订阅或不同 Azure 订阅添加其他存储帐户。In addition to this default storage account, additional storage accounts can be added from the same Azure subscription or different Azure subscriptions during the cluster creation process or after a cluster has been created.

可以通过以下方式之一添加其他存储帐户:Additional storage account can be added in one on the following ways:

  • 在“Ambari HDFS 配置高级自定义”核心站点,添加存储帐户名称和密钥并重启服务Ambari HDFS Config Advanced Custom core-site Add the storage Account Name and key Restarting the services
  • 通过传递存储帐户名称和密钥,使用脚本操作Using Script action by passing the storage account name and key


在有效的用例中,可以通过向  Azure 支持发出的请求来增加对 Azure 存储的限制。In valid use-cases, the limits on the Azure storage can be increased via a request made to Azure Support.

有关详细信息,请参阅将其他存储帐户添加到 HDInsightFor more information, see Add additional storage accounts to HDInsight.

后续步骤Next steps

阅读本系列教程的下一篇文章:Read the next article in this series: