HDInsight 中的 Azure 存储概述Azure Storage overview in HDInsight

Azure 存储是一种稳健、通用的存储解决方案,它与 HDInsight 无缝集成。Azure Storage is a robust general-purpose storage solution that integrates seamlessly with HDInsight. HDInsight 可将 Azure 存储中的 Blob 容器用作群集的默认文件系统。HDInsight can use a blob container in Azure Storage as the default file system for the cluster. HDInsight 中的整套组件可以通过 HDFS 界面直接操作以 Blob 形式存储的结构化或非结构化数据。Through an HDFS interface, the full set of components in HDInsight can operate directly on structured or unstructured data stored as blobs.

我们建议为默认群集存储和业务数据使用单独的存储容器。We recommend using separate storage containers for your default cluster storage and your business data. 分离是为了将 HDInsight 日志和临时文件与你自己的业务数据隔离。The separation is to isolate the HDInsight logs and temporary files from your own business data. 我们还建议在每次使用后删除默认的 Blob 容器(其中包含应用程序和系统日志),以降低存储成本。We also recommend deleting the default blob container, which contains application and system logs, after each use to reduce storage cost. 请确保在删除该容器之前检索日志。Make sure to retrieve the logs before deleting the container.

如果选择在“选定网络”上通过“防火墙和虚拟网络”限制来保护存储帐户的安全, 请务必启用“允许受信任的 Microsoft 服务...”例外。通过此例外,HDInsight 可访问你的存储帐户。If you choose to secure your storage account with the Firewalls and virtual networks restrictions on Selected networks, be sure to enable the exception Allow trusted Microsoft services.... The exception is so that HDInsight can access your storage account.

HDInsight 存储体系结构HDInsight storage architecture

下图提供了 Azure 存储的 HDInsight 存储体系结构的抽象视图:The following diagram provides an abstract view of the HDInsight architecture of Azure Storage:

“HDInsight 存储体系结构”HDInsight Storage Architecture

HDInsight 提供对在本地附加到计算节点的分布式文件系统的访问权限。HDInsight provides access to the distributed file system that is locally attached to the compute nodes. 可使用完全限定 URI 访问该文件系统,例如:This file system can be accessed by using the fully qualified URI, for example:


通过 HDInsight 还可以访问 Azure 存储中的数据。Through HDInsight, you can also access data in Azure Storage. 语法如下所示:The syntax is as follows:


将 Azure 存储帐户与 HDInsight 群集配合使用时,请注意以下原则:Consider the following principles when using an Azure Storage account with HDInsight clusters:

  • 连接到群集的存储帐户中的容器: 由于在创建过程中帐户名称和密钥会与群集相关联,因此你对这些容器中的 Blob 具有完全访问权限。Containers in the storage accounts that are connected to a cluster: Because the account name and key are associated with the cluster during creation, you have full access to the blobs in those containers.

  • 没有连接到群集的存储帐户中的公共容器或公共 Blob: 你对这些容器中的 Blob 具有只读权限。Public containers or public blobs in storage accounts that aren't connected to a cluster: You have read-only permission to the blobs in the containers.


    利用公共容器,可以获得该容器中可用的所有 Blob 的列表以及容器元数据。Public containers allow you to get a list of all blobs that are available in that container and to get container metadata. 利用公共 Blob,仅在知道正确 URL 时才可访问 Blob。Public blobs allow you to access the blobs only if you know the exact URL.

  • 没有连接到群集的存储帐户中的专用容器: 不能访问这些容器中的 Blob,除非在提交 WebHCat 作业时定义存储帐户。Private containers in storage accounts that aren't connected to a cluster: You can't access the blobs in the containers unless you define the storage account when you submit the WebHCat jobs.

创建过程中定义的存储帐户及其密钥存储在群集节点上的 %HADOOP/_HOME%/conf/core-site.xml 中。The storage accounts that are defined in the creation process and their keys are stored in %HADOOP_HOME%/conf/core-site.xml on the cluster nodes. HDInsight 默认使用 core-site.xml 文件中定义的存储帐户。By default, HDInsight uses the storage accounts defined in the core-site.xml file. 可以使用 Apache Ambari 修改此设置。You can modify this setting by using Apache Ambari.

多个 WebHCat 作业,包括 Apache Hive。Multiple WebHCat jobs, including Apache Hive. MapReduce、Apache Hadoop 流和 Apache Pig 带有对存储帐户和元数据的描述。And MapReduce, Apache Hadoop streaming, and Apache Pig, carry a description of storage accounts and metadata. (它目前适用于带有存储帐户的 Pig,但不适用于元数据。)有关详细信息,请参阅将 HDInsight 群集与备用存储帐户和元存储配合使用(This aspect is currently true for Pig with storage accounts but not for metadata.) For more information, see Using an HDInsight cluster with alternate storage accounts and metastores.

Blob 可用于结构化和非结构化数据。Blobs can be used for structured and unstructured data. Blob 容器将数据存储为键值对,没有目录层次结构。Blob containers store data as key/value pairs and have no directory hierarchy. 不过,键名称可以包含斜杠字符 (/),使其看起来像存储在目录结构中的文件。However the key name can include a slash character ( / ) to make it appear as if a file is stored within a directory structure. 例如,Blob 的键可以是 input/log1.txtFor example, a blob's key can be input/log1.txt. 不存在实际的 input 目录,但由于键名称中包含斜杠字符,键看起来像一个文件路径。No actual input directory exists, but because of the slash character in the key name, the key looks like a file path.

Azure 存储的优点Benefits of Azure Storage

未共置在一起的计算群集和存储资源存在隐含的性能成本。Compute clusters and storage resources that aren't colocated have implied performance costs. 通过在 Azure 区域中的存储帐户资源附近创建计算群集可以减少这些成本。These costs are mitigated by the way the compute clusters are created close to the storage account resources inside the Azure region. 在此区域中,计算节点可以通过高速网络高效访问 Azure 存储中的数据。In this region, the compute nodes can efficiently access the data over the high-speed network inside Azure Storage.

在 Azure 存储而非 HDFS 中存储数据可带来多项优势:When you store the data in Azure Storage instead of HDFS, you get several benefits:

  • 数据重用和共享: HDFS 中的数据位于计算群集内。Data reuse and sharing: The data in HDFS is located inside the compute cluster. 仅有权访问计算群集的应用程序才能通过 HDFS API 使用数据。Only the applications that have access to the compute cluster can use the data by using HDFS APIs. 相比之下,Azure 存储中的数据可通过 HDFS API 或 Blob 存储 REST API 进行访问。The data in Azure Storage, by contrast, can be accessed through either the HDFS APIs or the Blob storage REST APIs. 因此,可以使用更多的应用程序(包括其他 HDInsight 群集)和工具来生成和使用此类数据。Because of this arrangement, a larger set of applications (including other HDInsight clusters) and tools can be used to produce and consume the data.

  • 数据存档: 在 Azure 存储中存储数据后,可以安全地删除用于计算的 HDInsight 群集而不会丢失用户数据。Data archiving: When data is stored in Azure Storage, the HDInsight clusters used for computation can be safely deleted without losing user data.

  • 数据存储成本: 与将数据存储在 Azure 存储中相比,长期将数据存储在 DFS 中的成本更高。Data storage cost: Storing data in DFS for the long term is more costly than storing the data in Azure Storage. 因为计算群集的成本高于 Azure 存储的成本。Because the cost of a compute cluster is higher than the cost of Azure Storage. 此外,由于数据无需在每次生成计算群集时重新加载,也节省了数据加载成本。Also, because the data doesn't have to be reloaded for every compute cluster generation, you're saving data-loading costs as well.

  • 弹性横向扩展: 尽管 HDFS 提供扩展文件系统,但规模由为群集创建的节点数量决定。Elastic scale-out: Although HDFS provides you with a scaled-out file system, the scale is determined by the number of nodes that you create for your cluster. 与在 Azure 存储中自动获得的弹性缩放功能相比,更改规模可能会更复杂。Changing the scale can be more complicated than the elastic scaling capabilities that you get automatically in Azure Storage.

  • 异地复制: 可对 Azure 存储进行异地复制。Geo-replication: Your Azure Storage can be geo-replicated. 尽管异地复制可提供地理恢复和数据冗余功能,但针对异地复制位置的故障转移将大大影响性能,并且可能会产生额外成本。Although geo-replication gives you geographic recovery and data redundancy, a failover to the geo-replicated location severely affects your performance, and it might incur additional costs. 因此,请谨慎选择异地复制,并仅在数据的价值值得支付额外成本时才选择它。So choose geo-replication cautiously and only if the value of the data justifies the additional cost.

某些 MapReduce 作业和包可能会产生中间结果,并不想在 Azure 存储中存储这些结果。Certain MapReduce jobs and packages might create intermediate results that you wouldn't want to store in Azure Storage. 在此情况下,仍可以选择将数据存储在本地 HDFS 中。In that case, you can choose to store the data in the local HDFS. HDInsight 在 Hive 作业和其他过程中会为其中某些中间结果使用 DFS。HDInsight uses DFS for several of these intermediate results in Hive jobs and other processes.


大多数 HDFS 命令(例如 lscopyFromLocalmkdir)在 Azure 存储中可按预期方式工作。Most HDFS commands (for example, ls, copyFromLocal, and mkdir) work as expected in Azure Storage. 只有特定于本机 HDFS 实现(称作 DFS)的命令在 Azure 存储上会显示不同的行为,例如 fschkdfsadminOnly the commands that are specific to the native HDFS implementation (which is referred to as DFS), such as fschk and dfsadmin, show different behavior in Azure Storage.

后续步骤Next steps