将 Azure 存储与 Azure HDInsight 群集配合使用Use Azure storage with Azure HDInsight clusters

若要分析 HDInsight 群集中的数据,可将数据存储在 Azure 存储和/或 Azure Data Lake Storage Gen 2 中。To analyze data in HDInsight cluster, you can store the data either in Azure Storage, Azure Data Lake Storage Gen 2, or both. 使用这两个存储选项都可以安全地删除用于计算的 HDInsight 群集,而不会丢失用户数据。Both storage options enable you to safely delete HDInsight clusters that are used for computation without losing user data.

Apache Hadoop 支持默认文件系统的概念。Apache Hadoop supports a notion of the default file system. 默认文件系统意指默认方案和授权。The default file system implies a default scheme and authority. 它还可用于解析相对路径。It can also be used to resolve relative paths. 在 HDInsight 群集创建过程中,可以指定 Azure 存储中的 Blob 容器作为默认文件系统,或者借助 HDInsight 3.6,可以选择 Azure 存储或 Azure Data Lake Storage Gen 2 作为默认文件系统(有少数例外)。During the HDInsight cluster creation process, you can specify a blob container in Azure Storage as the default file system, or with HDInsight 3.6, you can select either Azure Storage or Azure Data Lake Storage Gen 2 as the default files system with a few exceptions.

本文介绍 Azure 存储如何与 HDInsight 群集配合使用。In this article, you learn how Azure Storage works with HDInsight clusters. 若要深入了解如何创建 HDInsight 群集,请参阅在 HDInsight 中创建 Apache Hadoop 群集For more information about creating an HDInsight cluster, see Create Apache Hadoop clusters in HDInsight.

重要

存储帐户类型 BlobStorage 仅可用作 HDInsight 群集的辅助存储器。Storage account kind BlobStorage can only be used as secondary storage for HDInsight clusters.

存储帐户类型Storage account kind 支持的服务Supported services 支持的性能层Supported performance tiers 支持的访问层Supported access tiers
StorageV2(常规用途 v2)StorageV2 (general-purpose v2) BlobBlob 标准Standard 热、冷、存档*Hot, Cool, Archive*
存储(常规用途 v1)Storage (general-purpose v1) BlobBlob 标准Standard 不适用N/A
BlobStorageBlobStorage BlobBlob 标准Standard 热、冷、存档*Hot, Cool, Archive*

建议不要使用默认 blob 容器来存储业务数据。We don't recommend that you use the default blob container for storing business data. 最佳做法是每次使用之后删除默认 Blob 容器以降低存储成本。Deleting the default blob container after each use to reduce storage cost is a good practice. 默认容器包含应用程序日志和系统日志。The default container contains application and system logs. 请确保在删除该容器之前检索日志。Make sure to retrieve the logs before deleting the container.

不支持将单个 blob 容器共享为多个群集的默认文件系统。Sharing one blob container as the default file system for multiple clusters isn't supported.

备注

存档访问层是一个离线层,具有几小时的检索延迟,不建议与 HDInsight 一起使用。The Archive access tier is an offline tier that has a several hour retrieval latency and isn't recommended for use with HDInsight. 有关详细信息,请参阅存档访问层For more information, see Archive access tier.

从群集中访问文件Access files from within cluster

可以通过多种方法从 HDInsight 群集访问 Data Lake Storage 中的文件。There are several ways you can access the files in Data Lake Storage from an HDInsight cluster. URI 方案提供了使用 wasb: 前缀的未加密访问和使用 wasbs 的 SSL 加密访问。The URI scheme provides unencrypted access (with the wasb: prefix) and SSL encrypted access (with wasbs). 建议尽量使用 wasbs ,即使在访问位于同一 Azure 区域内的数据时也是如此。We recommend using wasbs wherever possible, even when accessing data that lives inside the same region in Azure.

  • 使用完全限定的名称Using the fully qualified name. 使用此方法时,需提供要访问的文件的完整路径。With this approach, you provide the full path to the file that you want to access.

    wasb://<containername>@<accountname>.blob.core.chinacloudapi.cn/<file.path>/
    wasbs://<containername>@<accountname>.blob.core.chinacloudapi.cn/<file.path>/
    
  • 使用缩短的路径格式Using the shortened path format. 使用此方法时,需将群集根的路径替换为:With this approach, you replace the path up to the cluster root with:

    wasb:///<file.path>/
    wasbs:///<file.path>/
    
  • 使用相对路径Using the relative path. 使用此方法时,仅需提供要访问的文件的相对路径。With this approach, you only provide the relative path to the file that you want to access.

    /<file.path>/
    

数据访问示例Data access examples

示例基于到群集的头节点的 ssh 连接Examples are based on an ssh connection to the head node of the cluster. 示例使用所有三个 URI 方案。The examples use all three URI schemes. CONTAINERNAMESTORAGEACCOUNT 替换为相关值Replace CONTAINERNAME and STORAGEACCOUNT with the relevant values

一些 hdfs 命令A few hdfs commands

  1. 在本地存储上创建一个简单的文件。Create a simple file on local storage.

    touch testFile.txt
    
  2. 在群集存储上创建目录。Create directories on cluster storage.

    hdfs dfs -mkdir wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.chinacloudapi.cn/sampledata1/
    hdfs dfs -mkdir wasbs:///sampledata2/
    hdfs dfs -mkdir /sampledata3/
    
  3. 将数据从本地存储复制到群集存储。Copy data from local storage to cluster storage.

    hdfs dfs -copyFromLocal testFile.txt  wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.chinacloudapi.cn/sampledata1/
    hdfs dfs -copyFromLocal testFile.txt  wasbs:///sampledata2/
    hdfs dfs -copyFromLocal testFile.txt  /sampledata3/
    
  4. 列出群集存储上的目录内容。List directory contents on cluster storage.

    hdfs dfs -ls wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.chinacloudapi.cn/sampledata1/
    hdfs dfs -ls wasbs:///sampledata2/
    hdfs dfs -ls /sampledata3/
    

备注

在 HDInsight 外部使用 Blob 时,大多数实用程序无法识别 WASB 格式,应改用基本路径格式,如 example/jars/hadoop-mapreduce-examples.jarWhen working with blobs outside of HDInsight, most utilities do not recognize the WASB format and instead expect a basic path format, such as example/jars/hadoop-mapreduce-examples.jar.

创建 Hive 表Creating a Hive table

但为了便于说明,显示了三个文件位置。Three file locations are shown for illustrative purposes. 实际执行时,仅使用 LOCATION 条目之一。For actual execution, use only one of the LOCATION entries.

DROP TABLE myTable;
CREATE EXTERNAL TABLE myTable (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 'wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.chinacloudapi.cn/example/data/';
LOCATION 'wasbs:///example/data/';
LOCATION '/example/data/';

从群集外部访问文件Access files from outside cluster

Microsoft 提供以下用于操作 Azure 存储的工具:Microsoft provides the following tools to work with Azure Storage:

工具Tool LinuxLinux OS XOS X WindowsWindows
Azure 门户Azure portal
Azure CLIAzure CLI
Azure PowerShellAzure PowerShell
AzCopyAzCopy

从 Ambari 标识存储路径Identify storage path from Ambari

  • 若要标识已配置的默认存储的完整路径,请导航到To identify the complete path to the configured default store, navigate to:

    “HDFS” > “配置”,然后在筛选器输入框中输入 fs.defaultFSHDFS > Configs and enter fs.defaultFS in the filter input box.

  • 若要检查是否已将 wasb 存储配置为辅助存储器,请导航到To check if wasb store is configured as secondary storage, navigate to:

    “HDFS” > “配置”,然后在筛选器输入框中输入 blob.core.chinacloudapi.cnHDFS > Configs and enter blob.core.chinacloudapi.cn in the filter input box.

若要使用 Ambari REST API 获取路径,请参阅获取默认存储To obtain the path using Ambari REST API, see Get the default storage.

Blob 容器Blob containers

若要使用 Blob,请先创建 Azure 存储帐户To use blobs, you first create an Azure Storage account. 在此过程中,可指定在其中创建存储帐户的 Azure 区域。As part of this, you specify an Azure region where the storage account is created. 群集和存储帐户必须位于同一区域。The cluster and the storage account must be hosted in the same region. Hive 元存储 SQL Server 数据库和 Apache Oozie 元存储 SQL Server 数据库也必须位于同一区域。The Hive metastore SQL Server database and Apache Oozie metastore SQL Server database must also be located in the same region.

无论所创建的每个 Blob 位于何处,它都属于 Azure 存储帐户中的某个容器。Wherever it lives, each blob you create belongs to a container in your Azure Storage account. 此容器可以是在 HDInsight 外部创建的现有的 Blob,也可以是为 HDInsight 群集创建的容器。This container may be an existing blob that was created outside of HDInsight, or it may be a container that is created for an HDInsight cluster.

默认的 Blob 容器存储群集特定的信息,如作业历史记录和日志。The default Blob container stores cluster-specific information such as job history and logs. 请不要多个 HDInsight 群集之间共享默认的 Blob 容器。Don't share a default Blob container with multiple HDInsight clusters. 这可能会损坏作业历史记录。This might corrupt job history. 建议对每个群集使用不同的容器,并将共享数据放入在所有相关群集的部署中指定的链接存储帐户,而不是放入默认存储帐户。It's recommended to use a different container for each cluster and put shared data on a linked storage account specified in deployment of all relevant clusters rather than the default storage account. 有关配置链接存储帐户的详细信息,请参阅创建 HDInsight 群集For more information on configuring linked storage accounts, see Create HDInsight clusters. 但是,在删除原始的 HDInsight 群集后,可以重用默认存储容器。However you can reuse a default storage container after the original HDInsight cluster has been deleted. 对于 HBase 群集,实际上可以通过使用已删除的 HBase 群集使用的默认 Blob 容器创建新的 HBase 群集,从而保留 HBase 表架构和数据。For HBase clusters, you can actually keep the HBase table schema and data by creating a new HBase cluster using the default blob container that is used by an HBase cluster that has been deleted.

备注

需要安全传输的功能强制通过安全连接来实施针对帐户的所有请求。The feature that requires secure transfer enforces all requests to your account through a secure connection. 仅 HDInsight 群集 3.6 或更高版本支持此功能。Only HDInsight cluster version 3.6 or newer supports this feature. 有关详细信息,请参阅在 Azure HDInsight 中使用安全传输存储帐户创建 Apache Hadoop 群集For more information, see Create Apache Hadoop cluster with secure transfer storage accounts in Azure HDInsight.

使用其他存储帐户Use additional storage accounts

创建 HDInsight 群集时,可以指定要与其关联的 Azure 存储帐户。While creating an HDInsight cluster, you specify the Azure Storage account you want to associate with it. 除了此存储帐户外,在创建过程中或群集创建完成后,还可以从同一 Azure 订阅或不同 Azure 订阅添加其他存储帐户。In addition to this storage account, you can add additional storage accounts from the same Azure subscription or different Azure subscriptions during the creation process or after a cluster has been created. 有关添加其他存储帐户的说明,请参阅创建 HDInsight 群集For instructions about adding additional storage accounts, see Create HDInsight clusters.

警告

不支持在 HDInsight 群集之外的其他位置使用别的存储帐户。Using an additional storage account in a different location than the HDInsight cluster is not supported.

后续步骤Next steps

本文已介绍如何将 HDFS 兼容的 Azure 存储与 HDInsight 配合使用。In this article, you learned how to use HDFS-compatible Azure storage with HDInsight. 这样,便可以构建可缩放的长期存档数据获取解决方案,并使用 HDInsight 来解锁存储的结构化和非结构化数据中的信息。This allows you to build scalable, long-term, archiving data acquisition solutions and use HDInsight to unlock the information inside the stored structured and unstructured data.

有关详细信息,请参阅:For more information, see: