配合使用 Azure Data Lake Storage Gen2 和 Azure HDInsight 群集Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters

Azure Data Lake Storage Gen2 是构建在 Azure Blob 存储基础之上的,专用于大数据分析的云存储服务。Azure Data Lake Storage Gen2 is a cloud storage service dedicated to big data analytics, built on Azure Blob storage. Data Lake Storage Gen2 将 Azure Blob 存储的功能组合在一起。Data Lake Storage Gen2 combines the capabilities of Azure Blob storage. 这些功能包括:文件系统语义、目录级和文件级安全性以及适应性。These features include: file system semantics, directory-level and file-level security, and adaptability. 以及 Azure Blob 存储的低成本、分层存储、高可用性和灾难恢复功能。Along with the low-cost, tiered storage, high availability, and disaster-recovery capabilities from Azure Blob storage.

警告

HDInsight 群集是基于分钟按比例计费,而不管用户是否使用它们。Billing for HDInsight clusters is prorated per minute, whether you use them or not. 请务必在使用完群集之后将其删除。Be sure to delete your cluster after you finish using it. 请参阅如何删除 HDInsight 群集See how to delete an HDInsight cluster.

Data Lake Storage Gen2 可用性Data Lake Storage Gen2 availability

Data Lake Storage Gen2 能够以默认存储和附加存储帐户的形式用作几乎所有 Azure HDInsight 群集类型的存储选项。Data Lake Storage Gen2 is available as a storage option for almost all Azure HDInsight cluster types as both a default and an additional storage account. 但是,HBase 只能有一个帐户使用 Data Lake Storage Gen2。HBase, however, can have only one account with Data Lake Storage Gen2.

使用 Data Lake Storage Gen2 创建 HDInsight 群集Create HDInsight clusters using Data Lake Storage Gen2

请使用以下链接,详细了解有关如何创建具有 Data Lake Storage Gen2 访问权限的 HDInsight 群集的说明。Use the following links for detailed instructions on how to create HDInsight clusters with access to Data Lake Storage Gen2.

HDInsight 中 Data Lake Storage Gen2 的访问控制Access control for Data Lake Storage Gen2 in HDInsight

Data Lake Storage Gen2 支持哪些类型的权限?What kinds of permissions does Data Lake Storage Gen2 support?

Data Lake Storage Gen2 使用一个支持基于角色的访问控制 (RBAC) 和类似于 POSIX 的访问控制列表 (ACL) 的访问控制模型。Data Lake Storage Gen2 uses an access control model that supports both role-based access control (RBAC) and POSIX-like access control lists (ACLs).

RBAC 使用角色分配有效地将权限集应用到 Azure 资源的用户、组和服务主体。RBAC uses role assignments to effectively apply sets of permissions to users, groups, and service principals for Azure resources. 一般情况下,这些 Azure 资源限制为顶级资源(例如 Azure Blob 存储帐户)。Typically, those Azure resources are constrained to top-level resources (for example, Azure Blob storage accounts). 对于 Azure Blob 存储以及 Azure Data Lake Storage Gen2,此机制已扩展到文件系统资源。For Azure Blob storage, and also Data Lake Storage Gen2, this mechanism has been extended to the file system resource.

有关使用 ACL 分配文件权限的详细信息,请参阅对文件和目录应用访问控制列表For more information about file permissions with ACLs, see Access control lists on files and directories.

如何在 Data Lake Storage Gen2 中控制对数据的访问?How do I control access to my data in Data Lake Storage Gen2?

HDInsight 群集在 Data Lake Storage Gen2 中访问文件的能力通过托管标识进行控制。Your HDInsight cluster's ability to access files in Data Lake Storage Gen2 is controlled through managed identities. 托管标识是在 Azure Active Directory (Azure AD) 中注册的标识,其凭据由 Azure 管理。A managed identity is an identity registered in Azure Active Directory (Azure AD) whose credentials are managed by Azure. 使用托管标识,无需在 Azure AD 中注册服务主体,也无需维护证书等凭据。With managed identities, you don't need to register service principals in Azure AD or maintain credentials such as certificates.

Azure 服务有两种类型的托管标识:系统分配的托管标识和用户分配的托管标识。Azure services have two types of managed identities: system-assigned and user-assigned. HDInsight 使用用户分配的托管标识来访问 Data Lake Storage Gen2。HDInsight uses user-assigned managed identities to access Data Lake Storage Gen2. user-assigned managed identity是作为独立的 Azure 资源创建的。A user-assigned managed identity is created as a standalone Azure resource. 在创建过程中,Azure 会在由所用订阅信任的 Azure AD 租户中创建一个标识。Through a create process, Azure creates an identity in the Azure AD tenant that's trusted by the subscription in use. 在创建标识后,可以将标识分配到一个或多个 Azure 服务实例。After the identity is created, the identity can be assigned to one or more Azure service instances.

用户分配标识的生命周期与它所分配到的 Azure 服务实例的生命周期是分开管理的。The lifecycle of a user-assigned identity is managed separately from the lifecycle of the Azure service instances to which it's assigned. 有关托管标识的详细信息,请参阅什么是 Azure 资源托管标识?For more information about managed identities, see What are managed identities for Azure resources?.

如何设置 Azure AD 用户的权限,以使用 Hive 或其他服务在 Data Lake Storage Gen2 中查询数据?How do I set permissions for Azure AD users to query data in Data Lake Storage Gen2 by using Hive or other services?

若要为用户设置权限以查询数据,请将 Azure AD 安全组用作 ACL 中分配的主体。To set permissions for users to query data, use Azure AD security groups as the assigned principal in ACLs. 不要直接向单个用户或服务主体分配文件访问权限。Don't directly assign file-access permissions to individual users or service principals. 使用 AD 安全组来控制权限流时,可以添加和删除用户或服务主体,而无需将 ACL 重新应用到整个目录结构。With Azure AD security groups to control the flow of permissions, you can add and remove users or service principals without reapplying ACLs to an entire directory structure. 只需要从相应的 Azure AD 安全组添加或删除用户。You only have to add or remove the users from the appropriate Azure AD security group. ACL 不可继承,因此,重新应用 ACL 需要更新针对每个文件和子目录应用的 ACL。ACLs aren't inherited, so reapplying ACLs requires updating the ACL on every file and subdirectory.

从群集访问文件Access files from the cluster

可以通过多种方法从 HDInsight 群集访问 Data Lake Storage Gen2 中的文件。There are several ways you can access the files in Data Lake Storage Gen2 from an HDInsight cluster.

  • 使用完全限定的名称Using the fully qualified name . 使用此方法时,需提供要访问的文件的完整路径。With this approach, you provide the full path to the file that you want to access.

    abfs://<containername>@<accountname>.dfs.core.chinacloudapi.cn/<file.path>/
    
  • 使用缩短的路径格式Using the shortened path format . 使用此方法时,需将群集根的路径替换为:With this approach, you replace the path up to the cluster root with:

    abfs:///<file.path>/
    
  • 使用相对路径Using the relative path . 使用此方法时,仅需提供要访问的文件的相对路径。With this approach, you only provide the relative path to the file that you want to access.

    /<file.path>/
    

数据访问示例Data access examples

示例基于到群集的头节点的 ssh 连接Examples are based on an ssh connection to the head node of the cluster. 示例使用所有三个 URI 方案。The examples use all three URI schemes. CONTAINERNAMESTORAGEACCOUNT 替换为相关值Replace CONTAINERNAME and STORAGEACCOUNT with the relevant values

几个 hdfs 命令A few hdfs commands

  1. 在本地存储上创建一个文件。Create a file on local storage.

    touch testFile.txt
    
  2. 在群集存储上创建目录。Create directories on cluster storage.

    hdfs dfs -mkdir abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.chinacloudapi.cn/sampledata1/
    hdfs dfs -mkdir abfs:///sampledata2/
    hdfs dfs -mkdir /sampledata3/
    
  3. 将数据从本地存储复制到群集存储。Copy data from local storage to cluster storage.

    hdfs dfs -copyFromLocal testFile.txt  abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.chinacloudapi.cn/sampledata1/
    hdfs dfs -copyFromLocal testFile.txt  abfs:///sampledata2/
    hdfs dfs -copyFromLocal testFile.txt  /sampledata3/
    
  4. 列出群集存储上的目录内容。List directory contents on cluster storage.

    hdfs dfs -ls abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.chinacloudapi.cn/sampledata1/
    hdfs dfs -ls abfs:///sampledata2/
    hdfs dfs -ls /sampledata3/
    

创建 Hive 表Creating a Hive table

但为了便于说明,显示了三个文件位置。Three file locations are shown for illustrative purposes. 实际执行时,仅使用 LOCATION 条目之一。For actual execution, use only one of the LOCATION entries.

DROP TABLE myTable;
CREATE EXTERNAL TABLE myTable (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 'abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.chinacloudapi.cn/example/data/';
LOCATION 'abfs:///example/data/';
LOCATION '/example/data/';

后续步骤Next steps