配合使用 Azure Data Lake Storage Gen2 和 Azure HDInsight 群集Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters

Azure Data Lake Storage Gen2 是构建在 Azure Blob 存储基础之上的,专用于大数据分析的云存储服务。Azure Data Lake Storage Gen2 is a cloud storage service dedicated to big data analytics, built on Azure Blob storage. Data Lake Storage Gen2 将 Azure Blob 存储的功能组合在一起。Data Lake Storage Gen2 combines the capabilities of Azure Blob storage. 这些功能包括:文件系统语义、目录级和文件级安全性以及适应性。These features include: file system semantics, directory-level and file-level security, and adaptability. 以及 Azure Blob 存储的低成本、分层存储、高可用性和灾难恢复功能。Along with the low-cost, tiered storage, high availability, and disaster-recovery capabilities from Azure Blob storage.

Data Lake Storage Gen2 可用性Data Lake Storage Gen2 availability

Data Lake Storage Gen2 能够以默认存储和附加存储帐户的形式用作几乎所有 Azure HDInsight 群集类型的存储选项。Data Lake Storage Gen2 is available as a storage option for almost all Azure HDInsight cluster types as both a default and an additional storage account. 但是,HBase 只能有一个 Data Lake Storage Gen2 帐户。HBase, however, can have only one Data Lake Storage Gen2 account.

通过 Azure 门户创建使用 Data Lake Storage Gen2 的群集Create a cluster with Data Lake Storage Gen2 through the Azure portal

若要创建将 Data Lake Storage Gen2 用作存储的 HDInsight 群集,请遵循以下步骤配置 Data Lake Storage Gen2 帐户。To create an HDInsight cluster that uses Data Lake Storage Gen2 for storage, follow these steps to configure a Data Lake Storage Gen2 account.

创建用户分配的托管标识Create a user-assigned managed identity

创建用户分配的托管标识(如果还没有)。Create a user-assigned managed identity, if you don't already have one.

  1. 登录到 Azure 门户Sign in to the Azure portal.
  2. 在左上角,单击“创建资源”。In the upper-left click Create a resource.
  3. 在搜索框中键入“用户分配”并单击“用户分配的托管标识”。In the search box, type user assigned and click User Assigned Managed Identity.
  4. 单击创建Click Create.
  5. 输入托管标识的名称,选择正确的订阅、资源组和位置。Enter a name for your managed identity, select the correct subscription, resource group, and location.
  6. 单击创建Click Create.

有关 Azure HDInsight 中托管标识的工作原理的详细信息,请参阅 Azure HDInsight 中的托管标识For more information on how managed identities work in Azure HDInsight, see Managed identities in Azure HDInsight.

创建用户分配的托管标识

创建 Data Lake Storage Gen2 帐户Create a Data Lake Storage Gen2 account

创建 Azure Data Lake Storage Gen2 存储帐户。Create an Azure Data Lake Storage Gen2 storage account.

  1. 登录到 Azure 门户Sign in to the Azure portal.
  2. 在左上角,单击“创建资源”。In the upper-left click Create a resource.
  3. 在搜索框中,键入 storage,然后单击 Storage accountIn the search box, type storage and click Storage account.
  4. 单击创建Click Create.
  5. 在“创建存储帐户”屏幕上:On the Create storage account screen:
    1. 选择正确的订阅和资源组。Select the correct subscription and resource group.
    2. 输入 Data Lake Storage Gen2 帐户的名称。Enter a name for your Data Lake Storage Gen2 account.
    3. 单击“高级”选项卡。Click on the Advanced tab.
    4. 单击 Data Lake Storage Gen2 下的“分层命名空间”旁边的“启用”。Click Enabled next to Hierarchical namespace under Data Lake Storage Gen2.
    5. 单击“查看 + 创建”。Click Review + create.
    6. 单击“创建” Click Create

有关存储帐户创建过程中其他选项的详细信息,请参阅快速入门:创建 Azure Data Lake Storage Gen2 存储帐户For more information on other options during storage account creation, see Quickstart: Create an Azure Data Lake Storage Gen2 storage account.

显示 Azure 门户中存储帐户创建情况的屏幕截图

在 Data Lake Storage Gen2 帐户中设置托管标识的权限Set up permissions for the managed identity on the Data Lake Storage Gen2 account

将托管标识分配到存储帐户上的“存储 Blob 数据所有者”角色。Assign the managed identity to the Storage Blob Data Owner role on the storage account.

  1. Azure 门户中转到自己的存储帐户。In the Azure portal, go to your storage account.

  2. 选择存储帐户,然后选择“访问控制(IAM)”以显示该帐户的访问控制设置。Select your storage account, then select Access control (IAM) to display the access control settings for the account. 选择“角色分配”选项卡以查看角色分配列表。Select the Role assignments tab to see the list of role assignments.

    显示存储访问控制设置的屏幕截图

  3. 选择“+ 添加角色分配”按钮以添加一个新角色。Select the + Add role assignment button to add a new role.

  4. 在“添加角色分配”窗口中,选择“存储 Blob 数据所有者”角色 。In the Add role assignment window, select the Storage Blob Data Owner role. 然后,选择具有托管标识和存储帐户的订阅。Then, select the subscription that has the managed identity and storage account. 接下来,搜索并找到之前创建的用户分配托管标识。Next, search to locate the user-assigned managed identity that you created previously. 最后,选择托管标识,它将在“选定成员”下列出。Finally, select the managed identity, and it will be listed under Selected members.

    显示如何分配 RBAC 角色的屏幕截图

  5. 选择“保存” 。Select Save. 现在,选定的用户分配的标识会列在选定的角色下。The user-assigned identity that you selected is now listed under the selected role.

  6. 此初始设置完成后,可通过门户创建群集。After this initial setup is complete, you can create a cluster through the portal. 群集必须与存储帐户位于同一 Azure 区域中。The cluster must be in the same Azure region as the storage account. 在群集创建菜单的“存储”选项卡中,选择以下选项:In the Storage tab of the cluster creation menu, select the following options:

    • 对于“主要存储类型”,请选择“Azure Data Lake Storage Gen2” 。For Primary storage type, select Azure Data Lake Storage Gen2.

    • 在“主存储帐户”下,搜索并选择新建的 Data Lake Storage Gen2 存储帐户。Under Primary Storage account, search for and select the newly created Data Lake Storage Gen2 storage account.

    • 在“标识”下,选择新建的用户分配的托管标识。Under Identity, select the newly created user-assigned managed identity.

      用于配合使用 Data Lake Storage Gen2 和 Azure HDInsight 的存储设置

    备注

    • 若要添加辅助 Data Lake Storage Gen2 帐户,请直接在存储帐户级别将此前创建的托管标识分配给希望添加的新 Data Lake Storage Gen2 存储帐户。To add a secondary Data Lake Storage Gen2 account, at the storage account level, simply assign the managed identity created earlier to the new Data Lake Storage Gen2 storage account that you want to add. 请注意,不支持通过 HDInsight 上的“其他存储帐户”边栏选项卡添加辅助 Data Lake Storage Gen2 帐户。Please be advised that adding a secondary Data Lake Storage Gen2 account via the "Additional storage accounts" blade on HDInsight isn't supported.
    • 可以在 HDInsight 使用的 Azure 存储帐户上启用 RA-GRS 或 RA-ZRS。You can enable RA-GRS or RA-ZRS on the Azure storage account that HDInsight uses. 但是,不支持针对 RA-GRS 或 RA-ZRS 辅助终结点创建群集。However, creating a cluster against the RA-GRS or RA-ZRS secondary endpoint isn't supported.

通过 Azure CLI 创建使用 Data Lake Storage Gen2 的群集Create a cluster with Data Lake Storage Gen2 through the Azure CLI

可以下载示例模板文件下载示例参数文件You can download a sample template file and download a sample parameters file. 在使用下面的模板和 Azure CLI 代码片段之前,请将以下占位符替换为其正确值:Before using the template and the Azure CLI code snippet below, replace the following placeholders with their correct values:

占位符Placeholder 说明Description
<SUBSCRIPTION_ID> Azure 订阅的 IDThe ID of your Azure subscription
<RESOURCEGROUPNAME> 要在其中创建新群集和存储帐户的资源组。The resource group where you want the new cluster and storage account created.
<MANAGEDIDENTITYNAME> 将获得 Azure Data Lake Storage Gen2 帐户的权限的托管标识的名称。The name of the managed identity that will be given permissions on your Azure Data Lake Storage Gen2 account.
<STORAGEACCOUNTNAME> 要创建的新 Azure Data Lake Storage Gen2 帐户。The new Azure Data Lake Storage Gen2 account that will be created.
<FILESYSTEMNAME> 此群集应在存储帐户中使用的文件系统的名称。The name of the filesystem that this cluster should use in the storage account.
<CLUSTERNAME> 你的 HDInsight 群集的名称。The name of your HDInsight cluster.
<PASSWORD> 你选择的使用 SSH 及 Ambari 仪表板登录群集的密码。Your chosen password for signing in to the cluster using SSH and the Ambari dashboard.

以下代码片段将会执行下述初始步骤:The code snippet below does the following initial steps:

  1. 登录到 Azure 帐户。Logs in to your Azure account.
  2. 设置要在其中执行创建操作的活动订阅。Sets the active subscription where the create operations will be done.
  3. 为新的部署活动创建新的资源组。Creates a new resource group for the new deployment activities.
  4. 创建用户分配的托管标识。Creates a user-assigned managed identity.
  5. 将一个扩展添加到 Azure CLI,以使用 Data Lake Storage Gen2 的功能。Adds an extension to the Azure CLI to use features for Data Lake Storage Gen2.
  6. 使用 --hierarchical-namespace true 标志创建新的 Data Lake Storage Gen2 帐户。Creates a new Data Lake Storage Gen2 account by using the --hierarchical-namespace true flag.
az login
az account set --subscription <SUBSCRIPTION_ID>

# Create resource group
az group create --name <RESOURCEGROUPNAME> --location chinaeast

# Create managed identity
az identity create -g <RESOURCEGROUPNAME> -n <MANAGEDIDENTITYNAME>

az extension add --name storage-preview

az storage account create --name <STORAGEACCOUNTNAME> \
    --resource-group <RESOURCEGROUPNAME> \
    --location chinaeast --sku Standard_LRS \
    --kind StorageV2 --hierarchical-namespace true

接下来,登录到门户。Next, sign in to the portal. 将新的用户分配的托管标识添加到存储帐户上的“存储 Blob 数据参与者”角色。Add the new user-assigned managed identity to the Storage Blob Data Contributor role on the storage account. 此步骤在使用 Azure 门户的步骤 3 中已描述。This step is described in step 3 under Using the Azure portal.

重要

请确保你的存储帐户具有用户分配的具有“存储 Blob 数据参与者”角色权限的标识,否则群集创建将失败。Ensure that your storage account has the user-assigned identity with Storage Blob Data Contributor role permissions, otherwise cluster creation will fail.

az group deployment create --name HDInsightADLSGen2Deployment \
    --resource-group <RESOURCEGROUPNAME> \
    --template-file hdinsight-adls-gen2-template.json \
    --parameters parameters.json

通过 Azure PowerShell 创建使用 Data Lake Storage Gen2 的群集Create a cluster with Data Lake Storage Gen2 through Azure PowerShell

当前不支持使用 PowerShell 创建具有 Azure Data Lake Storage Gen2 的 HDInsight 群集。Using PowerShell to create an HDInsight cluster with Azure Data Lake Storage Gen2 is not currently supported.

HDInsight 中 Data Lake Storage Gen2 的访问控制Access control for Data Lake Storage Gen2 in HDInsight

Data Lake Storage Gen2 支持哪些类型的权限?What kinds of permissions does Data Lake Storage Gen2 support?

Data Lake Storage Gen2 使用一个支持基于角色的访问控制 (RBAC) 和类似于 POSIX 的访问控制列表 (ACL) 的访问控制模型。Data Lake Storage Gen2 uses an access control model that supports both role-based access control (RBAC) and POSIX-like access control lists (ACLs). Data Lake Storage Gen1 仅支持用于控制数据访问的访问控制列表。Data Lake Storage Gen1 supports access control lists only for controlling access to data.

RBAC 使用角色分配有效地将权限集应用到 Azure 资源的用户、组和服务主体。RBAC uses role assignments to effectively apply sets of permissions to users, groups, and service principals for Azure resources. 一般情况下,这些 Azure 资源限制为顶级资源(例如 Azure 存储帐户)。Typically, those Azure resources are constrained to top-level resources (for example, Azure storage accounts). 对于 Azure 存储以及 Azure Data Lake Storage Gen2,此机制已扩展到文件系统资源。For Azure Storage, and also Data Lake Storage Gen2, this mechanism has been extended to the file system resource.

有关使用 RBAC 分配文件权限的详细信息,请参阅 Azure 基于角色的访问控制 (RBAC)For more information about file permissions with RBAC, see Azure role-based access control (RBAC).

有关使用 ACL 分配文件权限的详细信息,请参阅对文件和目录应用访问控制列表For more information about file permissions with ACLs, see Access control lists on files and directories.

如何在 Data Lake Storage Gen2 中控制对数据的访问?How do I control access to my data in Data Lake Storage Gen2?

HDInsight 群集在 Data Lake Storage Gen2 中访问文件的能力通过托管标识进行控制。Your HDInsight cluster's ability to access files in Data Lake Storage Gen2 is controlled through managed identities. 托管标识是在 Azure Active Directory (Azure AD) 中注册的标识,其凭据由 Azure 管理。A managed identity is an identity registered in Azure Active Directory (Azure AD) whose credentials are managed by Azure. 使用托管标识,无需在 Azure AD 中注册服务主体,也无需维护证书等凭据。With managed identities, you don't need to register service principals in Azure AD or maintain credentials such as certificates.

Azure 服务有两种类型的托管标识:系统分配的托管标识和用户分配的托管标识。Azure services have two types of managed identities: system-assigned and user-assigned. HDInsight 使用用户分配的托管标识来访问 Data Lake Storage Gen2。HDInsight uses user-assigned managed identities to access Data Lake Storage Gen2. 用户分配的托管标识作为独立的 Azure 资源创建。A user-assigned managed identity is created as a standalone Azure resource. 在创建过程中,Azure 会在由所用订阅信任的 Azure AD 租户中创建一个标识。Through a create process, Azure creates an identity in the Azure AD tenant that's trusted by the subscription in use. 在创建标识后,可以将标识分配到一个或多个 Azure 服务实例。After the identity is created, the identity can be assigned to one or more Azure service instances.

用户分配标识的生命周期与它所分配到的 Azure 服务实例的生命周期是分开管理的。The lifecycle of a user-assigned identity is managed separately from the lifecycle of the Azure service instances to which it's assigned. 有关托管标识的详细信息,请参阅什么是 Azure 资源托管标识?For more information about managed identities, see What are managed identities for Azure resources?.

如何设置 Azure AD 用户的权限,以使用 Hive 或其他服务在 Data Lake Storage Gen2 中查询数据?How do I set permissions for Azure AD users to query data in Data Lake Storage Gen2 by using Hive or other services?

若要为用户设置权限以查询数据,请将 Azure AD 安全组用作 ACL 中分配的主体。To set permissions for users to query data, use Azure AD security groups as the assigned principal in ACLs. 不要直接向单个用户或服务主体分配文件访问权限。Don't directly assign file-access permissions to individual users or service principals. 使用 AD 安全组来控制权限流时,可以添加和删除用户或服务主体,而无需将 ACL 重新应用到整个目录结构。With Azure AD security groups to control the flow of permissions, you can add and remove users or service principals without reapplying ACLs to an entire directory structure. 只需要从相应的 Azure AD 安全组添加或删除用户。You only have to add or remove the users from the appropriate Azure AD security group. ACL 不可继承,因此,重新应用 ACL 需要更新针对每个文件和子目录应用的 ACL。ACLs aren't inherited, so reapplying ACLs requires updating the ACL on every file and subdirectory.

从群集访问文件Access files from the cluster

可以通过多种方法从 HDInsight 群集访问 Data Lake Storage Gen2 中的文件。There are several ways you can access the files in Data Lake Storage Gen2 from an HDInsight cluster.

  • 使用完全限定的名称Using the fully qualified name. 使用此方法时,需提供要访问的文件的完整路径。With this approach, you provide the full path to the file that you want to access.

    abfs://<containername>@<accountname>.dfs.core.chinacloudapi.cn/<file.path>/
    
  • 使用缩短的路径格式Using the shortened path format. 使用此方法时,需将群集根的路径替换为:With this approach, you replace the path up to the cluster root with:

    abfs:///<file.path>/
    
  • 使用相对路径Using the relative path. 使用此方法时,仅需提供要访问的文件的相对路径。With this approach, you only provide the relative path to the file that you want to access.

    /<file.path>/
    

数据访问示例Data access examples

示例基于到群集的头节点的 ssh 连接Examples are based on an ssh connection to the head node of the cluster. 示例使用所有三个 URI 方案。The examples use all three URI schemes. CONTAINERNAMESTORAGEACCOUNT 替换为相关值Replace CONTAINERNAME and STORAGEACCOUNT with the relevant values

几个 hdfs 命令A few hdfs commands

  1. 在本地存储上创建一个文件。Create a file on local storage.

    touch testFile.txt
    
  2. 在群集存储上创建目录。Create directories on cluster storage.

    hdfs dfs -mkdir abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.chinacloudapi.cn/sampledata1/
    hdfs dfs -mkdir abfs:///sampledata2/
    hdfs dfs -mkdir /sampledata3/
    
  3. 将数据从本地存储复制到群集存储。Copy data from local storage to cluster storage.

    hdfs dfs -copyFromLocal testFile.txt  abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.chinacloudapi.cn/sampledata1/
    hdfs dfs -copyFromLocal testFile.txt  abfs:///sampledata2/
    hdfs dfs -copyFromLocal testFile.txt  /sampledata3/
    
  4. 列出群集存储上的目录内容。List directory contents on cluster storage.

    hdfs dfs -ls abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.chinacloudapi.cn/sampledata1/
    hdfs dfs -ls abfs:///sampledata2/
    hdfs dfs -ls /sampledata3/
    

创建 Hive 表Creating a Hive table

但为了便于说明,显示了三个文件位置。Three file locations are shown for illustrative purposes. 实际执行时,仅使用 LOCATION 条目之一。For actual execution, use only one of the LOCATION entries.

DROP TABLE myTable;
CREATE EXTERNAL TABLE myTable (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 'abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.chinacloudapi.cn/example/data/';
LOCATION 'abfs:///example/data/';
LOCATION '/example/data/';

后续步骤Next steps