使用 DistCp 在 Azure 存储 Blob 与 Data Lake Storage Gen2 之间复制数据Use DistCp to copy data between Azure Storage Blobs and Azure Data Lake Storage Gen2

可以使用 DistCp 在常规用途 V2 存储帐户与启用了分层命名空间的常规用途 V2 存储帐户之间复制数据。You can use DistCp to copy data between a general purpose V2 storage account and a general purpose V2 storage account with hierarchical namespace enabled. 本文提供如何使用 DistCp 工具的说明。This article provides instructions on how use the DistCp tool.

DistCp 提供了各种命令行参数,强烈建议你阅读本文以优化对 DistCp 的使用。DistCp provides a variety of command-line parameters and we strongly encourage you to read this article in order to optimize your usage of it. 本文介绍了基本功能,同时重点介绍了如何使用 DistCp 将数据复制到支持分层命名空间的帐户。This article shows basic functionality while focusing on its use for copying data to a hierarchical namespace enabled account.

先决条件Prerequisites

  • 一个 Azure 订阅An Azure subscription. 请参阅获取 Azure 1 元人民币的试用订阅See Get Azure 1rmb trial.
  • 未启用 Data Lake Storage Gen2 功能(分层命名空间)的现有 Azure 存储帐户。An existing Azure Storage account without Data Lake Storage Gen2 capabilities (hierarchical namespace) enabled.
  • 启用 Data Lake Storage Gen2 功能(分层命名空间)的 Azure 存储帐户。An Azure Storage account with Data Lake Storage Gen2 capabilities (hierarchical namespace) enabled. 有关如何创建 Azure 存储帐户的说明,请参阅创建 Azure 存储帐户For instructions on how to create one, see Create an Azure Storage account
  • 在已启用分层命名空间的存储帐户中创建的容器。A container that has been created in the storage account with hierarchical namespace enabled.
  • 可以访问启用了分层命名空间功能的存储帐户的 Azure HDInsight 群集。An Azure HDInsight cluster with access to a storage account with the hierarchical namespace feature enabled. 请参阅配合使用 Azure Data Lake Storage Gen2 和 Azure HDInsight 群集See Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters. 请确保对该群集启用远程桌面。Make sure you enable Remote Desktop for the cluster.

从 HDInsight Linux 群集使用 DistCpUse DistCp from an HDInsight Linux cluster

HDInsight 群集附带 DistCp 实用工具,该实用工具可用于从不同源中复制数据到 HDInsight 群集。An HDInsight cluster comes with the DistCp utility, which can be used to copy data from different sources into an HDInsight cluster. 如果 HDInsight 群集已配置为一起使用 Azure Blob 存储和 Azure Data Lake Storage,则可以立即使用 DistCp 实用工具在其间复制数据。If you have configured the HDInsight cluster to use Azure Blob Storage and Azure Data Lake Storage together, the DistCp utility can be used out-of-the-box to copy data between as well. 本部分介绍如何使用 DistCp 实用工具。In this section, we look at how to use the DistCp utility.

  1. 在 HDI 群集中创建 SSH 会话。Create an SSH session to your HDI cluster. 请参阅连接到基于 Linux 的 HDInsight 群集See Connect to a Linux-based HDInsight cluster.

  2. 验证是否可以访问现有的常规用途 V2 帐户(未启用分层命名空间)。Verify whether you can access your existing general purpose V2 account (without hierarchical namespace enabled).

    hdfs dfs –ls wasbs://<container-name>@<storage-account-name>.blob.core.chinacloudapi.cn/
    

    输出应提供容器中内容的列表。The output should provide a list of contents in the container.

  3. 同样,验证是否可从此群集访问启用分层命名空间的存储帐户。Similarly, verify whether you can access the storage account with hierarchical namespace enabled from the cluster. 运行以下命令:Run the following command:

    hdfs dfs -ls abfss://<container-name>@<storage-account-name>.dfs.core.chinacloudapi.cn/
    

    输出会提供 Data Lake Storage 帐户中文件/文件夹的列表。The output should provide a list of files/folders in the Data Lake storage account.

  4. 使用 DistCp 从 WASB 将数据复制到 Data Lake Storage 帐户。Use DistCp to copy data from WASB to a Data Lake Storage account.

    hadoop distcp wasbs://<container-name>@<storage-account-name>.blob.core.chinacloudapi.cn/example/data/gutenberg abfss://<container-name>@<storage-account-name>.dfs.core.chinacloudapi.cn/myfolder
    

    该命令会将 Blob 存储中 /example/data/gutenberg/ 文件夹的内容复制到 Data Lake Storage 帐户中的 /myfolder 。The command copies the contents of the /example/data/gutenberg/ folder in Blob storage to /myfolder in the Data Lake Storage account.

  5. 同样,使用 DistCp 从 Data Lake Storage 帐户将数据复制到 Blob 存储 (WASB)。Similarly, use DistCp to copy data from Data Lake Storage account to Blob Storage (WASB).

    hadoop distcp abfss://<container-name>@<storage-account-name>.dfs.core.chinacloudapi.cn/myfolder wasbs://<container-name>@<storage-account-name>.blob.core.chinacloudapi.cn/example/data/gutenberg
    

    该命令会将 Data Lake Store 帐户中 /myfolder 的内容复制到 WASB 中的 /example/data/gutenberg/ 文件夹 。The command copies the contents of /myfolder in the Data Lake Store account to /example/data/gutenberg/ folder in WASB.

使用 DistCp 时的性能注意事项Performance considerations while using DistCp

由于 DistCp 的最小粒度是单个文件,设置同步复制的最大数目是针对 Data Lake Storage 对其进行优化的最重要参数。Because DistCp's lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Storage. 同步复制的数目等于命令行上的映射器数 (m) 参数。Number of simultaneous copies is equal to the number of mappers (m) parameter on the command line. 此参数指定用于复制数据的映射器的最大数目。This parameter specifies the maximum number of mappers that are used to copy data. 默认值为 20。Default value is 20.

示例Example

hadoop distcp -m 100 wasbs://<container-name>@<storage-account-name>.blob.core.chinacloudapi.cn/example/data/gutenberg abfss://<container-name>@<storage-account-name>.dfs.core.chinacloudapi.cn/myfolder

如何确定要使用的映射器数?How do I determine the number of mappers to use?

请参考下面的指导。Here's some guidance that you can use.

  • 步骤 1:确定可用于“默认”YARN 应用队列的总内存 - 第一步是确定可用于“默认”YARN 应用队列的内存。Step 1: Determine total memory available to the 'default' YARN app queue - The first step is to determine the memory available to the 'default' YARN app queue. 可在与群集关联的 Ambari 门户中获取此信息。This information is available in the Ambari portal associated with the cluster. 导航到 YARN 并查看“配置”选项卡可看到可用于“默认”应用队列的 YARN 内存。Navigate to YARN and view the Configs tab to see the YARN memory available to the 'default' app queue. 这是 DistCp 作业(实际是 MapReduce 作业)的总可用内存。This is the total available memory for your DistCp job (which is actually a MapReduce job).

  • 步骤 2:计算映射器数 - m 的值等于总 YARN 内存除以 YARN 容器大小的商。Step 2: Calculate the number of mappers - The value of m is equal to the quotient of total YARN memory divided by the YARN container size. YARN 容器大小的信息也可在 Ambari 门户中找到。The YARN container size information is available in the Ambari portal as well. 导航到 YARN 并查看“配置”选项卡。YARN 容器大小显示在此窗口中。Navigate to YARN and view the Configs tab. The YARN container size is displayed in this window. 用于得到映射器数 (m) 的公式是The equation to arrive at the number of mappers (m) is

    m = (number of nodes * YARN memory for each node) / YARN container sizem = (number of nodes * YARN memory for each node) / YARN container size

示例Example

假设你有一个 4x D14v2s 群集,并且想要从 10 个不同的文件夹传输 10 TB 的数据。Let's assume that you have a 4x D14v2s cluster and you are trying to transfer 10 TB of data from 10 different folders. 每个文件夹都包含不同数量的数据,并且每个文件夹中的文件大小也不同。Each of the folders contains varying amounts of data and the file sizes within each folder are different.

  • 总 YARN 内存:从 Ambari 门户确定一个 D14 节点的 YARN 内存为 96 GB。Total YARN memory: From the Ambari portal you determine that the YARN memory is 96 GB for a D14 node. 因此,具有 4 个节点的群集的总 YARN 内存是:So, total YARN memory for four node cluster is:

    YARN memory = 4 * 96GB = 384GBYARN memory = 4 * 96GB = 384GB

  • 映射器数:从 Ambari 门户确定一个 D14 群集节点的 YARN 容器大小为 3,072 MB。Number of mappers: From the Ambari portal you determine that the YARN container size is 3,072 MB for a D14 cluster node. 因此,映射器数为:So, number of mappers is:

    m = (4 nodes * 96GB) / 3072MB = 128 mappersm = (4 nodes * 96GB) / 3072MB = 128 mappers

如果其他应用程序正在使用内存,则可以选择仅将群集的部分 YARN 内存用于 DistCp。If other applications are using memory, then you can choose to only use a portion of your cluster’s YARN memory for DistCp.

复制大型数据集Copying large datasets

当要移动的数据集非常大(例如,大于 1 TB)时,或者如果有许多不同的文件夹,则应考虑使用多个 DistCp 作业。When the size of the dataset to be moved is large (for example, >1 TB) or if you have many different folders, you should consider using multiple DistCp jobs. 可能没有任何性能提升,但它会展开作业,因此如果有任何作业失败,只需重启该特定作业(而不是整个作业)。There is likely no performance gain, but it spreads out the jobs so that if any job fails, you only need to restart that specific job rather than the entire job.

限制Limitations

  • DistCp 会尝试创建大小类似的映射器以优化性能。DistCp tries to create mappers that are similar in size to optimize performance. 增加映射器数不一定始终会提高性能。Increasing the number of mappers may not always increase performance.

  • DistCp 被限制为每个文件只有一个映射器。DistCp is limited to only one mapper per file. 因此,映射器数不应超过拥有的文件数。Therefore, you should not have more mappers than you have files. 由于 DistCp 只能将一个映射器分配给一个文件,这会限制可用于复制大型文件的并发数量。Since DistCp can only assign one mapper to a file, this limits the amount of concurrency that can be used to copy large files.

  • 如果有少量大型文件,则应将它们拆分为 256 MB 的文件块,以便提供更多潜在并发。If you have a small number of large files, then you should split them into 256 MB file chunks to give you more potential concurrency.