在 HDInsight 中上传 Apache Hadoop 作业的数据Upload data for Apache Hadoop jobs in HDInsight

Azure HDInsight 提供一个基于 Azure 存储和 Azure Data Lake Storage (Gen2) 的功能完备的 Hadoop 分布式文件系统 (HDFS)。Azure HDInsight provides a full-featured Hadoop distributed file system (HDFS) over Azure Storage and Azure Data Lake Storage (Gen2). Azure 存储以及 Data Lake Storage Gen2 设计为 HDFS 扩展,为客户提供无缝体验。Azure Storage and Data Lake Storage Gen2 are designed as HDFS extensions to provide a seamless experience to customers. 它们通过启用 Hadoop 生态系统中的整套组件以直接操作其管理的数据。They enable the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Azure 存储、Data Lake Storage Gen1 和 Data lake Storage Gen2 是独特的文件系统,并且已针对数据的存储和计算进行了优化。Azure Storage, Data Lake Storage Gen1, and Gen2 are distinct file systems that are optimized for storage of data and computations on that data. 若要了解使用 Azure 存储的优点,请参阅将 Azure 存储与 HDInsight 配合使用For information about the benefits of using Azure Storage, see Use Azure Storage with HDInsight.

先决条件Prerequisites

在开始下一步之前,请注意以下要求:Note the following requirements before you begin:

将数据上传到 Azure 存储Upload data to Azure Storage

实用程序Utilities

Microsoft 提供以下实用工具用于操作 Azure 存储:Microsoft provides the following utilities to work with Azure Storage:

工具Tool LinuxLinux OS XOS X WindowsWindows
Azure 门户Azure portal
Azure CLIAzure CLI
Azure PowerShellAzure PowerShell
AzCopyAzCopy
Hadoop 命令Hadoop command

备注

Hadoop 命令仅在 HDInsight 群集上可用。The Hadoop command is only available on the HDInsight cluster. 使用该命令只能将数据从本地文件系统载入 Azure 存储。The command only allows loading data from the local file system into Azure Storage.

Hadoop 命令行Hadoop command line

仅当数据已存在于群集头节点中时,才可以使用 Hadoop 命令行将数据存储到 Azure 存储 Blob。The Hadoop command line is only useful for storing data into Azure storage blob when the data is already present on the cluster head node.

若要使用 Hadoop 命令,必须先通过 SSH 或 PuTTY 连接到头节点。In order to use the Hadoop command, you must first connect to the headnode using SSH or PuTTY.

连接之后,可以使用以下语法将文件上传到存储。Once connected, you can use the following syntax to upload a file to storage.

hadoop fs -copyFromLocal <localFilePath> <storageFilePath>

例如: hadoop fs -copyFromLocal data.txt /example/data/data.txtFor example, hadoop fs -copyFromLocal data.txt /example/data/data.txt

由于 HDInsight 的默认文件系统在 Azure 存储中,因此 /example/data/data.txt 实际是在 Azure 存储中。Because the default file system for HDInsight is in Azure Storage, /example/data/data.txt is actually in Azure Storage. 也可以将该文件表示为:You can also refer to the file as:

wasbs:///example/data/data.txt

or

wasbs://<ContainerName>@<StorageAccountName>.blob.core.chinacloudapi.cn/example/data/davinci.txt

若要查看可用于文件的其他 Hadoop 命令的列表,请参阅 https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.htmlFor a list of other Hadoop commands that work with files, see https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

警告

在 Apache HBase 群集上,写入数据为 256 KB 时会使用默认块大小。On Apache HBase clusters, the default block size used when writing data is 256 KB. 虽然在使用 HBase Api 或 REST API 时可良好运行,但使用 hadoophdfs dfs 命令编写大于 ~12 GB 的数据会导致错误。While this works fine when using HBase APIs or REST APIs, using the hadoop or hdfs dfs commands to write data larger than ~12 GB results in an error. 有关详细信息,请参阅本文的在 Blob 上编写时的存储异常部分。For more information, see the storage exception for write on blob section in this article.

图形客户端Graphical clients

还有一些应用程序可提供用于 Azure 存储的图形界面。There are also several applications that provide a graphical interface for working with Azure Storage. 下表是其中一些应用程序的列表:The following table is a list of a few of these applications:

客户端Client LinuxLinux OS XOS X WindowsWindows
用于 HDInsight 的 Microsoft Visual Studio ToolsMicrosoft Visual Studio Tools for HDInsight
Azure 存储空间资源管理器Azure Storage Explorer
CeruleaCerulea
CloudXplorerCloudXplorer
适用于 Microsoft Azure 的 CloudBerry ExplorerCloudBerry Explorer for Microsoft Azure
CyberduckCyberduck

将 Azure 存储装载为本地驱动器Mount Azure Storage as Local Drive

请参阅将 Azure 存储装载为本地驱动器See Mount Azure Storage as Local Drive.

使用服务上传Upload using services

Apache SqoopApache Sqoop

Sqoop 是一种专用于在 Hadoop 和关系数据库之间传输数据的工具。Sqoop is a tool designed to transfer data between Hadoop and relational databases. 可以使用此工具将数据从关系数据库管理系统 (RDBMS)(如 SQL Server、MySQL 或 Oracle)中导入到 Hadoop 分布式文件系统 (HDFS),在 Hadoop 中使用 MapReduce 或 Hive 转换数据,然后回过来将数据导出到 RDBMS。You can use it to import data from a relational database management system (RDBMS), such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Hive, and then export the data back into an RDBMS.

有关详细信息,请参阅将 Sqoop 与 HDInsight 配合使用For more information, see Use Sqoop with HDInsight.

开发 SDKDevelopment SDKs

还可以使用 Azure SDK 通过以下编程语言来访问 Azure 存储:Azure Storage can also be accessed using an Azure SDK from the following programming languages:

  • .NET.NET
  • JavaJava
  • Node.jsNode.js
  • PHPPHP
  • PythonPython
  • RubyRuby

有关安装 Azure SDK 的详细信息,请参阅 Azure 下载For more information on installing the Azure SDKs, see Azure downloads

故障排除Troubleshooting

写入 blob 时的存储异常Storage exception for write on blob

症状:使用 hadoophdfs dfs 命令在 HBase 群集上编写大于或等于 ~12 GB 的文件时,可能会遇到以下错误:Symptoms: When using the hadoop or hdfs dfs commands to write files that are ~12 GB or larger on an HBase cluster, you may encounter the following error:

ERROR azure.NativeAzureFileSystem: Encountered Storage Exception for write on Blob : example/test_large_file.bin._COPYING_ Exception details: null Error Code : RequestBodyTooLarge
copyFromLocal: java.io.IOException
        at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:661)
        at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:366)
        at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:350)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.microsoft.azure.storage.StorageException: The request body is too large and exceeds the maximum permissible limit.
        at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:89)
        at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307)
        at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:182)
        at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlockInternal(CloudBlockBlob.java:816)
        at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlock(CloudBlockBlob.java:788)
        at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:354)
        ... 7 more

原因:HBase on HDInsight 群集在写入 Azure 存储时默认阻止 256KB 大小的块。Cause: HBase on HDInsight clusters default to a block size of 256KB when writing to Azure storage. 尽管这对 HBase API 或 REST API 来说可良好运行,但使用 hadoophdfs dfs 命令行实用工具时则会导致错误。While it works for HBase APIs or REST APIs, it results in an error when using the hadoop or hdfs dfs command-line utilities.

解决方法:使用 fs.azure.write.request.size 指定更大的块大小。Resolution: Use fs.azure.write.request.size to specify a larger block size. 可以使用 -D 参数基于使用情况执行此操作。You can do this on a per-use basis by using the -D parameter. 以下命令是将此参数用于 hadoop 命令的示例:The following command is an example using this parameter with the hadoop command:

hadoop -fs -D fs.azure.write.request.size=4194304 -copyFromLocal test_large_file.bin /example/data

还可使用 Apache Ambari 以全局方式增加 fs.azure.write.request.size 的值。You can also increase the value of fs.azure.write.request.size globally by using Apache Ambari. 可以使用以下步骤在 Ambari Web UI 中更改该值:The following steps can be used to change the value in the Ambari Web UI:

  1. 在浏览器中,转到群集的 Ambari Web UI。In your browser, go to the Ambari Web UI for your cluster. 该地址为 https://CLUSTERNAME.azurehdinsight.cn ,其中“CLUSTERNAME”是群集名称 。This is https://CLUSTERNAME.azurehdinsight.cn, where CLUSTERNAME is the name of your cluster.

    出现提示时,输入群集的管理员名称和密码。When prompted, enter the admin name and password for the cluster.

  2. 在屏幕左侧选择“HDFS”,然后选择“配置”选项卡 。From the left side of the screen, select HDFS, and then select the Configs tab.

  3. 在“筛选...”字段中输入 fs.azure.write.request.sizeIn the Filter... field, enter fs.azure.write.request.size. 这会在页面中间显示字段和当前值。This displays the field and current value in the middle of the page.

  4. 将值从 262144 (256KB) 更改为新的值。Change the value from 262144 (256KB) to the new value. 例如,4194304 (4MB)。For example, 4194304 (4MB).

通过 Ambari Web UI 更改值的图像

有关如何使用 Ambari 的详细信息,请参阅使用 Apache Ambari Web UI 管理 HDInsight 群集For more information on using Ambari, see Manage HDInsight clusters using the Apache Ambari Web UI.

后续步骤Next steps

现在,已了解如何将数据导入 HDInsight,请阅读以下文章了解如何执行分析:Now that you understand how to get data into HDInsight, read the following articles to learn how to perform analysis: