使用 Azure HDInsight 对 Apache Hadoop HDFS 进行故障排除Troubleshoot Apache Hadoop HDFS by using Azure HDInsight

了解使用 Hadoop 分布式文件系统 (HDFS) 时的常见问题和解决方法。Learn top issues and resolutions when working with Hadoop Distributed File System (HDFS). 有关完整的命令列表,请参阅 HDFS 命令指南文件系统 Shell 指南For a full list of commands, see the HDFS Commands Guide and the File System Shell Guide.

如何从群集内访问本地 HDFS?How do I access the local HDFS from inside a cluster?

问题Issue

从 HDInsight 群集内通过命令行和应用程序代码(而不是 Azure Blob 存储)访问本地 HDFS。Access the local HDFS from the command line and application code instead of by using Azure Blob storage from inside the HDInsight cluster.

解决步骤Resolution steps

  1. 在命令提示符下,按原义使用 hdfs dfs -D "fs.default.name=hdfs://mycluster/" ...,如以下命令中所示:At the command prompt, use hdfs dfs -D "fs.default.name=hdfs://mycluster/" ... literally, as in the following command:

    hdfs dfs -D "fs.default.name=hdfs://mycluster/" -ls /
    Found 3 items
    drwxr-xr-x   - hdiuser hdfs          0 2017-03-24 14:12 /EventCheckpoint-30-8-24-11102016-01
    drwx-wx-wx   - hive    hdfs          0 2016-11-10 18:42 /tmp
    drwx------   - hdiuser hdfs          0 2016-11-10 22:22 /user
    
  2. 从源代码按原义使用 URI hdfs://mycluster/,如以下示例应用程序中所示:From source code, use the URI hdfs://mycluster/ literally, as in the following sample application:

    import java.io.IOException;
    import java.net.URI;
    import org.apache.commons.io.IOUtils;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.*;
    
    public class JavaUnitTests {
    
        public static void main(String[] args) throws Exception {
    
            Configuration conf = new Configuration();
            String hdfsUri = "hdfs://mycluster/";
            conf.set("fs.defaultFS", hdfsUri);
            FileSystem fileSystem = FileSystem.get(URI.create(hdfsUri), conf);
            RemoteIterator<LocatedFileStatus> fileStatusIterator = fileSystem.listFiles(new Path("/tmp"), true);
            while(fileStatusIterator.hasNext()) {
                System.out.println(fileStatusIterator.next().getPath().toString());
            }
        }
    }
    
  3. 使用以下命令在 HDInsight 群集上运行已编译的 .jar 文件(例如,名为 java-unit-tests-1.0.jar 的文件):Run the compiled .jar file (for example, a file named java-unit-tests-1.0.jar) on the HDInsight cluster with the following command:

    hadoop jar java-unit-tests-1.0.jar JavaUnitTests
    hdfs://mycluster/tmp/hive/hive/5d9cf301-2503-48c7-9963-923fb5ef79a7/inuse.info
    hdfs://mycluster/tmp/hive/hive/5d9cf301-2503-48c7-9963-923fb5ef79a7/inuse.lck
    hdfs://mycluster/tmp/hive/hive/a0be04ea-ae01-4cc4-b56d-f263baf2e314/inuse.info
    hdfs://mycluster/tmp/hive/hive/a0be04ea-ae01-4cc4-b56d-f263baf2e314/inuse.lck
    

写入 blob 时的存储异常Storage exception for write on blob

问题Issue

使用 hadoophdfs dfs 命令在 HBase 群集上编写大于或等于 ~12 GB 的文件时,可能会遇到以下错误:When using the hadoop or hdfs dfs commands to write files that are ~12 GB or larger on an HBase cluster, you may come across the following error:

ERROR azure.NativeAzureFileSystem: Encountered Storage Exception for write on Blob : example/test_large_file.bin._COPYING_ Exception details: null Error Code : RequestBodyTooLarge
copyFromLocal: java.io.IOException
        at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:661)
        at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:366)
        at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:350)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.microsoft.azure.storage.StorageException: The request body is too large and exceeds the maximum permissible limit.
        at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:89)
        at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307)
        at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:182)
        at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlockInternal(CloudBlockBlob.java:816)
        at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlock(CloudBlockBlob.java:788)
        at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:354)
        ... 7 more

原因Cause

写入 Azure 存储时,HDInsight 群集上的 HBase 的块大小将默认为 256 KB。HBase on HDInsight clusters default to a block size of 256 KB when writing to Azure storage. 尽管这对 HBase API 或 REST API 来说可良好运行,但使用 hadoophdfs dfs 命令行实用工具时则会导致错误。While it works for HBase APIs or REST APIs, it results in an error when using the hadoop or hdfs dfs command-line utilities.

解决方法Resolution

使用 fs.azure.write.request.size 指定更大的块大小。Use fs.azure.write.request.size to specify a larger block size. 可以使用 -D 参数基于使用情况执行此修改。You can do this modification on a per-use basis by using the -D parameter. 以下命令是将此参数用于 hadoop 命令的示例:The following command is an example using this parameter with the hadoop command:

hadoop -fs -D fs.azure.write.request.size=4194304 -copyFromLocal test_large_file.bin /example/data

还可使用 Apache Ambari 以全局方式增加 fs.azure.write.request.size 的值。You can also increase the value of fs.azure.write.request.size globally by using Apache Ambari. 可以使用以下步骤在 Ambari Web UI 中更改该值:The following steps can be used to change the value in the Ambari Web UI:

  1. 在浏览器中,转到群集的 Ambari Web UI。In your browser, go to the Ambari Web UI for your cluster. URL 为 https://CLUSTERNAME.azurehdinsight.cn,其中 CLUSTERNAME 是群集的名称。The URL is https://CLUSTERNAME.azurehdinsight.cn, where CLUSTERNAME is the name of your cluster. 出现提示时,输入群集的管理员名称和密码。When prompted, enter the admin name and password for the cluster.

  2. 在屏幕左侧选择“HDFS”,然后选择“配置”选项卡 。From the left side of the screen, select HDFS, and then select the Configs tab.

  3. 在“筛选...”字段中输入 fs.azure.write.request.sizeIn the Filter... field, enter fs.azure.write.request.size.

  4. 将值从 262144 (256 KB) 更改为新的值。Change the value from 262144 (256 KB) to the new value. 例如,4194304 (4 MB)。For example, 4194304 (4 MB).

    通过 Ambari Web UI 更改值的图像

有关如何使用 Ambari 的详细信息,请参阅使用 Apache Ambari Web UI 管理 HDInsight 群集For more information on using Ambari, see Manage HDInsight clusters using the Apache Ambari Web UI.

dudu

-du 命令显示给定目录中包含的文件和目录的大小或文件的长度(如果它只是一个文件)。The -du command displays sizes of files and directories contained in the given directory or the length of a file in case it's just a file.

-s 选项生成所显示文件长度的聚合汇总。The -s option produces an aggregate summary of file lengths being displayed.
-h 选项设置文件大小的格式。The -h option formats the file sizes.

示例:Example:

hdfs dfs -du -s -h hdfs://mycluster/
hdfs dfs -du -s -h hdfs://mycluster/tmp

rmrm

-rm 命令删除指定为参数的文件。The -rm command deletes files specified as arguments.

示例:Example:

hdfs dfs -rm hdfs://mycluster/tmp/testfile

后续步骤Next steps

如果你的问题未在本文中列出,或者无法解决问题,请访问以下渠道以获取更多支持:If you didn't see your problem or are unable to solve your issue, visit the following channel for more support:

  • 如果需要更多帮助,可以从 Azure 门户提交支持请求。If you need more help, you can submit a support request from the Azure portal. 从菜单栏中选择“支持”,或打开“帮助 + 支持” 中心。Select Support from the menu bar or open the Help + support hub.