在 HDInsight 中上传 Apache Hadoop 作业的数据Upload data for Apache Hadoop jobs in HDInsight

Azure HDInsight 提供一个基于 Azure 存储和 Azure Data Lake Storage (Gen2) 的功能完备的 Hadoop 分布式文件系统 (HDFS)。Azure HDInsight provides a full-featured Hadoop distributed file system (HDFS) over Azure Storage and Azure Data Lake Storage (Gen2). Azure 存储以及 Data Lake Storage Gen2 设计为 HDFS 扩展,为客户提供无缝体验。Azure Storage and Data Lake Storage Gen2 are designed as HDFS extensions to provide a seamless experience to customers. 它们通过启用 Hadoop 生态系统中的整套组件以直接操作其管理的数据。They enable the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Azure 存储、Data Lake Storage Gen1 和 Data lake Storage Gen2 是独特的文件系统,并且已针对数据的存储和计算进行了优化。Azure Storage, Data Lake Storage Gen1, and Gen2 are distinct file systems that are optimized for storage of data and computations on that data. 若要了解使用 Azure 存储的优点,请参阅[将 Azure 存储与 HDInsight 配合使用][hdinsight-storage]。For information about the benefits of using Azure Storage, see [Use Azure Storage with HDInsight][hdinsight-storage].

先决条件Prerequisites

在开始下一步之前,请注意以下要求:Note the following requirements before you begin:

将数据上传到 Azure 存储Upload data to Azure Storage

实用程序Utilities

Microsoft 提供以下实用工具用于操作 Azure 存储:Microsoft provides the following utilities to work with Azure Storage:

工具Tool LinuxLinux OS XOS X WindowsWindows
Azure 门户Azure portal
Azure CLIAzure CLI
Azure PowerShellAzure PowerShell
AzCopyAzCopy
Hadoop 命令Hadoop command

备注

Hadoop 命令仅在 HDInsight 群集上可用。The Hadoop command is only available on the HDInsight cluster. 使用该命令只能将数据从本地文件系统载入 Azure 存储。The command only allows loading data from the local file system into Azure Storage.

Hadoop 命令行Hadoop command line

仅当数据已存在于群集头节点中时,才可以使用 Hadoop 命令行将数据存储到 Azure 存储 Blob。The Hadoop command line is only useful for storing data into Azure storage blob when the data is already present on the cluster head node.

若要使用 Hadoop 命令,必须先通过 SSH 或 PuTTY 连接到头节点。To use the Hadoop command, you must first connect to the headnode using SSH or PuTTY.

连接之后,可以使用以下语法将文件上传到存储。Once connected, you can use the following syntax to upload a file to storage.

hadoop fs -copyFromLocal <localFilePath> <storageFilePath>

例如: hadoop fs -copyFromLocal data.txt /example/data/data.txtFor example, hadoop fs -copyFromLocal data.txt /example/data/data.txt

由于 HDInsight 的默认文件系统在 Azure 存储中,因此 /example/data/data.txt 实际是在 Azure 存储中。Because the default file system for HDInsight is in Azure Storage, /example/data/data.txt is actually in Azure Storage. 也可以将该文件表示为:You can also refer to the file as:

wasbs:///example/data/data.txt

or

wasbs://<ContainerName>@<StorageAccountName>.blob.core.chinacloudapi.cn/example/data/davinci.txt

若要查看可用于文件的其他 Hadoop 命令的列表,请参阅 https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.htmlFor a list of other Hadoop commands that work with files, see https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

警告

在 Apache HBase 群集上,写入数据为 256 KB 时会使用默认块大小。On Apache HBase clusters, the default block size used when writing data is 256 KB. 虽然在使用 HBase Api 或 REST API 时可良好运行,但使用 hadoophdfs dfs 命令编写大于 ~12 GB 的数据会导致错误。While this works fine when using HBase APIs or REST APIs, using the hadoop or hdfs dfs commands to write data larger than ~12 GB results in an error. 有关详细信息,请参阅本文的在 Blob 上编写时的存储异常部分。For more information, see the storage exception for write on blob section in this article.

图形客户端Graphical clients

还有一些应用程序可提供用于 Azure 存储的图形界面。There are also several applications that provide a graphical interface for working with Azure Storage. 下表是其中一些应用程序的列表:The following table is a list of a few of these applications:

客户端Client LinuxLinux OS XOS X WindowsWindows
用于 HDInsight 的 Microsoft Visual Studio ToolsMicrosoft Visual Studio Tools for HDInsight
Azure 存储资源管理器Azure Storage Explorer
Cerulea
CloudXplorerCloudXplorer
适用于 Microsoft Azure 的 CloudBerry ExplorerCloudBerry Explorer for Microsoft Azure
CyberduckCyberduck

将 Azure 存储装载为本地驱动器Mount Azure Storage as Local Drive

请参阅将 Azure 存储装载为本地驱动器See Mount Azure Storage as Local Drive.

使用服务上传Upload using services

Azure 数据工厂Azure Data Factory

Azure 数据工厂服务是完全托管的服务,可将数据存储、处理及移动服务组合成自适应且可靠的简化数据生产管道。The Azure Data Factory service is a fully managed service for composing data: storage, processing, and movement services into streamlined, adaptable, and reliable data production pipelines.

存储类型Storage type 文档Documentation
Azure Blob 存储Azure Blob storage 使用 Azure 数据工厂向/从 Azure Blob 存储复制数据Copy data to or from Azure Blob storage by using Azure Data Factory
Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2 使用 Azure 数据工厂将数据加载到 Azure Data Lake Storage Gen2 中Load data into Azure Data Lake Storage Gen2 with Azure Data Factory

Apache SqoopApache Sqoop

Sqoop 是一种专用于在 Hadoop 和关系数据库之间传输数据的工具。Sqoop is a tool designed to transfer data between Hadoop and relational databases. 可使用该工具从关系数据库管理系统 (RDBMS)(例如 SQL Server、MySQL 或 Oracle)中导入数据。Use it to import data from a relational database management system (RDBMS), such as SQL Server, MySQL, or Oracle. 然后将其导入 Hadoop 分布式文件系统 (HDFS)。Then into the Hadoop distributed file system (HDFS). 使用 MapReduce 或 Hive 在 Hadoop 中转换数据,然后将数据导回 RDBMS。Transform the data in Hadoop with MapReduce or Hive, and then export the data back into an RDBMS.

有关详细信息,请参阅将 Sqoop 与 HDInsight 配合使用For more information, see Use Sqoop with HDInsight.

开发 SDKDevelopment SDKs

还可以使用 Azure SDK 通过以下编程语言来访问 Azure 存储:Azure Storage can also be accessed using an Azure SDK from the following programming languages:

  • .NET.NET
  • JavaJava
  • Node.jsNode.js
  • PHPPHP
  • PythonPython
  • RubyRuby

有关安装 Azure SDK 的详细信息,请参阅 Azure 下载For more information on installing the Azure SDKs, see Azure downloads

后续步骤Next steps

现在,你已了解如何将数据导入 HDInsight,接下来请阅读以下文章了解如何执行分析:Now that you understand how to get data into HDInsight, read the following articles to learn analysis: