在 HDInsight 中上传 Apache Hadoop 作业的数据Upload data for Apache Hadoop jobs in HDInsight
HDInsight 提供基于 Azure 存储和 Azure Data Lake Storage 的 Hadoop 分布式文件系统 (HDFS)。HDInsight provides a Hadoop distributed file system (HDFS) over Azure Storage, and Azure Data Lake Storage. Azure 存储以及 Data Lake Storage Gen2 设计为 HDFS 扩展。Azure Storage and Data Lake Storage Gen2 are designed as HDFS extensions. 它们通过启用 Hadoop 环境中的整套组件直接操作其管理的数据。They enable the full set of components in the Hadoop environment to operate directly on the data it manages. Azure 存储、Data Lake Storage Gen2 是不同的文件系统。Azure Storage, Data Lake Storage Gen2 are distinct file systems. 系统已针对此类数据的存储和计算进行优化。The systems are optimized for storage of data and computations on that data. 若要了解使用 Azure 存储的优点,请参阅将 Azure 存储与 HDInsight 配合使用。For information about the benefits of using Azure Storage, see Use Azure Storage with HDInsight. 另请参阅将 Data Lake Storage Gen2 与 HDInsight 配合使用。See also, Use Data Lake Storage Gen2 with HDInsight.
先决条件Prerequisites
在开始下一步之前,请注意以下要求:Note the following requirements before you begin:
- 一个 Azure HDInsight 群集。An Azure HDInsight cluster. 有关说明,请参阅 Azure HDInsight 入门。For instructions, see Get started with Azure HDInsight.
- 了解以下文章:Knowledge of the following articles:
将数据上传到 Azure 存储Upload data to Azure Storage
实用程序Utilities
Microsoft 提供以下实用工具用于操作 Azure 存储:Microsoft provides the following utilities to work with Azure Storage:
工具Tool | LinuxLinux | OS XOS X | WindowsWindows |
---|---|---|---|
Azure 门户Azure portal | ✔✔ | ✔✔ | ✔✔ |
Azure CLIAzure CLI | ✔✔ | ✔✔ | ✔✔ |
Azure PowerShellAzure PowerShell | ✔✔ | ||
AzCopyAzCopy | ✔✔ | ✔✔ | |
Hadoop 命令Hadoop command | ✔✔ | ✔✔ | ✔✔ |
备注
Hadoop 命令仅在 HDInsight 群集上可用。The Hadoop command is only available on the HDInsight cluster. 使用该命令只能将数据从本地文件系统载入 Azure 存储。The command only allows loading data from the local file system into Azure Storage.
Hadoop 命令行Hadoop command line
仅当数据已存在于群集头节点中时,才可以使用 Hadoop 命令行将数据存储到 Azure 存储 Blob。The Hadoop command line is only useful for storing data into Azure storage blob when the data is already present on the cluster head node.
若要使用 Hadoop 命令,必须先通过 SSH 或 PuTTY 连接到头节点。To use the Hadoop command, you must first connect to the headnode using SSH or PuTTY.
连接之后,可以使用以下语法将文件上传到存储。Once connected, you can use the following syntax to upload a file to storage.
hadoop fs -copyFromLocal <localFilePath> <storageFilePath>
例如: hadoop fs -copyFromLocal data.txt /example/data/data.txt
For example, hadoop fs -copyFromLocal data.txt /example/data/data.txt
由于 HDInsight 的默认文件系统在 Azure 存储中,因此 /example/data/data.txt 实际是在 Azure 存储中。Because the default file system for HDInsight is in Azure Storage, /example/data/data.txt is actually in Azure Storage. 也可以将该文件表示为:You can also refer to the file as:
wasbs:///example/data/data.txt
或or
wasbs://<ContainerName>@<StorageAccountName>.blob.core.chinacloudapi.cn/example/data/davinci.txt
若要查看可用于文件的其他 Hadoop 命令的列表,请参阅 https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.htmlFor a list of other Hadoop commands that work with files, see https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
警告
在 Apache HBase 群集上,写入数据为 256 KB 时会使用默认块大小。On Apache HBase clusters, the default block size used when writing data is 256 KB. 虽然在使用 HBase Api 或 REST API 时可良好运行,但使用 hadoop
或 hdfs dfs
命令编写大于 ~12 GB 的数据会导致错误。While this works fine when using HBase APIs or REST APIs, using the hadoop
or hdfs dfs
commands to write data larger than ~12 GB results in an error. 有关详细信息,请参阅在 blob 上写入时的存储异常。For more information, see storage exception for write on blob.
图形客户端Graphical clients
还有一些应用程序可提供用于 Azure 存储的图形界面。There are also several applications that provide a graphical interface for working with Azure Storage. 下表是其中一些应用程序的列表:The following table is a list of a few of these applications:
客户端Client | LinuxLinux | OS XOS X | WindowsWindows |
---|---|---|---|
用于 HDInsight 的 Microsoft Visual Studio ToolsMicrosoft Visual Studio Tools for HDInsight | ✔✔ | ✔✔ | ✔✔ |
Azure 存储资源管理器Azure Storage Explorer | ✔✔ | ✔✔ | ✔✔ |
Cerulea |
✔✔ | ||
CloudXplorerCloudXplorer | ✔✔ | ||
适用于 Microsoft Azure 的 CloudBerry ExplorerCloudBerry Explorer for Microsoft Azure | ✔✔ | ||
CyberduckCyberduck | ✔✔ | ✔✔ |
将 Azure 存储装载为本地驱动器Mount Azure Storage as Local Drive
请参阅将 Azure 存储装载为本地驱动器。See Mount Azure Storage as Local Drive.
使用服务上传Upload using services
Azure 数据工厂Azure Data Factory
Azure 数据工厂服务是完全托管的服务,可将数据存储、处理及移动服务组合成自适应且可靠的简化数据生产管道。The Azure Data Factory service is a fully managed service for composing data: storage, processing, and movement services into streamlined, adaptable, and reliable data production pipelines.
存储类型Storage type | 文档Documentation |
---|---|
Azure Blob 存储Azure Blob storage | 使用 Azure 数据工厂向/从 Azure Blob 存储复制数据Copy data to or from Azure Blob storage by using Azure Data Factory |
Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2 | 使用 Azure 数据工厂将数据加载到 Azure Data Lake Storage Gen2 中Load data into Azure Data Lake Storage Gen2 with Azure Data Factory |
Apache SqoopApache Sqoop
Sqoop 是一种专用于在 Hadoop 和关系数据库之间传输数据的工具。Sqoop is a tool designed to transfer data between Hadoop and relational databases. 可使用该工具从关系数据库管理系统 (RDBMS)(例如 SQL Server、MySQL 或 Oracle)中导入数据。Use it to import data from a relational database management system (RDBMS), such as SQL Server, MySQL, or Oracle. 然后将其导入 Hadoop 分布式文件系统 (HDFS)。Then into the Hadoop distributed file system (HDFS). 使用 MapReduce 或 Hive 在 Hadoop 中转换数据,然后将数据导回 RDBMS。Transform the data in Hadoop with MapReduce or Hive, and then export the data back into an RDBMS.
有关详细信息,请参阅将 Sqoop 与 HDInsight 配合使用。For more information, see Use Sqoop with HDInsight.
开发 SDKDevelopment SDKs
还可以使用 Azure SDK 通过以下编程语言来访问 Azure 存储:Azure Storage can also be accessed using an Azure SDK from the following programming languages:
- .NET.NET
- JavaJava
- Node.jsNode.js
- PHPPHP
- PythonPython
- RubyRuby
有关安装 Azure SDK 的详细信息,请参阅 Azure 下载For more information on installing the Azure SDKs, see Azure downloads
后续步骤Next steps
现在,你已了解如何将数据导入 HDInsight,接下来请阅读以下文章了解如何执行分析:Now that you understand how to get data into HDInsight, read the following articles to learn analysis: