在 HDInsight 中将 Apache Sqoop 与 Hadoop 配合使用Use Apache Sqoop with Hadoop in HDInsight

了解如何使用 HDInsight 中的 Apache Sqoop 在 HDInsight 群集与 Azure SQL 数据库之间导入和导出数据。Learn how to use Apache Sqoop in HDInsight to import and export data between an HDInsight cluster and an Azure SQL database.

虽然自然而然地选用 Apache Hadoop 处理如日志和文件等非结构化和半结构化的数据,但可能还需要处理存储在关系数据库中的结构化数据。Although Apache Hadoop is a natural choice for processing unstructured and semi-structured data, such as logs and files, there may also be a need to process structured data that is stored in relational databases.

Apache Sqoop 是一种专用于在 Hadoop 群集和关系数据库之间传输数据的工具。Apache Sqoop is a tool designed to transfer data between Hadoop clusters and relational databases. 可以使用此工具将数据从关系数据库管理系统 (RDBMS)(如 SQL Server、MySQL 或 Oracle)中导入到 Hadoop 分布式文件系统 (HDFS),在 Hadoop 中使用 MapReduce 或 Apache Hive 转换数据,然后将数据导回到 RDBMS。You can use it to import data from a relational database management system (RDBMS) such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Apache Hive, and then export the data back into an RDBMS. 本文将 SQL Server 数据库用作关系数据库。In this article, you are using a SQL Server database for your relational database.

Important

本文将设置一个测试环境用于执行数据传输。This article sets up a test environment to perform the data transfer. 然后,你将从下面的运行 Sqoop 作业部分所述的方法中,为此环境选择一种数据传输方法。You then choose a data transfer method for this environment from one of the methods in section Run Sqoop jobs, further below.

有关 HDInsight 群集上支持的 Sqoop 版本,请参阅 HDInsight 提供的群集版本有哪些新增功能?For Sqoop versions that are supported on HDInsight clusters, see What's new in the cluster versions provided by HDInsight?

了解方案Understand the scenario

HDInsight 群集带有某些示例数据。HDInsight cluster comes with some sample data. 可使用以下两个示例:You use the following two samples:

  • 一个 Apache Log4j 日志文件,位于 /example/data/sample.logAn Apache Log4j log file, which is located at /example/data/sample.log. 以下日志会从该文件中提取出来:The following logs are extracted from the file:
2012-02-03 18:35:34 SampleClass6 [INFO] everything normal for id 577725851
2012-02-03 18:35:34 SampleClass4 [FATAL] system problem at id 1991281254
2012-02-03 18:35:34 SampleClass3 [DEBUG] detail for id 1304807656
...
  • 一个名为 hivesampletable 的 Hive 表,它引用 /hive/warehouse/hivesampletable 中的数据文件。A Hive table named hivesampletable, which references the data file located at /hive/warehouse/hivesampletable. 该表包含一些移动设备数据。The table contains some mobile device data.

    字段Field 数据类型Data type
    clientidclientid 字符串string
    querytimequerytime 字符串string
    marketmarket 字符串string
    deviceplatformdeviceplatform 字符串string
    devicemakedevicemake 字符串string
    devicemodeldevicemodel 字符串string
    statestate 字符串string
    countrycountry 字符串string
    querydwelltimequerydwelltime Doubledouble
    sessionidsessionid bigintbigint
    sessionpageviewordersessionpagevieworder bigintbigint

本文将使用这两个数据集来测试 Sqoop 导入和导出。In this article, you use these two datasets to test Sqoop import and export.

设置测试环境Set up test environment

群集、SQL 数据库和其他对象是在 Azure 门户中使用 Azure 资源管理器模板创建的。The cluster, SQL database, and other objects are created through the Azure portal using an Azure Resource Manager template. 可以在 Azure 快速入门模板中找到模板。The template can be found in Azure QuickStart Templates. 该资源管理器模板调用 bacpac 包,以将表架构部署到 SQL 数据库。The Resource Manager template calls a bacpac package to deploy the table schemas to a SQL database. bacpac 包位于公共 blob 容器 https://hditutorialdata.blob.core.chinacloudapi.cn/usesqoop/SqoopTutorial-2016-2-23-11-2.bacpac 中。The bacpac package is located in a public blob container, https://hditutorialdata.blob.core.chinacloudapi.cn/usesqoop/SqoopTutorial-2016-2-23-11-2.bacpac. 如果想要私有容器用于 bacpac 文件,请使用模板中的以下值:If you want to use a private container for the bacpac files, use the following values in the template:

"storageKeyType": "Primary",
"storageKey": "<TheAzureStorageAccountKey>",

Note

使用模板或 Azure 门户进行的导入操作仅支持从 Azure Blob 存储导入 BACPAC 文件。Import using a template or the Azure portal only supports importing a BACPAC file from Azure blob storage.

  1. 选择以下映像在 Azure 门户中打开资源管理器模板。Select the following image to open the Resource Manager template in the Azure portal.

    Deploy to Azure

  2. 输入以下属性:Enter the following properties:

    字段Field ValueValue
    订阅Subscription 从下拉列表中选择你的 Azure 订阅。Select your Azure subscription from the drop-down list.
    资源组Resource group 从下拉列表中选择你的资源组,或新建一个资源组Select your resource group from the drop-down list, or create a new one
    位置Location 从下拉列表中选择区域。Select a region from the drop-down list.
    群集名称Cluster Name 输入 Hadoop 群集的名称。Enter a name for the Hadoop cluster. 请仅使用小写字母。Use lowercase letter only.
    群集登录用户名Cluster Login User Name 保留预先填充的值 adminKeep the pre-populated value admin.
    群集登录密码Cluster Login Password 输入密码。Enter a password.
    SSH 用户名Ssh User Name 保留预先填充的值 sshuserKeep the pre-populated value sshuser.
    SSH 密码Ssh Password 输入密码。Enter a password.
    SQL 管理员登录名Sql Admin Login 保留预先填充的值 sqluserKeep the pre-populated value sqluser.
    SQL 管理员密码Sql Admin Password 输入密码。Enter a password.
    _artifacts 位置_artifacts Location 使用默认值(除非想要在其他位置使用自己的 bacpac 文件)。Use the default value unless you want to use your own bacpac file in a different location.
    _artifacts 位置 SAS 令牌_artifacts Location Sas Token 留空。Leave blank.
    Bacpac 文件名Bacpac File Name 使用默认值(除非想要使用自己的 bacpac 文件)。Use the default value unless you want to use your own bacpac file.
    位置Location 使用默认值。Use the default value.

    Azure SQL 服务器名称将是 <ClusterName>dbserverThe Azure SQL Server name will be <ClusterName>dbserver. 数据库名称将是 <ClusterName>dbThe database name will be <ClusterName>db. 默认的存储帐户名将是 e6qhezrh2pdquThe default storage account name will be e6qhezrh2pdqu.

  3. 选择“我同意上述条款和条件”。Select I agree to the terms and conditions stated above.

  4. 选择“购买”。Select Purchase. 此时会出现一个标题为“为模板部署提交部署”的新磁贴。You see a new tile titled Submitting deployment for Template deployment. 创建群集和 SQL 数据库大约需要 20 分钟时间。It takes about around 20 minutes to create the cluster and SQL database.

运行 Sqoop 作业Run Sqoop jobs

HDInsight 可以使用各种方法运行 Sqoop 作业。HDInsight can run Sqoop jobs by using a variety of methods. 使用下表来确定哪种方法最适合用户,并访问此链接进行演练。Use the following table to decide which method is right for you, then follow the link for a walkthrough.

使用此方法 ,如果想要...Use this if you want... ... 交互式 shell...an interactive shell ...批处理...batch processing ...从此 客户端操作系统...from this client operating system
SSHSSH Linux、Unix、Mac OS X 或 WindowsLinux, Unix, Mac OS X, or Windows
.NET SDK for Hadoop.NET SDK for Hadoop   Windows(暂时)Windows (for now)
Azure PowerShellAzure PowerShell   WindowsWindows

限制Limitations

  • 批量导出 - 在基于 Linux 的 HDInsight 上,用于将数据导出到 Microsoft SQL Server 或 Azure SQL 数据库的 Sqoop 连接器目前不支持批量插入。Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to Microsoft SQL Server or Azure SQL Database does not currently support bulk inserts.
  • 批处理 - 在基于 Linux 的 HDInsight 上,如果在执行插入时使用 -batch 开关,Sqoop 将执行多次插入而不是批处理插入操作。Batching - With Linux-based HDInsight, When using the -batch switch when performing inserts, Sqoop performs multiple inserts instead of batching the insert operations.

后续步骤Next steps

现在你已了解如何使用 Sqoop。Now you have learned how to use Sqoop. 若要了解更多信息,请参阅以下文章:To learn more, see: