Use Apache Sqoop with Hadoop in HDInsight
Learn how to use Apache Sqoop in HDInsight to import and export data between an HDInsight cluster and Azure SQL Database.
Although Apache Hadoop is a natural choice for processing unstructured and semi-structured data, such as logs and files, there may also be a need to process structured data that is stored in relational databases.
Apache Sqoop is a tool designed to transfer data between Hadoop clusters and relational databases. You can use it to import data from a relational database management system (RDBMS) such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Apache Hive, and then export the data back into an RDBMS. In this article, you're using Azure SQL Database for your relational database.
Important
This article sets up a test environment to perform the data transfer. You then choose a data transfer method for this environment from one of the methods in section Run Sqoop jobs.
For Sqoop versions that are supported on HDInsight clusters, see What's new in the cluster versions provided by HDInsight?
Understand the scenario
HDInsight cluster comes with some sample data. You use the following two samples:
- An Apache
Log4j
log file, which is located at/example/data/sample.log
. The following logs are extracted from the file:
2012-02-03 18:35:34 SampleClass6 [INFO] everything normal for id 577725851
2012-02-03 18:35:34 SampleClass4 [FATAL] system problem at id 1991281254
2012-02-03 18:35:34 SampleClass3 [DEBUG] detail for id 1304807656
...
A Hive table named
hivesampletable
, which references the data file located at/hive/warehouse/hivesampletable
. The table contains some mobile device data.Field Data type clientid string querytime string market string deviceplatform string devicemake string devicemodel string state string country string querydwelltime double sessionid
bigint sessionpagevieworder bigint
In this article, you use these two datasets to test Sqoop import and export.
Set up test environment
The cluster, SQL database, and other objects are created through the Azure portal using an Azure Resource Manager template. The template can be found in Azure quickstart templates. The Resource Manager template calls a bacpac package to deploy the table schemas to a SQL database. If you want to use a private container for the bacpac files, use the following values in the template:
"storageKeyType": "Primary",
"storageKey": "<TheAzureStorageAccountKey>",
Note
Import using a template or the Azure portal only supports importing a BACPAC file from Azure blob storage.
Select the following image to open the Resource Manager template in the Azure portal.
Enter the following properties:
Field Value Subscription Select your Azure subscription from the drop-down list. Resource group Select your resource group from the drop-down list, or create a new one Location Select a region from the drop-down list. Cluster Name Enter a name for the Hadoop cluster. Use lowercase letter only. Cluster sign-in User Name Keep the prepopulated value admin
.Cluster sign in Password Enter a password. Ssh User Name Keep the prepopulated value sshuser
.Ssh Password Enter a password. Sql Admin sign-in Keep the prepopulated value sqluser
.Sql Admin Password Enter a password. _artifacts Location Use the default value unless you want to use your own bacpac file in a different location. _artifacts Location Sas Token Leave blank. Bacpac File Name Use the default value unless you want to use your own bacpac file. Location Use the default value. The logical SQL server name is
<ClusterName>dbserver
. The database name is<ClusterName>db
. The default storage account name ise6qhezrh2pdqu
.Select I agree to the terms and conditions stated above.
Select Purchase. You see a new tile titled Submitting deployment for Template deployment. It takes about around 20 minutes to create the cluster and SQL database.
Run Sqoop jobs
HDInsight can run Sqoop jobs by using various methods. Use the following table to decide which method is right for you, then follow the link for a walkthrough.
Use this if you want... | ...an interactive shell | ...batch processing | ...from this client operating system |
---|---|---|---|
SSH | ✔ | ✔ | Linux, Unix, Mac OS X, or Windows |
.NET SDK for Hadoop | ✔ | Windows (for now) | |
Azure PowerShell | ✔ | Windows |
Limitations
- Bulk export - With Linux-based HDInsight, the Sqoop connector used to export data to Microsoft SQL Server or SQL Database doesn't currently support bulk inserts.
- Batching - With Linux-based HDInsight, When using the
-batch
switch when performing inserts, Sqoop performs multiple inserts instead of batching the insert operations.
Next steps
Now you've learned how to use Sqoop. To learn more, see:
- Upload data to HDInsight: Find other methods for uploading data to HDInsight/Azure Blob storage.
- Use Apache Sqoop to import and export data between Apache Hadoop on HDInsight and SQL Database