Run Apache Sqoop jobs with Azure PowerShell in HDInsight
Learn how to use Azure PowerShell to run Apache Sqoop jobs in Azure HDInsight to import and export data between an HDInsight cluster and Azure SQL Database or SQL Server. This article is a continuation of Use Apache Sqoop with Hadoop in HDInsight.
Prerequisites
A workstation with Azure PowerShell AZ Module installed.
Completion of Set up test environment from Use Apache Sqoop with Hadoop in HDInsight.
Familiarity with Sqoop. For more information, see Sqoop User Guide.
Sqoop export
From Hive to SQL.
This example exports data from the Hive hivesampletable
table to the mobiledata
table in SQL. Set the values for the variables below and then execute the command.
$hdinsightClusterName = ""
$httpPassword = ''
$sqlDatabasePassword = ''
# These values only need to be changed if the template was not followed.
$httpUserName = "admin"
$sqlServerLogin = "sqluser"
$sqlServerName = $hdinsightClusterName + "dbserver"
$sqlDatabaseName = $hdinsightClusterName + "db"
$pw = ConvertTo-SecureString -String $httpPassword -AsPlainText -Force
$httpCredential = New-Object System.Management.Automation.PSCredential($httpUserName,$pw)
# Connection string
$connectionString = "jdbc:sqlserver://$sqlServerName.database.chinacloudapi.cn;user=$sqlServerLogin@$sqlServerName;password=$sqlDatabasePassword;database=$sqlDatabaseName"
# start export
New-AzHDInsightSqoopJobDefinition `
-Command "export --connect $connectionString --table mobiledata --hcatalog-table hivesampletable" `
| Start-AzHDInsightJob `
-ClusterName $hdinsightClusterName `
-HttpCredential $httpCredential
Alternative execution
The code below performs the same export; however, it provides a way to read the output logs. Execute the code to begin the export.
$sqoopCommand = "export --connect $connectionString --table mobiledata --hcatalog-table hivesampletable" $sqoopDef = New-AzHDInsightSqoopJobDefinition ` -Command $sqoopCommand $sqoopJob = Start-AzHDInsightJob ` -ClusterName $hdinsightClusterName ` -HttpCredential $httpCredential ` -JobDefinition $sqoopDef
The code below displays the output logs. Execute the code below:
Get-AzHDInsightJobOutput ` -ClusterName $hdinsightClusterName ` -HttpCredential $httpCredential ` -JobId $sqoopJob.JobId ` -DisplayOutputType StandardError Get-AzHDInsightJobOutput ` -ClusterName $hdinsightClusterName ` -HttpCredential $httpCredential ` -JobId $sqoopJob.JobId ` -DisplayOutputType StandardOutput
If you receive the error message, The specified blob does not exist.
, try again after a few minutes.
Sqoop import
From SQL to Azure Storage. This example imports data from the mobiledata
table in SQL, to the wasb:///tutorials/usesqoop/importeddata
directory on HDInsight. The fields in the data are separated by a tab character, and the lines are terminated by a new-line character. This example assumes you've completed the prior example.
$sqoopCommand = "import --connect $connectionString --table mobiledata --target-dir wasb:///tutorials/usesqoop/importeddata --fields-terminated-by '\t' --lines-terminated-by '\n' -m 1"
$sqoopDef = New-AzHDInsightSqoopJobDefinition `
-Command $sqoopCommand
$sqoopJob = Start-AzHDInsightJob `
-ClusterName $hdinsightClusterName `
-HttpCredential $httpCredential `
-JobDefinition $sqoopDef
Get-AzHDInsightJobOutput `
-ClusterName $hdinsightClusterName `
-HttpCredential $httpCredential `
-JobId $sqoopJob.JobId `
-DisplayOutputType StandardError
Get-AzHDInsightJobOutput `
-ClusterName $hdinsightClusterName `
-HttpCredential $httpCredential `
-JobId $sqoopJob.JobId `
-DisplayOutputType StandardOutput
Additional Sqoop export example
This is a robust example that exports data from /tutorials/usesqoop/data/sample.log
from the default storage account, and then imports it to a table called log4jlogs
in a SQL Server database. This example isn't dependent on the prior examples.
The following PowerShell script pre-processes the source file and then exports it to table log4jlogs
. Replace CLUSTERNAME
, CLUSTERPASSWORD
, and SQLPASSWORD
with the values you used from the prerequisite.
<#------ BEGIN USER INPUT ------#>
$hdinsightClusterName = "CLUSTERNAME"
$httpUserName = "admin" #default is admin, update as needed
$httpPassword = 'CLUSTERPASSWORD'
$sqlDatabasePassword = 'SQLPASSWORD'
<#------- END USER INPUT -------#>
# Other fixed variable that should be used as is
$sqlServerName = $hdinsightClusterName + "dbserver"
$sqlDatabaseName = $hdinsightClusterName + "db"
$tableName_log4j = "log4jlogs"
$exportDir_log4j = "/tutorials/usesqoop/data"
$sourceBlobName = "example/data/sample.log"
$destBlobName = "tutorials/usesqoop/data/sample.log"
$sqljdbcdriver = "/user/oozie/share/lib/sqoop/mssql-jdbc-7.0.0.jre8.jar"
$cluster = Get-AzHDInsightCluster -ClusterName $hdinsightClusterName
$defaultStorageAccountName = $cluster.DefaultStorageAccount -replace '.blob.core.chinacloudapi.cn'
$defaultStorageContainer = $cluster.DefaultStorageContainer
$resourceGroup = $cluster.ResourceGroup
$sqlServer = Get-AzSqlServer -ResourceGroupName $resourceGroup -ServerName $sqlServerName
$sqlServerLogin = $sqlServer.SqlAdministratorLogin
$sqlServerFQDN = $sqlServer.FullyQualifiedDomainName
#Connect to Azure subscription
Write-Host "`nConnecting to your Azure subscription ..." -ForegroundColor Green
try{Get-AzContext}
catch{Connect-AzAccount -Environment AzureChinaCloud}
#pre-process the source file
Write-Host "`nPreprocessing the source file ..." -ForegroundColor Green
# This procedure creates a new file with $destBlobName
# Define the connection string
$defaultStorageAccountKey = (Get-AzStorageAccountKey `
-ResourceGroupName $resourceGroup `
-Name $defaultStorageAccountName)[0].Value
# Create block blob objects referencing the source and destination blob.
$storageAccount = Get-AzStorageAccount `
-ResourceGroupName $resourceGroup `
-Name $defaultStorageAccountName
$storageContainer = ($storageAccount |Get-AzStorageContainer -Name $defaultStorageContainer).CloudBlobContainer
$sourceBlob = $storageContainer.GetBlockBlobReference($sourceBlobName)
$destBlob = $storageContainer.GetBlockBlobReference($destBlobName)
# Define a MemoryStream and a StreamReader for reading from the source file
$stream = New-Object System.IO.MemoryStream
$stream = $sourceBlob.OpenRead()
$sReader = New-Object System.IO.StreamReader($stream)
# Define a MemoryStream and a StreamWriter for writing into the destination file
$memStream = New-Object System.IO.MemoryStream
$writeStream = New-Object System.IO.StreamWriter $memStream
# Pre-process the source blob
$exString = "java.lang.Exception:"
while(-Not $sReader.EndOfStream){
$line = $sReader.ReadLine()
$split = $line.Split(" ")
# remove the "java.lang.Exception" from the first element of the array
# for example: java.lang.Exception: 2012-02-03 19:11:02 SampleClass8 [WARN] problem finding id 153454612
if ($split[0] -eq $exString){
#create a new ArrayList to remove $split[0]
$newArray = [System.Collections.ArrayList] $split
$newArray.Remove($exString)
# update $split and $line
$split = $newArray
$line = $newArray -join(" ")
}
# remove the lines that has less than 7 elements
if ($split.count -ge 7){
write-host $line
$writeStream.WriteLine($line)
}
}
# Write to the destination blob
$writeStream.Flush()
$memStream.Seek(0, "Begin")
$destBlob.UploadFromStream($memStream)
#export the log file from the cluster to SQL
Write-Host "Exporting the log file ..." -ForegroundColor Green
$pw = ConvertTo-SecureString -String $httpPassword -AsPlainText -Force
$httpCredential = New-Object System.Management.Automation.PSCredential($httpUserName,$pw)
# Connection string
$connectionString = "jdbc:sqlserver://$sqlServerFQDN;user=$sqlServerLogin@$sqlServerName;password=$sqlDatabasePassword;database=$sqlDatabaseName"
# Submit a Sqoop job
$sqoopDef = New-AzHDInsightSqoopJobDefinition `
-Command "export --connect $connectionString --table $tableName_log4j --export-dir $exportDir_log4j --input-fields-terminated-by \0x20 -m 1" `
-Files $sqljdbcdriver
$sqoopJob = Start-AzHDInsightJob `
-ClusterName $hdinsightClusterName `
-HttpCredential $httpCredential `
-JobDefinition $sqoopDef
Wait-AzHDInsightJob `
-ResourceGroupName $resourceGroup `
-ClusterName $hdinsightClusterName `
-HttpCredential $httpCredential `
-JobId $sqoopJob.JobId
Write-Host "Standard Error" -BackgroundColor Green
Get-AzHDInsightJobOutput `
-ResourceGroupName $resourceGroup `
-ClusterName $hdinsightClusterName `
-DefaultStorageAccountName $defaultStorageAccountName `
-DefaultStorageAccountKey $defaultStorageAccountKey `
-DefaultContainer $defaultStorageContainer `
-HttpCredential $httpCredential `
-JobId $sqoopJob.JobId `
-DisplayOutputType StandardError
Write-Host "Standard Output" -BackgroundColor Green
Get-AzHDInsightJobOutput `
-ResourceGroupName $resourceGroupName `
-ClusterName $hdinsightClusterName `
-DefaultStorageAccountName $defaultStorageAccountName `
-DefaultStorageAccountKey $defaultStorageAccountKey `
-DefaultContainer $defaultStorageContainer `
-HttpCredential $httpCredential `
-JobId $sqoopJob.JobId `
-DisplayOutputType StandardOutput
Limitations
Linux-based HDInsight presents the following limitations:
Bulk export: The Sqoop connector that's used to export data to SQL doesn't currently support bulk inserts.
Batching: By using the
-batch
switch when it performs inserts, Sqoop performs multiple inserts instead of batching the insert operations.
Next steps
Now you've learned how to use Sqoop. To learn more, see:
- Use Apache Oozie with HDInsight: Use Sqoop action in an Oozie workflow.
- Upload data to HDInsight: Find other methods for uploading data to HDInsight or Azure Blob storage.