使用 PowerShell 通过 HDInsight 上的 Apache Hadoop 运行 MapReduce 作业

2025/04/07

本文档提供使用 Azure PowerShell 在 HDInsight 上的 Hadoop 群集中运行 MapReduce 作业的示例。

先决条件

HDInsight 中的 Apache Hadoop 群集。请参阅使用 Azure 门户创建 Apache Hadoop 群集。
已安装 PowerShell Az 模块。

运行 MapReduce 作业

Azure PowerShell 提供 cmdlet，可在 HDInsight 上远程运行 MapReduce 作业。从内部来讲，PowerShell 将对 HDInsight 群集上运行的 WebHCat（以前称为 Templeton）进行 REST 调用。

在远程 HDInsight 群集上运行 MapReduce 作业时，将使用以下 cmdlet。

Cmdlet	说明
Connect-AzAccount -Environment AzureChinaCloud	在 Azure 订阅中进行 Azure PowerShell 身份验证。
New-AzHDInsightMapReduceJobDefinition	使用指定的 MapReduce 信息创建新的作业定义。
Start-AzHDInsightJob	将作业定义发送到 HDInsight 并启动作业。将返回作业对象。
Wait-AzHDInsightJob	使用作业对象来检查作业的状态。它等到作业完成或超出等待时间。
Get-AzHDInsightJobOutput	用于检索作业的输出。

以下步骤演示了如何使用这些 Cmdlet 在 HDInsight 群集上运行作业。

使用编辑器将以下代码保存为 mapreducejob.ps1。

# Login to your Azure subscription
$context = Get-AzContext
if ($context -eq $null) 
{
    Connect-AzAccount -Environment AzureChinaCloud
}
$context

# Get cluster info
$clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
$creds=Get-Credential -Message "Enter the login for the cluster"

#Get the cluster info so we can get the resource group, storage, etc.
$clusterInfo = Get-AzHDInsightCluster -ClusterName $clusterName
$resourceGroup = $clusterInfo.ResourceGroup
$storageAccountName=$clusterInfo.DefaultStorageAccount.split('.')[0]
$container=$clusterInfo.DefaultStorageContainer
#NOTE: This assumes that the storage account is in the same resource
# group as the cluster. If it is not, change the
# --ResourceGroupName parameter to the group that contains storage.
$storageAccountKey=(Get-AzStorageAccountKey `
    -Name $storageAccountName `
-ResourceGroupName $resourceGroup)[0].Value

#Create a storage context
$context = New-AzStorageContext `
    -StorageAccountName $storageAccountName `
    -StorageAccountKey $storageAccountKey

#Define the MapReduce job
#NOTE: If using an HDInsight 2.0 cluster, use hadoop-examples.jar instead.
# -JarFile = the JAR containing the MapReduce application
# -ClassName = the class of the application
# -Arguments = The input file, and the output directory
$wordCountJobDefinition = New-AzHDInsightMapReduceJobDefinition `
    -JarFile "/example/jars/hadoop-mapreduce-examples.jar" `
    -ClassName "wordcount" `
    -Arguments `
        "/example/data/gutenberg/davinci.txt", `
        "/example/data/WordCountOutput"

#Submit the job to the cluster
Write-Host "Start the MapReduce job..." -ForegroundColor Green
$wordCountJob = Start-AzHDInsightJob `
    -ClusterName $clusterName `
    -JobDefinition $wordCountJobDefinition `
    -HttpCredential $creds

#Wait for the job to complete
Write-Host "Wait for the job to complete..." -ForegroundColor Green
Wait-AzHDInsightJob `
    -ClusterName $clusterName `
    -JobId $wordCountJob.JobId `
    -HttpCredential $creds
# Download the output
Get-AzStorageBlobContent `
    -Blob 'example/data/WordCountOutput/part-r-00000' `
    -Container $container `
    -Destination output.txt `
    -Context $context
# Print the output of the job.
Get-AzHDInsightJobOutput `
    -Clustername $clusterName `
    -JobId $wordCountJob.JobId `
    -HttpCredential $creds

打开一个新的 Azure PowerShell 命令提示符。将目录更改为 mapreducejob.ps1 文件所在位置，并使用以下命令来运行脚本：
```
.\mapreducejob.ps1
```
运行脚本时，系统会提示输入 HDInsight 群集的名称和该群集的登录名。还会提示针对 Azure 订阅进行身份验证。

作业完成后，将收到类似于以下文本的输出：

Cluster         : CLUSTERNAME
ExitCode        : 0
Name            : wordcount
PercentComplete : map 100% reduce 100%
Query           :
State           : Completed
StatusDirectory : f1ed2028-afe8-402f-a24b-13cc17858097
SubmissionTime  : 12/5/2014 8:34:09 PM
JobId           : job_1415949758166_0071

此输出指示作业已成功完成。

备注

如果 ExitCode 的值不是 0，请参阅故障排除。

此示例还会将下载的文件存储到从中运行脚本的目录中的 output.txt 文件。

查看输出

若要查看作业生成的单词和计数，请在文本编辑器中打开 output.txt 文件。

备注

MapReduce 作业的输出文件是固定不变的。因此，如果重新运行此示例，将需要更改输出文件的名称。

疑难解答

如果作业完成时未返回任何信息，请查看该作业的错误。若要查看此作业的错误信息，请将以下命令添加到 mapreducejob.ps1 文件的末尾。然后保存该文件，并重新运行脚本。

# Print the output of the WordCount job.
Write-Host "Display the standard output ..." -ForegroundColor Green
Get-AzHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $wordCountJob.JobId `
        -HttpCredential $creds `
        -DisplayOutputType StandardError

作业运行期间，此 cmdlet 返回写入到 STDERR 中的信息。

后续步骤

如你所见，Azure PowerShell 提供了简单的方法让你在 HDInsight 群集上运行 MapReduce 作业、监视作业状态，以及检索输出。有关 HDInsight 上的 Hadoop 的其他使用方法的信息：

通过