使用 PowerShell 通过 HDInsight 上的 Apache Hadoop 运行 MapReduce 作业Run MapReduce jobs with Apache Hadoop on HDInsight using PowerShell

本文档提供使用 Azure PowerShell 在 HDInsight 上的 Hadoop 群集中运行 MapReduce 作业的示例。This document provides an example of using Azure PowerShell to run a MapReduce job in a Hadoop on HDInsight cluster.

必备条件Prerequisites

运行 MapReduce 作业Run a MapReduce job

Azure PowerShell 提供 cmdlet,可在 HDInsight 上远程运行 MapReduce 作业。Azure PowerShell provides cmdlets that allow you to remotely run MapReduce jobs on HDInsight. 从内部来讲,PowerShell 将对 HDInsight 群集上运行的 WebHCat(以前称为 Templeton)进行 REST 调用。Internally, PowerShell makes REST calls to WebHCat (formerly called Templeton) running on the HDInsight cluster.

在远程 HDInsight 群集上运行 MapReduce 作业时,将使用以下 Cmdlet。The following cmdlets are used when running MapReduce jobs in a remote HDInsight cluster.

CmdletCmdlet 说明Description
Connect-AzAccountConnect-AzAccount 在 Azure 订阅中进行 Azure PowerShell 身份验证。Authenticates Azure PowerShell to your Azure subscription.
New-AzHDInsightMapReduceJobDefinitionNew-AzHDInsightMapReduceJobDefinition 使用指定的 MapReduce 信息创建新的作业定义 。Creates a new job definition by using the specified MapReduce information.
Start-AzHDInsightJobStart-AzHDInsightJob 将作业定义发送到 HDInsight 并启动作业。Sends the job definition to HDInsight and starts the job. 将返回作业对象 。A job object is returned.
Wait-AzHDInsightJobWait-AzHDInsightJob 使用作业对象来检查作业的状态。Uses the job object to check the status of the job. 它等到作业完成或超出等待时间。It waits until the job completes or the wait time is exceeded.
Get-AzHDInsightJobOutputGet-AzHDInsightJobOutput 用于检索作业的输出。Used to retrieve the output of the job.

以下步骤演示了如何使用这些 Cmdlet 在 HDInsight 群集上运行作业。The following steps demonstrate how to use these cmdlets to run a job in your HDInsight cluster.

  1. 使用编辑器将以下代码保存为 mapreducejob.ps1Using an editor, save the following code as mapreducejob.ps1.

    # Login to your Azure subscription
    # Is there an active Azure subscription?
    $sub = Get-AzureRmSubscription -ErrorAction SilentlyContinue
    if(-not($sub))
    {
        Add-AzureRmAccount -EnvironmentName AzureChinaCloud
    }
    
    # Get cluster info
    $clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
    $creds=Get-Credential -Message "Enter the login for the cluster"
    
    #Get the cluster info so we can get the resource group, storage, etc.
    $clusterInfo = Get-AzureRmHDInsightCluster -ClusterName $clusterName
    $resourceGroup = $clusterInfo.ResourceGroup
    $storageAccountName=$clusterInfo.DefaultStorageAccount.split('.')[0]
    $container=$clusterInfo.DefaultStorageContainer
    #NOTE: This assumes that the storage account is in the same resource
    #      group as the cluster. If it is not, change the
    #      --ResourceGroupName parameter to the group that contains storage.
    $storageAccountKey=(Get-AzureRmStorageAccountKey `
        -Name $storageAccountName `
    -ResourceGroupName $resourceGroup)[0].Value
    
    #Create a storage context
    $context = New-AzureStorageContext `
        -StorageAccountName $storageAccountName `
        -StorageAccountKey $storageAccountKey
    
    #Define the MapReduce job
    #NOTE: If using an HDInsight 2.0 cluster, use hadoop-examples.jar instead.
    # -JarFile = the JAR containing the MapReduce application
    # -ClassName = the class of the application
    # -Arguments = The input file, and the output directory
    $wordCountJobDefinition = New-AzureRmHDInsightMapReduceJobDefinition `
        -JarFile "/example/jars/hadoop-mapreduce-examples.jar" `
        -ClassName "wordcount" `
        -Arguments `
            "/example/data/gutenberg/davinci.txt", `
            "/example/data/WordCountOutput"
    
    #Submit the job to the cluster
    Write-Host "Start the MapReduce job..." -ForegroundColor Green
    $wordCountJob = Start-AzureRmHDInsightJob `
        -ClusterName $clusterName `
        -JobDefinition $wordCountJobDefinition `
        -HttpCredential $creds
    
    #Wait for the job to complete
    Write-Host "Wait for the job to complete..." -ForegroundColor Green
    Wait-AzureRmHDInsightJob `
        -ClusterName $clusterName `
        -JobId $wordCountJob.JobId `
        -HttpCredential $creds
    # Download the output
    Get-AzureStorageBlobContent `
        -Blob 'example/data/WordCountOutput/part-r-00000' `
        -Container $container `
        -Destination output.txt `
        -Context $context
    # Print the output of the job.
    Get-AzureRmHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $wordCountJob.JobId `
        -HttpCredential $creds
    
  2. 打开一个新的 Azure PowerShell 命令提示符。Open a new Azure PowerShell command prompt. 将目录更改为 mapreducejob.ps1 文件所在位置,并使用以下命令来运行脚本:Change directories to the location of the mapreducejob.ps1 file, then use the following command to run the script:

    .\mapreducejob.ps1
    

    运行脚本时,系统会提示输入 HDInsight 群集的名称和该群集的登录名。When you run the script, you are prompted for the name of the HDInsight cluster and the cluster login. 还会提示针对 Azure 订阅进行身份验证。You may also be prompted to authenticate to your Azure subscription.

  3. 作业完成后,将收到类似于以下文本的输出:When the job completes, you receive output similar to the following text:

    Cluster         : CLUSTERNAME
    ExitCode        : 0
    Name            : wordcount
    PercentComplete : map 100% reduce 100%
    Query           :
    State           : Completed
    StatusDirectory : f1ed2028-afe8-402f-a24b-13cc17858097
    SubmissionTime  : 12/5/2014 8:34:09 PM
    JobId           : job_1415949758166_0071
    

    此输出指示作业已成功完成。This output indicates that the job completed successfully.

    备注

    如果 ExitCode 的值不是 0,请参阅故障排除If the ExitCode is a value other than 0, see Troubleshooting.

    此示例还会将下载的文件存储到从中运行脚本的目录中的 output.txt 文件。This example also stores the downloaded files to an output.txt file in the directory that you run the script from.

查看输出View output

若要查看作业生成的单词和计数,请在文本编辑器中打开 output.txt 文件 。To see the words and counts produced by the job, open the output.txt file in a text editor.

备注

MapReduce 作业的输出文件是固定不变的。The output files of a MapReduce job are immutable. 因此,如果重新运行此示例,将需要更改输出文件的名称。So if you rerun this sample, you need to change the name of the output file.

故障排除Troubleshooting

如果作业完成时未返回任何信息,请查看该作业的错误。If no information is returned when the job completes, view errors for the job. 若要查看此作业的错误信息,请将以下命令添加到 mapreducejob.ps1 文件的末尾。To view error information for this job, add the following command to the end of the mapreducejob.ps1 file. 然后保存该文件,并重新运行脚本。Then save the file and rerun the script.

# Print the output of the WordCount job.
Write-Host "Display the standard output ..." -ForegroundColor Green
Get-AzHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $wordCountJob.JobId `
        -HttpCredential $creds `
        -DisplayOutputType StandardError

作业运行期间,此 cmdlet 返回写入到 STDERR 中的信息。This cmdlet returns the information that was written to STDERR as the job runs.

后续步骤Next steps

如你所见,Azure PowerShell 提供了简单的方法让你在 HDInsight 群集上运行 MapReduce 作业、监视作业状态,以及检索输出。As you can see, Azure PowerShell provides an easy way to run MapReduce jobs on an HDInsight cluster, monitor the job status, and retrieve the output. 有关 HDInsight 上的 Hadoop 的其他使用方法的信息:For information about other ways you can work with Hadoop on HDInsight: