使用 PowerShell 运行 Apache Hive 查询Run Apache Hive queries using PowerShell

本文档举例说明如何使用 Azure PowerShell 在 Apache Hadoop on HDInsight 群集中运行 Apache Hive 查询。This document provides an example of using Azure PowerShell to run Apache Hive queries in an Apache Hadoop on HDInsight cluster.

备注

本文档未详细描述示例中使用的 HiveQL 语句的作用。This document does not provide a detailed description of what the HiveQL statements that are used in the examples do. 有关此示例中使用的 HiveQL 的信息,请参阅将 Apache Hive 与 HDInsight 上的 Apache Hadoop 配合使用For information on the HiveQL that is used in this example, see Use Apache Hive with Apache Hadoop on HDInsight.

必备条件Prerequisites

运行 Hive 查询Run a Hive query

Azure PowerShell 提供 cmdlet,可让你在 HDInsight 上远程运行 Hive 查询。Azure PowerShell provides cmdlets that allow you to remotely run Hive queries on HDInsight. cmdlet 在内部对 HDInsight 群集上的 WebHCat 进行 REST 调用。Internally, the cmdlets make REST calls to WebHCat on the HDInsight cluster.

在远程 HDInsight 群集上运行 Hive 查询时,使用以下 Cmdlet:The following cmdlets are used when running Hive queries in a remote HDInsight cluster:

  • Connect-AzAccount:在 Azure 订阅中进行 Azure PowerShell 身份验证。Connect-AzAccount: Authenticates Azure PowerShell to your Azure subscription.
  • New-AzHDInsightHiveJobDefinition:使用指定的 HiveQL 语句创建作业定义 。New-AzHDInsightHiveJobDefinition: Creates a job definition by using the specified HiveQL statements.
  • Start-AzHDInsightJob:将作业定义发送到 HDInsight 并启动作业。Start-AzHDInsightJob: Sends the job definition to HDInsight and starts the job. 将返回作业对象 。A job object is returned.
  • Wait-AzHDInsightJob:使用作业对象来检查作业的状态。Wait-AzHDInsightJob: Uses the job object to check the status of the job. 它等到作业完成或超出等待时间。It waits until the job completes or the wait time is exceeded.
  • Get-AzHDInsightJobOutput:用于检索作业的输出。Get-AzHDInsightJobOutput: Used to retrieve the output of the job.
  • Invoke-AzHDInsightHiveJob:用于运行 HiveQL 语句。Invoke-AzHDInsightHiveJob: Used to run HiveQL statements. 此 cmdlet 将阻止查询完成,然后返回结果。This cmdlet blocks the query completes, then returns the results.
  • Use-AzHDInsightCluster:设置要用于 Invoke-AzHDInsightHiveJob 命令的当前群集。Use-AzHDInsightCluster: Sets the current cluster to use for the Invoke-AzHDInsightHiveJob command.

以下步骤演示了如何使用这些 Cmdlet 在 HDInsight 群集上运行作业:The following steps demonstrate how to use these cmdlets to run a job in your HDInsight cluster:

  1. 使用编辑器将以下代码保存为 hivejob.ps1Using an editor, save the following code as hivejob.ps1.

    # Login to your Azure subscription
    # Is there an active Azure subscription?
    $sub = Get-AzureRmSubscription -ErrorAction SilentlyContinue
    if(-not($sub))
    {
        Add-AzureRmAccount -EnvironmentName AzureChinaCloud
    }
    
    #Get cluster info
    $clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
    $creds=Get-Credential -Message "Enter the login for the cluster"
    
    #HiveQL
    #Note: set hive.execution.engine=tez; is not required for
    #      Linux-based HDInsight
    $queryString = "set hive.execution.engine=tez;" +
                "DROP TABLE log4jLogs;" +
                "CREATE EXTERNAL TABLE log4jLogs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasbs:///example/data/';" +
                "SELECT * FROM log4jLogs WHERE t4 = '[ERROR]';"
    
    #Create an HDInsight Hive job definition
    $hiveJobDefinition = New-AzureRmHDInsightHiveJobDefinition -Query $queryString 
    
    #Submit the job to the cluster
    Write-Host "Start the Hive job..." -ForegroundColor Green
    
    $hiveJob = Start-AzureRmHDInsightJob -ClusterName $clusterName -JobDefinition $hiveJobDefinition -ClusterCredential $creds
    
    #Wait for the Hive job to complete
    Write-Host "Wait for the job to complete..." -ForegroundColor Green
    Wait-AzureRmHDInsightJob -ClusterName $clusterName -JobId $hiveJob.JobId -ClusterCredential $creds
    
    # Print the output
    Write-Host "Display the standard output..." -ForegroundColor Green
    Get-AzureRmHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $hiveJob.JobId `
        -HttpCredential $creds
    
  2. 打开一个新的 Azure PowerShell 命令提示符。Open a new Azure PowerShell command prompt. 将目录更改到 hivejob.ps1 文件的位置,并使用以下命令来运行脚本:Change directories to the location of the hivejob.ps1 file, then use the following command to run the script:

     .\hivejob.ps1
    

    脚本运行时,系统会提示输入群集名称和 HTTPS/群集管理员帐户凭据。When the script runs, you are prompted to enter the cluster name and the HTTPS/Cluster Admin account credentials. 可能还会提示登录到 Azure 订阅。You may also be prompted to sign in to your Azure subscription.

  3. 作业完成时,它会返回类似以下文本的信息:When the job completes, it returns information similar to the following text:

     Display the standard output...
     2012-02-03      18:35:34        SampleClass0    [ERROR] incorrect       id
     2012-02-03      18:55:54        SampleClass1    [ERROR] incorrect       id
     2012-02-03      19:25:27        SampleClass4    [ERROR] incorrect       id
    
  4. 如前所述,Invoke-Hive 可以用来运行查询,并等待响应。As mentioned earlier, Invoke-Hive can be used to run a query and wait for the response. 使用以下脚本查看 Invoke-Hive 的工作原理:Use the following script to see how Invoke-Hive works:

    # Login to your Azure subscription
    # Is there an active Azure subscription?
    $sub = Get-AzureRmSubscription -ErrorAction SilentlyContinue
    if(-not($sub))
    {
        Add-AzureRmAccount -EnvironmentName AzureChinaCloud
    }
    
    #Get cluster info
    $clusterName = Read-Host -Prompt "Enter the HDInsight cluster name"
    $creds=Get-Credential -Message "Enter the login for the cluster"
    
    # Set the cluster to use
    Use-AzureRmHDInsightCluster -ClusterName $clusterName -HttpCredential $creds
    
    $queryString = "set hive.execution.engine=tez;" +
                "DROP TABLE log4jLogs;" +
                "CREATE EXTERNAL TABLE log4jLogs(t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/example/data/';" +
                "SELECT * FROM log4jLogs WHERE t4 = '[ERROR]';"
    Invoke-AzureRmHDInsightHiveJob `
        -StatusFolder "statusout" `
        -Query $queryString
    

    输出类似于以下文本:The output looks like the following text:

     2012-02-03    18:35:34    SampleClass0    [ERROR]    incorrect    id
     2012-02-03    18:55:54    SampleClass1    [ERROR]    incorrect    id
     2012-02-03    19:25:27    SampleClass4    [ERROR]    incorrect    id
    

    备注

    对于较长的 HiveQL 查询,可以使用 Azure PowerShell Here-Strings cmdlet 或 HiveQL 脚本文件。For longer HiveQL queries, you can use the Azure PowerShell Here-Strings cmdlet or HiveQL script files. 以下代码段显示了如何使用 Invoke-Hive cmdlet 来运行 HiveQL 脚本文件。The following snippet shows how to use the Invoke-Hive cmdlet to run a HiveQL script file. HiveQL 脚本文件必须上传到 wasb://。The HiveQL script file must be uploaded to wasb://.

    Invoke-AzHDInsightHiveJob -File "wasbs://<ContainerName>@<StorageAccountName>/<Path>/query.hql"

    有关 Here-Strings 的详细信息,请参阅使用 Windows PowerShell Here-StringsFor more information about Here-Strings, see Using Windows PowerShell Here-Strings.

故障排除Troubleshooting

如果作业完成时未返回任何信息,请查看错误日志。If no information is returned when the job completes, view the error logs. 若要查看此作业的错误信息,请将以下内容添加到 hivejob.ps1 文件的末尾,保存并重新运行该文件。To view error information for this job, add the following to the end of the hivejob.ps1 file, save it, and then run it again.

# Print the output of the Hive job.
Get-AzHDInsightJobOutput `
        -Clustername $clusterName `
        -JobId $job.JobId `
        -HttpCredential $creds `
        -DisplayOutputType StandardError

作业处理期间,此 cmdlet 返回写入到 STDERR 中的信息。This cmdlet returns the information that is written to STDERR during job processing.

摘要Summary

如你所见,Azure PowerShell 提供了简单的方法让你在 HDInsight 群集上运行 Hive 查询,监视作业状态,以及检索输出。As you can see, Azure PowerShell provides an easy way to run Hive queries in an HDInsight cluster, monitor the job status, and retrieve the output.

后续步骤Next steps

有关 HDInsight 中的 Hive 的一般信息:For general information about Hive in HDInsight:

有关 HDInsight 上 Hadoop 的其他使用方法的信息:For information about other ways you can work with Hadoop on HDInsight: