将 Java UDF 与 HDInsight 中的 Apache Hive 配合使用Use a Java UDF with Apache Hive in HDInsight

了解如何创建可用于 Apache Hive 的基于 Java 的用户定义函数 (UDF)。Learn how to create a Java-based user-defined function (UDF) that works with Apache Hive. 此示例中的 Java UDF 将文本字符串表转换为全小写字符。The Java UDF in this example converts a table of text strings to all-lowercase characters.

要求Requirements

  • HDInsight 群集An HDInsight cluster

    重要

    Linux 是 HDInsight 3.4 或更高版本上使用的唯一操作系统。Linux is the only operating system used on HDInsight version 3.4 or greater. 本文档中的大多数步骤同时适用于基于 Windows 和基于 Linux 的群集。Most steps in this document work on both Windows- and Linux-based clusters. 但是,用于将已编译的 UDF 上传到群集并运行的步骤特定于基于 Linux 的群集。However, the steps used to upload the compiled UDF to the cluster and run it are specific to Linux-based clusters. 提供可用于基于 Windows 的群集的信息的链接。Links are provided to information that can be used with Windows-based clusters.

  • Java JDK 8 或更高版本(或类似程序,如 OpenJDK)Java JDK 8 or later (or an equivalent, such as OpenJDK)

  • Apache MavenApache Maven

  • 文本编辑器或 Java IDEA text editor or Java IDE

    重要

    如果在 Windows 客户端上创建 Python 文件,则必须采用使用 LF 作为行结束的编辑器。If you create the Python files on a Windows client, you must use an editor that uses LF as a line ending. 如果无法确定编辑器使用的是 LF 还是 CRLF,请参阅“故障排除”部分,了解删除 CR 字符的步骤。If you are not sure whether your editor uses LF or CRLF, see the Troubleshooting section for steps on removing the CR character.

创建示例 Java UDFCreate an example Java UDF

  1. 从命令行中,使用以下命令新建 Maven 项目:From a command line, use the following to create a new Maven project:

    mvn archetype:generate -DgroupId=com.microsoft.examples -DartifactId=ExampleUDF -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
    

    备注

    如果使用 PowerShell,必须将参数用引号引起来。If you are using PowerShell, you must put quotes around the parameters. 例如,mvn archetype:generate "-DgroupId=com.microsoft.examples" "-DartifactId=ExampleUDF" "-DarchetypeArtifactId=maven-archetype-quickstart" "-DinteractiveMode=false"For example, mvn archetype:generate "-DgroupId=com.microsoft.examples" "-DartifactId=ExampleUDF" "-DarchetypeArtifactId=maven-archetype-quickstart" "-DinteractiveMode=false".

    此命令创建一个名为 exampleudf 的目录,其中包含 Maven 项目。This command creates a directory named exampleudf, which contains the Maven project.

  2. 创建该项目后,删除作为项目的一部分创建的 exampleudf/src/test 目录。Once the project has been created, delete the exampleudf/src/test directory that was created as part of the project.

    然后,将现有 <dependencies> 条目替换为以下 XML:Then replace the existing <dependencies> entry with the following XML:

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.3</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>1.2.1</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>
    

    这些条目指定了 HDInsight 3.6 中包含的 Hadoop 和 Hive 版本。These entries specify the version of Hadoop and Hive included with HDInsight 3.6. 可以在 HDInsight 组件版本控制文档中找到 HDInsight 提供的 Hadoop 和 Hive 的版本信息。You can find information on the versions of Hadoop and Hive provided with HDInsight from the HDInsight component versioning document.

    在文件末尾的 </project> 行之前添加 <build> 部分。Add a <build> section before the </project> line at the end of the file. 该部分应包含以下 XML:This section should contain the following XML:

    <build>
        <plugins>
            <!-- build for Java 1.8. This is required by HDInsight 3.6  -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.3</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <!-- build an uber jar -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.1</version>
                <configuration>
                    <!-- Keep us from getting a can't overwrite file error -->
                    <transformers>
                        <transformer
                                implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer">
                        </transformer>
                        <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer">
                        </transformer>
                    </transformers>
                    <!-- Keep us from getting a bad signature error -->
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                            </excludes>
                        </filter>
                    </filters>
                </configuration>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
    

    这些条目用于定义如何生成项目。These entries define how to build the project. 具体而言,项目使用的 Java 版本以及如何生成部署到群集的 uberjar。Specifically, the version of Java that the project uses and how to build an uberjar for deployment to the cluster.

    一旦进行了更改,请保存该文件。Save the file once the changes have been made.

  3. 输入以下命令,以创建并打开新文件 ExampleUDF.javaEnter the command below to create and open a new file ExampleUDF.java:

    notepad src/main/java/com/microsoft/examples/ExampleUDF.java
    

    将以下 Java 代码复制并粘贴到新文件中。Then copy and paste the java code below into the new file. 然后关闭该文件。Then close the file.

    package com.microsoft.examples;
    
    import org.apache.hadoop.hive.ql.exec.Description;
    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.io.*;
    
    // Description of the UDF
    @Description(
        name="ExampleUDF",
        value="returns a lower case version of the input string.",
        extended="select ExampleUDF(deviceplatform) from hivesampletable limit 10;"
    )
    public class ExampleUDF extends UDF {
        // Accept a string input
        public String evaluate(String input) {
            // If the value is null, return a null
            if(input == null)
                return null;
            // Lowercase the input string and return it
            return input.toLowerCase();
        }
    }
    

    该代码实现接受一个字符串值,并返回该字符串的小写形式的 UDF。This code implements a UDF that accepts a string value, and returns a lowercase version of the string.

生成并安装 UDFBuild and install the UDF

在以下命令中,请将 sshuser 替换为实际用户名(如果两者不同)。In the commands below, replace sshuser with the actual username if different. mycluster 替换为实际群集名称。Replace mycluster with the actual cluster name.

  1. 通过输入以下命令编译和打包 UDF:Compile and package the UDF by entering the following command:

    mvn compile package
    

    此命令生成 UDF 并将其打包到 exampleudf/target/ExampleUDF-1.0-SNAPSHOT.jar 文件中。This command builds and packages the UDF into the exampleudf/target/ExampleUDF-1.0-SNAPSHOT.jar file.

  2. 使用 scp 命令将文件复制到 HDInsight 群集,只需输入以下命令即可:Use the scp command to copy the file to the HDInsight cluster by entering the following command:

    scp ./target/ExampleUDF-1.0-SNAPSHOT.jar sshuser@mycluster-ssh.azurehdinsight
    
  3. 输入以下命令,通过 SSH 连接到群集:Connect to the cluster using SSH by entering the following command:

    ssh sshuser@mycluster-ssh.azurehdinsight.cn
    
  4. 从打开的 SSH 会话将 jar 文件复制到 HDInsight 存储。From the open SSH session, copy the jar file to HDInsight storage.

    hdfs dfs -put ExampleUDF-1.0-SNAPSHOT.jar /example/jars
    

在 Hive 中使用 UDFUse the UDF from Hive

  1. 输入以下命令,在 SSH 会话中启动 Beeline 客户端:Start the Beeline client from the SSH session by entering the following command:

    beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http'
    

    该命令假定你使用默认的群集登录帐户 adminThis command assumes that you used the default of admin for the login account for your cluster.

  2. 显示 jdbc:hive2://localhost:10001/> 提示符后,输入以下代码将 UDF 添加到 Hive,并将其作为函数公开。Once you arrive at the jdbc:hive2://localhost:10001/> prompt, enter the following to add the UDF to Hive and expose it as a function.

    ADD JAR wasbs:///example/jars/ExampleUDF-1.0-SNAPSHOT.jar;
    CREATE TEMPORARY FUNCTION tolower as 'com.microsoft.examples.ExampleUDF';
    
  3. 使用该 UDF 将从表中检索的值转换为小写字符串。Use the UDF to convert values retrieved from a table to lower case strings.

    SELECT tolower(state) AS ExampleUDF, state FROM hivesampletable LIMIT 10;
    

    此查询将从表中选择状态,并将字符串转换为小写字符串,然后将它们与未修改的名称一起显示。This query selects the state from the table, convert the string to lower case, and then display them along with the unmodified name. 显示的输出类似于以下文本:The output appears similar to the following text:

     +---------------+---------------+--+
     |  exampleudf   |     state     |
     +---------------+---------------+--+
     | california    | California    |
     | pennsylvania  | Pennsylvania  |
     | pennsylvania  | Pennsylvania  |
     | pennsylvania  | Pennsylvania  |
     | colorado      | Colorado      |
     | colorado      | Colorado      |
     | colorado      | Colorado      |
     | utah          | Utah          |
     | utah          | Utah          |
     | colorado      | Colorado      |
     +---------------+---------------+--+
    

故障排除Troubleshooting

运行 hive 作业时,可能会遇到类似于以下文本的错误:When running the hive job, you may come across an error similar to the following text:

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20001]: An error occurred while reading or writing to your custom script. It may have crashed with an error.

此问题可能是由 Python 文件中的行尾结束符号导致的。This problem may be caused by the line endings in the Python file. 许多 Windows 编辑器默认为使用 CRLF 作为行尾结束符号,但 Linux 应用程序通常应使用 LF。Many Windows editors default to using CRLF as the line ending, but Linux applications usually expect LF.

可以使用以下 PowerShell 语句删除 CR 字符,此后再将文件上传到 HDInsight:You can use the following PowerShell statements to remove the CR characters before uploading the file to HDInsight:

# Set $original_file to the python file path
$text = [IO.File]::ReadAllText($original_file) -replace "`r`n", "`n"
[IO.File]::WriteAllText($original_file, $text)

后续步骤Next steps

有关使用 Hive 的其他方式,请参阅将 Apache Hive 与 HDInsight 配合使用For other ways to work with Hive, see Use Apache Hive with HDInsight.

有关 Hive 用户定义函数的详细信息,请参阅 apache.org 网站上的 Hive wiki 的 Apache Hive 运算符和用户定义函数部分。For more information on Hive User-Defined Functions, see Apache Hive Operators and User-Defined Functions section of the Hive wiki at apache.org.