Use a Java UDF with Apache Hive in HDInsight
Learn how to create a Java-based user-defined function (UDF) that works with Apache Hive. The Java UDF in this example converts a table of text strings to all-lowercase characters.
Prerequisites
A Hadoop cluster on HDInsight. See Get Started with HDInsight on Linux.
Apache Maven properly installed according to Apache. Maven is a project build system for Java projects.
The URI scheme for your clusters primary storage. This would be wasb:// for Azure Storage, abfs:// for Azure Data Lake Storage Gen2. If secure transfer is enabled for Azure Storage or Data Lake Storage Gen2, the URI would be wasbs:// or abfss://, respectively See also, secure transfer.
A text editor or Java IDE
Important
If you create the Python files on a Windows client, you must use an editor that uses LF as a line ending. If you are not sure whether your editor uses LF or CRLF, see the Troubleshooting section for steps on removing the CR character.
Test environment
The environment used for this article was a computer running Windows 10. The commands were executed in a command prompt, and the various files were edited with Notepad. Modify accordingly for your environment.
From a command prompt, enter the commands below to create a working environment:
IF NOT EXIST C:\HDI MKDIR C:\HDI
cd C:\HDI
Create an example Java UDF
Create a new Maven project by entering the following command:
mvn archetype:generate -DgroupId=com.microsoft.examples -DartifactId=ExampleUDF -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
This command creates a directory named
exampleudf
, which contains the Maven project.Once the project has been created, delete the
exampleudf/src/test
directory that was created as part of the project by entering the following command:cd ExampleUDF rmdir /S /Q "src/test"
Open
pom.xml
by entering the command below:notepad pom.xml
Then replace the existing
<dependencies>
entry with the following XML:<dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.3</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>1.2.1</version> <scope>provided</scope> </dependency> </dependencies>
These entries specify the version of Hadoop and Hive included with HDInsight 3.6. You can find information on the versions of Hadoop and Hive provided with HDInsight from the HDInsight component versioning document.
Add a
<build>
section before the</project>
line at the end of the file. This section should contain the following XML:<build> <plugins> <!-- build for Java 1.8. This is required by HDInsight 3.6 --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.3</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <!-- build an uber jar --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.2.1</version> <configuration> <!-- Keep us from getting a can't overwrite file error --> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer"> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"> </transformer> </transformers> <!-- Keep us from getting a bad signature error --> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> </configuration> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
These entries define how to build the project. Specifically, the version of Java that the project uses and how to build an uberjar for deployment to the cluster.
Save the file once the changes have been made.
Enter the command below to create and open a new file
ExampleUDF.java
:notepad src/main/java/com/microsoft/examples/ExampleUDF.java
Then copy and paste the Java code below into the new file. Then close the file.
package com.microsoft.examples; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.*; // Description of the UDF @Description( name="ExampleUDF", value="returns a lower case version of the input string.", extended="select ExampleUDF(deviceplatform) from hivesampletable limit 10;" ) public class ExampleUDF extends UDF { // Accept a string input public String evaluate(String input) { // If the value is null, return a null if(input == null) return null; // Lowercase the input string and return it return input.toLowerCase(); } }
This code implements a UDF that accepts a string value, and returns a lowercase version of the string.
Build and install the UDF
In the commands below, replace sshuser
with the actual username if different. Replace mycluster
with the actual cluster name.
Compile and package the UDF by entering the following command:
mvn compile package
This command builds and packages the UDF into the
exampleudf/target/ExampleUDF-1.0-SNAPSHOT.jar
file.Use the
scp
command to copy the file to the HDInsight cluster by entering the following command:scp ./target/ExampleUDF-1.0-SNAPSHOT.jar sshuser@mycluster-ssh.azurehdinsight.cn:
Connect to the cluster using SSH by entering the following command:
ssh sshuser@mycluster-ssh.azurehdinsight.cn
From the open SSH session, copy the jar file to HDInsight storage.
hdfs dfs -put ExampleUDF-1.0-SNAPSHOT.jar /example/jars
Use the UDF from Hive
Start the Beeline client from the SSH session by entering the following command:
beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http'
This command assumes that you used the default of admin for the login account for your cluster.
Once you arrive at the
jdbc:hive2://localhost:10001/>
prompt, enter the following to add the UDF to Hive and expose it as a function.ADD JAR wasbs:///example/jars/ExampleUDF-1.0-SNAPSHOT.jar; CREATE TEMPORARY FUNCTION tolower as 'com.microsoft.examples.ExampleUDF';
Use the UDF to convert values retrieved from a table to lower case strings.
SELECT tolower(state) AS ExampleUDF, state FROM hivesampletable LIMIT 10;
This query selects the state from the table, convert the string to lower case, and then display them along with the unmodified name. The output appears similar to the following text:
+---------------+---------------+--+ | exampleudf | state | +---------------+---------------+--+ | california | California | | pennsylvania | Pennsylvania | | pennsylvania | Pennsylvania | | pennsylvania | Pennsylvania | | colorado | Colorado | | colorado | Colorado | | colorado | Colorado | | utah | Utah | | utah | Utah | | colorado | Colorado | +---------------+---------------+--+
Troubleshooting
When running the hive job, you may come across an error similar to the following text:
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20001]: An error occurred while reading or writing to your custom script. It may have crashed with an error.
This problem may be caused by the line endings in the Python file. Many Windows editors default to using CRLF as the line ending, but Linux applications usually expect LF.
You can use the following PowerShell statements to remove the CR characters before uploading the file to HDInsight:
# Set $original_file to the Python file path
$text = [IO.File]::ReadAllText($original_file) -replace "`r`n", "`n"
[IO.File]::WriteAllText($original_file, $text)
Next steps
For other ways to work with Hive, see Use Apache Hive with HDInsight.
For more information on Hive User-Defined Functions, see Apache Hive Operators and User-Defined Functions section of the Hive wiki at apache.org.