Use a JAR in an Azure Databricks job
The Java archive or JAR file format is based on the popular ZIP file format and is used for aggregating many Java or Scala files into one. Using the JAR task, you can ensure fast and reliable installation of Java or Scala code in your Azure Databricks jobs. This article provides an example of creating a JAR and a job that runs the application packaged in the JAR. In this example, you will:
- Create the JAR project defining an example application.
- Bundle the example files into a JAR.
- Create a job to run the JAR.
- Run the job and view the results.
Before you begin
You need the following to complete this example:
- For Java JARs, the Java Development Kit (JDK).
- For Scala JARs, the JDK and sbt.
Step 1: Create a local directory for the example
Create a local directory to hold the example code and generated artifacts, for example, databricks_jar_test
.
Step 2: Create the JAR
Complete the following instructions to use Java or Scala to create the JAR.
Create a Java JAR
From the
databricks_jar_test
folder, create a file namedPrintArgs.java
with the following contents:import java.util.Arrays; public class PrintArgs { public static void main(String[] args) { System.out.println(Arrays.toString(args)); } }
Compile the
PrintArgs.java
file, which creates the filePrintArgs.class
:javac PrintArgs.java
(Optional) Run the compiled program:
java PrintArgs Hello World! # [Hello, World!]
In the same folder as the
PrintArgs.java
andPrintArgs.class
files, create a folder namedMETA-INF
.In the
META-INF
folder, create a file namedMANIFEST.MF
with the following contents. Be sure to add a newline at the end of this file:Main-Class: PrintArgs
From the root of the
databricks_jar_test
folder, create a JAR namedPrintArgs.jar
:jar cvfm PrintArgs.jar META-INF/MANIFEST.MF *.class
(Optional) To test it, from the root of the
databricks_jar_test
folder, run the JAR:java -jar PrintArgs.jar Hello World! # [Hello, World!]
Note
If you get the error
no main manifest attribute, in PrintArgs.jar
, be sure to add a newline to the end of theMANIFEST.MF
file, and then try creating and running the JAR again.Upload
PrintArgs.jar
to a volume. See Upload files to a Unity Catalog volume.
Create a Scala JAR
From the
databricks_jar_test
folder, create an empty file namedbuild.sbt
with the following contents:ThisBuild / scalaVersion := "2.12.14" ThisBuild / organization := "com.example" lazy val PrintArgs = (project in file(".")) .settings( name := "PrintArgs" )
From the
databricks_jar_test
folder, create the folder structuresrc/main/scala/example
.In the
example
folder, create a file namedPrintArgs.scala
with the following contents:package example object PrintArgs { def main(args: Array[String]): Unit = { println(args.mkString(", ")) } }
Compile the program:
sbt compile
(Optional) Run the compiled program:
sbt "run Hello World\!" # Hello, World!
In the
databricks_jar_test/project
folder, create a file namedassembly.sbt
with the following contents:addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.0.0")
From the root of the
databricks_jar_test
folder, run theassembly
command, which generates a JAR under thetarget
folder:sbt assembly
(Optional) To test it, from the root of the
databricks_jar_test
folder, run the JAR:java -jar target/scala-2.12/PrintArgs-assembly-0.1.0-SNAPSHOT.jar Hello World! # Hello, World!
Upload
PrintArgs-assembly-0.1.0-SNAPSHOT.jar
to a volume. See Upload files to a Unity Catalog volume.
Step 3. Create an Azure Databricks job to run the JAR
- Go to your Azure Databricks landing page and do one of the following:
- In the sidebar, click Workflows and click .
- In the sidebar, click New and select Job from the menu.
- In the task dialog box that appears on the Tasks tab, replace Add a name for your job… with your job name, for example
JAR example
. - For Task name, enter a name for the task, for example
java_jar_task
for Java, orscala_jar_task
for Scala. - For Type, select JAR.
- For Main class, for this example, enter
PrintArgs
for Java, orexample.PrintArgs
for Scala. - For Cluster, select a compatible cluster. See Java and Scala library support.
- For Dependent libraries, click + Add.
- In the Add dependent library dialog, with Volumes selected, enter the location where you uploaded the JAR (
PrintArgs.jar
orPrintArgs-assembly-0.1.0-SNAPSHOT.jar
) in the previous step into Volumes File Path, or filter or browse to find the JAR. Select it. - Click Add.
- For Parameters, for this example, enter
["Hello", "World!"]
. - Click Add.
Step 4: Run the job and view the job run details
Click to run the workflow. To view details for the run, click View run in the Triggered run pop-up or click the link in the Start time column for the run in the job runs view.
When the run completes, the output displays in the Output panel, including the arguments passed to the task.
Output size limits for JAR jobs
Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run is canceled and marked as failed.
To avoid encountering this limit, you can prevent stdout from being returned from the driver to Azure Databricks by setting the spark.databricks.driver.disableScalaOutput
Spark configuration to true
. By default, the flag value is false
. The flag controls cell output for Scala JAR jobs and Scala notebooks. If the flag is enabled, Spark does not return job execution results to the client. The flag does not affect the data that is written in the cluster's log files. Databricks recommends setting this flag only for job clusters for JAR jobs because it disables notebook results.
Recommendation: Use the shared SparkContext
Because Azure Databricks is a managed service, some code changes might be necessary to ensure that your Apache Spark jobs run correctly. JAR job programs must use the shared SparkContext
API to get the SparkContext
. Because Azure Databricks initializes the SparkContext
, programs that invoke new SparkContext()
will fail. To get the SparkContext
, use only the shared SparkContext
created by Azure Databricks:
val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()
There are also several methods you should avoid when using the shared SparkContext
.
- Do not call
SparkContext.stop()
. - Do not call
System.exit(0)
orsc.stop()
at the end of yourMain
program. This can cause undefined behavior.
Recommendation: Use try-finally
blocks for job clean up
Consider a JAR that consists of two parts:
jobBody()
which contains the main part of the job.jobCleanup()
which has to be executed afterjobBody()
, whether that function succeeded or returned an exception.
For example, jobBody()
creates tables and jobCleanup()
drops those tables.
The safe way to ensure that the clean-up method is called is to put a try-finally
block in the code:
try {
jobBody()
} finally {
jobCleanup()
}
You should not try to clean up using sys.addShutdownHook(jobCleanup)
or the following code:
val cleanupThread = new Thread { override def run = jobCleanup() }
Runtime.getRuntime.addShutdownHook(cleanupThread)
Because of the way the lifetime of Spark containers is managed in Azure Databricks, the shutdown hooks are not run reliably.
Configuring JAR job parameters
You pass parameters to JAR jobs with a JSON string array. See the spark_jar_task
object in the request body passed to the Create a new job operation (POST /jobs/create
) in the Jobs API. To access these parameters, inspect the String
array passed into your main
function.
Manage library dependencies
The Spark driver has certain library dependencies that cannot be overridden. If your job adds conflicting libraries, the Spark driver library dependencies take precedence.
To get the full list of the driver library dependencies, run the following command in a notebook attached to a cluster configured with the same Spark version (or the cluster with the driver you want to examine):
%sh
ls /databricks/jars
When you define library dependencies for JARs, Databricks recommends listing Spark and Hadoop as provided
dependencies. In Maven, add Spark and Hadoop as provided dependencies:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
<scope>provided</scope>
</dependency>
In sbt
, add Spark and Hadoop as provided dependencies:
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0" % "provided"
libraryDependencies += "org.apache.hadoop" %% "hadoop-core" % "1.2.1" % "provided"
Tip
Specify the correct Scala version for your dependencies based on the version you are running.
Next steps
To learn more about creating and running Azure Databricks jobs, see Schedule and orchestrate workflows.