Create an Azure Databricks compatible JAR

This page describes how to create a JAR with Scala or Java code that is compatible with your Azure Databricks workspace.

At a high level, your JAR must meet the following requirements for compatibility:

  • Your Java Development Kit (JDK) version matches the JDK version on your Databricks cluster compute.

  • For Scala, your version of Scala matches the Scala version on your Databricks cluster compute.

  • Databricks Connect is added as a dependency and matches the version running on your Databricks cluster, or, your spark dependencies are compatible with your Databricks environment.

  • The local project you are compiling is packaged as a single JAR and includes all unprovided dependencies. Alternatively, you can add them to your environment or cluster.

  • The code in your JAR file correctly interacts with the Spark session or context.

  • For standard compute, all JARs used are added to the allowlist.

Databricks Connect and Databricks Runtime versioning

When creating a JAR to run in Azure Databricks, it is helpful to understand how you call Spark APIs and what version of the APIs you are calling.

Databricks Connect

Databricks Connect implements the Spark Connect architecture, which separates client and server components. This separation allows you to efficiently share clusters while fully enforcing Unity Catalog governance with measures such as row filters and column masks. However, Unity Catalog clusters in standard access mode have some limitations, for example lack of support for APIs such as Spark Context and RDDs. Limtations are listed in Standard compute requirements and limitations.

Databricks Connect gives you access to all Spark functionality, including Spark Connect, and is included with the standard compute. For these compute types, Databricks Connect is required, because it provides all necessary Spark APIs.

Databricks Runtime

The Databricks Runtime runs on compute managed by Azure Databricks. It is based on Spark but includes performance improvements and other enhancements for ease of use.

On standard compute, Databricks Connect provides APIs that call into the Databricks Runtime running on the compute. On dedicated compute, you compile against the Spark APIs, which are backed by the Databricks Runtime on the copmute.

Find the correct versions for your compute

To compile a compatible JAR file, you must know the version of Databricks Connect and Databricks Runtime that your compute is running.

Compute type How to find the correct versions
Compute in standard mode Uses Databricks Runtime and provides Databricks Connect to call APIs. To find the Databricks Runtime version for your compute:
  • In the workspace, click Cloud icon. Compute in the sidebar, and select your compute. The Databricks Runtime version is displayed in the configuration details. The major and minor version values of the Databricks Connect version matches the major and minor version values of the Databricks Runtime version.
Compute in dedicated mode Uses Databricks Runtime and allows you to compile against Spark APIs directly.
To find the Databricks Runtime version for your compute cluster:
  • In the workspace, click Cloud icon. Compute in the sidebar, and select your compute. The Databricks Runtime version is displayed in the configuration details.

JDK and Scala versions

When you build a JAR, the Java Development Kit (JDK) and Scala versions that you use to compile your code must match the versions running on your compute.

For standard compute, use the Databricks Connect version to find the compatible JDK and Scala versions. See the version support matrix.

If you are using dedicated compute, you must match the JDK and Scala versions of the Databricks Runtime on the compute. The System environment section of the Databricks Runtime release notes versions and compatibility for each Databricks Runtime version includes the correct Java and Scala versions. For example, for Databricks Runtime 17.3 LTS, see Databricks Runtime 17.3 LTS.

Note

Using a JDK or Scala version that doesn't match your compute's JDK or Scala versions may cause unexpected behavior or prevent your code from running.

Dependencies

You must set up your dependencies correctly in your build file.

Databricks Connect or Apache Spark

For standard compute, Databricks recommends adding a dependency on Databricks Connect instead of Spark to build JARs. Databricks Runtime is not identical to Spark, and includes performance and stability improvements. Databricks Connect provides the Spark APIs that are available in Azure Databricks. To include Databricks Connect, add a dependency:

Java

In the Maven pom.xml file:

<dependency>
  <groupId>com.databricks</groupId>
  <artifactId>databricks-connect_2.13</artifactId>
  <version>17.0.2</version>
</dependency>

Scala

In the build.sbt file:

libraryDependencies += "com.databricks" %% "databricks-connect" % "17.0.2"

Note

The Databricks Connect version must match the version included in the Databricks Runtime of your cluster.

Databricks recommends depending on Databricks Connect. If you do not want to use Databricks Connect, compile against spark-sql-api. Add this specific Spark library to your dependencies, but do not include the library in your JAR. In the build file, configure the scope for the dependency as provided:

Java

In the Maven pom.xml file:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql-api</artifactId>
  <version>4.0.1</version>
  <scope>provided</scope>
</dependency>

Scala

In the build.sbt file:

libraryDependencies += "org.apache.spark" %% "spark-sql-api" % "4.0.1" % "provided"

Spark dependencies

For standard compute, do not include any other Spark dependencies in your project. Using Databricks Connect provides all of the necessary Spark session APIs.

Classic compute and Databricks Runtime provided libraries

If you are running on classic compute (in either dedicated or standard mode), the Databricks Runtime includes many common libraries. Find the libraries and versions that are included in the System Environment section of the Databricks Runtime release notes for your Databricks Runtime version. For example, the Databricks Runtime 17.3 LTS System Environment section lists the versions of each library available in the Databricks Runtime.

To compile against one of these libraries, add it as a dependency with the provided option. For example, in Databricks Runtime 17.3 LTS, the protobuf-java library is provided, and you can compile against it with the following configuration:

Java

In the Maven pom.xml:

<dependency>
  <groupId>com.google.protobuf</groupId>
  <artifactId>protobuf-java</artifactId>
  <version>3.25.5</version>
  <scope>provided</scope>
</dependency>

Scala

In build.sbt:

libraryDependencies += "com.google.protobuf" %% "protobuf-java" % "3.25.5" % "provided"

non-provided libraries

For libraries that aren't available in the Databricks Runtime, you can include them yourself, in your JAR. For example, to include circe-core in your build.sbt file, add the following command:

libraryDependencies += "io.circe" %% "circe-core" % "0.14.10"

Package as a single JAR

Databricks recommends packaging your application and all dependencies into a single JAR file, also known as an über or fat JAR. For sbt, use sbt-assembly, and for Maven, use maven-shade-plugin. See the official Maven Shade Plugin and sbt-assembly documentation for details.

Alternatively, you can install dependencies as cluster-scoped libraries. See compute-scoped libraries for more information. If you install libraries on your cluster, your dependency in code should include the provided keyword to avoid packaging the library in your jar.

Note

For Scala JARs installed as libraries on Unity Catalog standard clusters, classes in the JAR libraries must be in a named package, such as com.databricks.MyClass, or errors will occur when importing the library.

Using the Spark session in your code

When you are running a JAR within a job, you must use the Spark session that is provided by Azure Databricks for the job. The following code shows how to access the session from your code:

Java

SparkSession spark = SparkSession.builder().getOrCreate();

Scala

val spark = SparkSession.builder().getOrCreate()

Ensure your JAR is allowlisted (standard compute)

For security reasons, standard access mode requires an administrator to add Maven coordinates and paths for JAR libraries to an allowlist. See Allowlist libraries and init scripts on compute with standard access mode (formerly shared access mode).

Recommendation: Use try-finally blocks for job clean up

If you want to have code that reliably runs at the end of your job, for example, to cleanup temporary files that were created during the job, use a try-finally block. Do not use a shutdown hook, as these do not run reliably in jobs.

Consider a JAR that consists of two parts:

  • jobBody() which contains the main part of the job.
  • jobCleanup() which has to be executed after jobBody(), whether that function succeeded or returned an exception.

For example, jobBody() creates tables and jobCleanup() drops those tables.

The safe way to ensure that the clean-up method is called is to put a try-finally block in the code:

try {
  jobBody()
} finally {
  jobCleanup()
}

You should not try to clean up using sys.addShutdownHook(jobCleanup) or the following code:

// Do NOT clean up with a shutdown hook like this. This will fail.
val cleanupThread = new Thread { override def run = jobCleanup() }
Runtime.getRuntime.addShutdownHook(cleanupThread)

Because of the way the lifetime of Spark containers is managed in Azure Databricks, the shutdown hooks are not run reliably.

Reading job parameters

Pparameters are passed to your JAR job with a JSON string array. To access these parameters, inspect the String array passed into your main function.

For more details on parameters, see Parameterize jobs.

Build a JAR

The following steps take you through creating and compiling a simple JAR file using Scala or Java to work in Azure Databricks.

Requirements

Your local development environment must have the following:

  • Java Development Kit (JDK) 17.
  • sbt (for Scala JARs).
  • Databricks CLI version 0.218.0 or above. To check your installed version of the Databricks CLI, run the command databricks -v. To install the Databricks CLI, see Install or update the Databricks CLI.
  • Databricks CLI authentication is configured with a DEFAULT profile. To configure authentication, see Configure access to your workspace.

Create a Scala JAR

  1. Run the following command to create a new Scala project:

    > sbt new scala/scala-seed.g8
    

    When prompted, enter a project name, for example, my-spark-app.

  2. Replace the contents of your build.sbt file with the following. Choose the Scala and Databricks Connect versions that are needed for your compute. See Dependencies.

    scalaVersion := "2.13.16"
    libraryDependencies += "com.databricks" %% "databricks-connect" % "17.0.1"
    // other dependencies go here...
    
    // to run with new jvm options, a fork is required otherwise it uses same options as sbt process
    fork := true
    javaOptions += "--add-opens=java.base/java.nio=ALL-UNNAMED"
    
  3. Edit or create a project/assembly.sbt file, and add this line:

    addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.3.1")
    
  4. Create your main class in src/main/scala/example/DatabricksExample.scala:

    package com.examples
    
    import com.databricks.connect.DatabricksSession
    import org.apache.spark.sql.{SparkSession}
    
    object SparkJar {
      def main(args: Array[String]): Unit = {
        val spark: SparkSession = DatabricksSession.builder().getOrCreate()
    
        // Prints the arguments to the class, which
        // are job parameters when run as a job:
        println(args.mkString(", "))
    
        // Shows using spark:
        println(spark.version)
        println(spark.range(10).limit(3).collect().mkString(" "))
      }
    }
    
  5. To build your JAR file, run the following command:

    > sbt assembly
    

Create a Java JAR

  1. Create a folder for your JAR.

  2. In the folder, create a file named PrintArgs.java with the following contents:

    import java.util.Arrays;
    
    public class PrintArgs {
      public static void main(String[] args) {
        System.out.println(Arrays.toString(args));
      }
    }
    
  3. Compile the PrintArgs.java file, which creates the file PrintArgs.class:

    javac PrintArgs.java
    
  4. (Optional) Run the compiled program:

    java PrintArgs Hello World!
    
    # [Hello, World!]
    
  5. In the folder, create a pom.xml file, and add the following code to enable Maven shade.

    <build>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <version>3.6.0</version>
          <executions>
            <execution>
              <phase>package</phase>
              <goals><goal>shade</goal></goals>
            </execution>
          </executions>
        </plugin>
      </plugins>
    </build>
    
  6. In the JAR folder, create a folder named META-INF.

  7. In the META-INF folder, create a file named MANIFEST.MF with the following contents. Be sure to add a newline at the end of this file:

    Main-Class: PrintArgs
    
  8. From your JAR folder, create a JAR named PrintArgs.jar:

    jar cvfm PrintArgs.jar META-INF/MANIFEST.MF *.class
    
  9. (Optional) To test it, run the JAR:

    java -jar PrintArgs.jar Hello World!
    
    # [Hello, World!]
    

    Note

    If you get the error no main manifest attribute, in PrintArgs.jar, be sure to add a newline to the end of the MANIFEST.MF file, and then try creating and running the JAR again.

Next steps