Install Databricks Connect for Scala
Note
This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.
This article describes how to install Databricks Connect for Scala. See What is Databricks Connect?. For the Python version of this article, see Install Databricks Connect for Python.
Requirements
- Your target Azure Databricks workspace and cluster must meet the requirements for Cluster configuration for Databricks Connect.
- The Java Development Kit (JDK) installed on your development machine. Databricks recommends that the version of your JDK installation that you use matches the JDK version on your Azure Databricks cluster. To find the JDK version on your cluster, refer to the "System environment" section of the Databricks Runtime release notes for your cluster. For instance,
Zulu 8.70.0.23-CA-linux64
corresponds to JDK 8. See Databricks Runtime release notes versions and compatibility. - Scala installed on your development machine. Databricks recommends that the version of your Scala installation you use matches the Scala version on your Azure Databricks cluster. To find the Scala version on your cluster, refer to the "System environment" section of the Databricks Runtime release notes for your cluster. See Databricks Runtime release notes versions and compatibility.
- A Scala build tool on your development machine, such as
sbt
.
Set up the client
After you meet the requirements for Databricks Connect, complete the following steps to set up the Databricks Connect client.
Step 1: Add a reference to the Databricks Connect client
In your Scala project's build file such as
build.sbt
forsbt
,pom.xml
for Maven, orbuild.gradle
for Gradle, add the following reference to the Databricks Connect client:Sbt
libraryDependencies += "com.databricks" % "databricks-connect" % "14.0.0"
Maven
<dependency> <groupId>com.databricks</groupId> <artifactId>databricks-connect</artifactId> <version>14.0.0</version> </dependency>
Gradle
implementation 'com.databricks.databricks-connect:14.0.0'
Replace
14.0.0
with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.
Step 2: Configure connection properties
In this section, you configure properties to establish a connection between Databricks Connect and your remote Azure Databricks cluster. These properties include settings to authenticate Databricks Connect with your cluster.
For Databricks Connect for Databricks Runtime 13.3 LTS and above, for Scala, Databricks Connect includes the Databricks SDK for Java. This SDK implements the Databricks client unified authentication standard, a consolidated and consistent architectural and programmatic approach to authentication. This approach makes setting up and automating authentication with Azure Databricks more centralized and predictable. It enables you to configure Azure Databricks authentication once and then use that configuration across multiple Azure Databricks tools and SDKs without further authentication configuration changes.
Note
- The Databricks SDK for Java has not yet implemented Azure managed identities authentication.
Collect the following configuration properties.
- The Azure Databricks workspace instance name. This is the same as the Server Hostname value for your cluster; see Get connection details for an Azure Databricks compute resource.
- The ID of your cluster. You can obtain the cluster ID from the URL. See Cluster URL and ID.
- Any other properties that are necessary for the supported Databricks authentication type. These properties are described throughout this section.
Configure the connection within your code. Databricks Connect searches for configuration properties in the following order until it finds them. Once it finds them, it stops searching through the remaining options. The details for each option appear after the following table:
Configuration properties option Applies to 1. The DatabricksSession
class'sremote()
methodAzure Databricks personal access token authentication only 2. An Azure Databricks configuration profile All Azure Databricks authentication types 3. The SPARK_REMOTE
environment variableAzure Databricks personal access token authentication only 4. The DATABRICKS_CONFIG_PROFILE
environment variableAll Azure Databricks authentication types 5. An environment variable for each configuration property All Azure Databricks authentication types 6. An Azure Databricks configuration profile named DEFAULT
All Azure Databricks authentication types The
DatabricksSession
class'sremote()
methodFor this option, which applies to Azure Databricks personal access token authentication only, specify the workspace instance name, the Azure Databricks personal access token, and the ID of the cluster.
You can initialize the
DatabricksSession
class in several ways, as follows:- Set the
host
,token
, andclusterId
fields inDatabricksSession.builder
. - Use the Databricks SDK's
Config
class. - Specify a Databricks configuration profile along with the
clusterId
field.
Databricks does not recommend that you directly specify these connection properties in your code. Instead, Databricks recommends configuring properties through environment variables or configuration files, as described throughout this section. The following code examples assume that you provide some implementation of the proposed
retrieve*
functions yourself to get the necessary properties from the user or from some other configuration store, such as Azure KeyVault.The code for each of these approaches is as follows:
// Set the host, token, and clusterId fields in DatabricksSession.builder. // If you have already set the DATABRICKS_CLUSTER_ID environment variable with the // cluster's ID, you do not also need to set the clusterId field here. import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder() .host(retrieveWorkspaceInstanceName()) .token(retrieveToken()) .clusterId(retrieveClusterId()) .getOrCreate() // Use the Databricks SDK's Config class. // If you have already set the DATABRICKS_CLUSTER_ID environment variable with the // cluster's ID, you do not also need to set the clusterId field here. import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setHost(retrieveWorkspaceInstanceName()) .setToken(retrieveToken()) val spark = DatabricksSession.builder() .sdkConfig(config) .clusterId(retrieveClusterId()) .getOrCreate() // Specify a Databricks configuration profile along with the clusterId field. // If you have already set the DATABRICKS_CLUSTER_ID environment variable with the // cluster's ID, you do not also need to set the clusterId field here. import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setProfile("<profile-name>") val spark = DatabricksSession.builder() .sdkConfig(config) .clusterId(retrieveClusterId()) .getOrCreate()
- Set the
An Azure Databricks configuration profile
For this option, create or identify an Azure Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the supported Databricks authentication type that you want to use.The required configuration profile fields for each authentication type are as follows:
For Azure Databricks personal access token authentication:
host
andtoken
.For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication:
host
,azure_tenant_id
,azure_client_id
,azure_client_secret
, and possiblyazure_workspace_resource_id
.For Azure CLI authentication:
host
.For Azure managed identities authentication (where supported):
host
,azure_use_msi
,azure_client_id
, and possiblyazure_workspace_resource_id
.
Then set the name of this configuration profile through the
DatabricksConfig
class.You can specify
cluster_id
in a few ways, as follows:- Include the
cluster_id
field in your configuration profile, and then just specify the configuration profile's name. - Specify the configuration profile name along with the
clusterId
field.
If you have already set the
DATABRICKS_CLUSTER_ID
environment variable with the cluster's ID, you do not also need to specify thecluster_id
orclusterId
fields.The code for each of these approaches is as follows:
// Include the cluster_id field in your configuration profile, and then // just specify the configuration profile's name: import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setProfile("<profile-name>") val spark = DatabricksSession.builder() .sdkConfig(config) .getOrCreate() // Specify the configuration profile name along with the clusterId field. // In this example, retrieveClusterId() assumes some custom implementation that // you provide to get the cluster ID from the user or from some other // configuration store: import com.databricks.connect.DatabricksSession import com.databricks.sdk.core.DatabricksConfig val config = new DatabricksConfig() .setProfile("<profile-name>") val spark = DatabricksSession.builder() .sdkConfig(config) .clusterId(retrieveClusterId()) .getOrCreate()
The
SPARK_REMOTE
environment variableFor this option, which applies to Azure Databricks personal access token authentication only, set the
SPARK_REMOTE
environment variable to the following string, replacing the placeholders with the appropriate values.sc://<workspace-instance-name>:443/;token=<access-token-value>;x-databricks-cluster-id=<cluster-id>
Then initialize the
DatabricksSession
class as follows:import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
To set environment variables, see your operating system's documentation.
The
DATABRICKS_CONFIG_PROFILE
environment variableFor this option, create or identify an Azure Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the supported Databricks authentication type that you want to use.If you have already set the
DATABRICKS_CLUSTER_ID
environment variable with the cluster's ID, you do not also need to specifycluster_id
.The required configuration profile fields for each authentication type are as follows:
For Azure Databricks personal access token authentication:
host
andtoken
.For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication:
host
,azure_tenant_id
,azure_client_id
,azure_client_secret
, and possiblyazure_workspace_resource_id
.For Azure CLI authentication:
host
.For Azure managed identities authentication (where supported):
host
,azure_use_msi
,azure_client_id
, and possiblyazure_workspace_resource_id
.
Set the
DATABRICKS_CONFIG_PROFILE
environment variable to the name of this configuration profile. Then initialize theDatabricksSession
class as follows:import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
To set environment variables, see your operating system's documentation.
An environment variable for each configuration property
For this option, set the
DATABRICKS_CLUSTER_ID
environment variable and any other environment variables that are necessary for the supported Databricks authentication type that you want to use.The required environment variables for each authentication type are as follows:
For Azure Databricks personal access token authentication:
DATABRICKS_HOST
andDATABRICKS_TOKEN
.For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication:
DATABRICKS_HOST
,ARM_TENANT_ID
,ARM_CLIENT_ID
,ARM_CLIENT_SECRET
, and possiblyDATABRICKS_AZURE_RESOURCE_ID
.For Azure CLI authentication:
DATABRICKS_HOST
.For Azure managed identities authentication (where supported):
DATABRICKS_HOST
,ARM_USE_MSI
,ARM_CLIENT_ID
, and possiblyDATABRICKS_AZURE_RESOURCE_ID
.
Then initialize the
DatabricksSession
class as follows:import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()
To set environment variables, see your operating system's documentation.
An Azure Databricks configuration profile named
DEFAULT
For this option, create or identify an Azure Databricks configuration profile containing the field
cluster_id
and any other fields that are necessary for the supported Databricks authentication type that you want to use.If you have already set the
DATABRICKS_CLUSTER_ID
environment variable with the cluster's ID, you do not also need to specifycluster_id
.The required configuration profile fields for each authentication type are as follows:
For Azure Databricks personal access token authentication:
host
andtoken
.For Microsoft Entra ID (formerly Azure Active Directory) service principal authentication:
host
,azure_tenant_id
,azure_client_id
,azure_client_secret
, and possiblyazure_workspace_resource_id
.For Azure CLI authentication:
host
.For Azure managed identities authentication (where supported):
host
,azure_use_msi
,azure_client_id
, and possiblyazure_workspace_resource_id
.
Name this configuration profile
DEFAULT
.Then initialize the
DatabricksSession
class as follows:scala import com.databricks.connect.DatabricksSession val spark = DatabricksSession.builder().getOrCreate()