Service principals for CI/CD

This article describes how to use service principals for CI/CD with Azure Databricks. A service principal is an identity created for use with automated tools and applications, including:

As a security best practice, Databricks recommends using a service principal and its token instead of your Azure Databricks user or your Databricks personal access token for your workspace user to give CI/CD platforms access to Azure Databricks resources. Some benefits to this approach include the following:

  • You can grant and restrict access to Azure Databricks resources for a service principal independently of a user. For instance, this allows you to prohibit a service principal from acting as an admin in your Azure Databricks workspace while still allowing other specific users in your workspace to continue to act as admins.
  • Users can safeguard their access tokens from being accessed by CI/CD platforms.
  • You can temporarily disable or permanently delete a service principal without impacting other users. For instance, this allows you to pause or remove access from a service principal that you suspect is being used in a malicious way.
  • If a user leaves your organization, you can remove that user without impacting any service principal.

To give a CI/CD platform access to your Azure Databricks workspace, do the following:

Choose one of the following supported MS Entra authentication mechanisms with a service connection:

  • Microsoft Entra workload identity federation, using the Azure CLI as the authentication mechanism.
    • A Microsoft Entra service principal, using a Microsoft Entra client secret as the authentication mechanism.

      • A Microsoft Entra ID managed identity.

Requirements

  • The Azure Databricks OAuth token or Microsoft Entra ID token for an Azure Databricks managed service principal or a Microsoft Entra ID managed service principal. To create an Azure Databricks managed service principal or Microsoft Entra ID managed service principal and its Azure Databricks OAuth token or Microsoft Entra ID token, see Manage service principals.
  • An account with your Git provider.

Set up GitHub Actions

GitHub Actions must be able to access your Azure Databricks workspace. If you want to use Azure Databricks Git folders, your workspace must also be able to access GitHub.

To enable GitHub Actions to access your Azure Databricks workspace, you must provide information about your Azure Databricks managed service principal or Microsoft Entra ID managed service principal to GitHub Actions. This can include information such as the Application (client) ID, the Directory (tenant) ID for a Microsoft Entra ID managed service principal, the Azure Databricks managed service principal's or Microsoft Entra ID managed service principal's client secret, or the access_token value for an Azure Databricks managed service principal, depending on the GitHub Action's requirements. For more information, see Manage service principals and the GitHub Action's documentation.

If you also want to enable your Azure Databricks workspace to access GitHub when you use Azure Databricks Git folders, you must add the GitHub personal access token for a GitHub machine user to your workspace.

Provide information about your service principal to GitHub Actions

This section describes how to enable GitHub Actions to access your Azure Databricks workspace.

As a security best practice, Databricks recommends that you do not enter information about your service principal directly into the body of a GitHub Actions file. You should provide this information to GitHub Actions by using GitHub encrypted secrets instead.

GitHub Actions, such as the ones that Databricks lists in Continuous integration and delivery using GitHub Actions, rely on various GitHub encrypted secrets such as:

  • DATABRICKS_HOST, which is the value https:// followed by your workspace instance name, for example adb-1234567890123456.7.databricks.azure.cn.
  • AZURE_CREDENTIALS, which is a JSON document that represents the output of running the Azure CLI to get information about a Microsoft Entra ID managed service principal. For more information, see the documentation for the GitHub Action.
  • AZURE_SP_APPLICATION_ID, which is the value of the Application (client) ID for a Microsoft Entra ID managed service principal.
  • AZURE_SP_TENANT_ID, which is the value of the Directory (tenant) ID for a Microsoft Entra ID managed service principal.
  • AZURE_SP_CLIENT_SECRET, which is the value of the client secret's Value for a Microsoft Entra ID managed service principal.

For more information about which GitHub encrypted secrets are required for a GitHub Action, see Manage service principals and the documentation for that GitHub Action.

To add these GitHub encrypted secrets to your GitHub repository, see Creating encrypted secrets for a repository in the GitHub documentation. For other approaches to add these GitHub repository secrets, see Encrypted secrets in the GitHub documentation.

Add the GitHub personal access token for a GitHub machine user to your Azure Databricks workspace

This section describes how to enable your Azure Databricks workspace to access GitHub with Azure Databricks Git folders. This is an optional task in CI/CD scenarios.

As a security best practice, Databricks recommends that you use GitHub machine users instead of GitHub personal accounts, for many of the same reasons that you should use a service principal instead of an Azure Databricks user. To add the GitHub personal access token for a GitHub machine user to your Azure Databricks workspace, do the following:

  1. Create a GitHub machine user, if you do not already have one available. A GitHub machine user is a GitHub personal account, separate from your own GitHub personal account, that you can use to automate activity on GitHub. Create a new separate GitHub account to use as a GitHub machine user, if you do not already have one available.

    Note

    When you create a new separate GitHub account as a GitHub machine user, you cannot associate it with the email address for your own GitHub personal account. Instead, see your organization's email administrator about getting a separate email address that you can associate with this new separate GitHub account as a GitHub machine user.

    See your organization's account administrator about managing the separate email address and its associated GitHub machine user and its GitHub personal access tokens within your organization.

  2. Give the GitHub machine user access to your GitHub repository. See Inviting a team or person in the GitHub documentation. To accept the invitation, you may first need to sign out of your GitHub personal account, and then sign back in as the GitHub machine user.

  3. Sign in to GitHub as the machine user, and then create a GitHub personal access token for that machine user. See Create a personal access token in the GitHub documentation. Be sure to give the GitHub personal access token repo access.

  4. Gather the Microsoft Entra ID token for your service principal, your GitHub machine username, and then Add Git provider credentials to an Azure Databricks workspace.

Set up Azure Pipelines

Azure Pipelines must be able to access your Azure Databricks workspace. If you also want to use Azure Databricks Git folders, your workspace must be able to access Azure Pipelines.

Azure Pipelines YAML pipeline files rely on environment variables to access your Azure Databricks workspace. These environment variables include ones such as:

  • DATABRICKS_HOST, which is the value https:// followed by your workspace instance name, for example adb-1234567890123456.7.databricks.azure.cn.
  • DATABRICKS_TOKEN, which is the value of the token_value value that you copied after you created the Microsoft Entra ID token for the Microsoft Entra ID managed service principal.

See also the following Databricks blog:

Optional for CI/CD scenarios: If your workspace uses Azure Databricks Git folders, and you want to enable your workspace to access Azure Pipelines, gather:

  • The Microsoft Entra ID token for your service principal
  • Your Azure Pipelines username

Then, Add Git provider credentials to an Azure Databricks workspace.

Set up GitLab CI/CD

GitLab CI/CD must be able to access your Azure Databricks workspace. If you also want to use Azure Databricks Git folders, your workspace must be able to access GitLab CI/CD.

To access your Azure Databricks workspace, GitLab CI/CD .gitlab-ci.yml files, such as the one as part of the Basic Python Template in dbx, rely on custom CI/CD variables such as:

  • DATABRICKS_HOST, which is the value https:// followed by your workspace instance name, for example adb-1234567890123456.7.databricks.azure.cn.
  • DATABRICKS_TOKEN, which is the value of the token_value value that you copied after you created the Microsoft Entra ID token for the service principal.

To add these custom variables to your GitLab CI/CD project, see Add a CI/CD variable to a project in the GitLab CI/CD documentation.

If your workspace uses Databricks Git folders, and you want to enable your workspace to access GitLab CI/CD, gather:

  • The Microsoft Entra ID token for your service principal
  • Your GitLab CI/CD username

Then Add Git provider credentials to an Azure Databricks workspace.

Add Git provider credentials to an Azure Databricks workspace

This section describes how to enable your Azure Databricks workspace to access a Git provider for Azure Databricks Git folders. This is optional in CI/CD scenarios. For example, you may only want your Git provider to access your Azure Databricks workspace, but you do not also want to use Azure Databricks Git folders in your workspace with your Git provider. If so, then skip this section.

Before you begin, gather the following information and tools:

  • The Microsoft Entra ID token for your service principal.
  • The username associated with your Git provider.
  • The access token associated with the user for your Git provider.

Note

For Azure Pipelines, see Use personal access tokens on the Azure website.

  • Databricks CLI version 0.205 or above. See What is the Databricks CLI?. You cannot use the Azure Databricks user interface.
  • An Azure Databricks configuration profile in your .databrickscfg file, with the profile's fields set correctly for the related host representing your Azure Databricks per-workspace URL, for example https://adb-1234567890123456.7.databricks.azure.cn, and token representing the Microsoft Entra ID token for your service principal. (Do not use the Databricks personal access token for your workspace user.) See Azure Databricks personal access token authentication.

Use the Databricks CLI to run the following command:

databricks git-credentials create <git-provider-short-name> --git-username <git-provider-user-name> --personal-access-token <git-provider-access-token> -p <profile-name>
  • Use one of the following for <git-provider-short-name>:
    • For GitHub, use GitHub.
    • For Azure Pipelines, use AzureDevOpsServices.
    • For GitLab CI/CD, use GitLab.
  • Replace <git-provider-user-name> with the username associated with your Git provider.
  • Replace <git-provider-access-token> with the access token associated with the user for your Git provider.
  • Replace <profile-name> with the name of the Azure Databricks configuration profile in your .databrickscfg file.

Tip

To confirm that the call was successful, you can run one of the following Databricks CLI commands, and review the output:

databricks git-credentials list -p <profile-name>
databricks git-credentials get <credential-id> -p <profile-name>