Set up private Git connectivity for Azure Databricks Git folders (Repos)

Learn about and configure Git server proxy for Databricks Git folders, a configurable service that enables you to proxy Git commands from Databricks workspace Git folders to your on-premises repos served by GitHub Enterprise Server, Azure DevOps Server, Bitbucket Server, and GitLab self-managed.

Note

Users with a Databricks Git server proxy configured during preview should upgrade cluster permissions for best performance. See Remove global CAN_ATTACH_TO permissions.

The Databricks Git server proxy is specifically designed to work with the version of the DBR included in the configuration notebook. Users are discouraged from updating the DBR version of the proxy cluster.

What is Git server proxy for Databricks Git folders?

Databricks Git server proxy for Git folders is a feature that allows you to proxy Git commands from your Azure Databricks workspace to an on-premises Git server.

Databricks Git folders (formerly Repos) represents your connected Git repos as folders. The contents of these folders are version-controlled by syncing them to the connected Git repository. By default, Git folders can synchronize only with public Git providers (like public GitHub, GitLab, Azure DevOps, and others). However, if you host your own on-premises Git server (such as GitHub Enterprise Server, Bitbucket Server , or GitLab self-managed), you must use Git server proxy with Git folders to provide Databricks access to your Git server. Your Git server must be accessible from your Azure Databricks data plane (driver node).

If your corporate network is private (VPN) access only (no public access), you must run a Git server proxy to access Git repositories located outside of it and to add Git folders to your workspaces.

How does Git Server Proxy for Databricks Git folders work?

Git Server Proxy for Databricks Git folders proxies Git commands from the Databricks control plane to a "proxy cluster" running in your Databricks workspace's compute plane. In this context, the proxy cluster is a cluster configured to run a proxy service for Git commands from Databricks Git folders to your self-hosted Git repo. This proxy service receives Git commands from the Databricks control plane and forwards them to your Git server instance.

The diagram below illustrates the overall system architecture:

Diagram that shows how Git Server Proxy for Databricks Git folders is configured to run from a customer's compute plane

Currently, a Git server proxy no longer requires CAN_ATTACH_TO permission for all users. Admins with an existing proxy clusters can now modify the cluster ACL permission to enable this feature. To enable it:

  1. Select Compute from the sidebar, and then click the Kebab menu kebab menu next to the Compute entry for the Git Server Proxy you're running:

    Select Compute from the sidebar, select the kebab to the right of your Git proxy server compute resource

  2. From the dialog, remove the Can Attach To entry for All Users:

    In the modal dialog box that pops up, click X to the right of All Users, Can Attach To

How do I set up Git Server Proxy for Databricks Git folders?

This section describes how to prepare your Git server instance for Git Server Proxy for Databricks Git folders, create the proxy, and validate your configuration.

Before you begin

Before enabling the proxy, consider the following prerequisites and planning tasks:

  • Your workspace has the Databricks Git folders feature enabled.
  • Your Git server instance is accessible from your Azure Databricks workspace's compute plane VPC, and has both HTTPS and personal access tokens (PATs) enabled.

Note

Git server proxy for Databricks works in all regions supported by your VPC.

Step 1: Prepare your Git server instance

Important

You must be an admin on the workspace with access rights to create a compute resource and complete this task.

To configure your Git server instance:

  1. Give the proxy cluster's driver node access your Git server.

    Your enterprise Git server can have an allowlist of IP addresses from which access is permitted.

    1. Associate a static outbound IP address for traffic that originates from your proxy cluster. You can do this by using Azure Firewall or an egress appliance.
    2. Add the IP address from the previous step to your Git server's allowlist.
  2. Set your Git server instance to allow HTTPS transport.

    • For GitHub Enterprise, see Which remote URL should I use in the GitHub Enterprise help.
    • For Bitbucket, go to the Bitbucket server administration page and select server settings. In the HTTP(S) SCM hosting section, enable the HTTP(S) enabled checkbox.

Step 2: Run the enablement notebook

To enable the proxy:

  1. Log into your Azure Databricks workspace as a workspace admin with access rights to create a cluster.

  2. Import this notebook, which chooses the smallest instance type available from your cloud provider to run the Git proxy.:

    Notebook: Enable Git server proxy for Databricks Git folders for private Git server connectivity in Git folders.

  3. Click "Run All" to run the notebook, which performs the following tasks:

    • Creates a single node compute resource named "Databricks Git Proxy", which does not auto-terminate. This is the "Git proxy service" that will process and forward Git commands from your Azure Databricks workspace to your on-premises Git server.
    • Enables a feature flag that controls whether Git requests in Databricks Git folders are proxied via the compute instance.

    As a best practice, consider creating a simple job to run the Git proxy compute resource. This can be a simple notebook that prints or logs status such as "The Git proxy service is running." Set the job to run on regular time intervals to ensure the Git proxy service is always available for your users.

Note

Running an additional long-running compute resource to host the proxy software incurs extra DBUs. To minimize costs, the notebook configures the proxy to use a single-node compute resource with an inexpensive node type. However, you might want to modify the compute options to suit your needs. For more information on compute instance pricing, see the Databricks pricing calculator.

Step 3: Validate your Git server configuration

To validate your Git server configuration, try to clone a repo hosted on your private Git server via the proxy cluster. A successful clone means that you have successfully enabled Git server proxy for your workspace.

Step 4: Create proxy-enabled repos

After users configure their Git credentials, no further steps are required to create or synchronize your repos. To configure credentials and create a repo in Databricks Git folders, see Configure Git credentials & connect a remote repo to Azure Databricks.

Remove global CAN_ATTACH_TO permissions

Admins with an existing proxy clusters can now modify the cluster ACL permission to leverage generally available Git server proxy behavior.

If you previously configured Databricks Git server proxy with CAN_ATTACH_TO privileges, use the following steps to remove these permissions:

  1. Select Compute from the sidebar, and then click the Kebab menu kebab menu next to the Compute entry for the Git server proxy you're running:

    Select Compute from the sidebar, select the kebab to the right of your Git proxy server compute resource

  2. From the dialog, remove the Can Attach To entry for All Users:

    In the modal dialog box that pops up, click X to the right of All Users, Can Attach To

Troubleshooting

Did you encounter an error while configuring Git server proxy for Databricks Git folders? Here are some common issues and ways to diagnose them more effectively.

Checklist for common problems

Before you start diagnosing an error, confirm that you've completed the following steps:

  • Confirm that your proxy cluster is running.
  • Confirm that you are a workspace administrator.
  • Run the enablement notebook again and capture the results, if you haven't already. If you are unable to debug the issue, Databricks Support can review the results. You can export and send the enablement notebook as a DBC archive.

Change your Git proxy configuration

If your Git proxy service is not working with the default configuration, you can set specific environment variables to make changes to it to better support your network infrastructure.

Use the following environment variables to update the configuration for your Git proxy service:

Environment variable Format Description
GIT_PROXY_ENABLE_SSL_VERIFICATION true/false Set this to false if you are using a self-signed certificate for your private Git server.
GIT_PROXY_CA_CERT_PATH File path (string) Set this to the path to a CA certificate file used for SSL verification. Example: /FileStore/myCA.pem
GIT_PROXY_HTTP_PROXY https://<hostname>:<port #> Set this to the HTTPS URL for your network's firewall proxy for HTTP traffic.
GIT_PROXY_CUSTOM_HTTP_PORT Port number (integer) Set this to the port number assigned to your Git server's HTTP port.

To set these environment variables, go to the Compute tab in your Azure Databricks workspace and select the compute configuration for your Git proxy service. At the bottom of the Configuration pane, expand Advanced options and select the Spark tab under it. Set one or more of these environment variables by adding them to the Environment variables text area.

The Databricks compute configuration page where you set environment variables for a Git proxy

Inspect logs on the proxy cluster

The file at /databricks/git-proxy/git-proxy.log on the proxy cluster contains logs that are useful for debugging purposes.

The log file should start with the line Data-plane proxy server binding to ('', 8000)… If it does not, this means that the proxy server did not start properly. Try restarting the cluster, or delete the cluster you created and run the enablement notebook again.

If the log file does start with this line, review the log statements that follow it for each Git request initiated by a Git operation in Databricks Git folders.

For example:

  do_GET: https://server-address/path/to/repo/info/refs?service=git-upload-pack 10.139.0.25 - - [09/Jun/2021 06:53:02] /
  "GET /server-address/path/to/repo/info/refs?service=git-upload-pack HTTP/1.1" 200`

Error logs written to this file can be useful to help you or Databricks Support debug issues.

Common error messages and their resolution

  • Secure connection could not be established because of SSL problems

    You might see the following error:

      https://git.consult-prodigy.com/Prodigy/databricks_test: Secure connection to https://git.consult-prodigy.com/Prodigy/databricks_test could not be established because of SLL problems
    

    Often this means that you are using a repository that requires special SSL certificates. Check the content of the /databricks/git-proxy/git-proxy.log file on the proxy cluster. If it says that certificate validation failed, then you must add the certificate of authority to the system certificate chain. First, extract the root certificate (using the browser or other option) and upload it to DBFS. Then, edit the Git folders Git Proxy cluster to use the GIT_PROXY_CA_CERT_PATH environment variable to point to the root certificate file. For more information about editing cluster environment variables, see Environment variables.

    After you have completed that step, restart the cluster.

  • Failure to clone repository with error "Missing/Invalid Git credentials"

    First, check that you have configured your Git credentials in User Settings.

    You might encounter this error:

      Error: Invalid Git credentials. Go to User Settings -> Git Integration and check that your personal access token or app password has the correct repo access.
    

    If your organization is using SAML SSO, make sure the token has been authorized (this can be done from your Git server's Personal Access Token (PAT) management page).

Frequently asked questions

What are the security implications of the Git server proxy?

The most important things to know are:

  • Proxying does not affect the security architecture of your Databricks control plane.
  • You can only have one Git proxy server cluster per workspace.

Yes. In the current release, your Azure Databricks workspace does not differentiate between proxied and non-proxied repos.

Does the Git proxy feature work with other Git enterprise server providers?

Databricks Git folders supports GitHub Enterprise, Bitbucket Server, Azure DevOps Server, and GitLab self-managed. Other enterprise Git server providers should work as well if they conform to common Git specifications.

Do Databricks Git folders support GPG signing of commits?

No.

Do Databricks Git folders support SSH transport for Git operations?

No. Only HTTPS is supported.

Is the use of a non-default HTTPS port on the Git server supported?

Currently, the enablement notebook assumes that your Git server uses the default HTTPS port 443. You can set the environment variable GIT_PROXY_CUSTOM_HTTP_PORT to overwrite the port value with a preferred one.

Can you share one proxy for multiple workspaces or do you need one proxy cluster per workspace?

You need one proxy cluster per Azure Databricks workspace.

Does the proxy work with legacy single-notebook versioning?

No, the proxy does not work with legacy single-notebook versioning. Users must migrate to Databricks Git folders versioning.

Can Databricks hide Git server URLs that are proxied? Could users enter the original Git server URLs rather than proxied URLs?

Yes to both questions. Users do not need to adjust their behavior for the proxy. With the current proxy implementation, all Git traffic for Databricks Git folders is routed through the proxy. Users enter the normal Git repo URL such as https://git.company.com/org/repo-name.git.

How often will users work with the Git URLs?

Typically a user would just add the Git URL when they create a new repo or check out an existing repo that they have not already checked out.

Does the feature transparently proxy authentication data to the Git server?

Yes, the proxy uses the user account's Git server token to authenticate to the Git server.

Is there Databricks access to Git server code?

The Azure Databricks proxy service accesses the Git repository on the Git server using the user-provided credential and synchronizes any code files in the repository with the repo. Access is restricted by the permissions specified in the user-provided personal access token (PAT).