Connect to and manage Azure Databricks Unity Catalog in Microsoft Purview

This article outlines how to register Azure Databricks, and how to authenticate and interact with Azure Databricks Unity Catalog in Microsoft Purview. For more information about Microsoft Purview, see the introductory article.

Supported capabilities

Scanning capabilities

Metadata Extraction Full Scan Incremental Scan Scoped Scan
Yes Yes Yes No

When scanning Azure Databricks Unity Catalog, Microsoft Purview supports:

  • Extracting technical metadata including:
    • Metastore
    • Catalogs
    • Schemas
    • Tables including the columns
    • Views including the columns
  • Fetching lineage on assets relationships between tables, views, and columns during notebook runs.

When setting up a scan, you can choose to scan the entire Unity Catalog, or scope the scan to a subset of catalogs.

Other capabilities

For classifications, policies, and live view, see the list of supported capabilities.

Note

This connector brings metadata from Azure Databricks Unity Catalog. To scan Azure Databricks workspace-scoped metadata, see Azure Databricks Hive Metastore connector.

Known limitations

  • In Microsoft Purview, Databricks notebook names appear as numeric IDs instead of readable names. This limitation exists because Databricks doesn't expose notebook names in the Unity Catalog system table.

  • You might encounter errors if scan results from Azure Databricks exceed 1 MB and Azure Databricks-managed blob storage denies public network access. To prevent this problem, ensure that Microsoft Purview has access to the internal DBFS storage location of the Azure Databricks workspace being scanned. To learn more, see Cloud fetch in JDBC.

  • Incremental scan is available only for the Azure Databricks Unity Catalog data source.

  • Scoped scan is available only for the Unity Catalog option under Azure Databricks data source.

  • You can add managed private endpoints only for the Unity Catalog option under Azure Databricks data source.

  • When you delete an object from the data source, the subsequent scan doesn't automatically remove the corresponding asset in Microsoft Purview.

  • Lineage information isn't available in Azure Databricks workspaces in the China region. This limitation exists because Azure Databricks system tables aren't supported in this region. Microsoft Purview uses these tables to extract lineage, so it can't retrieve lineage in this region.

  • Set the Databricks table column comment to an empty string if you don't want the column description displayed in Microsoft Purview.

  • For more information about other limitations related to native Azure Databricks lineage, see Azure Databricks documentation.

Prerequisites

  • To fetch lineage from Azure Databricks using Microsoft Purview, the following prerequisites must be in place:

    • Enable the system schema: The system schema system.access must be enabled in your Unity Catalog. This requirement exists because lineage information is stored in system tables, and enabling this schema allows access to those tables. Learn more about monitoring usage with system tables.

    • User privileges:

      • The user account you use for scanning needs to have SELECT privileges on the following system tables:

        • system.access.table_lineage
        • system.access.column_lineage

        These permissions are required because lineage data is read directly from the system tables, and without the necessary access, Microsoft Purview can't retrieve the lineage information.

Register

This section describes how to register an Azure Databricks workspace in Microsoft Purview by using the classic Microsoft Purview governance portal.

  1. Go to your Microsoft Purview account.

  2. Select Data Map on the left pane.

  3. Select Register.

  4. In Register sources, select Azure Databricks Unity Catalog > Continue.

  5. On the Register sources (Azure Databricks Unity Catalog) screen, complete the following steps:

  6. For Name, enter a name that Microsoft Purview will list as the data source.

    1. For Metastore ID, provide the metastore ID for the Azure Databricks Unity Catalog metastore that you want to scan.

    2. Select a collection from the list.

Screenshot of registering Azure Databricks Unity Catalog source.

  1. Select Finish.

Scan

Tip

To troubleshoot any problems with scanning:

  1. Confirm you meet all prerequisites.
  2. Review the scan troubleshooting documentation.

Use the following steps to scan Azure Databricks and automatically identify assets. For more information about scanning, see Scans and ingestion in Microsoft Purview.

  1. Go to Sources.

  2. Select the registered Azure Databricks.

  3. Select + New scan.

  4. Provide the following details:

    1. Name: Enter a name for the scan.

    2. Connect via integration runtime: Choose the default Azure integration runtime, Managed Virtual Network IR, or a Kubernetes supported self-hosted integration runtime you created.

    3. Credential: Select the credential to connect to your data source. Make sure to:

    4. Workspace URL: Provide the URL for the workspace that you want to scan.

      1. HTTP path: Specify the Databricks SQL Warehouse’s HTTP path that Microsoft Purview connects to and performs the scan; for example, /sql/1.0/endpoints/xxxxxxxxxxxxxxxx. You can find it in Azure Databricks workspace -> SQL Warehouses -> your warehouse -> Connection details -> HTTP path.
    5. Lineage extraction: Toggle lineage extraction to On to fetch lineage of the scanned assets.

  5. Select Test connection to validate the settings.

    Screenshot of setting up Azure Databricks Unity Catalog scan.

  6. Select Continue.

  7. For Scan trigger, choose whether to set up a schedule or run the scan once.

  8. Review your scan and select Save and Run.

After the scan finishes successfully, see how to browse and search assets.

View your scans and scan runs

To view existing scans:

  1. Go to the Microsoft Purview portal. On the left pane, select Data map.
  2. Select the data source. You can view a list of existing scans on that data source under Recent scans, or you can view all scans on the Scans tab.
  3. Select the scan that has results you want to view. The pane shows you all the previous scan runs, along with the status and metrics for each scan run.
  4. Select the run ID to check the scan run details.

Manage your scans

To edit, cancel, or delete a scan:

  1. Go to the Microsoft Purview portal. On the left pane, select Data Map.

  2. Select the data source. You can view a list of existing scans on that data source under Recent scans, or you can view all scans on the Scans tab.

  3. Select the scan that you want to manage. You can then:

    • Edit the scan by selecting Edit scan.
    • Cancel an in-progress scan by selecting Cancel scan run.
    • Delete your scan by selecting Delete scan.

Note

  • Deleting your scan does not delete catalog assets created from previous scans.

Frequently asked questions (FAQ)

Does Microsoft Purview capture column-level lineage from Unity Catalog?

Microsoft Purview captures lineage at both the Unity Catalog table and view level, as well as at the column level.

Why didn't Microsoft Purview fetch the lineage after I ran my notebook?

After you run your notebook, Databricks might take a few minutes to update the lineage information in its system tables. Microsoft Purview can fetch the lineage after the system tables are updated.

Next steps

After you register your source, use the following guides to learn more about Microsoft Purview and your data: