Use Unity Catalog with your Delta Live Tables pipelines
Important
Delta Live Tables support for Unity Catalog is in Public Preview.
In addition to the existing support for persisting tables to the Hive metastore, you can use Unity Catalog with your Delta Live Tables pipelines to:
- Define a catalog in Unity Catalog where your pipeline will persist tables.
- Read data from Unity Catalog tables.
Your workspace can contain pipelines that use Unity Catalog or the Hive metastore. However, a single pipeline cannot write to both the Hive metastore and Unity Catalog and existing pipelines cannot be upgraded to use Unity Catalog. Your existing pipelines that do not use Unity Catalog are not affected by this preview, and will continue to persist data to the Hive metastore using the configured storage location.
Unless specified otherwise in this document, all existing data sources and Delta Live Tables functionality are supported with pipelines that use Unity Catalog. Both the Python and SQL interfaces are supported with pipelines that use Unity Catalog.
The tables created in your pipeline can also be queried from shared Unity Catalog clusters using Databricks Runtime 13.3 LTS and above or a SQL warehouse. Tables cannot be queried from assigned or no isolation clusters.
To manage permissions on the tables created by a Unity Catalog pipeline, use GRANT and REVOKE.
Requirements
The following are required to create tables in Unity Catalog from a Delta Live Tables pipeline:
- You must have
USE CATALOG
privileges on the target catalog. - You must have
CREATE MATERIALIZED VIEW
andUSE SCHEMA
privileges in the target schema if your pipeline creates materialized views. - You must have
CREATE TABLE
andUSE SCHEMA
privileges in the target schema if your pipeline creates streaming tables. - If a target schema is not specified in the pipeline settings, you must have
CREATE MATERIALIZED VIEW
orCREATE TABLE
privileges on at least one schema in the target catalog.
Limitations
The following are limitations when using Unity Catalog with Delta Live Tables:
By default, only the pipeline owner and workspace admins have permission to view the driver logs from the cluster that runs a Unity Catalog-enabled pipeline. To enable access for other users to view the driver logs, see Allow non-admin users to view the driver logs from a Unity Catalog-enabled pipeline.
Existing pipelines that use the Hive metastore cannot be upgraded to use Unity Catalog. To migrate an existing pipeline that writes to Hive metastore, you must create a new pipeline and re-ingest data from the data source(s).
You cannot create a Unity Catalog-enabled pipeline in a workspace attached to a metastore created during the Unity Catalog public preview. See Upgrade to privilege inheritance.
JARs are not supported. The installation of third-party libraries is supported for only Python libraries. See Manage Python dependencies for Delta Live Tables pipelines.
Data manipulation language (DML) queries that modify the schema of a streaming table are not supported.
A materialized view created in a Delta Live Tables pipeline cannot be used as a streaming source outside of that pipeline, for example, in another pipeline or in a downstream notebook.
Publishing to schemas that specify a managed storage location is supported only in the preview channel.
If a pipeline publishes to a schema with a managed storage location, the schema can be changed in a subsequent update, but only if the updated schema uses the same storage location as the previously specified schema.
If the target schema specifies a storage location, all tables are stored there. If a schema storage location is not specified, tables are stored in the catalog storage location if the target catalog specifies one. If schema and catalog storage locations are not specified, tables are stored in the root storage location of the metastore where the tables are published.
The History tab in Catalog Explorer does not show history for streaming tables and materialized views .
The
LOCATION
property is not supported when defining a table.Unity Catalog-enabled pipelines cannot publish to the Hive metastore.
Python UDF support is in Public Preview. To use Python UDFs, your pipeline must use the preview channel.
You cannot use Delta Sharing with a Delta Live Tables materialized view or streaming table published to Unity Catalog.
You cannot use the
event_log
table valued function in a pipeline or query to access the event logs of multiple pipelines.You cannot share a view created over the
event_log
table valued function with other users.Single-node clusters are not supported with Unity Catalog-enabled pipelines. Because Delta Live Tables might create a single-node cluster to run smaller pipelines, your pipeline might fail with an error message referencing
single-node mode
. If this occurs, make sure you specify at least one worker when you Configure your compute settings.Tables created in a Unity Catalog-enabled pipeline cannot be queried from assigned or no isolation clusters. To query tables created by a Delta Live Tables pipeline, you must use a shared access mode cluster using Databricks Runtime 13.3 LTS and above or a SQL warehouse.
Delta Live Tables uses a shared access mode cluster to run a Unity Catalog-enabled pipeline. A Unity Catalog-enabled pipeline cannot run on an assigned cluster. To learn about limitations of shared access mode with Unity Catalog, see Shared access mode limitations on Unity Catalog.
You cannot use row filters or column masks with materialized views or streaming tables published to Unity Catalog.
Note
The underlying files supporting materialized views might include data from upstream tables (including possible personally identifiable information) that do not appear in the materialized view definition. This data is automatically added to the underlying storage to support incremental refreshing of materialized views .
Because the underlying files of a materialized view might risk exposing data from upstream tables not part of the materialized view schema, Databricks recommends not sharing the underlying storage with untrusted downstream consumers.
For example, suppose the definition of a materialized view includes a COUNT(DISTINCT field_a)
clause. Even though the materialized view definition only includes the aggregate COUNT DISTINCT
clause, the underlying files will contain a list of the actual values of field_a
.
Changes to existing functionality
When Delta Live Tables is configured to persist data to Unity Catalog, the lifecycle of the table is managed by the Delta Live Tables pipeline. Because the pipeline manages the table lifecycle and permissions:
- When a table is removed from the Delta Live Tables pipeline definition, the corresponding materialized view or streaming table entry is removed from Unity Catalog on the next pipeline update. The actual data is retained for a period of time so that it can be recovered if it was deleted by mistake. The data can be recovered by adding the materialized view or streaming table back into the pipeline definition.
- Deleting the Delta Live Tables pipeline results in deletion of all tables defined in that pipeline. Because of this change, the Delta Live Tables UI is updated to prompt you to confirm deletion of a pipeline.
- Internal backing tables, including backing tables used to support
APPLY CHANGES INTO
, are not directly accessible by users.
Write tables to Unity Catalog from a Delta Live Tables pipeline
Note
If you do not select a catalog and target schema for a pipeline, tables are not published to Unity Catalog and can only be accessed by queries in the same pipeline.
To write your tables to Unity Catalog, when you create a pipeline, select Unity Catalog under Storage options, select a catalog in the Catalog drop-down menu, and select an existing schema or enter the name for a new schema in the Target schema drop-down menu. To learn about Unity Catalog catalogs, see What are catalogs in Azure Databricks?. To learn about schemas in Unity Catalog,see What are schemas in Azure Databricks?.
Ingest data into a Unity Catalog pipeline
Your pipeline configured to use Unity Catalog can read data from:
- Unity Catalog managed and external tables, views, materialized views and streaming tables.
- Hive metastore tables and views.
- Auto Loader using the
cloud_files()
function to read from Unity Catalog external locations. - Apache Kafka and Amazon Kinesis.
The following are examples of reading from Unity Catalog and Hive metastore tables.
Batch ingestion from a Unity Catalog table
SQL
CREATE OR REFRESH MATERIALIZED VIEW
table_name
AS SELECT
*
FROM
my_catalog.my_schema.table1;
Python
@dlt.table
def table_name():
return spark.table("my_catalog.my_schema.table")
Stream changes from a Unity Catalog table
SQL
CREATE OR REFRESH STREAMING TABLE
table_name
AS SELECT
*
FROM
STREAM(my_catalog.my_schema.table1);
Python
@dlt.table
def table_name():
return spark.readStream.table("my_catalog.my_schema.table")
Ingest data from Hive metastore
A pipeline that uses Unity Catalog can read data from Hive metastore tables using the hive_metastore
catalog:
SQL
CREATE OR REFRESH MATERIALIZED VIEW
table_name
AS SELECT
*
FROM
hive_metastore.some_schema.table;
Python
@dlt.table
def table3():
return spark.table("hive_metastore.some_schema.table")
Ingest data from Auto Loader
SQL
CREATE OR REFRESH STREAMING TABLE
table_name
AS SELECT
*
FROM
cloud_files(
<path-to-uc-external-location>,
"json"
)
Python
@dlt.table(table_properties={"quality": "bronze"})
def table_name():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.load(f"{path_to_uc_external_location}")
)
Share materialized views
By default, the tables created by a pipeline can be queried only by the pipeline owner. You can give other users the ability to query a table by using GRANT statements and you can revoke query access using REVOKE statements. For more information about privileges in Unity Catalog, see Manage privileges in Unity Catalog.
Grant select on a table
GRANT SELECT ON TABLE
my_catalog.my_schema.table_name
TO
`user@databricks.com`
Revoke select on a table
REVOKE SELECT ON TABLE
my_catalog.my_schema.table_name
FROM
`user@databricks.com`
Grant create table or create materialized view privileges
GRANT CREATE { MATERIALIZED VIEW | TABLE } ON SCHEMA
my_catalog.my_schema
TO
{ principal | user }
View lineage for a pipeline
Lineage for tables in a Delta Live Tables pipeline is visible in Catalog Explorer. For materialized views or streaming tables in a Unity Catalog-enabled pipeline, the Catalog Explorer lineage UI shows the upstream and downstream tables. To learn more about Unity Catalog lineage, see Capture and view data lineage using Unity Catalog.
For a materialized view or streaming table in a Unity Catalog-enabled Delta Live Tables pipeline, the Catalog Explorer lineage UI will also link to the pipeline that produced the materialized view or streaming table if the pipeline is accessible from the current workspace.
Add, change, or delete data in a streaming table
You can use data manipulation language (DML) statements, including insert, update, delete, and merge statements, to modify streaming tables published to Unity Catalog. Support for DML queries against streaming tables enables use cases such as updating tables for General Data Protection Regulation (GDPR) compliance.
Note
- DML statements that modify the table schema of a streaming table are not supported. Ensure that your DML statements do not attempt to evolve the table schema.
- DML statements that update a streaming table can be run only in a shared Unity Catalog cluster or a SQL warehouse using Databricks Runtime 13.3 LTS and above.
- Because streaming requires append-only data sources, if your processing requires streaming from a source streaming table with changes (for example, by DML statements), set the skipChangeCommits flag when reading the source streaming table. When
skipChangeCommits
is set, transactions that delete or modify records on the source table are ignored. If your processing does not require a streaming table, you can use a materialized view (which does not have the append-only restriction) as the target table.
The following are examples of DML statements to modify records in a streaming table.
Delete records with a specific ID:
DELETE FROM my_streaming_table WHERE id = 123;
Update records with a specific ID:
UPDATE my_streaming_table SET name = 'Jane Doe' WHERE id = 123;
Publish tables with row filters and column masks
Important
This feature is in Public Preview.
Row filters let you specify a function that applies as a filter whenever a table scan fetches rows. These filters ensure that subsequent queries only return rows for which the filter predicate evaluates to true.
Column masks let you mask a column's values whenever a table scan fetches rows. Future queries for that column return the evaluated function's result instead of the column's original value. For more information on how to use row filters and column masks, see Filter sensitive table data using row filters and column masks.
Managing Row Filters and Column Masks
Row filters and column masks on materialized views and streaming tables should be added, updated, or dropped through the CREATE OR REFRESH
statement.
For detailed syntax on defining tables with row filters and column masks, please refer to Delta Live Tables SQL language reference and Delta Live Tables Python language reference.
Behavior
The following are important details when using row filters or column masks in Delta Live Tables pipelines:
- Refresh as owner: When a pipeline update refreshes a materialized view or streaming table, row filter and column mask functions run with the pipeline owner's rights. This means the table refresh uses the security context of the user who created the pipeline. Functions that check user context (such as
CURRENT_USER
andIS_MEMBER
) are evaluated using the pipeline owner's user context. - Query: When querying a materialized view or streaming table, functions that check user context (such as
CURRENT_USER
andIS_MEMBER
) are evaluated using the invoker's user context. This approach enforces user-specific data security and access controls based on the current user's context. - When creating materialized views over source tables that contain row filters and column masks, the refresh of the materialized view is always a full refresh. A full refresh reprocesses all data available in the source with the latest definitions. This process checks that security policies on the source tables are evaluated and applied with the most up-to-date data and definitions.
Observability
Use DESCRIBE EXTENDED
, INFORMATION_SCHEMA
, or the Catalog Explorer to examine the existing row filters and column masks that apply to a given materialized view or streaming table. This functionality allows users to audit and review data access and protection measures on materialized views and streaming tables.