Lineage system tables reference

Important

This feature is in Public Preview.

This article provides an overview of the two lineage system tables. These system tables build on Unity Catalog's data lineage feature, allowing you to programmatically query lineage data to fuel decision making and reports.

There are two lineage system tables:

  • system.access.table_lineage
  • system.access.column_lineage

Note

Both lineage tables represent a subset of all read/write events, as it is not always possible to capture lineage. Records are only emitted when lineage can be inferred.

Table lineage table

The table lineage system table includes a record for each read or write event on a Unity Catalog table or path. This includes but is not limited to job runs, notebook runs, and dashboards updated with the read or write event.

Column lineage table

The column lineage table does not include events that do not have a source. For example, if you insert into a column using explicit values, it is not captured. If you read a column, it is captured whether or not you write the output. Column lineage is not supported for Delta Live Tables.

Lineage system table schema

The lineage system tables use the following schema. The table lineage schema doesn't include source_column_name and target_column_name.

Column name Data type Description Example
account_id string The ID of the Azure Databricks account. 7af234db-66d7-4db3-bbf0-956098224879
metastore_id string The ID of the Unity Catalog metastore. 5a31ba44-bbf4-4174-bf33-e1fa078e6765
workspace_id string The ID of the workspace 123456789012345
entity_type string The type of entity the lineage transaction was captured from. The value is NOTEBOOK, JOB, PIPELINE, DBSQL_DASHBOARD, DBSQL_QUERY, OR NULL. NOTEBOOK
entity_id string The ID of the entity the lineage transaction was captured from. If entity_type is NULL, entity_id is NULL. * Notebook: 23098402394234
* Job: 23098402394234
* Databricks SQL query: e9cd8a31-de2f-4206-adfa-4f6605d68d88
* Databricks SQL dashboard: e9cd8a31-de2f-4206-adfa-4f6605d68d88
* Pipeline: e9cd8a31-de2f-4206-adfa-4f6605d68d88
entity_run_id string ID to describe the unique run of the entity, or NULL. This differs for each entity type:

* Notebook: command_run_id
* Job: job_run_id
* Databricks SQL query: query_run_id
* Databricks SQL dashboard: query_run_id
* Pipeline: pipeline_update_id

If entity_type is NULL, entity_run_id is NULL.
* Notebook: 23098402394234
* Job: 23098402394234
* Databricks SQL query: e9cd8a31-de2f-4206-adfa-4f6605d68d88
* Databricks SQL dashboard: e9cd8a31-de2f-4206-adfa-4f6605d68d88
* Pipeline: e9cd8a31-de2f-4206-adfa-4f6605d68d88
source_table_full_name string Three-part name to identify the source table. catalog.schema.table
source_table_catalog string The catalog of the source table. catalog
source_table_schema string The schema of the source table. catalog.schema
source_table_name string The name of the source table. table
source_path string Location in cloud storage of the source table, or the path if it's reading from cloud storage directly. abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1
source_type string The type of the source. The value is TABLE, PATH, VIEW, or STREAMING_TABLE. TABLE
source_column_name string The name of the source column. date
target_table_full_name string Three-part name to identify the target table. catalog.schema.table
target_table_catalog string The catalog of the target table. catalog
target_table_schema string The schema of the target table. catalog.schema
target_table_name string The name of the target table. table
target_path string Location in cloud storage of the target table. abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1
target_type string The type of the target. The value is TABLE, PATH, VIEW,or STREAMING TABLE. TABLE
target_column_name string The name of the target column. date
created_by string The user who generated this lineage. This can be an Azure Databricks username, an Azure Databricks service principal ID, "System-User", or NULL if the user information cannot be captured. crampton.rods@email.com
event_time timestamp The timestamp when the lineage was generated. 2023-06-20T19:47:21.194+0000
event_date date The date when the lineage was generated. This is a partitioned column. 2023-06-20

Reading lineage system tables

Take notice of the following considerations when analyzing lineage system tables:

  • For the entity_type, Azure Databricks supports Delta Live Tables, notebooks, jobs, Databricks SQL queries, and dashboards. Events from other entities aren't supported.
  • If you see the entity_type as null, it means no Azure Databricks entity is involved in the event. For example, it could be the result of a JDBC query or from a user clicking into the Sample Data tab in the Azure Databricks UI.
  • To determine whether the event was a read or a write, you can view the source type and the target type.
    • Read-only: The source type is not null, but target type is null.
    • Write-only: The target type is not null, but the source type is null.
    • Read and write: The source type and the target type are not null.

Lineage system table example

As an example of how lineage is recorded in system tables, here is an example query followed by the lineage records the query creates:

CREATE OR REPLACE TABLE car_features
AS SELECT *,  in1+in2 as premium_feature_set
FROM car_features_exterior
JOIN car_features_interior
USING(id, model);

The record in system.access.table_lineage would look like:

entity_type entity_id source_table_name target_table_name created_by event_time
NOTEBOOK 27080565267 car_features_exterior car_features crampton@email.com 2023-01-25T16:19:58.908+0000
NOTEBOOK 27080565267 car_features_interior car_features crampton@email.com 2023-01-25T16:19:58.908+0000

The record in system.access.column_lineage would look like:

entity_type entity_id source_table_name target_table_name source_column_name target_column_name event_time
NOTEBOOK 27080565267 car_features_interior car_features in1 premium_feature_set 2023-01-25T16:19:58.908+0000
NOTEBOOK 27080565267 car_features_interior car_features in2 premium_feature_set 2023-01-25T16:19:58.908+0000

Note

Not all lineage columns are shown in above example. For the full schema, see the above lineage schema.

Troubleshooting external table queries

When you reference an external table using its cloud storage path, the associated lineage record only includes the path name and not the table name. As an example, the lineage record for this query would include the path name and not the table name:

SELECT * FROM delta.`abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1`;

If you are trying to query lineage records for an external table referenced by path, you will need to filter the query using source_path or target_path instead of source_table_full_name or target_table_full_name. For example, the following query pulls all lineage records for an external table:

SELECT *
FROM system.access.table_lineage
WHERE
  source_path = "abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1" OR
  target_path = "abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1";

Example: Retrieve lineage records based on external table name

If you don't want to manually retrieve the cloud storage path to find lineage, you can use the following function to get lineage data using the table name. You can also replace system.access.table_lineage with system.access.column_lineage in the function if you want to query column lineage.

def getLineageForTable(table_name):
  table_path = spark.sql(f"describe detail {table_name}").select("location").head()[0]

  df = spark.read.table("system.access.table_lineage")
  return df.where(
    (df.source_table_full_name == table_name)
    | (df.target_table_full_name == table_name)
    | (df.source_path == table_path)
    | (df.target_path == table_path)
  )

Then use the following command to call the function and display lineage records for the external table:

display(getLineageForTable("table_name"))