世系系统表引用

项目
11/24/2023

重要

此功能目前以公共预览版提供。

本文概述了两个世系系统表。这些系统表基于 Unity Catalog 的数据世系功能构建，允许你以编程方式查询世系数据，从而推动决策制定和报告。

有两个世系系统表：

system.access.table_lineage
system.access.column_lineage

注意

这两个世系表呈现的都是所有读/写事件的子集，因为世系并不一定可以捕获。仅当可以推断世系时，才会发出记录。

表世系表

表世系系统表包括 Unity Catalog 表或路径上每个读取或写入事件的记录。这包括但不限于作业运行、笔记本运行以及使用读取或写入事件更新的仪表板。

列世系表

列世系表不包括没有源的事件。例如，如果使用显式值向列中插入数据，则该事件不会被捕获。如果读取列，则无论是否写入输出，该事件都会被捕获。增量实时表不支持列世系。

世系系统表架构

世系系统表使用以下架构。表世系架构不包括 source_column_name 和 target_column_name。

列名称	数据类型	说明	示例
`account_id`	字符串	Azure Databricks 帐户的 ID。	`7af234db-66d7-4db3-bbf0-956098224879`
`metastore_id`	string	Unity Catalog 元存储的 ID。	`5a31ba44-bbf4-4174-bf33-e1fa078e6765`
`workspace_id`	string	工作区的 ID	`123456789012345`
`entity_type`	string	从中捕获世系事务的实体的类型。其值为 `NOTEBOOK`、`JOB`、`PIPELINE`、`DBSQL_DASHBOARD`、`DBSQL_QUERY` 或 `NULL`。	`NOTEBOOK`
`entity_id`	string	从中捕获世系事务的实体的 ID。如果 `entity_type` 为 `NULL`，则 `entity_id` 为 `NULL`。	* 笔记本：`23098402394234` * 作业：`23098402394234` * Databricks SQL 查询：`e9cd8a31-de2f-4206-adfa-4f6605d68d88` * Databricks SQL 仪表板：`e9cd8a31-de2f-4206-adfa-4f6605d68d88` * 管道：`e9cd8a31-de2f-4206-adfa-4f6605d68d88`
`entity_run_id`	string	用于描述实体唯一运行的 ID，或 `NULL`。该值因实体类型而异： * 笔记本：command_run_id * 作业：job_run_id * Databricks SQL 查询：query_run_id * Databricks SQL 仪表板：query_run_id * 管道：pipeline_update_id 如果 `entity_type` 为 `NULL`，则 `entity_run_id` 为 `NULL`。	* 笔记本：`23098402394234` * 作业：`23098402394234` * Databricks SQL 查询：`e9cd8a31-de2f-4206-adfa-4f6605d68d88` * Databricks SQL 仪表板：`e9cd8a31-de2f-4206-adfa-4f6605d68d88` * 管道：`e9cd8a31-de2f-4206-adfa-4f6605d68d88`
`source_table_full_name`	string	用于标识源表的三部分名称。	`catalog.schema.table`
`source_table_catalog`	string	源表的目录。	`catalog`
`source_table_schema`	string	源表的架构。	`catalog.schema`
`source_table_name`	string	源表的名称。	`table`
`source_path`	string	源表在云存储中的位置；如果直接从云存储读取，则为路径。	`abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1`
`source_type`	string	源的类型。其值为 `TABLE`、`PATH`、`VIEW` 或 `STREAMING_TABLE`。	`TABLE`
`source_column_name`	string	源列的名称。	`date`
`target_table_full_name`	string	用于标识目标表的三部分名称。	`catalog.schema.table`
`target_table_catalog`	string	目标表的目录。	`catalog`
`target_table_schema`	string	目标表的架构。	`catalog.schema`
`target_table_name`	string	目标表的名称。	`table`
`target_path`	string	目标表在云存储中的位置。	`abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1`
`target_type`	string	目标的类型。其值为 `TABLE`、`PATH`、`VIEW` 或 `STREAMING TABLE`。	`TABLE`
`target_column_name`	string	目标列的名称。	`date`
`created_by`	string	生成此世系的用户。这可以是 Azure Databricks 用户名、Azure Databricks 服务主体 ID“System-User”，也可以是 `NULL`（如果无法捕获用户信息）。	`crampton.rods@email.com`
`event_time`	timestamp	生成世系的时间戳。	`2023-06-20T19:47:21.194+0000`
`event_date`	date	生成世系的日期。这是一个分区列。	`2023-06-20`

读取世系系统表

分析世系系统表时，请注意以下事项：

对于 entity_type，Azure Databricks 支持增量实时表、笔记本、作业、Databricks SQL 查询和仪表板。不支持来自其他实体的事件。
如果将 entity_type 视为 null，则表示事件中未涉及任何 Azure Databricks 实体。例如，它可能是 JDBC 查询或用户单击 Azure Databricks UI 中的“示例数据”选项卡的结果。
若要确定事件是读取还是写入，可以查看源类型和目标类型。
- 只读：源类型不为 null，但目标类型为 null。
- 只写：目标类型不为 null，但源类型为 null。
- 读取和写入：源类型和目标类型均不为 null。

世系系统表示例

为了举例说明如何在系统表中记录世系，下面提供了一个示例查询，后跟查询创建的世系记录：

CREATE OR REPLACE TABLE car_features
AS SELECT *,  in1+in2 as premium_feature_set
FROM car_features_exterior
JOIN car_features_interior
USING(id, model);

system.access.table_lineage 中的记录如下所示：

`entity_type`	`entity_id`	`source_table_name`	`target_table_name`	`created_by`	`event_time`
`NOTEBOOK`	`27080565267`	`car_features_exterior`	`car_features`	`crampton@email.com`	`2023-01-25T16:19:58.908+0000`
`NOTEBOOK`	`27080565267`	`car_features_interior`	`car_features`	`crampton@email.com`	`2023-01-25T16:19:58.908+0000`

system.access.column_lineage 中的记录如下所示：

`entity_type`	`entity_id`	`source_table_name`	`target_table_name`	`source_column_name`	`target_column_name`	`event_time`
`NOTEBOOK`	`27080565267`	`car_features_interior`	`car_features`	`in1`	`premium_feature_set`	`2023-01-25T16:19:58.908+0000`
`NOTEBOOK`	`27080565267`	`car_features_interior`	`car_features`	`in2`	`premium_feature_set`	`2023-01-25T16:19:58.908+0000`

注意

上面的示例中并未显示所有世系列。有关完整架构，请参阅上面的世系架构。

排查外部表查询问题

使用外部表的云存储路径引用外部表时，关联的世系记录仅包括路径名称，而不包括表名。例如，此查询的世系记录将包括路径名称，而不是表名：

SELECT * FROM delta.`abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1`;

如果尝试查询路径引用的外部表的世系记录，则需要使用 source_path 或 target_path，而不是 source_table_full_name 或 target_table_full_name。例如，以下查询可拉取外部表的所有世系记录：

SELECT *
FROM system.access.table_lineage
WHERE
  source_path = "abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1" OR
  target_path = "abfss://my-container-name@storage-account-name.dfs.core.chinacloudapi.cn/table1";

示例：基于外部表名称检索世系记录

如果不想手动检索云存储路径来查找世系，可使用以下函数来通过表名称获取世系数据。如果要查询列世系，还可以在函数中将 system.access.table_lineage 替换为 system.access.column_lineage。

def getLineageForTable(table_name):
  table_path = spark.sql(f"describe detail {table_name}").select("location").head()[0]

  df = spark.read.table("system.access.table_lineage")
  return df.where(
    (df.source_table_full_name == table_name)
    | (df.target_table_full_name == table_name)
    | (df.source_path == table_path)
    | (df.target_path == table_path)
  )

然后使用以下命令调用函数并显示外部表的世系记录：

display(getLineageForTable("table_name"))

通过