“数据跳过”索引Data skipping index

重要

Databricks Runtime 7.0 中已经删除了 DATASKIPPING INDEXDATASKIPPING INDEX was removed in Databricks Runtime 7.0. 建议改用 Delta 表,它提供了经过改进的数据跳过功能We recommend that you use Delta tables instead, which offer improved data skipping capabilities.

说明Description

除了删除分区之外,Databricks Runtime 还有另一项功能可用于避免扫描不相关数据,即“数据跳过”索引。In addition to Partition pruning, Databricks Runtime includes another feature that is meant to avoid scanning irrelevant data, namely the Data Skipping Index. 该功能使用文件级统计信息,以便在文件粒度执行更多的跳过。It uses file-level statistics in order to perform additional skipping at file granularity. 此功能适用于(但不依赖于)Hive 样式分区。This works with, but does not depend on, Hive-style partitioning.

“数据跳过”的有效性如何,取决于数据的特征以及数据的物理布局。The effectiveness of data skipping depends on the characteristics of your data and its physical layout. 由于跳过是在文件粒度执行的,因此,将数据在多个文件之间水平进行分区是很重要的。As skipping is done at file granularity, it is important that your data is horizontally partitioned across multiple files. 发生这种情况通常是因为有多个追加作业、执行(随机)分区操作、执行 Bucket 存储操作,以及/或者使用了 spark.sql.files.maxRecordsPerFileThis will typically happen as a consequence of having multiple append jobs, (shuffle) partitioning, bucketing, and/or the use of spark.sql.files.maxRecordsPerFile. 此功能不仅最适用于具有已排序的 Bucket 的表 (df.write.bucketBy(...).sortBy(...).saveAsTable(...) / CREATE TABLE ... CLUSTERED BY ... SORTED BY ...) 或具有与分区键(例如 brandName - modelNamecompanyID - stockPrice)相关的列的表,而且也最适用于数据正好表现出某些排序性/群聚性(例如 orderIDbitcoinValue)的情况。It works best on tables with sorted buckets (df.write.bucketBy(...).sortBy(...).saveAsTable(...) / CREATE TABLE ... CLUSTERED BY ... SORTED BY ...), or with columns that are correlated with partition keys (for example, brandName - modelName, companyID - stockPrice), but also when your data just happens to exhibit some sortedness / clusteredness (for example, orderID, bitcoinValue).

备注

此 beta 版本功能具有许多重要限制:This beta feature has a number of important limitations :

  • 此功能是可选的:它需要基于单个表手动启用。It’s Opt-In: needs to be enabled manually, on a per-table basis.
  • 此功能仅适用于 SQL:没有适用于此功能的数据帧 API。It’s SQL only: there is no DataFrame API for it.
  • 为表编制索引后,在对索引显式执行 REFRESH 操作之前,无法保证后续 INSERTADD PARTITION 操作的效果是可见的。Once a table is indexed, the effects of subsequent INSERT or ADD PARTITION operations are not guaranteed to be visible until the index is explicitly REFRESHed.

SQL 语法SQL Syntax

创建索引Create Index

CREATE DATASKIPPING INDEX ON [TABLE] [db_name.]table_name

针对所提供的表为前(即最左侧)N 个受支持的列启用“数据跳过”,其中 N 由 spark.databricks.io.skipping.defaultNumIndexedCols 控制(默认值 :32)Enables Data Skipping on the given table for the first (i.e. left-most) N supported columns, where N is controlled by spark.databricks.io.skipping.defaultNumIndexedCols (default: 32)

partitionBy 列始终会被编制索引,不计入此 N 值。partitionBy columns are always indexed and do not count towards this N .

为列创建索引Create Index For Columns

CREATE DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
    FOR COLUMNS (col1, ...)

针对所提供的表为指定的列的列表启用“数据跳过”。Enables Data Skipping on the given table for the specified list of columns. 与上述情形相同,除了指定的列之外,所有的 partitionBy 列也始终都会被编制索引。Same as above, all partitionBy columns will always be indexed in addition to the ones specified.

描述索引Describe Index

DESCRIBE DATASKIPPING INDEX [EXTENDED] ON [TABLE] [db_name.]table_name

显示所提供的表中哪些列已被编制索引,并且显示所收集的相应类型的文件级统计信息。Displays which columns of the given table are indexed, along with the corresponding types of file-level statistic that are collected.

如果指定了 EXTENDED,会显示名为“effectiveness_score”的第三列。该列提供了一个近似的度量,用于表示我们预期 DataSkipping 有益于相应列上筛选器的程度。If EXTENDED is specified, a third column called “effectiveness_score” is displayed that gives an approximate measure of how beneficial we expect DataSkipping to be for filters on the corresponding columns.

刷新完整索引Refresh Full Index

REFRESH DATASKIPPING INDEX ON [TABLE] [db_name.]table_name

重新生成整个索引。Rebuilds the whole index. 即,将为表的所有分区重新编制索引。I.e. all the table’s partitions will be re-indexed.

刷新分区Refresh Partitions

REFRESH DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
    PARTITION (part_col_name1[=val1], part_col_name2[=val2], ...)

只为指定的分区重新编制索引。Re-indexes the specified partitions only. 此操作的速度通常应该会比完整索引刷新的速度快。This operation should generally be faster than full index refresh.

删除索引Drop Index

DROP DATASKIPPING INDEX ON [TABLE] [db_name.]table_name

对所提供的表禁用“数据跳过”,并删除所有索引数据。Disables Data Skipping on the given table and deletes all index data.