Data skipping index
Important
DATASKIPPING INDEX
was removed in Databricks Runtime 7.0. We recommend that you use Delta tables instead, which offer improved data skipping capabilities.
Important
This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported. See Data skipping for Delta Lake.
Description
In addition to partition pruning, Databricks Runtime includes another feature that is meant to avoid scanning irrelevant data, namely the Data Skipping Index. It uses file-level statistics in order to perform additional skipping at file granularity. This works with, but does not depend on, Hive-style partitioning.
The effectiveness of data skipping depends on the characteristics of your data and its physical layout. As skipping is done at file granularity, it is important that your data is horizontally partitioned across multiple files. This will typically happen as a consequence of having multiple append jobs, (shuffle) partitioning, bucketing, and/or the use of spark.sql.files.maxRecordsPerFile
. It works best on tables with sorted buckets (df.write.bucketBy(...).sortBy(...).saveAsTable(...)
/ CREATE TABLE ... CLUSTERED BY ... SORTED BY ...
), or with columns that are correlated with partition keys (for example, brandName - modelName
, companyID - stockPrice
), but also when your data just happens to exhibit some sortedness / clusteredness (for example, orderID
, bitcoinValue
).
Note
This beta feature has a number of important limitations:
- It's Opt-In: needs to be enabled manually, on a per-table basis.
- It's SQL only: there is no DataFrame API for it.
- Once a table is indexed, the effects of subsequent
INSERT
orADD PARTITION
operations are not guaranteed to be visible until the index is explicitly REFRESHed.
SQL Syntax
Create Index
CREATE DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
Enables Data Skipping on the given table for the first (i.e. left-most) N supported columns, where N is controlled by spark.databricks.io.skipping.defaultNumIndexedCols
(default: 32)
partitionBy
columns are always indexed and do not count towards this N.
Create Index For Columns
CREATE DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
FOR COLUMNS (col1, ...)
Enables Data Skipping on the given table for the specified list of columns. Same as above, all partitionBy
columns will always be indexed in addition to the ones specified.
Describe Index
DESCRIBE DATASKIPPING INDEX [EXTENDED] ON [TABLE] [db_name.]table_name
Displays which columns of the given table are indexed, along with the corresponding types of file-level statistic that are collected.
If EXTENDED
is specified, a third column called "effectiveness_score" is displayed that gives an approximate measure of how beneficial we expect DataSkipping to be for filters on the corresponding columns.
Refresh Full Index
REFRESH DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
Rebuilds the whole index. I.e. all the table's partitions will be re-indexed.
Refresh Partitions
REFRESH DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
PARTITION (part_col_name1[=val1], part_col_name2[=val2], ...)
Re-indexes the specified partitions only. This operation should generally be faster than full index refresh.
Drop Index
DROP DATASKIPPING INDEX ON [TABLE] [db_name.]table_name
Disables Data Skipping on the given table and deletes all index data.