Bloom 筛选器索引Bloom filter indexes

Bloom 筛选器索引是一种节省空间的数据结构,可跳过选定列中的数据(特别对于包含任意文本的字段)。A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text. Bloom 筛选器的工作方式是:声明数据肯定不在文件中或可能在文件中,并定义误报率 (FPP) 。The Bloom filter operates by either stating that data is definitively not in the file, or that it is probably in the file, with a defined false positive probability (FPP).

Azure Databricks 支持文件级 Bloom 筛选器;每个数据文件都可关联一个 Bloom 筛选器索引文件。Azure Databricks supports file level Bloom filters; each data file can have a single Bloom filter index file associated with it. 读取文件前,Azure Databricks 会检查索引文件,并且仅在索引指示该文件可能与数据筛选器匹配时才会读取该文件。Before reading a file Azure Databricks checks the index file and the file is read only if the index indicates that the file might match a data filter. 如果索引不存在或未为查询的列定义 Bloom 筛选器,那么 Azure Databricks 会一直读取数据文件。Azure Databricks always reads the data file if an index does not exist or if a Bloom filter is not defined for a queried column.

Bloom 筛选器的大小由两者决定:为其创建了 Bloom 筛选器的集中的数字元素,以及必需的 FPP。The size of a Bloom filter depends on the number elements in the set for which the Bloom filter has been created and the required FPP. FPP 越低,每个元素使用的位数越多,它的准确度就越高,但代价是磁盘空间占用更多、下载速度更慢。The lower the FPP, the higher the number of used bits per element and the more accurate it will be, at the cost of more disk space and slower downloads. 例如,FPP 为 10% 时,每个元素需要 5 位。For example, an FPP of 10% requires 5 bits per element.

Bloom 筛选器索引是包含单个行的未压缩的 Parquet 文件。A Bloom filter index is an uncompressed Parquet file that contains a single row. 索引存储在与数据文件相关的 _delta_index 子目录中,且使用与后缀为 index.v1.parquet 的数据文件相同的名称。Indexes are stored in the _delta_index subdirectory relative to the data file and use the same name as the data file with the suffix index.v1.parquet. 例如,数据文件 dbfs:/db1/data.0001.parquet.snappy 的索引将命名为 dbfs:/db1/_delta_index/data.0001.parquet.snappy.index.v1.parquetFor example, the index for data file dbfs:/db1/data.0001.parquet.snappy would be named dbfs:/db1/_delta_index/data.0001.parquet.snappy.index.v1.parquet.

Bloom 筛选器支持具有以下(输入)数据类型的列:byteshortintlongfloatdoubledatetimestampstringBloom filters support columns with the following (input) data types: byte, short, int, long, float, double, date, timestamp, and string. 不会向 Bloom 筛选器添加 NULL 值,因此所有与 NULL 相关的筛选器都需要读取数据文件。Nulls are not added to the Bloom filter, so any null related filter requires reading the data file. Azure Databricks 支持以下数据源筛选器:andorinequalsequalsnullsafeAzure Databricks supports the following data source filters: and, or, in, equals, and equalsnullsafe. 嵌套列不支持 Bloom 筛选器。Bloom filters are not supported on nested columns.

配置Configuration

默认启用 Bloom 筛选器。Bloom filters are enabled by default. 若要禁用 Bloom 筛选器,请将会话级别 spark.databricks.io.skipping.bloomFilter.enabled 配置设置为 falseTo disable Bloom filters, set the session level spark.databricks.io.skipping.bloomFilter.enabled configuration to false.

创建 Bloom 筛选器索引Create a Bloom filter index

若要在表格中为新数据或重写数据的所有列或部分列创建 Bloom 筛选器索引,请使用 CREATE BLOOMFILTER INDEX DDL 语句。To create a Bloom filter index on a table for all columns or a subset of columns for new or rewritten data, use the CREATE BLOOMFILTER INDEX DDL statement. 例如,以下语句在列 sha 中创建 Bloom 筛选器索引,并在列中将 FPP 0.150,000,000 作为不同的项。For example, the following statement creates a Bloom filter index on the column sha with FPP 0.1 and 50,000,000 distinct items in the column.

CREATE BLOOMFILTER INDEX
ON TABLE bloom_test
FOR COLUMNS(sha OPTIONS (fpp=0.1, numItems=50000000))

删除 Bloom 筛选器索引Drop a Bloom filter index

若要从表或表的一组列中删除所有 Bloom 筛选器,可使用 DROP BLOOMFILTER INDEX DDL 语句。To drop all Bloom filters from a table or for a set of columns within a table, use the DROP BLOOMFILTER INDEX DDL statement. 例如: 。For example:

DROP BLOOMFILTER INDEX ON TABLE bloom_test FOR COLUMNS(sha);

笔记本Notebook

以下笔记本演示了定义 Bloom 筛选器索引如何加速“大海捞针式”查询。The following notebook demonstrates how defining an Bloom filter index speeds up “needle in a haystack” queries.

Bloom 筛选器演示笔记本Bloom filter demo notebook

获取笔记本Get notebook