通过缓存优化性能Optimize performance with caching

增量缓存通过使用快速中间数据格式在节点的本地存储中创建远程文件的副本来加快数据读取。The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. 必须从远程位置提取文件时,会自动缓存数据。The data is cached automatically whenever a file has to be fetched from a remote location. 然后在本地连续读取上述数据,这会显著提高读取速度。Successive reads of the same data are then performed locally, which results in significantly improved reading speed.

增量缓存支持读取 DBFS、HDFS、Azure Blob 存储和 Azure Data Lake Storage Gen2 中的 Parquet 文件。The Delta cache supports reading Parquet files in DBFS, HDFS, Azure Blob storage and Azure Data Lake Storage Gen2. 它不支持其他存储格式,例如 CSV、JSON 和 ORC。It does not support other storage formats such as CSV, JSON, and ORC.

备注

增量缓存适用于所有 Parquet 文件,并且不限于 Delta Lake 格式的文件The Delta cache works for all Parquet files and is not limited to Delta Lake format files.

增量缓存和 Apache Spark 缓存 Delta and Apache Spark caching

Azure Databricks 中提供了两种类型的缓存:增量缓存和 Apache Spark 缓存。There are two types of caching available in Azure Databricks: Delta caching and Apache Spark caching. 以下是每个类型的特征:Here are the characteristics of each type:

  • 存储数据的类型 :增量缓存包含远程数据的本地副本。Type of stored data : The Delta cache contains local copies of remote data. 它可以提高各种查询的性能,但不能用于存储任意子查询的结果。It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. Spark 缓存可以存储任何子查询数据的结果,以及以非 Parquet 格式(例如 CSV、JSON 和 ORC)存储的数据。The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).
  • 性能 :与 Spark 缓存中的数据相比,读取和操作增量缓存中的数据速度更快。Performance : The data stored in the Delta cache can be read and operated on faster than the data in the Spark cache. 原因在于,增量缓存使用的是高效的解压缩算法,并以最佳格式输出数据,从而使用全阶段代码生成进行进一步处理。This is because the Delta cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.
  • 自动和手动控制 :启用增量缓存后,必须从远程源提取的数据会自动添加到该缓存。Automatic vs manual control : When the Delta cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. 此过程完全透明,无需任何操作。This process is fully transparent and does not require any action. 但若要预先将数据预加载到缓存,可以使用 CACHE 命令(请参阅缓存数据子集)。However, to preload data into the cache beforehand, you can use the CACHE command (see Cache a subset of the data). 使用 Spark 缓存时,必须手动指定要缓存的表和查询。When you use the Spark cache, you must manually specify the tables and queries to cache.
  • 磁盘和基于内存 :增量缓存完全存储在本地磁盘上,因此 Spark 中的其他操作不会占用内存。Disk vs memory-based : The Delta cache is stored entirely on the local disk, so that memory is not taken away from other operations within Spark. 由于新式 SSD 读取速度较快,因此增量缓存可以完全驻留于磁盘,并且不会对其性能产生负面影响。Due to the high read speeds of modern SSDs, the Delta cache can be fully disk-resident without a negative impact on its performance. 相反,Spark 缓存使用内存。In contrast, the Spark cache uses memory.

备注

可以同时使用增量缓存和 Apache Spark 缓存。You can use Delta caching and Apache Spark caching at the same time.

总结Summary

下表总结了增量缓存和 Apache Spark 缓存之间的主要区别,以便选择最适合工作流的工具:The following table summarizes the key differences between Delta and Apache Spark caching so that you can choose the best tool for your workflow:

功能Feature 增量缓存Delta cache Apache Spark 缓存Apache Spark cache
存储格式Stored as 工作器节点上的本地文件。Local files on a worker node. 内存中块,但取决于存储级别。In-memory blocks, but it depends on storage level.
适用对象Applied to WASB 和其他文件系统上存储的任何 Parquet 表格。Any Parquet table stored on WASB and other file systems. 任何 RDD 或数据帧。Any RDD or DataFrame.
触发Triggered 自动执行,第一次读取时(如果启用了缓存)。Automatically, on the first read (if cache is enabled). 手动执行,需要更改代码。Manually, requires code changes.
已评估Evaluated 惰性。Lazily. 惰性。Lazily.
强制缓存Force cache CACHESELECTCACHE and SELECT .cache + 任何实现缓存的操作和 .persist.cache + any action to materialize the cache and .persist.
可用性Availability 可以使用配置标志启用或禁用,可以在某些节点类型上禁用。Can be enabled or disabled with configuration flags, disabled on certain node types. 始终可用。Always available.
逐出Evicted 更改任何文件时自动执行,重启群集时手动执行。Automatically on any file change, manually when restarting a cluster. 以 LRU 方式自动执行,使用 unpersist 手动执行。Automatically in LRU fashion, manually with unpersist.

增量缓存一致性Delta cache consistency

增量缓存会自动检测创建或删除数据文件的时间,并相应地更新其内容。The Delta cache automatically detects when data files are created or deleted and updates its content accordingly. 可以写入、修改和删除表格数据,并且无需显式地使缓存数据无效。You can write, modify, and delete table data with no need to explicitly invalidate cached data.

增量缓存会自动检测缓存后修改或覆盖的文件。The Delta cache automatically detects files that have been modified or overwritten after being cached. 所有过时项都将自动失效,并从缓存中逐出。Any stale entries are automatically invalidated and evicted from the cache.

使用增量缓存 Use Delta caching

若要使用增量缓存,请在配置群集时选择“增量缓存加速”工作器类型。To use Delta caching, choose a Delta Cache Accelerated worker type when you configure your cluster.

缓存加速群集Cache accelerated cluster

默认启用增量缓存,并配置为最多使用工作器节点随附的本地 SSD 的一半可用空间。The Delta cache is enabled by default and configured to use at most half of the space available on the local SSDs provided with the worker nodes.

有关配置选项,请参阅配置增量缓存For configuration options, see Configure the Delta cache.

缓存一部分数据Cache a subset of the data

若要显式选择要缓存的数据子集,请使用以下语法:To explicitly select a subset of data to be cached, use the following syntax:

CACHE SELECT column_name[, column_name, ...] FROM [db_name.]table_name [ WHERE boolean_expression ]

无需使用此命令即可正常使用增量缓存(首次访问时会自动缓存数据)。You don’t need to use this command for the Delta cache to work correctly (the data will be cached automatically when first accessed). 但如果需要查询性能保持一致,它可能会有所帮助。But it can be helpful when you require consistent query performance.

有关示例和详细信息,请参阅For examples and more details, see

监视增量缓存Monitor the Delta cache

可以在 Spark UI 的“存储”选项卡中检查每个执行程序上的增量缓存的当前状态。You can check the current state of the Delta cache on each of the executors in the Storage tab in the Spark UI.

监视增量缓存Monitor Delta cache

节点的磁盘使用率达到 100 %时,缓存管理器将丢弃最近使用的缓存项,以便为新数据腾出空间。When a node reaches 100% disk usage, the cache manager discards the least recently used cache entries to make space for new data.

配置增量缓存 Configure the Delta cache

提示

Azure Databricks 建议为群集选择缓存加速的工作器实例类型Azure Databricks recommends that you choose cache-accelerated worker instance types for your clusters. 此类实例针对增量缓存自动执行了最佳配置。Such instances are automatically configured optimally for the Delta cache.

配置磁盘使用率Configure disk usage

若要配置增量缓存如何使用工作器节点的本地存储,请在创建群集时指定以下 Spark 配置设置:To configure how the Delta cache uses the worker nodes’ local storage, specify the following Spark configuration settings during cluster creation:

  • spark.databricks.io.cache.maxDiskUsage - 每个节点为缓存数据预留的磁盘空间(以字节为单位)spark.databricks.io.cache.maxDiskUsage - disk space per node reserved for cached data in bytes
  • spark.databricks.io.cache.maxMetaDataCache - 每个节点为缓存元数据预留的磁盘空间(以字节为单位)spark.databricks.io.cache.maxMetaDataCache - disk space per node reserved for cached metadata in bytes
  • spark.databricks.io.cache.compression.enabled - 是否应以压缩格式存储缓存数据spark.databricks.io.cache.compression.enabled - should the cached data be stored in compressed format

示例配置:Example configuration:

spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false

启用增量缓存 Enable the Delta cache

若要启用和禁用增量缓存,请运行:To enable and disable the Delta cache, run:

spark.conf.set("spark.databricks.io.cache.enabled", "[true | false]")

禁用缓存不会删除本地存储中已有的数据。Disabling the cache does not result in dropping the data that is already in the local storage. 相反,它会阻止查询向缓存添加新数据,以及从缓存读取数据。Instead, it prevents queries from adding new data to the cache and reading data from the cache.