使用 Azure HDInsight IO 缓存提高 Apache Spark 工作负载的性能Improve performance of Apache Spark workloads using Azure HDInsight IO Cache

IO 缓存是 Azure HDInsight 的数据缓存服务,可用于提高 Apache Spark 作业的性能。IO Cache is a data caching service for Azure HDInsight that improves the performance of Apache Spark jobs. IO 缓存也适用于可在 Apache Spark 群集上运行的 Apache TEZApache Hive 工作负载。IO Cache also works with Apache TEZ and Apache Hive workloads, which can be run on Apache Spark clusters. IO 缓存使用名为 RubiX 的开源缓存组件。IO Cache uses an open-source caching component called RubiX. RubiX 是用于可从云存储系统访问数据的大数据分析引擎的本地磁盘缓存。RubiX is a local disk cache for use with big data analytics engines that access data from cloud storage systems. RubiX 在缓存系统中是唯一的,因为它使用固态硬盘 (SSD),而不是保留操作内存以供缓存。RubiX is unique among caching systems, because it uses Solid-State Drives (SSDs) rather than reserve operating memory for caching purposes. IO 缓存服务可在群集的每个工作节点上启动和管理 RubiX 元数据服务器。The IO Cache service launches and manages RubiX Metadata Servers on each worker node of the cluster. 它还可以配置群集的所有服务以便透明使用 RubiX 缓存。It also configures all services of the cluster for transparent use of RubiX cache.

大多数 SSD 提供超过 1 GB/秒的带宽。Most SSDs provide more than 1 GByte per second of bandwidth. 此带宽由操作系统内存中文件缓存补充,提供的带宽足以加载大数据计算处理引擎(如 Apache Spark)。This bandwidth, complemented by the operating system in-memory file cache, provides enough bandwidth to load big data compute processing engines, such as Apache Spark. 剩余的操作内存可供 Apache Spark 处理很大程度依赖于内存的任务(如数据重组)。The operating memory is left available for Apache Spark to process heavily memory-dependent tasks, such as shuffles. 独占使用操作内存可让 Apache Spark 实现最佳资源使用情况。Having exclusive use of operating memory allows Apache Spark to achieve optimal resource usage.


IO 缓存当前将 RubiX 用作缓存组件,但在该服务的将来版本中可能会有所更改。IO Cache currently uses RubiX as a caching component, but this may change in future versions of the service. 请使用 IO 缓存接口,并且不要直接对 RubiX 实现执行任何依赖项。Please use IO Cache interfaces and don't take any dependencies directly on the RubiX implementation. 目前仅有 Azure BLOB 存储支持 IO 缓存。IO Cache is only supported with Azure BLOB Storage at this time.

Azure HDInsight IO 缓存的优点Benefits of Azure HDInsight IO Cache

使用 IO 缓存可为从 Azure Blob 存储读取数据的作业提高性能。Using IO Cache provides a performance increase for jobs that read data from Azure Blob Storage.

使用 IO 缓存时,无需对 Spark 作业进行更改即可看到性能提升。You don't have to make any changes to your Spark jobs to see performance increases when using IO Cache. 禁用 IO 缓存时,此 Spark 代码会从 Azure Blob 存储远程读取数据:spark.read.load('wasbs:///myfolder/data.parquet').count()When IO Cache is disabled, this Spark code would read data remotely from Azure Blob Storage: spark.read.load('wasbs:///myfolder/data.parquet').count(). 激活 IO 缓存时,同一行代码会导致通过 IO 缓存进行缓存读取。When IO Cache is activated, the same line of code causes a cached read through IO Cache. 执行以下读取操作时,会从 SSD 本地读取数据。On following reads, the data is read locally from SSD. HDInsight 群集上的工作节点配备有本地附加的专用 SSD 驱动器。Worker nodes on HDInsight cluster are equipped with locally attached, dedicated SSD drives. HDInsight IO 缓存使用这些本地 SSD 进行缓存,这样可以将延迟降为最低级别并最大程度地增加带宽。HDInsight IO Cache uses these local SSDs for caching, which provides lowest level of latency and maximizes bandwidth.

入门Getting started

默认情况下,在预览版中将停用 Azure HDInsight IO 缓存。Azure HDInsight IO Cache is deactivated by default in preview. IO 缓存可用于运行 Apache Spark 2.3 的 Azure HDInsight 3.6+ Spark 群集。IO Cache is available on Azure HDInsight 3.6+ Spark clusters, which run Apache Spark 2.3. 要在 HDInsight 4.0 上激活 IO 缓存,请执行以下步骤:To activate IO Cache on HDInsight 4.0, do the following steps:

  1. 在 Web 浏览器中,导航到 https://CLUSTERNAME.azurehdinsight.cn,其中 CLUSTERNAME 是群集的名称。From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.cn, where CLUSTERNAME is the name of your cluster.

  2. 选择左侧的“IO 缓存” 服务。Select the IO Cache service on the left.

  3. 选择“操作”(在 HDI 3.6 中为“服务操作”)和“激活” 。Select Actions (Service Actions in HDI 3.6) and Activate.

    在 Ambari 中启用 IO 缓存服务Enabling the IO Cache service in Ambari

  4. 确认重新启动群集上所有受影响的服务。Confirm restart of all the affected services on the cluster.


即使进度栏显示已激活,但 IO 缓存实际上未启用,直到重新启动其他受影响的服务。Even though the progress bar shows activated, IO Cache isn't actually enabled until you restart the other affected services.


启用 IO 缓存后可能会收到运行 Spark 作业时出现的磁盘空间错误。You may get disk space errors running Spark jobs after enabling IO Cache. 出现这些错误的原因是 Spark 还将本地磁盘存储用于在执行数据重组操作期间存储数据。These errors occur because Spark also uses local disk storage for storing data during shuffling operations. 启用 IO 缓存并减少 Spark 存储空间后,Spark 可能会耗尽 SSD 空间。Spark may run out of SSD space once IO Cache is enabled and the space for Spark storage is reduced. IO 缓存所用的空间量默认为 SSD 空间总量的一半。The amount of space used by IO Cache defaults to half of the total SSD space. IO 缓存的磁盘空间使用量可以在 Ambari 中进行配置。The disk space usage for IO Cache is configurable in Ambari. 如果收到磁盘空间错误,请减少 IO 缓存所用的 SSD 空间量,并重新启动该服务。If you get disk space errors, reduce the amount of SSD space used for IO Cache and restart the service. 若要更改为 IO 缓存设置的空间,请执行以下步骤:To change the space set for IO Cache, do the following steps:

  1. 在 Apache Ambari 中,选择左侧的“HDFS”服务 。In Apache Ambari, select the HDFS service on the left.

  2. 依次选择 “配置”和 “高级”选项卡。Select the Configs and Advanced tabs.

    编辑 HDFS 高级配置Edit HDFS Advanced Configuration

  3. 向下滚动并展开 “自定义 core-site”区域。Scroll down and expand the Custom core-site area.

  4. 查找属性 hadoop.cache.data.fullness.percentage 。Locate the property hadoop.cache.data.fullness.percentage.

  5. 更改框中的值。Change the value in the box.

    编辑 IO 缓存填充度百分比Edit IO Cache Fullness Percentage

  6. 选择右上角的“保存” 。Select Save on the upper right.

  7. 选择“重新启动” > “重新启动所有受影响的项”。Select Restart > Restart All Affected.

    Apache Ambari 重新启动所有受影响的服务Apache Ambari restart all affected

  8. 选择“确认全部重启” 。Select Confirm Restart All.

如果不起作用,请禁用 IO 缓存。If that does not work, disable IO Cache.

后续步骤Next Steps