Azure HDInsight 上的 SparkCruiseSparkCruise on Azure HDInsight

本文档介绍 Azure HDInsight 的功能 SparkCruise,它自动重用 Apache Spark 计算以提高查询效率。This document discusses the Azure HDInsight feature SparkCruise, which automatically reuses Apache Spark computations to increase query efficiency.

概述Overview

在分析平台(如 Apache Spark)上运行的查询被分解成包含较小子查询的查询计划。The queries that you run on an analytics platform such as Apache Spark, are decomposed into a query plan that contains smaller subqueries. 这些子查询可能在多个查询的查询计划中重复出现。These subqueries may show up repeatedly across query plans for multiple queries. 每次出现时,都将重新执行它们以返回结果。Each time they occur, they're re-executed to return the results. 但是,重新执行相同的查询可能效率低下,并会导致不必要的计算开销。Re-executing the same query, however, can be inefficient and lead to unnecessary computation costs.

SparkCruise 是一个工作负载优化平台,它可以重用常见的计算,从而减少总体查询执行时间和数据传输成本。SparkCruise is a workload optimization platform that can reuse common computations, decreasing overall query execution time and data transfer costs. 该平台使用具体化视图的概念,这是一个以预计算的形式存储结果的查询。The platform uses the concept of a materialized view, which is a query whose results are stored in pre-computed form. 当该查询稍后再次出现时,可以重用这些结果,而不是全部重新计算结果。Those results can then be reused when the query itself shows up again later, rather than recomputing the results all over again.

设置和安装Setup and installation

所有具有 Spark 2.3 或 2.4 的 HDInsight 4.0 群集都提供 SparkCruise。SparkCruise is available on all HDInsight 4.0 clusters with Spark 2.3 or 2.4. SparkCruise 库文件安装在 HDInsight 集群的 /opt/peregrine/ 目录中。The SparkCruise library files are installed in the /opt/peregrine/ directory on your HDInsight cluster. SparkCruise 需要以下配置属性才能正常运行,默认已设置这些属性。To work properly, SparkCruise requires the following configuration properties, which are set by default.

  • spark.sql.queryExecutionListeners 设置为 com.microsoft.peregrine.spark.listeners.PlanLogListener,这会启用查询计划的日志记录。spark.sql.queryExecutionListeners is set to com.microsoft.peregrine.spark.listeners.PlanLogListener, which enables logging of query plans.
  • spark.sql.extensions 设置为 com.microsoft.peregrine.spark.extensions.SparkExtensionsHdi,这会启用在线具体化和重用的优化程序规则。spark.sql.extensions is set to com.microsoft.peregrine.spark.extensions.SparkExtensionsHdi, which enables the optimizer rules for online materialization and reuse.

Spark SQL 中的配置重用Computation Reuse in Spark SQL

下面的示例场景演示了如何使用 SparkCruise 优化 Apache Spark 查询。The following sample scenario illustrates how to use SparkCruise to optimize Apache Spark queries.

  1. 使用 SSH 连接到 Spark 群集的头节点。SSH into the head node of your spark cluster.

  2. 键入 spark-shellType spark-shell.

  3. 运行以下代码片段,以使用群集上的示例数据运行一些基本查询。Run the following code snippet, which runs a few basic queries using sample data on the cluster.

    spark.sql("select count(*) from hivesampletable").collect
    spark.sql("select count(*) from hivesampletable").collect
    spark.sql("select distinct market from hivesampletable where querytime like '11%'").show
    spark.sql("select distinct state, country from hivesampletable where querytime like '11%'").show
    :quit
    
  4. 使用 SparkCruise 查询分析工具分析前面的查询的查询计划,这些查询存储在 Spark 应用程序日志中。Use the SparkCruise query analysis tool to analyze the query plans of the previous queries, which are stored in the Spark application logs.

    sudo /opt/peregrine/analyze/peregrine.sh analyze views
    
  5. 查看分析过程的输出,这是一个反馈文件。View the output of the analysis process, which is a feedback file. 此反馈文件包含将来 Spark SQL 查询的注释。This feedback file contains annotations for future Spark SQL queries.

    sudo /opt/peregrine/analyze/peregrine.sh show
    

analyze 命令分析查询计划并创建工作负载的表格表示法。The analyze command parses the query plans and creates a tabular representation of the workload. 可以使用 HDInsight SparkCruise 示例存储库中包含的 WorkloadInsights 笔记本来查询此工作负载表。This workload table can be queried using the WorkloadInsights notebook included in the HDInsight SparkCruise Samples repository. 然后,views 命令会识别常见的子计划表达式,并选择感兴趣的子计划表达式以供将来具体化和重用。Then, the views command identifies common subplan expressions and selects interesting subplan expressions for future materialization and reuse. 输出是一个反馈文件,其中包含未来 Spark SQL 查询的注释。The output is a feedback file containing annotations for future Spark SQL queries.

show 命令显示的输出类似于以下文本:The show command displays an output like the following text:

Feedback file -->

1593761760087311271 Materialize /peregrine/views/1593761760087311271
1593761760087311271 Reuse /peregrine/views/1593761760087311271
18446744073621796959 Materialize /peregrine/views/18446744073621796959
18446744073621796959 Reuse /peregrine/views/18446744073621796959
11259615723090744908 Materialize /peregrine/views/11259615723090744908
11259615723090744908 Reuse /peregrine/views/11259615723090744908
9409467400931056980 Materialize /peregrine/views/9409467400931056980
9409467400931056980 Reuse /peregrine/views/9409467400931056980

Materialized subexpressions -->

Found 4 items
-rw-r--r--   1 sshuser sshuser     113445 2020-07-24 16:46 /peregrine/views/logical_ir.csv
-rw-r--r--   1 sshuser sshuser     169458 2020-07-24 16:46 /peregrine/views/physical_ir.csv
-rw-r--r--   1 sshuser sshuser      25730 2020-07-24 16:46 /peregrine/views/views.csv
-rw-r--r--   1 sshuser sshuser        536 2020-07-24 16:46 /peregrine/views/views.stp

反馈文件包含以下格式的条目:subplan-identifier [Materialize|Reuse] input/path/to/actionThe feedback file contains entries in the following format: subplan-identifier [Materialize|Reuse] input/path/to/action. 在本例中,有两个独特的签名:一个代表前两个重复的查询,另一个代表最后两个查询中的筛选器谓词。In this example, there are two unique signatures: one representing the first two repeated queries and the second representing the filter predicate in last two queries. 有了此反馈文件,下面的查询在再次提交时将自动具体化并重用公共子计划。With this feedback file, the following queries when submitted again will now automatically materialize and reuse common subplans.

若要测试优化,请执行另一组示例查询。To test the optimizations, execute another set of sample queries.

  1. 键入 spark-shellType spark-shell.

  2. 执行以下代码片段Execute the following code snippet

    spark.sql("select count(*) from hivesampletable").collect
    spark.sql("select count(*) from hivesampletable").collect
    spark.sql("select distinct state, country from hivesampletable where querytime like '12%'").show
    spark.sql("select distinct market from hivesampletable where querytime like '12%'").show
    :quit
    
  3. 再次查看反馈文件以查看已重用的查询的签名。View the feedback file again, to see the signatures of the queries that have been reused.

    sudo /opt/peregrine/analyze/peregrine.sh show
    

show 命令生成的输出类似于以下文本:The show command gives an output similar to the following text:

Feedback file -->

1593761760087311271 Materialize /peregrine/views/1593761760087311271
1593761760087311271 Reuse /peregrine/views/1593761760087311271
18446744073621796959 Materialize /peregrine/views/18446744073621796959
18446744073621796959 Reuse /peregrine/views/18446744073621796959
11259615723090744908 Materialize /peregrine/views/11259615723090744908
11259615723090744908 Reuse /peregrine/views/11259615723090744908
9409467400931056980 Materialize /peregrine/views/9409467400931056980
9409467400931056980 Reuse /peregrine/views/9409467400931056980

Materialized subexpressions -->

Found 8 items
drwxr-xr-x   - root root          0 2020-07-24 17:21 /peregrine/views/11259615723090744908
drwxr-xr-x   - root root          0 2020-07-24 17:21 /peregrine/views/1593761760087311271
drwxr-xr-x   - root root          0 2020-07-24 17:22 /peregrine/views/18446744073621796959
drwxr-xr-x   - root root          0 2020-07-24 17:21 /peregrine/views/9409467400931056980
-rw-r--r--   1 root root     113445 2020-07-24 16:46 /peregrine/views/logical_ir.csv
-rw-r--r--   1 root root     169458 2020-07-24 16:46 /peregrine/views/physical_ir.csv
-rw-r--r--   1 root root      25730 2020-07-24 16:46 /peregrine/views/views.csv
-rw-r--r--   1 root root        536 2020-07-24 16:46 /peregrine/views/views.stp

尽管查询中的文本值已从 '11%' 更改为 '12%',但 SparkCruise 仍可将以前的查询与具有微小变化(如文本值和数据集版本略有变化)的新查询匹配。Although the literal value in the query has changed from '11%' to '12%', SparkCruise can still match previous queries to new queries with slight variations like the evolution of literal values and dataset versions. 如果查询有重大更改,则可以重新运行分析工具来标识可重用的新查询。If there are major changes to a query, you can rerun the analysis tool to identify new queries that can be reused.

在后台,SparkCruise 会触发一个子查询,从包含它的第一个查询具体化所选的子计划。Behind the scenes, SparkCruise triggers a subquery for materializing the selected subplan from the first query that contains it. 之后的查询可以直接读取已具体化的子计划,而不是重新计算它们。Later queries can directly read the materialized subplans instead of recomputing them. 在此工作负载中,子计划将通过第一个和第三个查询以在线方式进行具体化。In this workload, the subplans will be materialized in an online fashion by the first and third queries. 在公共子计划具体化后,我们可以看到查询的计划更改。We can see the plan change of queries after the common subplans are materialized.

你可以看到现在有四个新的具体化子表达式可在后续查询中重用。You can see that there are now four new materialized subexpressions available to be reused in subsequent queries.

清理Clean up

反馈文件、已具体化的子计划和查询日志在 Spark 会话中保持不变。The feedback files, materialized subplans, and query logs are persisted across Spark sessions. 若要删除这些文件,请运行以下命令:To remove these files, run the following command:

sudo /opt/peregrine/analyze/peregrine.sh clean

后续步骤Next steps