基于成本的优化器Cost-based optimizer

Spark SQL 可以使用基于成本的优化器 (CBO) 来改进查询计划。Spark SQL can use a cost-based optimizer (CBO) to improve query plans. 这对于包含多个联接的查询特别有用。This is especially useful for queries with multiple joins. 要使此优化器有效,至关重要的是收集表和列的统计信息并使这些信息保持最新。For this to work it is critical to collect table and column statistics and keep them up to date.

收集统计信息Collect statistics

若要充分利用 CBO,请务必同时收集 列统计信息表统计信息To get the full benefit of the CBO it is important to collect both column statistics and table statistics . 可以使用 Analyze Table 命令收集统计信息。Statistics can be collected using the Analyze Table command.

提示

要使统计信息保持最新,请在将内容写入表后运行 ANALYZE TABLETo maintain the statistics up-to-date, run ANALYZE TABLE after writing to the table.

验证查询计划Verify query plans

可通过多种方式来验证查询计划。There are several ways to verify the query plan.

EXPLAIN 命令EXPLAIN command

使用 SQL Explain 命令检查计划是否使用统计信息。Use the SQL Explain command to check if the plan uses statistics. 如果缺少统计信息,则查询计划可能不是最佳计划。If statistics are missing then the query plan might not be optimal.

== Optimized Logical Plan ==
Aggregate [s_store_sk], [s_store_sk, count(1) AS count(1)L], Statistics(sizeInBytes=20.0 B, rowCount=1, hints=none)
+- Project [s_store_sk], Statistics(sizeInBytes=18.5 MB, rowCount=1.62E+6, hints=none)
   +- Join Inner, (d_date_sk = ss_sold_date_sk), Statistics(sizeInBytes=30.8 MB, rowCount=1.62E+6, hints=none)
      :- Project [ss_sold_date_sk, s_store_sk], Statistics(sizeInBytes=39.1 GB, rowCount=2.63E+9, hints=none)
      :  +- Join Inner, (s_store_sk = ss_store_sk), Statistics(sizeInBytes=48.9 GB, rowCount=2.63E+9, hints=none)
      :     :- Project [ss_store_sk, ss_sold_date_sk], Statistics(sizeInBytes=39.1 GB, rowCount=2.63E+9, hints=none)
      :     :  +- Filter (isnotnull(ss_store_sk) && isnotnull(ss_sold_date_sk)), Statistics(sizeInBytes=39.1 GB, rowCount=2.63E+9, hints=none)
      :     :     +- Relation[ss_store_sk,ss_sold_date_sk] parquet, Statistics(sizeInBytes=134.6 GB, rowCount=2.88E+9, hints=none)
      :     +- Project [s_store_sk], Statistics(sizeInBytes=11.7 KB, rowCount=1.00E+3, hints=none)
      :        +- Filter isnotnull(s_store_sk), Statistics(sizeInBytes=11.7 KB, rowCount=1.00E+3, hints=none)
      :           +- Relation[s_store_sk] parquet, Statistics(sizeInBytes=88.0 KB, rowCount=1.00E+3, hints=none)
      +- Project [d_date_sk], Statistics(sizeInBytes=12.0 B, rowCount=1, hints=none)
         +- Filter ((((isnotnull(d_year) && isnotnull(d_date)) && (d_year = 2000)) && (d_date = 2000-12-31)) && isnotnull(d_date_sk)), Statistics(sizeInBytes=38.0 B, rowCount=1, hints=none)
            +- Relation[d_date_sk,d_date,d_year] parquet, Statistics(sizeInBytes=1786.7 KB, rowCount=7.30E+4, hints=none)

重要

rowCount 统计信息对于具有多个联接的查询尤其重要。The rowCount statistic is especially important for queries with multiple joins. 如果缺少 rowCount,则意味着没有足够的信息来对其进行计算(也就是说,某些必需列没有统计信息)。If rowCount is missing, it means there is not enough information to calculate it (that is, some required columns do not have statistics).

Spark SQL UISpark SQL UI

使用 Spark SQL UI 页可以查看已执行的计划以及统计信息的准确性。Use the Spark SQL UI page to see the executed plan and accuracy of the statistics.

缺少估算Missing estimate

诸如 rows output: 2,451,005 est: N/A 之类的行表示此运算符大约生成 2 百万个行,但没有可用的统计信息。A line such as rows output: 2,451,005 est: N/A means that this operator produces approximately 2M rows and there were no statistics available.

好的估算Good estimate

诸如 rows output: 2,451,005 est: 1616404 (1X) 之类的行表示此运算符大约生成 2 百万个行,而估算的行约为 160 万个,估算误差系数为 1。A line such as rows output: 2,451,005 est: 1616404 (1X) means that this operator produces approx. 2M rows, while the estimate was approx. 1.6M and the estimation error factor was 1.

差的估算Bad estimate

诸如 rows output: 2,451,005 est: 2626656323 之类的行表示此运算符大约生成 2 百万个行,而估算的行为 20 亿个,估算误差系数为 1000。A line such as rows output: 2,451,005 est: 2626656323 means that this operator produces approximately 2M rows while the estimate was 2B rows, so the estimation error factor was 1000.

禁用基于成本的优化器Disable the Cost-Based Optimizer

默认情况下,CBO 已启用。The CBO is enabled by default. 可以通过更改 spark.sql.cbo.enabled 标志来禁用 CBO。You disable the CBO by changing the spark.sql.cbo.enabled flag.

spark.conf.set("spark.sql.cbo.enabled", false)