有关在 Synapse SQL 池中使用复制表的设计指南Design guidance for using replicated tables in Synapse SQL pool

本文提供了有关在 Synapse SQL 池架构中设计复制表的建议。This article gives recommendations for designing replicated tables in your Synapse SQL pool schema. 可以使用这些建议通过减少数据移动和降低查询复杂性来提高查询性能。Use these recommendations to improve query performance by reducing data movement and query complexity.

先决条件Prerequisites

本文假设读者熟悉 SQL 池中的数据分布和数据移动概念。This article assumes you are familiar with data distribution and data movement concepts in SQL pool.   有关详细信息,请参阅体系结构一文。For more information, see the architecture article.

作为表设计的一部分,请尽可能多地去了解你的数据及其查询方式。As part of table design, understand as much as possible about your data and how the data is queried. 例如,请考虑以下问题:For example, consider these questions:

  • 表有多大?How large is the table?
  • 表的刷新频率是多少?How often is the table refreshed?
  • SQL 池数据库中是否包含事实数据表和维度表?Do I have fact and dimension tables in a SQL pool database?

什么是复制的表?What is a replicated table?

复制的表具有可在每个计算节点上访问的完整表副本。A replicated table has a full copy of the table accessible on each Compute node. 复制表后,在执行联接或聚合前将无需在计算节点之间传输数据。Replicating a table removes the need to transfer data among Compute nodes before a join or aggregation. 由于表具有多个副本,因此当表压缩后的大小小于 2 GB 时,复制的表性能最佳。Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed. 2 GB 不是硬性限制。2 GB is not a hard limit. 如果数据为静态数据,不会更改,则可复制更大的表。If the data is static and does not change, you can replicate larger tables.

下图显示了可在每个计算节点上访问的复制表。The following diagram shows a replicated table that is accessible on each Compute node. 在 SQL 池中,复制表完整复制到每个计算节点上的分发数据库。In SQL pool, the replicated table is fully copied to a distribution database on each Compute node.

复制表Replicated table

复制表非常适合星型架构中的维度表。Replicated tables work well for dimension tables in a star schema. 维度表通常联接到事实数据表,后者的分发不同于维度表。Dimension tables are typically joined to fact tables which are distributed differently than the dimension table. 通常情况下,维度的大小让存储并维护多个副本变得可行。Dimensions are usually of a size that makes it feasible to store and maintain multiple copies. 维度存储着不常更改的描述性数据,例如,客户名称和地址以及产品详细信息。Dimensions store descriptive data that changes slowly, such as customer name and address, and product details. 该数据的缓变本性使复制的表不会经历太多的维护。The slowly changing nature of the data leads to less maintenance of the replicated table.

在下列情况下,请考虑使用复制的表:Consider using a replicated table when:

  • 磁盘上的表大小小于 2 GB,无论有多少行。The table size on disk is less than 2 GB, regardless of the number of rows. 若要查明表的大小,可以使用 DBCC PDW_SHOWSPACEUSED 命令:DBCC PDW_SHOWSPACEUSED('ReplTableCandidate')To find the size of a table, you can use the DBCC PDW_SHOWSPACEUSED command: DBCC PDW_SHOWSPACEUSED('ReplTableCandidate').
  • 表用于不采用复制的表时将要求移动数据的联接中。The table is used in joins that would otherwise require data movement. 连接未分布在同一列上的表(如将哈希分布式表连接到轮循机制表)时,需要进行数据移动才能完成此查询。When joining tables that are not distributed on the same column, such as a hash-distributed table to a round-robin table, data movement is required to complete the query. 如果其中一个表较小,请考虑使用复制表。If one of the tables is small, consider a replicated table. 大多数情况下,我们建议使用复制的表而非循环表。We recommend using replicated tables instead of round-robin tables in most cases. 若要查看查询计划中的数据移动操作,请使用 sys.dm_pdw_request_stepsTo view data movement operations in query plans, use sys.dm_pdw_request_steps. BroadcastMoveOperation 是典型的数据移动操作,可通过使用复制的表来消除。The BroadcastMoveOperation is the typical data movement operation that can be eliminated by using a replicated table.

在下列情况下,复制的表可能不会产生最佳查询性能:Replicated tables may not yield the best query performance when:

  • 表具有频繁的插入、更新和删除操作。The table has frequent insert, update, and delete operations. 这些数据操作语言 (DML) 操作要求重新生成复制表。The data manipulation language (DML) operations require a rebuild of the replicated table. 频繁地重新生成会导致性能降低。Rebuilding frequently can cause slower performance.
  • SQL 池数据库会频繁缩放。The SQL pool database is scaled frequently. 缩放 SQL 池数据库会更改计算节点数,这会重新生成复制表。Scaling a SQL pool database changes the number of Compute nodes, which incurs rebuilding the replicated table.
  • 表具有大量列,但数据操作通常仅访问少量的列。The table has a large number of columns, but data operations typically access only a small number of columns. 在这种情况下,与复制整个表相比,将表分发,然后对经常访问的列创建索引可能更为高效。In this scenario, instead of replicating the entire table, it might be more effective to distribute the table, and then create an index on the frequently accessed columns. 当查询需要进行数据移动时,SQL 池仅移动所请求列中的数据。When a query requires data movement, SQL pool only moves data for the requested columns.

将复制的表与简单的查询谓词一起使用Use replicated tables with simple query predicates

在选择分发或复制表之前,请考虑计划对表运行的查询类型。Before you choose to distribute or replicate a table, think about the types of queries you plan to run against the table. 只要可能,Whenever possible,

  • 对于具有简单查询谓词(例如等于或不等于)的查询,请使用复制的表。Use replicated tables for queries with simple query predicates, such as equality or inequality.
  • 对于具有复杂查询谓词(例如 LIKE 或 NOT LIKE)的查询,请使用分布式表。Use distributed tables for queries with complex query predicates, such as LIKE or NOT LIKE.

当工作分布在所有计算节点中时,CPU 密集型查询的性能最佳。CPU-intensive queries perform best when the work is distributed across all of the Compute nodes. 例如,对表的每个行运行计算的查询在分布式表上的性能要好于在复制的表上的性能。For example, queries that run computations on each row of a table perform better on distributed tables than replicated tables. 由于复制的表完整存储在每个计算节点上,因此,针对复制的表的 CPU 密集型查询将针对每个计算节点上的整个表运行。Since a replicated table is stored in full on each Compute node, a CPU-intensive query against a replicated table runs against the entire table on every Compute node. 此额外的计算会降低查询性能。The extra computation can slow query performance.

例如,以下查询具有复杂谓词。For example, this query has a complex predicate. 当数据位于分布式表而非复制的表中时,它运行更快。It runs faster when the data is in a distributed table instead of a replicated table. 在此示例中,数据可以是循环分布式的。In this example, the data can be round-robin distributed.


SELECT EnglishProductName
FROM DimProduct
WHERE EnglishDescription LIKE '%frame%comfortable%'

将现有的循环表转换为复制的表Convert existing round-robin tables to replicated tables

如果已经具有循环表,如果它们满足本文中列出的条件,建议将其转换为复制的表。If you already have round-robin tables, we recommend converting them to replicated tables if they meet the criteria outlined in this article. 与循环表相比,复制的表可以提高性能,因为它们不要求移动数据。Replicated tables improve performance over round-robin tables because they eliminate the need for data movement. 循环表始终要求为联接移动数据。A round-robin table always requires data movement for joins.

此示例使用 CTAS 将 DimSalesTerritory 表更改为复制的表。This example uses CTAS to change the DimSalesTerritory table to a replicated table. 无论 DimSalesTerritory 是哈希分布式表还是循环表,此示例都是可行的。This example works regardless of whether DimSalesTerritory is hash-distributed or round-robin.

CREATE TABLE [dbo].[DimSalesTerritory_REPLICATE]
WITH
  (
    CLUSTERED COLUMNSTORE INDEX,  
    DISTRIBUTION = REPLICATE  
  )  
AS SELECT * FROM [dbo].[DimSalesTerritory]
OPTION  (LABEL  = 'CTAS : DimSalesTerritory_REPLICATE')

-- Switch table names
RENAME OBJECT [dbo].[DimSalesTerritory] to [DimSalesTerritory_old];
RENAME OBJECT [dbo].[DimSalesTerritory_REPLICATE] TO [DimSalesTerritory];

DROP TABLE [dbo].[DimSalesTerritory_old];

循环表与复制的表的查询性能对比示例Query performance example for round-robin versus replicated

复制的表不要求为联接移动任何数据,因为整个表已存在于每个计算节点上。A replicated table does not require any data movement for joins because the entire table is already present on each Compute node. 如果维度表是循环分布式的,则联接会将维度表整个复制到每个计算节点。If the dimension tables are round-robin distributed, a join copies the dimension table in full to each Compute node. 为了移动数据,查询计划包含了一个名为 BroadcastMoveOperation 的操作。To move the data, the query plan contains an operation called BroadcastMoveOperation. 此类数据移动操作会降低查询性能,使用复制的表可以避免。This type of data movement operation slows query performance and is eliminated by using replicated tables. 若要查看查询计划步骤,请使用 sys.dm_pdw_request_steps 系统目录视图。To view query plan steps, use the sys.dm_pdw_request_steps system catalog view.

例如,在针对 AdventureWorks 架构的以下查询中,FactInternetSales 表是哈希分布式的。For example, in following query against the AdventureWorks schema, the FactInternetSales table is hash-distributed. DimDateDimSalesTerritory 表是较小的维度表。The DimDate and DimSalesTerritory tables are smaller dimension tables. 此查询返回 2004 会计年度在北美的总销售额:This query returns the total sales in North America for fiscal year 2004:

SELECT [TotalSalesAmount] = SUM(SalesAmount)
FROM dbo.FactInternetSales s
INNER JOIN dbo.DimDate d
  ON d.DateKey = s.OrderDateKey
INNER JOIN dbo.DimSalesTerritory t
  ON t.SalesTerritoryKey = s.SalesTerritoryKey
WHERE d.FiscalYear = 2004
  AND t.SalesTerritoryGroup = 'North America'

我们已将 DimDateDimSalesTerritory 重新创建为循环表。We re-created DimDate and DimSalesTerritory as round-robin tables. 因此,查询显示了以下查询计划,该计划具有多个广播移动操作:As a result, the query showed the following query plan, which has multiple broadcast move operations:

轮循查询计划

我们将 DimDateDimSalesTerritory 重新创建为复制的表,然后再次运行了查询。We re-created DimDate and DimSalesTerritory as replicated tables, and ran the query again. 得到的查询计划短了很多,而且没有任何广播移动。The resulting query plan is much shorter and does not have any broadcast moves.

复制查询计划

修改复制的表时的性能注意事项Performance considerations for modifying replicated tables

SQL 池通过维护表的主版本来实现复制表。SQL pool implements a replicated table by maintaining a master version of the table. 它将主版本复制到每个计算节点上的第一个分发数据库。It copies the master version to the first distribution database on each Compute node. 发生更改时,会先更新主版本,然后重新生成每个计算节点上的表。When there is a change, the master version is updated first, then the tables on each Compute node are rebuilt. 重新生成复制表包括将表复制到每个计算节点,然后生成索引。A rebuild of a replicated table includes copying the table to each Compute node and then building the indexes. 例如,DW2000c 上的复制表有 5 个数据副本。For example, a replicated table on a DW2000c has 5 copies of the data. 每个计算节点上均存在主控副本和完整副本。A master copy and a full copy on each Compute node. 所有数据均存储在分发数据库中。All data is stored in distribution databases. SQL 池使用此模型来支持更快的数据修改语句和灵活的缩放操作。SQL pool uses this model to support faster data modification statements and flexible scaling operations.

在发生以下情况后,异步重新生成由对复制表的第一次查询触发:Asynchronous rebuilds are triggered by the first query against the replicated table after:

  • 加载或修改了数据Data is loaded or modified
  • Synapse SQL 实例缩放到了其他级别The Synapse SQL instance is scaled to a different level
  • 更新了表定义Table definition is updated

发生下列情况后,不需要重新生成:Rebuilds are not required after:

  • 暂停操作Pause operation
  • 恢复操作Resume operation

在修改数据后,重新生成不会立即进行。The rebuild does not happen immediately after data is modified. 重新生成将在查询首次从表中选择数据时触发。Instead, the rebuild is triggered the first time a query selects from the table. 触发重新生成的查询将立即从表的主版本读取,同时将数据异步复制到每个计算节点。The query that triggered the rebuild reads immediately from the master version of the table while the data is asynchronously copied to each Compute node. 在数据完成复制之前,后续查询将继续使用表的主版本。Until the data copy is complete, subsequent queries will continue to use the master version of the table. 如果强制执行其他重新生成操作的复制表发生任何活动,则数据复制将失效,并且下一个 select 语句将触发再次复制数据操作。If any activity happens against the replicated table that forces another rebuild, the data copy is invalidated and the next select statement will trigger data to be copied again.

谨慎使用索引Use indexes conservatively

标准索引做法适用于复制的表。Standard indexing practices apply to replicated tables. SQL 池在重新生成的过程中重新生成每个复制表索引。SQL pool rebuilds each replicated table index as part of the rebuild. 只有当提升性能比重新生成索引的成本更重要时,才应使用索引。Only use indexes when the performance gain outweighs the cost of rebuilding the indexes.

批量数据加载Batch data load

向复制的表中加载数据时,请尝试通过批量处理加载来尽量减少重新生成。When loading data into replicated tables, try to minimize rebuilds by batching loads together. 请在运行 select 语句之前执行所有批量加载。Perform all the batched loads before running select statements.

例如,以下加载模式从四个源加载数据并调用四个重新生成。For example, this load pattern loads data from four sources and invokes four rebuilds.

  • 从源 1 进行加载。Load from source 1.
  • Select 语句触发重新生成 1。Select statement triggers rebuild 1.
  • 从源 2 进行加载。Load from source 2.
  • Select 语句触发重新生成 2。Select statement triggers rebuild 2.
  • 从源 3 进行加载。Load from source 3.
  • Select 语句触发重新生成 3。Select statement triggers rebuild 3.
  • 从源 4 进行加载。Load from source 4.
  • Select 语句触发重新生成 4。Select statement triggers rebuild 4.

例如,以下加载模式从四个源加载数据,但仅调用一个重新生成。For example, this load pattern loads data from four sources, but only invokes one rebuild.

  • 从源 1 进行加载。Load from source 1.
  • 从源 2 进行加载。Load from source 2.
  • 从源 3 进行加载。Load from source 3.
  • 从源 4 进行加载。Load from source 4.
  • Select 语句触发重新生成。Select statement triggers rebuild.

在批量加载后重新生成复制的表Rebuild a replicated table after a batch load

为了确保查询执行时间一致,建议在批量加载后强制生成复制表。To ensure consistent query execution times, consider forcing the build of the replicated tables after a batch load. 否则,第一个查询仍将使用数据移动来完成查询。Otherwise, the first query will still use data movement to complete the query.

以下查询使用 sys.pdw_replicated_table_cache_state DMV 列出已修改但未重新生成的复制的表。This query uses the sys.pdw_replicated_table_cache_state DMV to list the replicated tables that have been modified, but not rebuilt.

SELECT [ReplicatedTable] = t.[name]
  FROM sys.tables t  
  JOIN sys.pdw_replicated_table_cache_state c  
    ON c.object_id = t.object_id
  JOIN sys.pdw_table_distribution_properties p
    ON p.object_id = t.object_id
  WHERE c.[state] = 'NotReady'
    AND p.[distribution_policy_desc] = 'REPLICATE'

若要触发重新生成,请在上一个输出中的每个表上运行以下语句。To trigger a rebuild, run the following statement on each table in the preceding output.

SELECT TOP 1 * FROM [ReplicatedTable]

后续步骤Next steps

若要创建复制的表,请使用下列语句之一:To create a replicated table, use one of these statements:

有关分布式表的概述,请参阅分布式表For an overview of distributed tables, see distributed tables.