在 Synapse SQL 池中设计分布式表的指南Guidance for designing distributed tables in Synapse SQL pool

在 Synapse SQL 池中设计哈希分布式表和轮循机制分布式表的一些建议。Recommendations for designing hash-distributed and round-robin distributed tables in Synapse SQL pools.

本文要求读者熟悉 Synapse SQL 池中的数据分布和数据移动概念。This article assumes you are familiar with data distribution and data movement concepts in Synapse SQL pool.  有关详细信息,请参阅 Azure Synapse Analytics 大规模并行处理 (MPP) 体系结构  For more information, see Azure Synapse Analytics massively parallel processing (MPP) architecture.

什么是分布式表?What is a distributed table?

分布式表显示为单个表,但表中的行实际存储在 60 个分布区中。A distributed table appears as a single table, but the rows are actually stored across 60 distributions. 这些行使用哈希或轮循机制算法进行分布。The rows are distributed with a hash or round-robin algorithm.

哈希分布式表可提高大型事实数据表的查询性能,本文会重点进行介绍。Hash-distributed tables improve query performance on large fact tables, and are the focus of this article. 轮循机制表可用于提高加载速度。Round-robin tables are useful for improving loading speed. 这些设计选择对提高查询和加载性能具有重大影响。These design choices have a significant impact on improving query and loading performance.

另一个表存储选项是跨所有计算节点复制一个小型表。Another table storage option is to replicate a small table across all the Compute nodes. 有关详细信息,请参阅复制表的设计准则For more information, see Design guidance for replicated tables. 若要在这三个选项之间快速选择其一,请参阅表概述中的分布式表。To quickly choose among the three options, see Distributed tables in the tables overview.

在设计表的过程中,尽可能多地了解数据以及数据查询方式。As part of table design, understand as much as possible about your data and how the data is queried.  例如,考虑以下问题:  For example, consider these questions:

  • 表有多大?How large is the table?
  • 表的刷新频率是多少?How often is the table refreshed?
  • Synapse SQL 池中有事实数据表和维度表吗?Do I have fact and dimension tables in a Synapse SQL pool?

哈希分布Hash distributed

哈希分布表通过使用确定性的哈希函数将每一行分配给一个分布区,实现表行的跨计算节点分布。A hash-distributed table distributes table rows across the Compute nodes by using a deterministic hash function to assign each row to one distribution.

分布式表Distributed table

由于相同的值始终哈希处理到相同的分布区,因此,数据仓库本身就具有行位置方面的信息。Since identical values always hash to the same distribution, the data warehouse has built-in knowledge of the row locations. 在 Synapse SQL 池中,利用此信息可最大程度地减少查询期间的数据移动,从而提高查询性能。In Synapse SQL pool this knowledge is used to minimize data movement during queries, which improves query performance.

哈希分布表适用于星型架构中的大型事实数据表。Hash-distributed tables work well for large fact tables in a star schema. 它们可以包含大量行,但仍实现高性能。They can have very large numbers of rows and still achieve high performance. 当然,用户应该了解一些设计注意事项,它们有助于获得分布式系统本应具有的性能。There are, of course, some design considerations that help you to get the performance the distributed system is designed to provide. 本文所述的选择合适的分布列就是其中之一。Choosing a good distribution column is one such consideration that is described in this article.

在以下情况下,考虑使用哈希分布表:Consider using a hash-distributed table when:

  • 磁盘上的表大小超过 2 GB。The table size on disk is more than 2 GB.
  • 对表进行频繁的插入、更新和删除操作。The table has frequent insert, update, and delete operations.

轮循机制分布Round-robin distributed

轮循机制分布表将表行均衡分布在所有分布区中。A round-robin distributed table distributes table rows evenly across all distributions. 将行分配到分布区的过程是随机的。The assignment of rows to distributions is random. 与哈希分布表不同的是,值相等的行不一定分配到相同的分布区。Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.

因此,系统有时候需要调用数据移动操作,以便在求解查询前更加合理地组织数据。As a result, the system sometimes needs to invoke a data movement operation to better organize your data before it can resolve a query. 此附加步骤可能使查询变慢。This extra step can slow down your queries. 例如,联接轮循机制表通常需要对行重新进行数据重组,这会给性能带来影响。For example, joining a round-robin table usually requires reshuffling the rows, which is a performance hit.

在以下情况下,考虑对表使用轮循机制分布:Consider using the round-robin distribution for your table in the following scenarios:

  • 在最开始将其用作一个简单的起点,因为该分布是默认选项When getting started as a simple starting point since it is the default
  • 没有明显的联接键时If there is no obvious joining key
  • 没有合适的候选列对表进行哈希分布时If there is no good candidate column for hash distributing the table
  • 表没有与其他表共享通用的联接键时If the table does not share a common join key with other tables
  • 该联接比查询中的其他联接更不重要时If the join is less significant than other joins in the query
  • 表是临时过渡表时When the table is a temporary staging table

教程加载纽约出租车数据提供了将数据加载到轮循机制临时表的示例。The tutorial Load New York taxicab data gives an example of loading data into a round-robin staging table.

选择分布列Choosing a distribution column

哈希分布表中有一个分布列是哈希键。A hash-distributed table has a distribution column that is the hash key. 例如,以下代码会创建一个以 ProductKey 作为分布列的哈希分布表。For example, the following code creates a hash-distributed table with ProductKey as the distribution column.

CREATE TABLE [dbo].[FactInternetSales]
(   [ProductKey]            int          NOT NULL
,   [OrderDateKey]          int          NOT NULL
,   [CustomerKey]           int          NOT NULL
,   [PromotionKey]          int          NOT NULL
,   [SalesOrderNumber]      nvarchar(20) NOT NULL
,   [OrderQuantity]         smallint     NOT NULL
,   [UnitPrice]             money        NOT NULL
,   [SalesAmount]           money        NOT NULL
,  DISTRIBUTION = HASH([ProductKey])

可以更新分布列中存储的数据。Data stored in the distribution column can be updated. 对分布列中的数据进行更新可能导致数据无序操作。Updates to data in the distribution column could result in data shuffle operation.

选择分布列是一个重要的设计决策,因为此列中的值确定行的分布方式。Choosing a distribution column is an important design decision since the values in this column determine how the rows are distributed. 最佳选择取决于多种因素,通常需要对各方面进行权衡。The best choice depends on several factors, and usually involves tradeoffs. 选择分布列后,将无法对其进行更改。Once a distribution column is chosen, you cannot change it.

如果第一次未选择最合适的列,可以使用 CREATE TABLE AS SELECT (CTAS) 重新创建具有不同分布列的表。If you didn't choose the best column the first time, you can use CREATE TABLE AS SELECT (CTAS) to re-create the table with a different distribution column.

选择数据均衡分布的分布列Choose a distribution column with data that distributes evenly

为了获得最佳性能,所有分布区都应当具有大致相同的行数。For best performance, all of the distributions should have approximately the same number of rows. 当一个或多个分布区的行数不相称时,某些分布区会先于其他分布区完成其并行查询部分。When one or more distributions have a disproportionate number of rows, some distributions finish their portion of a parallel query before others. 由于必须等到所有分布区都完成处理,才能完成查询,因此,每个查询的速度取决于最慢分布区的速度。Since the query can't complete until all distributions have finished processing, each query is only as fast as the slowest distribution.

  • 数据倾斜意味着数据未均衡分布在分布区中Data skew means the data is not distributed evenly across the distributions
  • 处理倾斜意味着在运行并行查询时,某些分布区所用的时间比其他分布区长。Processing skew means that some distributions take longer than others when running parallel queries. 数据倾斜时可能会出现这种情况。This can happen when the data is skewed.

若要平衡并行处理,请选择符合以下条件的分布列:To balance the parallel processing, select a distribution column that:

  • 具有许多唯一值。Has many unique values. 分布列可以有一些重复值。The column can have some duplicate values. 但具有相同值的所有行都分配到相同的分布区。However, all rows with the same value are assigned to the same distribution. 由于有 60 个分布区,分布列应具有至少 60 个唯一值。Since there are 60 distributions, the column should have at least 60 unique values. 通常情况下,唯一值的数量要多得多。Usually the number of unique values is much greater.
  • 没有 NULL 值,或者只有几个 NULL 值。Does not have NULLs, or has only a few NULLs. 在极端示例中,如果列中的所有值均为 NULL,则所有行都分配到相同的分布区。For an extreme example, if all values in the column are NULL, all the rows are assigned to the same distribution. 因此,查询处理会向一个分布区倾斜,无法从并行处理中受益。As a result, query processing is skewed to one distribution, and does not benefit from parallel processing.
  • 不是日期列Is not a date column. 同一日期的所有数据都落在相同的分布区。All data for the same date lands in the same distribution. 如果多个用户都筛选同一个日期,则会由 60 个分布区中的 1 个分布区单独执行所有处理工作。If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.

选择能最大程度减少数据移动的分布列Choose a distribution column that minimizes data movement

为了获取正确的查询结果,查询可能将数据从一个计算节点移至另一个计算节点。To get the correct query result queries might move data from one Compute node to another. 当查询对分布式表执行联接和聚合操作时,通常会发生数据移动。Data movement commonly happens when queries have joins and aggregations on distributed tables. 选择能最大程度减少数据移动的分布列,这是优化 Synapse SQL 池性能最重要的策略之一。Choosing a distribution column that helps minimize data movement is one of the most important strategies for optimizing performance of your Synapse SQL pool.

若要最大程度减少数据移动,请选择符合以下条件的分布列:To minimize data movement, select a distribution column that:

  • 用于 JOINGROUP BYDISTINCTOVERHAVING 子句。Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. 当两个大型事实数据表频繁联接时,如果将这两个表分布在某个联接列上,查询性能将得到提升。When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. 如果某个表不进行联接操作,则考虑将该表分布在经常出现在 GROUP BY 子句中的列上。When a table is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause.
  • 用于 WHERE 子句。Is not used in WHERE clauses. 这可以缩小查询范围,从而不必在所有分布区上运行查询。This could narrow the query to not run on all the distributions.
  • 是日期列。Is not a date column. WHERE 子句通常按日期进行筛选。WHERE clauses often filter by date. 在这种情况下,可能会在少数几个分布区上运行所有处理。When this happens, all the processing could run on only a few distributions.

没有合适分布列时怎么办What to do when none of the columns are a good distribution column

如果列没有足够的不同值用于分步列,可创建一个新列作为一个或多个值的复合。If none of your columns have enough distinct values for a distribution column, you can create a new column as a composite of one or more values. 要避免查询执行时出现数据移动,可在查询中使用复合分布列作为联接列。To avoid data movement during query execution, use the composite distribution column as a join column in queries.

完成哈希分布表的设计后,下一步就是将数据加载到表。Once you design a hash-distributed table, the next step is to load data into the table. 有关加载指南,请参阅加载概述For loading guidance, see Loading overview.

如何判断分布列是否合适How to tell if your distribution column is a good choice

将数据加载到哈希分布表之后,查看行在 60 个分布区中分布的均衡程度。After data is loaded into a hash-distributed table, check to see how evenly the rows are distributed across the 60 distributions. 如果每个分布区的行数相差不超过 10%,性能不会受到明显影响。The rows per distribution can vary up to 10% without a noticeable impact on performance.

确定表是否有数据倾斜现象Determine if the table has data skew

一种快速的数据倾斜检查方法是使用 DBCC PDW_SHOWSPACEUSEDA quick way to check for data skew is to use DBCC PDW_SHOWSPACEUSED. 以下 SQL 代码返回 60 个分布区中每个分布区存储的表行数。The following SQL code returns the number of table rows that are stored in each of the 60 distributions. 为了获得平衡的性能,分布式表中的行应该均衡分布在所有分布区中。For balanced performance, the rows in your distributed table should be spread evenly across all the distributions.

-- Find data skew for a distributed table
DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');

若要确定哪些表的数据倾斜率超过 10%,请执行以下操作:To identify which tables have more than 10% data skew:

  1. 创建表概述一文中所示的 dbo.vTableSizes 视图。Create the view dbo.vTableSizes that is shown in the Tables overview article.
  2. 运行以下查询:Run the following query:
select *
from dbo.vTableSizes
where two_part_name in
    select two_part_name
    from dbo.vTableSizes
    where row_count > 0
    group by two_part_name
    having (max(row_count * 1.000) - min(row_count * 1.000))/max(row_count * 1.000) >= .10
order by two_part_name, row_count

检查查询计划中是否有数据移动Check query plans for data movement

合适的分布列可最大程度减少联接和聚合操作中的数据移动。A good distribution column enables joins and aggregations to have minimal data movement. 这会影响联接的编写方式。This affects the way joins should be written. 若要在联接两个哈希分布表时最大程度减少数据移动,联接列中必须有一个是分布列。To get minimal data movement for a join on two hash-distributed tables, one of the join columns needs to be the distribution column. 当两个哈希分布表在一个数据类型相同的分布列上联接时,该联接不需要移动数据。When two hash-distributed tables join on a distribution column of the same data type, the join does not require data movement. 联接可以使用附加列,而不发生数据移动。Joins can use additional columns without incurring data movement.

若要避免在联接过程中移动数据,应遵循以下做法:To avoid data movement during a join:

  • 参与联接的列的相关表必须哈希分布在一个联接列中。The tables involved in the join must be hash distributed on one of the columns participating in the join.
  • 两个表之间联接列的数据类型必须匹配。The data types of the join columns must match between both tables.
  • 必须使用 equals 运算符联接列。The columns must be joined with an equals operator.
  • 联接类型不能是 CROSS JOINThe join type may not be a CROSS JOIN.

若要了解查询是否正在移动数据,可以查看查询计划。To see if queries are experiencing data movement, you can look at the query plan.

解决分布列问题Resolve a distribution column problem

无需解决所有数据倾斜问题。It is not necessary to resolve all cases of data skew. 分布数据就是为了找出将数据倾斜降至最低与将数据移动降至最低两者之间的适当平衡。Distributing data is a matter of finding the right balance between minimizing data skew and data movement. 因为现实中无法每次都做到将数据倾斜和数据移动同时降至最低。It is not always possible to minimize both data skew and data movement. 有时,最少的数据移动带来的好处可能胜过数据倾斜造成的影响。Sometimes the benefit of having the minimal data movement might outweigh the impact of having data skew.

要确定是否应该解决表中的数据偏斜,应该尽可能了解工作负荷中的数据卷和查询。To decide if you should resolve data skew in a table, you should understand as much as possible about the data volumes and queries in your workload. 可以按照查询监视一文中的步骤监视倾斜对查询性能的影响。You can use the steps in the Query monitoring article to monitor the impact of skew on query performance. 具体而言,就是了解在各个分布区上完成大型查询所需的时间。Specifically, look for how long it takes large queries to complete on individual distributions.

由于不能更改现有表中的分布列,因此,解决数据倾斜的典型方法是重新创建具有不同分布列的表。Since you cannot change the distribution column on an existing table, the typical way to resolve data skew is to re-create the table with a different distribution column.

重新创建具有新分布列的表Re-create the table with a new distribution column

本示例使用 CREATE TABLE AS SELECT 重新创建具有不同哈希分布列的表。This example uses CREATE TABLE AS SELECT to re-create a table with a different hash distribution column.

CREATE TABLE [dbo].[FactInternetSales_CustomerKey]
     ,  DISTRIBUTION =  HASH([CustomerKey])
     ,  PARTITION       ( [OrderDateKey] RANGE RIGHT FOR VALUES (   20000101, 20010101, 20020101, 20030101
                                                                ,   20040101, 20050101, 20060101, 20070101
                                                                ,   20080101, 20090101, 20100101, 20110101
                                                                ,   20120101, 20130101, 20140101, 20150101
                                                                ,   20160101, 20170101, 20180101, 20190101
                                                                ,   20200101, 20210101, 20220101, 20230101
                                                                ,   20240101, 20250101, 20260101, 20270101
                                                                ,   20280101, 20290101
FROM    [dbo].[FactInternetSales]
OPTION  (LABEL  = 'CTAS : FactInternetSales_CustomerKey')

--Create statistics on new table
CREATE STATISTICS [ProductKey] ON [FactInternetSales_CustomerKey] ([ProductKey]);
CREATE STATISTICS [OrderDateKey] ON [FactInternetSales_CustomerKey] ([OrderDateKey]);
CREATE STATISTICS [CustomerKey] ON [FactInternetSales_CustomerKey] ([CustomerKey]);
CREATE STATISTICS [PromotionKey] ON [FactInternetSales_CustomerKey] ([PromotionKey]);
CREATE STATISTICS [SalesOrderNumber] ON [FactInternetSales_CustomerKey] ([SalesOrderNumber]);
CREATE STATISTICS [OrderQuantity] ON [FactInternetSales_CustomerKey] ([OrderQuantity]);
CREATE STATISTICS [UnitPrice] ON [FactInternetSales_CustomerKey] ([UnitPrice]);
CREATE STATISTICS [SalesAmount] ON [FactInternetSales_CustomerKey] ([SalesAmount]);

--Rename the tables
RENAME OBJECT [dbo].[FactInternetSales] TO [FactInternetSales_ProductKey];
RENAME OBJECT [dbo].[FactInternetSales_CustomerKey] TO [FactInternetSales];

后续步骤Next steps

若要创建分布式表,请使用以下语句之一:To create a distributed table, use one of these statements: