对 SQL 数据仓库中的表进行分区Partitioning tables in SQL Data Warehouse

在 Azure SQL 数据仓库中使用表分区的建议和示例。Recommendations and examples for using table partitions in Azure SQL Data Warehouse.

什么是表分区?What are table partitions?

表分区可将数据分成更小的数据组。Table partitions enable you to divide your data into smaller groups of data. 大多数情况下,表分区都是根据日期列进行创建。In most cases, table partitions are created on a date column. 所有 SQL 数据仓库表类型(包括聚集列存储、聚集索引和堆)都支持分区。Partitioning is supported on all SQL Data Warehouse table types; including clustered columnstore, clustered index, and heap. 所有分布类型(包括哈希分布或轮循机制分布)也都支持分区。Partitioning is also supported on all distribution types, including both hash or round robin distributed.

分区可能有利于数据维护和查询性能。Partitioning can benefit data maintenance and query performance. 分区是对二者都有利还是只对其中之一有利取决于数据加载方式,以及是否可以将同一个列用于两种目的,因为只能根据一个列来进行分区。Whether it benefits both or just one is dependent on how data is loaded and whether the same column can be used for both purposes, since partitioning can only be done on one column.

对加载的好处Benefits to loads

在 SQL 数据仓库中分区的主要好处是通过分区删除、切换和合并来提高数据加载效率和性能。The primary benefit of partitioning in SQL Data Warehouse is to improve the efficiency and performance of loading data by use of partition deletion, switching and merging. 大多数情况下,数据是根据日期列来分区的,而日期列与数据加载到数据库中的顺序密切相关。In most cases data is partitioned on a date column that is closely tied to the order in which the data is loaded into the database. 使用分区来维护数据的最大好处之一是可以避免事务日志记录。One of the greatest benefits of using partitions to maintain data it the avoidance of transaction logging. 虽然直接插入、更新或删除数据可能是最直接的方法,但如果在加载过程中使用分区,则只需付出一点点思考和努力就可以大大改进性能。While simply inserting, updating, or deleting data can be the most straightforward approach, with a little thought and effort, using partitioning during your load process can substantially improve performance.

可以使用分区切换来快速删除或替换表的一部分。Partition switching can be used to quickly remove or replace a section of a table. 例如,销售事实表可能仅包含过去 36 个月的数据。For example, a sales fact table might contain just data for the past 36 months. 在每个月月底,便从表删除最旧月份的销售数据。At the end of every month, the oldest month of sales data is deleted from the table. 删除该数据时,可以使用 delete 语句删除最旧月份的数据。This data could be deleted by using a delete statement to delete the data for the oldest month. 但是,使用 delete 语句逐行删除大量数据可能需要极长的时间,同时还会有执行大型事务的风险,这些大型事务在出现错误时进行回退的时间会很长。However, deleting a large amount of data row-by-row with a delete statement can take too much time, as well as create the risk of large transactions that take a long time to rollback if something goes wrong. 更理想的方式是删除最旧的数据分区。A more optimal approach is to drop the oldest partition of data. 如果在某种情况下删除各个行可能需要数小时,则删除整个分区可能只需数秒钟。Where deleting the individual rows could take hours, deleting an entire partition could take seconds.

对查询的好处Benefits to queries

分区还可用来提高查询性能。Partitioning can also be used to improve query performance. 对分区数据应用筛选器的查询可以将扫描限制在合格的分区上。A query that applies a filter to partitioned data can limit the scan to only the qualifying partitions. 此筛选方法可以避免全表扫描且仅扫描数据的一个较小子集。This method of filtering can avoid a full table scan and only scan a smaller subset of data. 引入聚集列存储索引以后,谓词消除的性能好处不再那么明显,但在某些情况下,可能会对查询有好处。With the introduction of clustered columnstore indexes, the predicate elimination performance benefits are less beneficial, but in some cases there can be a benefit to queries. 例如,如果使用销售日期字段将销售事实表分区成 36 个月,以销售日期进行筛选的查询便可以跳过对不符合筛选条件的分区的搜索。For example, if the sales fact table is partitioned into 36 months using the sales date field, then queries that filter on the sale date can skip searching in partitions that don't match the filter.

调整分区大小Sizing partitions

虽然在某些情况下可以使用分区来改进性能,但如果在创建表时使用 过多 分区,则在某些情况下可能会降低性能。While partitioning can be used to improve performance some scenarios, creating a table with too many partitions can hurt performance under some circumstances. 对于聚集列存储表,尤其要考虑到这一点。These concerns are especially true for clustered columnstore tables. 若要使数据分区有益于性能,务必了解使用数据分区的时机,以及要创建的分区的数目。For partitioning to be helpful, it is important to understand when to use partitioning and the number of partitions to create. 对于多少分区属于分区过多并没有简单的硬性规定,具体取决于数据,以及要同时加载多少分区。There is no hard fast rule as to how many partitions are too many, it depends on your data and how many partitions you loading simultaneously. 一个成功的分区方案通常只有数十到数百的分区,没有数千个。A successful partitioning scheme usually has tens to hundreds of partitions, not thousands.

在“聚集列存储”表上创建分区时,务请考虑每个分区可容纳的行数。When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. 对于聚集列存储表来说,若要进行最合适的压缩并获得最佳性能,则每个分布和分区至少需要 1 百万行。For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed. 在创建分区之前,SQL 数据仓库已将每个表细分到 60 个分布式数据库中。Before partitions are created, SQL Data Warehouse already divides each table into 60 distributed databases. 向表添加的任何分区都是基于在后台创建的分布。Any partitioning added to a table is in addition to the distributions created behind the scenes. 根据此示例,如果销售事实表包含 36 个月的分区,并假设 SQL 数据仓库有 60 个分布区,则销售事实表每个月应包含 6000 万行,或者在填充所有月份时包含 21 亿行。Using this example, if the sales fact table contained 36 monthly partitions, and given that SQL Data Warehouse has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. 如果表包含的行数少于每个分区行数的最小建议值,可考虑使用较少的分区,以增加每个分区的行数。If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition. 有关详细信息,请参阅索引编制一文,其中包含的查询可用于评估群集列存储索引的质量。For more information, see the Indexing article, which includes queries that can assess the quality of cluster columnstore indexes.

与 SQL Server 的语法差异Syntax differences from SQL Server

SQL 数据仓库引入了一种定义比 SQL Server 简单的分区的方法。SQL Data Warehouse introduces a way to define partitions that is simpler than SQL Server. 在 SQL 数据仓库中不像在 SQL Server 中一样使用分区函数和方案。Partitioning functions and schemes are not used in SQL Data Warehouse as they are in SQL Server. 只需识别分区列和边界点。Instead, all you need to do is identify partitioned column and the boundary points. 尽管分区的语法可能与 SQL Server 稍有不同,但基本概念是相同的。While the syntax of partitioning may be slightly different from SQL Server, the basic concepts are the same. SQL Server 和 SQL 数据仓库支持一个表一个分区列,后者可以是范围分区。SQL Server and SQL Data Warehouse support one partition column per table, which can be ranged partition. 若要详细了解分区,请参阅已分区表和已分区索引To learn more about partitioning, see Partitioned Tables and Indexes.

下例使用 CREATE TABLE 语句根据 OrderDateKey 列对 FactInternetSales 表进行分区:The following example uses the CREATE TABLE statement to partition the FactInternetSales table on the OrderDateKey column:

CREATE TABLE [dbo].[FactInternetSales]
(
    [ProductKey]            int          NOT NULL
,   [OrderDateKey]          int          NOT NULL
,   [CustomerKey]           int          NOT NULL
,   [PromotionKey]          int          NOT NULL
,   [SalesOrderNumber]      nvarchar(20) NOT NULL
,   [OrderQuantity]         smallint     NOT NULL
,   [UnitPrice]             money        NOT NULL
,   [SalesAmount]           money        NOT NULL
)
WITH
(   CLUSTERED COLUMNSTORE INDEX
,   DISTRIBUTION = HASH([ProductKey])
,   PARTITION   (   [OrderDateKey] RANGE RIGHT FOR VALUES
                    (20000101,20010101,20020101
                    ,20030101,20040101,20050101
                    )
                )
)
;

从 SQL Server 迁移分区Migrating partitioning from SQL Server

要将 SQL Server 分区定义迁移到 SQL 数据仓库,只需执行以下操作即可:To migrate SQL Server partition definitions to SQL Data Warehouse simply:

如果要从 SQL Server 实例迁移分区的表,则可使用以下 SQL 来算出每个分区中的行数。If you are migrating a partitioned table from a SQL Server instance, the following SQL can help you to figure out the number of rows that in each partition. 请记住,如果在 SQL 数据仓库上使用相同的分区粒度,则每个分区的行数会下降到原来的 1/60。Keep in mind that if the same partitioning granularity is used on SQL Data Warehouse, the number of rows per partition decreases by a factor of 60.

-- Partition information for a SQL Server Database
SELECT      s.[name]                        AS      [schema_name]
,           t.[name]                        AS      [table_name]
,           i.[name]                        AS      [index_name]
,           p.[partition_number]            AS      [partition_number]
,           SUM(a.[used_pages]*8.0)         AS      [partition_size_kb]
,           SUM(a.[used_pages]*8.0)/1024    AS      [partition_size_mb]
,           SUM(a.[used_pages]*8.0)/1048576 AS      [partition_size_gb]
,           p.[rows]                        AS      [partition_row_count]
,           rv.[value]                      AS      [partition_boundary_value]
,           p.[data_compression_desc]       AS      [partition_compression_desc]
FROM        sys.schemas s
JOIN        sys.tables t                    ON      t.[schema_id]         = s.[schema_id]
JOIN        sys.partitions p                ON      p.[object_id]         = t.[object_id]
JOIN        sys.allocation_units a          ON      a.[container_id]      = p.[partition_id]
JOIN        sys.indexes i                   ON      i.[object_id]         = p.[object_id]
                                            AND     i.[index_id]          = p.[index_id]
JOIN        sys.data_spaces ds              ON      ds.[data_space_id]    = i.[data_space_id]
LEFT JOIN   sys.partition_schemes ps        ON      ps.[data_space_id]    = ds.[data_space_id]
LEFT JOIN   sys.partition_functions pf      ON      pf.[function_id]      = ps.[function_id]
LEFT JOIN   sys.partition_range_values rv   ON      rv.[function_id]      = pf.[function_id]
                                            AND     rv.[boundary_id]      = p.[partition_number]
WHERE       p.[index_id] <=1
GROUP BY    s.[name]
,           t.[name]
,           i.[name]
,           p.[partition_number]
,           p.[rows]
,           rv.[value]
,           p.[data_compression_desc]
;

分区切换Partition switching

SQL 数据仓库支持分区拆分、合并和切换。SQL Data Warehouse supports partition splitting, merging, and switching. 这些函数每个都使用 ALTER TABLE 语句执行。Each of these functions is executed using the ALTER TABLE statement.

若要在两个表之间切换分区,必须确保分区对齐其各自的边界,并且表定义匹配。To switch partitions between two tables, you must ensure that the partitions align on their respective boundaries and that the table definitions match. 由于检查约束不可用于强制实施表中的值范围,因此源表必须包含与目标表相同的分区边界。As check constraints are not available to enforce the range of values in a table, the source table must contain the same partition boundaries as the target table. 如果那时分区边界不相同,则分区切换会失败,因为分区元数据不会同步。If the partition boundaries are not then same, then the partition switch will fail as the partition metadata will not be synchronized.

如何拆分包含数据的分区How to split a partition that contains data

使用 CTAS 语句是拆分包含数据的分区的最有效方法。The most efficient method to split a partition that already contains data is to use a CTAS statement. 如果分区表是群集列存储,则表分区必须为空才能拆分。If the partitioned table is a clustered columnstore, then the table partition must be empty before it can be split.

下面的示例创建分区列存储表。The following example creates a partitioned columnstore table. 它将一行插入到每个分区:It inserts one row into each partition:

CREATE TABLE [dbo].[FactInternetSales]
(
        [ProductKey]            int          NOT NULL
    ,   [OrderDateKey]          int          NOT NULL
    ,   [CustomerKey]           int          NOT NULL
    ,   [PromotionKey]          int          NOT NULL
    ,   [SalesOrderNumber]      nvarchar(20) NOT NULL
    ,   [OrderQuantity]         smallint     NOT NULL
    ,   [UnitPrice]             money        NOT NULL
    ,   [SalesAmount]           money        NOT NULL
)
WITH
(   CLUSTERED COLUMNSTORE INDEX
,   DISTRIBUTION = HASH([ProductKey])
,   PARTITION   (   [OrderDateKey] RANGE RIGHT FOR VALUES
                    (20000101
                    )
                )
)
;

INSERT INTO dbo.FactInternetSales
VALUES (1,19990101,1,1,1,1,1,1);
INSERT INTO dbo.FactInternetSales
VALUES (1,20000101,1,1,1,1,1,1);

以下查询使用 sys.partitions 目录视图查找行计数:The following query finds the row count by using the sys.partitions catalog view:

SELECT  QUOTENAME(s.[name])+'.'+QUOTENAME(t.[name]) as Table_name
,       i.[name] as Index_name
,       p.partition_number as Partition_nmbr
,       p.[rows] as Row_count
,       p.[data_compression_desc] as Data_Compression_desc
FROM    sys.partitions p
JOIN    sys.tables     t    ON    p.[object_id]   = t.[object_id]
JOIN    sys.schemas    s    ON    t.[schema_id]   = s.[schema_id]
JOIN    sys.indexes    i    ON    p.[object_id]   = i.[object_Id]
                            AND   p.[index_Id]    = i.[index_Id]
WHERE t.[name] = 'FactInternetSales'
;

以下拆分命令会收到一条错误消息:The following split command receives an error message:

ALTER TABLE FactInternetSales SPLIT RANGE (20010101);

消息 35346,级别 15,状态 1,行 44: ALTER PARTITION 语句的 SPLIT 子句失败,因为分区不为空。Msg 35346, Level 15, State 1, Line 44 SPLIT clause of ALTER PARTITION statement failed because the partition is not empty. 仅当表上存在列存储索引时,才可以拆分空分区。Only empty partitions can be split in when a columnstore index exists on the table. 请考虑在发出 ALTER PARTITION 语句前禁用列存储索引,并在 ALTER PARTITION 完成后重建列存储索引。Consider disabling the columnstore index before issuing the ALTER PARTITION statement, then rebuilding the columnstore index after ALTER PARTITION is complete.

但是,可以使用 CTAS 创建新表以保存数据。However, you can use CTAS to create a new table to hold the data.

CREATE TABLE dbo.FactInternetSales_20000101
    WITH    (   DISTRIBUTION = HASH(ProductKey)
            ,   CLUSTERED COLUMNSTORE INDEX
            ,   PARTITION   (   [OrderDateKey] RANGE RIGHT FOR VALUES
                                (20000101
                                )
                            )
            )
AS
SELECT *
FROM    FactInternetSales
WHERE   1=2
;

由于分区边界已对齐,因此允许切换。As the partition boundaries are aligned, a switch is permitted. 这将给源表留下空白分区可供我们以后拆分。This will leave the source table with an empty partition that you can subsequently split.

ALTER TABLE FactInternetSales SWITCH PARTITION 2 TO  FactInternetSales_20000101 PARTITION 2;

ALTER TABLE FactInternetSales SPLIT RANGE (20010101);

接下来只需使用 CTAS 将数据对齐新的分区边界,然后将数据切换回到主表。All that is left is to align the data to the new partition boundaries using CTAS, and then switch the data back into the main table.

CREATE TABLE [dbo].[FactInternetSales_20000101_20010101]
    WITH    (   DISTRIBUTION = HASH([ProductKey])
            ,   CLUSTERED COLUMNSTORE INDEX
            ,   PARTITION   (   [OrderDateKey] RANGE RIGHT FOR VALUES
                                (20000101,20010101
                                )
                            )
            )
AS
SELECT  *
FROM    [dbo].[FactInternetSales_20000101]
WHERE   [OrderDateKey] >= 20000101
AND     [OrderDateKey] <  20010101
;

ALTER TABLE dbo.FactInternetSales_20000101_20010101 SWITCH PARTITION 2 TO dbo.FactInternetSales PARTITION 2;

完成数据移动后,最好刷新有关目标表的统计信息。Once you have completed the movement of the data, it is a good idea to refresh the statistics on the target table. 更新统计信息可确保统计信息准确地反映数据在其各自的分区中的新分布。Updating statistics ensures the statistics accurately reflect the new distribution of the data in their respective partitions.

UPDATE STATISTICS [dbo].[FactInternetSales];

通过一个步骤将新数据加载到包含数据的分区中Load new data into partitions that contain data in one step

将数据加载到使用分区开关的分区中可以很方便地将新数据暂存在对用户不可见的表中,然后将新数据切换进来。Loading data into partitions with partition switching is a convenient way stage new data in a table that is not visible to users the switch in the new data. 在繁忙且需要处理与分区切换相关联的锁定争用的系统中,这可能很具挑战性。It can be challenging on busy systems to deal with the locking contention associated with partition switching. 在过去,若要清除分区中的现有数据,需要使用 ALTER TABLE 将数据切换出来,To clear out the existing data in a partition, an ALTER TABLE used to be required to switch out the data. 然后需要使用另一 ALTER TABLE 将新数据切换进去。Then another ALTER TABLE was required to switch in the new data. 在 SQL 数据仓库中,支持在 ALTER TABLE 命令中使用 TRUNCATE_TARGET 选项。In SQL Data Warehouse, the TRUNCATE_TARGET option is supported in the ALTER TABLE command. TRUNCATE_TARGETALTER TABLE 命令会使用新数据覆盖分区中的现有数据。With TRUNCATE_TARGET the ALTER TABLE command overwrites existing data in the partition with new data. 下面是一个示例。此示例使用 CTAS 创建一个包含现有数据的新表,插入新数据,然后将所有数据切换回目标表中,覆盖现有的数据。Below is an example which uses CTAS to create a new table with the existing data, inserts new data, then switches all the data back into the target table, overwriting the existing data.

CREATE TABLE [dbo].[FactInternetSales_NewSales]
    WITH    (   DISTRIBUTION = HASH([ProductKey])
            ,   CLUSTERED COLUMNSTORE INDEX
            ,   PARTITION   (   [OrderDateKey] RANGE RIGHT FOR VALUES
                                (20000101,20010101
                                )
                            )
            )
AS
SELECT  *
FROM    [dbo].[FactInternetSales]
WHERE   [OrderDateKey] >= 20000101
AND     [OrderDateKey] <  20010101
;

INSERT INTO dbo.FactInternetSales_NewSales
VALUES (1,20000101,2,2,2,2,2,2);

ALTER TABLE dbo.FactInternetSales_NewSales SWITCH PARTITION 2 TO dbo.FactInternetSales PARTITION 2 WITH (TRUNCATE_TARGET = ON);  

表分区源代码管理Table partitioning source control

若要避免表定义在源代码管理系统中失效,可以考虑以下方法:To avoid your table definition from rusting in your source control system, you may want to consider the following approach:

  1. 将表创建为分区表,但不包含分区值Create the table as a partitioned table but with no partition values

    CREATE TABLE [dbo].[FactInternetSales]
    (
        [ProductKey]            int          NOT NULL
    ,   [OrderDateKey]          int          NOT NULL
    ,   [CustomerKey]           int          NOT NULL
    ,   [PromotionKey]          int          NOT NULL
    ,   [SalesOrderNumber]      nvarchar(20) NOT NULL
    ,   [OrderQuantity]         smallint     NOT NULL
    ,   [UnitPrice]             money        NOT NULL
    ,   [SalesAmount]           money        NOT NULL
    )
    WITH
    (   CLUSTERED COLUMNSTORE INDEX
    ,   DISTRIBUTION = HASH([ProductKey])
    ,   PARTITION   (   [OrderDateKey] RANGE RIGHT FOR VALUES () )
    )
    ;
    
  2. SPLIT 表:SPLIT the table as part of the deployment process:

     -- Create a table containing the partition boundaries
    
    CREATE TABLE #partitions
    WITH
    (
        LOCATION = USER_DB
    ,   DISTRIBUTION = HASH(ptn_no)
    )
    AS
    SELECT  ptn_no
    ,       ROW_NUMBER() OVER (ORDER BY (ptn_no)) as seq_no
    FROM    (
        SELECT CAST(20000101 AS INT) ptn_no
        UNION ALL
        SELECT CAST(20010101 AS INT)
        UNION ALL
        SELECT CAST(20020101 AS INT)
        UNION ALL
        SELECT CAST(20030101 AS INT)
        UNION ALL
        SELECT CAST(20040101 AS INT)
    ) a
    ;
    
     -- Iterate over the partition boundaries and split the table
    
    DECLARE @c INT = (SELECT COUNT(*) FROM #partitions)
    ,       @i INT = 1                                 --iterator for while loop
    ,       @q NVARCHAR(4000)                          --query
    ,       @p NVARCHAR(20)     = N''                  --partition_number
    ,       @s NVARCHAR(128)    = N'dbo'               --schema
    ,       @t NVARCHAR(128)    = N'FactInternetSales' --table
    ;
    
    WHILE @i <= @c
    BEGIN
        SET @p = (SELECT ptn_no FROM #partitions WHERE seq_no = @i);
        SET @q = (SELECT N'ALTER TABLE '+@s+N'.'+@t+N' SPLIT RANGE ('+@p+N');');
    
        -- PRINT @q;
        EXECUTE sp_executesql @q;
        SET @i+=1;
    END
    
     -- Code clean-up
    
    DROP TABLE #partitions;
    

使用这种方法时,源代码管理中的代码保持静态,允许动态的分区边界值,并不断地与仓库一起演进。With this approach the code in source control remains static and the partitioning boundary values are allowed to be dynamic; evolving with the warehouse over time.

后续步骤Next steps

有关开发表的详细信息,请参阅表概述上的文章。For more information about developing tables, see the articles on Table Overview.