在 Synapse SQL 中优化事务Optimizing transactions in Synapse SQL

了解如何在 Synapse SQL 中优化事务代码的性能,同时最大程度地减少长时间回退的风险。Learn how to optimize the performance of your transactional code in Synapse SQL while minimizing risk for long rollbacks.

事务和日志记录Transactions and logging

事务是关系数据库引擎的一个重要组成部分。Transactions are an important component of a relational database engine. 在数据修改期间使用事务。Transactions are used during data modification. 这些事务可以是显式或隐式。These transactions can be explicit or implicit. 单个 INSERT、UPDATE 和 DELETE 语句都是隐式事务的示例。Single INSERT, UPDATE, and DELETE statements are all examples of implicit transactions. 显式事务使用 BEGIN TRAN、COMMIT TRAN 或 ROLLBACK TRAN。Explicit transactions use BEGIN TRAN, COMMIT TRAN, or ROLLBACK TRAN. 显式事务通常用于多个修改语句需要绑定在单个原子单元的时候。Explicit transactions are typically used when multiple modification statements need to be tied together in a single atomic unit.

使用事务日志跟踪对数据库的更改。Changes to the database are tracked using transaction logs. 每个分布区都具有其自己的事务日志。Each distribution has its own transaction log. 事务日志写入都是自动的。Transaction log writes are automatic. 无需任何配置。There is no configuration required. 尽管此过程可保证写入,但它确在系统中引入一项开销。However, whilst this process guarantees the write it does introduce an overhead in the system. 编写事务性高效的代码,可以尽量减少这种影响。You can minimize this impact by writing transactionally efficient code. 事务性高效的代码大致分为两类。Transactionally efficient code broadly falls into two categories.

  • 尽可能使用最少日志记录构造Use minimal logging constructs whenever possible
  • 使用限定范围的批来处理数据,避免单数形式的长时运行事务Process data using scoped batches to avoid singular long running transactions
  • 对于给定分区的大型修改,采用分区切换模式Adopt a partition switching pattern for large modifications to a given partition

最少日志记录与完整日志记录Minimal vs. full logging

完整记录的操作使用事务日志来跟踪每个行更改,而最少记录的操作只跟踪扩展分配和元数据更改。Unlike fully logged operations, which use the transaction log to keep track of every row change, minimally logged operations keep track of extent allocations and meta-data changes only. 因此,最少日志记录涉及只记录失败或显式请求 (ROLLBACK TRAN) 后回退事务所需的信息。Therefore, minimal logging involves logging only the information that is required to roll back the transaction after a failure, or for an explicit request (ROLLBACK TRAN). 由于事务日志中跟踪的信息少得多,因此最少记录操作的执行性能优于同样大小的完整记录操作。As much less information is tracked in the transaction log, a minimally logged operation performs better than a similarly sized fully logged operation. 此外,事务日志中写入较少,因此生成的日志数据量也更少,并且 I/O 更高效。Furthermore, because fewer writes go the transaction log, a much smaller amount of log data is generated and so is more I/O efficient.

事务安全限制仅适用于完整记录的操作。The transaction safety limits only apply to fully logged operations.

备注

最少记录的操作可以参与显式事务。Minimally logged operations can participate in explicit transactions. 分配结构中的所有更改都被跟踪,因此实现回滚最少记录的操作变得可能。As all changes in allocation structures are tracked, it is possible to roll back minimally logged operations.

最少记录的操作Minimally logged operations

以下操作可以实现最少记录:The following operations are capable of being minimally logged:

  • CREATE TABLE AS SELECT (CTAS)CREATE TABLE AS SELECT (CTAS)
  • INSERT..SELECTINSERT..SELECT
  • CREATE INDEXCREATE INDEX
  • ALTER INDEX REBUILDALTER INDEX REBUILD
  • DROP INDEXDROP INDEX
  • TRUNCATE TABLETRUNCATE TABLE
  • DROP TABLEDROP TABLE
  • ALTER TABLE SWITCH PARTITIONALTER TABLE SWITCH PARTITION

备注

内部数据移动操作(例如 BROADCAST 和 SHUFFLE)不受事务安全限制影响。Internal data movement operations (such as BROADCAST and SHUFFLE) are not affected by the transaction safety limit.

带批量加载的最少日志记录Minimal logging with bulk load

CTAS 和 INSERT...SELECT 都是批量加载操作。CTAS and INSERT...SELECT are both bulk load operations. 但两者都受目标表定义影响,并且取决于加载方案。However, both are influenced by the target table definition and depend on the load scenario. 下表说明了什么时候批量操作是完整记录的或最少记录的:The following table explains when bulk operations are fully or minimally logged:

主索引Primary Index 加载方案Load Scenario 日志记录模式Logging Mode
Heap 任意Any 最少Minimal
聚集索引Clustered Index 空目标表Empty target table 最少Minimal
聚集索引Clustered Index 加载的行不与目标中现有页面重叠Loaded rows do not overlap with existing pages in target 最少Minimal
聚集索引Clustered Index 加载的行与目标中现有页面重叠Loaded rows overlap with existing pages in target 完整Full
聚集列存储索引Clustered Columnstore Index 批大小 >= 102,400/每分区对齐的分布区Batch size >= 102,400 per partition aligned distribution 最少Minimal
聚集列存储索引Clustered Columnstore Index 批大小 < 102,400/每分区对齐的分布区Batch size < 102,400 per partition aligned distribution 完整Full

值得注意的是,任何更新辅助或非聚集索引的写入都会始终是完整记录的操作。It is worth noting that any writes to update secondary or non-clustered indexes will always be fully logged operations.

重要

Synapse SQL 池数据库具有 60 个分布区。A Synapse SQL pool database has 60 distributions. 因此,假设所有行均匀分布且处于单个分区中,批在写入到聚集列存储索引时会需有 6,144,000 行(或更多)要按最少记录的方式记入日志。Therefore, assuming all rows are evenly distributed and landing in a single partition, your batch will need to contain 6,144,000 rows or larger to be minimally logged when writing to a Clustered Columnstore Index. 如果对表进行分区且正插入的行跨越分区边界,则每个分区边界都需 6,144,000 行,假定数据分布很均匀。If the table is partitioned and the rows being inserted span partition boundaries, then you will need 6,144,000 rows per partition boundary assuming even data distribution. 每个分布区的每个分区各自必须超过 102,400 行的阈值,从而使插入以最少记录的方式记录到分布区中。Each partition in each distribution must independently exceed the 102,400 row threshold for the insert to be minimally logged into the distribution.

将数据加载到含聚集索引的非空表通常可以包含完整记录和最少记录的行的组合。Loading data into a non-empty table with a clustered index can often contain a mixture of fully logged and minimally logged rows. 聚集索引是页面的平衡树 (b-tree)。A clustered index is a balanced tree (b-tree) of pages. 如果正写入的页面已包含其他事务中的行,则这些写入操作会被完整记录。If the page being written to already contains rows from another transaction, then these writes will be fully logged. 但如果该页面为空,则写入到该页面会按最少记录的方式记录。However, if the page is empty then the write to that page will be minimally logged.

优化删除Optimizing deletes

DELETE 是一个完整记录的操作。DELETE is a fully logged operation. 如果需要删除表或分区中的大量数据, SELECT 要保留的数据通常更有意义,其可作为最少记录的操作来运行。If you need to delete a large amount of data in a table or a partition, it often makes more sense to SELECT the data you wish to keep, which can be run as a minimally logged operation. 若要选择数据,可使用 CTAS 创建新表。To select the data, create a new table with CTAS. 创建后,可通过 RENAME 操作使用新创建的表将旧表交换出来。Once created, use RENAME to swap out your old table with the newly created table.

-- Delete all sales transactions for Promotions except PromotionKey 2.

--Step 01. Create a new table select only the records we want to kep (PromotionKey 2)
CREATE TABLE [dbo].[FactInternetSales_d]
WITH
(    CLUSTERED COLUMNSTORE INDEX
,    DISTRIBUTION = HASH([ProductKey])
,     PARTITION     (    [OrderDateKey] RANGE RIGHT
                                    FOR VALUES    (    20000101, 20010101, 20020101, 20030101, 20040101, 20050101
                                                ,    20060101, 20070101, 20080101, 20090101, 20100101, 20110101
                                                ,    20120101, 20130101, 20140101, 20150101, 20160101, 20170101
                                                ,    20180101, 20190101, 20200101, 20210101, 20220101, 20230101
                                                ,    20240101, 20250101, 20260101, 20270101, 20280101, 20290101
                                                )
)
AS
SELECT     *
FROM     [dbo].[FactInternetSales]
WHERE    [PromotionKey] = 2
OPTION (LABEL = 'CTAS : Delete')
;

--Step 02. Rename the Tables to replace the
RENAME OBJECT [dbo].[FactInternetSales]   TO [FactInternetSales_old];
RENAME OBJECT [dbo].[FactInternetSales_d] TO [FactInternetSales];

优化更新Optimizing updates

UPDATE 是一个完整记录的操作。UPDATE is a fully logged operation. 如果需要更新表或分区中的大量行,通常更有效的方法是使用最少记录的操作(如 CTAS)来实现。If you need to update a large number of rows in a table or a partition, it can often be far more efficient to use a minimally logged operation such as CTAS to do so.

在下面的示例中,完整的表更新已转换为 CTAS,以便使最少日志记录成为可能。In the example below a full table update has been converted to a CTAS so that minimal logging is possible.

在此示例中,我们回顾性地向表中的销售额添加折扣金额:In this case, we are retrospectively adding a discount amount to the sales in the table:

--Step 01. Create a new table containing the "Update".
CREATE TABLE [dbo].[FactInternetSales_u]
WITH
(    CLUSTERED INDEX
,    DISTRIBUTION = HASH([ProductKey])
,     PARTITION     (    [OrderDateKey] RANGE RIGHT
                                    FOR VALUES    (    20000101, 20010101, 20020101, 20030101, 20040101, 20050101
                                                ,    20060101, 20070101, 20080101, 20090101, 20100101, 20110101
                                                ,    20120101, 20130101, 20140101, 20150101, 20160101, 20170101
                                                ,    20180101, 20190101, 20200101, 20210101, 20220101, 20230101
                                                ,    20240101, 20250101, 20260101, 20270101, 20280101, 20290101
                                                )
                )
)
AS
SELECT
    [ProductKey]  
,    [OrderDateKey]
,    [DueDateKey]  
,    [ShipDateKey]
,    [CustomerKey]
,    [PromotionKey]
,    [CurrencyKey]
,    [SalesTerritoryKey]
,    [SalesOrderNumber]
,    [SalesOrderLineNumber]
,    [RevisionNumber]
,    [OrderQuantity]
,    [UnitPrice]
,    [ExtendedAmount]
,    [UnitPriceDiscountPct]
,    ISNULL(CAST(5 as float),0) AS [DiscountAmount]
,    [ProductStandardCost]
,    [TotalProductCost]
,    ISNULL(CAST(CASE WHEN [SalesAmount] <=5 THEN 0
         ELSE [SalesAmount] - 5
         END AS MONEY),0) AS [SalesAmount]
,    [TaxAmt]
,    [Freight]
,    [CarrierTrackingNumber]
,    [CustomerPONumber]
FROM    [dbo].[FactInternetSales]
OPTION (LABEL = 'CTAS : Update')
;

--Step 02. Rename the tables
RENAME OBJECT [dbo].[FactInternetSales]   TO [FactInternetSales_old];
RENAME OBJECT [dbo].[FactInternetSales_u] TO [FactInternetSales];

--Step 03. Drop the old table
DROP TABLE [dbo].[FactInternetSales_old]

备注

重新创建大型表可以受益于使用 Synapse SQL 池工作负载管理功能。Re-creating large tables can benefit from using Synapse SQL pool workload management features. 有关详细信息,请参阅用于工作负荷管理的资源类For more information, see Resource classes for workload management.

使用分区切换进行优化Optimizing with partition switching

面对表分区内较大规模修改时,分区切换模式非常有用。If faced with large-scale modifications inside a table partition, then a partition switching pattern makes sense. 如果数据修改非常重要且跨越多个分区,则遍历分区可获得相同的结果。If the data modification is significant and spans multiple partitions, then iterating over the partitions achieves the same result.

执行分区切换的步骤如下所示:The steps to perform a partition switch are as follows:

  1. 创建清空的分区Create an empty out partition
  2. 采用 CTAS 执行“update”Perform the 'update' as a CTAS
  3. 将现有数据转出到输出表Switch out the existing data to the out table
  4. 转入新数据Switch in the new data
  5. 清理数据Clean up the data

但是,若要帮助确定要切换的分区,请创建以下帮助器过程。However, to help identify the partitions to switch, create the following helper procedure.

CREATE PROCEDURE dbo.partition_data_get
    @schema_name           NVARCHAR(128)
,    @table_name               NVARCHAR(128)
,    @boundary_value           INT
AS
IF OBJECT_ID('tempdb..#ptn_data') IS NOT NULL
BEGIN
    DROP TABLE #ptn_data
END
CREATE TABLE #ptn_data
WITH    (    DISTRIBUTION = ROUND_ROBIN
        ,    HEAP
        )
AS
WITH CTE
AS
(
SELECT     s.name                            AS [schema_name]
,        t.name                            AS [table_name]
,         p.partition_number                AS [ptn_nmbr]
,        p.[rows]                        AS [ptn_rows]
,        CAST(r.[value] AS INT)            AS [boundary_value]
FROM        sys.schemas                    AS s
JOIN        sys.tables                    AS t    ON  s.[schema_id]        = t.[schema_id]
JOIN        sys.indexes                    AS i    ON     t.[object_id]        = i.[object_id]
JOIN        sys.partitions                AS p    ON     i.[object_id]        = p.[object_id]
                                                AND i.[index_id]        = p.[index_id]
JOIN        sys.partition_schemes        AS h    ON     i.[data_space_id]    = h.[data_space_id]
JOIN        sys.partition_functions        AS f    ON     h.[function_id]        = f.[function_id]
LEFT JOIN    sys.partition_range_values    AS r     ON     f.[function_id]        = r.[function_id]
                                                AND r.[boundary_id]        = p.[partition_number]
WHERE i.[index_id] <= 1
)
SELECT    *
FROM    CTE
WHERE    [schema_name]        = @schema_name
AND        [table_name]        = @table_name
AND        [boundary_value]    = @boundary_value
OPTION (LABEL = 'dbo.partition_data_get : CTAS : #ptn_data')
;
GO

此过程可最大程度地重复使用代码,并使分区切换示例更紧凑。This procedure maximizes code reuse and keeps the partition switching example more compact.

下面的代码演示上述步骤,以实现完整的分区切换例程。The following code demonstrates the steps mentioned previously to achieve a full partition switching routine.

--Create a partitioned aligned empty table to switch out the data
IF OBJECT_ID('[dbo].[FactInternetSales_out]') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[FactInternetSales_out]
END

CREATE TABLE [dbo].[FactInternetSales_out]
WITH
(    DISTRIBUTION = HASH([ProductKey])
,    CLUSTERED COLUMNSTORE INDEX
,     PARTITION     (    [OrderDateKey] RANGE RIGHT
                                    FOR VALUES    (    20020101, 20030101
                                                )
                )
)
AS
SELECT *
FROM    [dbo].[FactInternetSales]
WHERE 1=2
OPTION (LABEL = 'CTAS : Partition Switch IN : UPDATE')
;

--Create a partitioned aligned table and update the data in the select portion of the CTAS
IF OBJECT_ID('[dbo].[FactInternetSales_in]') IS NOT NULL
BEGIN
    DROP TABLE [dbo].[FactInternetSales_in]
END

CREATE TABLE [dbo].[FactInternetSales_in]
WITH
(    DISTRIBUTION = HASH([ProductKey])
,    CLUSTERED COLUMNSTORE INDEX
,     PARTITION     (    [OrderDateKey] RANGE RIGHT
                                    FOR VALUES    (    20020101, 20030101
                                                )
                )
)
AS
SELECT
    [ProductKey]  
,    [OrderDateKey]
,    [DueDateKey]  
,    [ShipDateKey]
,    [CustomerKey]
,    [PromotionKey]
,    [CurrencyKey]
,    [SalesTerritoryKey]
,    [SalesOrderNumber]
,    [SalesOrderLineNumber]
,    [RevisionNumber]
,    [OrderQuantity]
,    [UnitPrice]
,    [ExtendedAmount]
,    [UnitPriceDiscountPct]
,    ISNULL(CAST(5 as float),0) AS [DiscountAmount]
,    [ProductStandardCost]
,    [TotalProductCost]
,    ISNULL(CAST(CASE WHEN [SalesAmount] <=5 THEN 0
         ELSE [SalesAmount] - 5
         END AS MONEY),0) AS [SalesAmount]
,    [TaxAmt]
,    [Freight]
,    [CarrierTrackingNumber]
,    [CustomerPONumber]
FROM    [dbo].[FactInternetSales]
WHERE    OrderDateKey BETWEEN 20020101 AND 20021231
OPTION (LABEL = 'CTAS : Partition Switch IN : UPDATE')
;

--Use the helper procedure to identify the partitions
--The source table
EXEC dbo.partition_data_get 'dbo','FactInternetSales',20030101
DECLARE @ptn_nmbr_src INT = (SELECT ptn_nmbr FROM #ptn_data)
SELECT @ptn_nmbr_src

--The "in" table
EXEC dbo.partition_data_get 'dbo','FactInternetSales_in',20030101
DECLARE @ptn_nmbr_in INT = (SELECT ptn_nmbr FROM #ptn_data)
SELECT @ptn_nmbr_in

--The "out" table
EXEC dbo.partition_data_get 'dbo','FactInternetSales_out',20030101
DECLARE @ptn_nmbr_out INT = (SELECT ptn_nmbr FROM #ptn_data)
SELECT @ptn_nmbr_out

--Switch the partitions over
DECLARE @SQL NVARCHAR(4000) = '
ALTER TABLE [dbo].[FactInternetSales]    SWITCH PARTITION '+CAST(@ptn_nmbr_src AS VARCHAR(20))    +' TO [dbo].[FactInternetSales_out] PARTITION '    +CAST(@ptn_nmbr_out AS VARCHAR(20))+';
ALTER TABLE [dbo].[FactInternetSales_in] SWITCH PARTITION '+CAST(@ptn_nmbr_in AS VARCHAR(20))    +' TO [dbo].[FactInternetSales] PARTITION '        +CAST(@ptn_nmbr_src AS VARCHAR(20))+';'
EXEC sp_executesql @SQL

--Perform the clean-up
TRUNCATE TABLE dbo.FactInternetSales_out;
TRUNCATE TABLE dbo.FactInternetSales_in;

DROP TABLE dbo.FactInternetSales_out
DROP TABLE dbo.FactInternetSales_in
DROP TABLE #ptn_data

使用小批量尽量减少日志记录Minimize logging with small batches

对于大型数据修改操作,将操作划分为区块或批次来界定工作单元很有效。For large data modification operations, it may make sense to divide the operation into chunks or batches to scope the unit of work.

以下代码是一个可行的示例。A following code is a working example. 批大小已设置为一个简单的数字来突显该方法。The batch size has been set to a trivial number to highlight the technique. 实际中批大小会变得非常大。In reality, the batch size would be significantly larger.

SET NO_COUNT ON;
IF OBJECT_ID('tempdb..#t') IS NOT NULL
BEGIN
    DROP TABLE #t;
    PRINT '#t dropped';
END

CREATE TABLE #t
WITH    (    DISTRIBUTION = ROUND_ROBIN
        ,    HEAP
        )
AS
SELECT    ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS seq_nmbr
,        SalesOrderNumber
,        SalesOrderLineNumber
FROM    dbo.FactInternetSales
WHERE    [OrderDateKey] BETWEEN 20010101 and 20011231
;

DECLARE    @seq_start        INT = 1
,        @batch_iterator    INT = 1
,        @batch_size        INT = 50
,        @max_seq_nmbr    INT = (SELECT MAX(seq_nmbr) FROM dbo.#t)
;

DECLARE    @batch_count    INT = (SELECT CEILING((@max_seq_nmbr*1.0)/@batch_size))
,        @seq_end        INT = @batch_size
;

SELECT COUNT(*)
FROM    dbo.FactInternetSales f

PRINT 'MAX_seq_nmbr '+CAST(@max_seq_nmbr AS VARCHAR(20))
PRINT 'MAX_Batch_count '+CAST(@batch_count AS VARCHAR(20))

WHILE    @batch_iterator <= @batch_count
BEGIN
    DELETE
    FROM    dbo.FactInternetSales
    WHERE EXISTS
    (
            SELECT    1
            FROM    #t t
            WHERE    seq_nmbr BETWEEN  @seq_start AND @seq_end
            AND        FactInternetSales.SalesOrderNumber        = t.SalesOrderNumber
            AND        FactInternetSales.SalesOrderLineNumber    = t.SalesOrderLineNumber
    )
    ;

    SET @seq_start = @seq_end
    SET @seq_end = (@seq_start+@batch_size);
    SET @batch_iterator +=1;
END

暂停和缩放指南Pause and scaling guidance

Synapse SQL 允许按需暂停、恢复和缩放 SQL 池。Synapse SQL lets you pause, resume, and scale your SQL pool on demand. 暂停或缩放 SQL 池时,必须了解任何正在运行的事务都会立即终止;这会导致任何打开的事务回退。When you pause or scale your SQL pool, it is important to understand that any in-flight transactions are terminated immediately; causing any open transactions to be rolled back. 如果工作负荷在暂停或缩放操作前已发出数据修改在长时间运行之后仍未完成的指示,则需要撤消此项工作。If your workload had issued a long running and incomplete data modification prior to the pause or scale operation, then this work will need to be undone. 此撤消操作可能会影响暂停或缩放SQL 池所需的时间。This undoing might impact the time it takes to pause or scale your SQL pool.

重要

UPDATEDELETE 都是完整记录的操作,因此这些撤消/重做操作相比同等最少记录的操作可能要花费更长的时间。Both UPDATE and DELETE are fully logged operations and so these undo/redo operations can take significantly longer than equivalent minimally logged operations.

最佳方案是在暂停或缩放 SQL 池之前完成正在执行的数据修改事务。The best scenario is to let in flight data modification transactions complete prior to pausing or scaling SQL pool. 但是,此方案不一定始终可行。However, this scenario might not always be practical. 若要降低长时间回退的风险,请考虑以下选项之一:To mitigate the risk of a long rollback, consider one of the following options:

  • 使用 CTAS 重新编写长时间运行的操作Rewrite long running operations using CTAS
  • 将该操作分解为多个块;针对行的子集进行操作Break the operation into chunks; operating on a subset of the rows

后续步骤Next steps

请参阅 Synapse SQL 中的事务,详细了解隔离级别和事务限制。See Transactions in Synapse SQL to learn more about isolation levels and transactional limits. 有关其他最佳实践的概述,请参阅 SQL 数据仓库最佳实践For an overview of other Best Practices, see SQL Data Warehouse Best Practices.