Synapse SQL 池开发最佳做法Development best practices for Synapse SQL pool

本文介绍在开发 SQL 池解决方案时的指导和最佳做法。This article describes guidance and best practices as you develop your SQL pool solution.

使用新的产品增强功能优化查询性能Tune query performance with new product enhancements

使用暂停和缩放来降低成本Reduce cost with pause and scale

若要了解如何通过暂停和缩放来降低成本,请参阅管理计算一文。For more information about reducing costs through pausing and scaling, see the Manage compute article.

维护统计信息Maintain statistics

可将 SQL 池配置为自动检测和创建有关列的统计信息。SQL pool can be configured to automatically detect and create statistics on columns. 优化工具创建的查询计划质量以可用的统计信息为限。The query plans created by the optimizer are only as good as the available statistics.

建议为数据库启用 AUTO_CREATE_STATISTICS,并使统计信息每日更新或者在每次加载后更新,确保查询中使用的有关列的统计信息始终保持最新。We recommend that you enable AUTO_CREATE_STATISTICS for your databases and keep the statistics updated daily or after each load to ensure that statistics on columns used in your queries are always up to date.

如果你发现更新所有统计信息需要太长的时间,可以尝试更精心地选择哪些列的统计信息需要频繁更新。If you find it is taking too long to update all of your statistics, you may want to try to be more selective about which columns need frequent statistics updates. 例如,可能想要更新每天都要添加新值的日期列。For example, you might want to update date columns, where new values may be added, daily.


对于涉及联接的列、WHERE 子句中使用的列、在 GROUP BY 中找到的列,更新其统计信息可以获得最大效益。You will gain the most benefit by having updated statistics on columns involved in joins, columns used in the WHERE clause, and columns found in GROUP BY.


哈希分布大型表Hash distribute large tables

默认情况下,表是以轮循机制分布的。By default, tables are Round Robin distributed. 此设计可让用户更容易开始创建表,而不必确定应该如何分布其表。This design makes it easy for users to get started creating tables without having to decide how their tables should be distributed.

轮循机制表在某些工作负荷中执行良好,但大多数情况下,选择分布列的执行性能将更好。Round Robin tables may perform sufficiently for some workloads, but in most cases selecting a distribution column will perform much better. 按列分布表的性能远远高于轮循机制表的最常见例子是联接两个大型事实表。The most common example of when a table distributed by a column will far outperform a Round Robin table is when two large fact tables are joined.

例如,如果有一个依 order_id 分布的订单表,以及一个也是依 order_id 分布的事务表,如果将订单数据联接到事务表上的 order_id,此查询将变成传递查询,也就是数据移动操作将被消除。For example, if you have an orders table, which is distributed by order_id, and a transactions table, which is also distributed by order_id, when you join your orders table to your transactions table on order_id, this query becomes a pass-through query, which means we eliminate data movement operations. 减少步骤意味着加快查询速度。Fewer steps mean a faster query. 更少的数据移动也会让查询更快。Less data movement also makes for faster queries.

加载分布的表时,请确保传入数据的分布键没有排序,因为这会拖慢加载速度。When loading a distributed table, be sure that your incoming data is not sorted on the distribution key as this will slow down your loads. 以下文章进一步详述了如何选择分布列来提升性能,以及如何在 CREATE TABLES 语句的 WITH 子句中定义分布式表。The articles that follow give further details on improving performance by selecting a distribution column and how to define a distributed table in the WITH clause of your CREATE TABLES statement.

另请参阅表概述表分布选择表分布CREATE TABLECREATE TABLE AS SELECTSee also Table overview, Table distribution, Selecting table distribution, CREATE TABLE, and CREATE TABLE AS SELECT

不要过度分区Do not over-partition

尽管数据分区可以让数据维护变得有效率(通过分区切换或优化扫描将分区消除),太多的分区将让查询变慢。While partitioning data can be effective for maintaining your data through partition switching or optimizing scans by with partition elimination, having too many partitions can slow down your queries.

在 SQL Server 上运行良好的高数据粒度分区策略经常无法在 SQL 池中正常使用。Frequently, a high granularity partitioning strategy that may work well on SQL Server may not work well on SQL pool. 如果每个分区的行数少于 1 百万,太多分区还会降低聚集列存储索引的效率。Having too many partitions can also reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows.

请记住,SQL 池在幕后将数据分区成 60 个数据库,因此如果创建包含 100 个分区的表,实际上会生成 6000 个分区。Keep in mind that behind the scenes, SQL pool partitions your data for you into 60 databases, so if you create a table with 100 partitions, this actually results in 6000 partitions. 每个工作负荷都不同,因此最佳建议是尝试不同的分区,找出最适合工作负荷的分区。Each workload is different so the best advice is to experiment with partitioning to see what works best for your workload.


请考虑使用比在 SQL Server 上运行良好的数据粒度更低的粒度。Consider using a lower granularity than what may have worked for you in SQL Server. 例如,考虑使用每周或每月分区,而不是每日分区。For example, consider using weekly or monthly partitions rather than daily partitions.

另请参阅表分区See also Table partitioning.

最小化事务大小Minimize transaction sizes

在事务中运行的 INSERT、UPDATE、DELETE 语句,失败时必须回滚。INSERT, UPDATE, and DELETE statements run in a transaction and when they fail they must be rolled back. 为了将长时间回滚的可能性降到最低,请尽可能将事务大小最小化。To minimize the potential for a long rollback, minimize transaction sizes whenever possible. 这可以通过将 INSERT、UPDATE、DELETE 语句分成小部分来达成。This can be done by dividing INSERT, UPDATE, and DELETE statements into parts.

例如,如果预期 INSERT 需要 1 小时,可能的话,将 INSERT 分成 4 个部分,每个部分运行 15 分钟。For example, if you have an INSERT, which you expect to take 1 hour, if possible, break up the INSERT into four parts, which will each run in 15 minutes. 使用特殊的最低限度日志记录方案,像是 CTAS、TRUNCATE、DROP TABLE 或 INSERT 空表,来降低回滚的风险。Leverage special Minimal Logging cases, like CTAS, TRUNCATE, DROP TABLE, or INSERT to empty tables, to reduce rollback risk.

另一个消除回滚的作法是使用“仅元数据”操作(像是分区切换)进行数据管理。Another way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. 例如,不要运行 DELETE 语句来删除表中所有 order_date 为 2001 年 10 月的行,而是将数据每月分区后,再从另一个表将有空分区之数据的分区调动出来(请参阅 ALTER TABLE 示例)。For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data monthly and then switch out the partition with data for an empty partition from another table (see ALTER TABLE examples).

针对未分区的表,请考虑使用 CTAS 将想要保留的数据写入表中,而不是使用 DELETE。For unpartitioned tables, consider using a CTAS to write the data you want to keep in a table rather than using DELETE. 如果 CTAS 需要的时间一样长,则较安全的操作,是在它具有极小事务记录的条件下运行它,且必要时可以快速地取消。If a CTAS takes the same amount of time, it is a much safer operation to run as it has minimal transaction logging and can be canceled quickly if needed.

另请参阅了解事务优化事务表分区TRUNCATE TABLEALTER TABLECreate table as select (CTAS)See also Understanding transactions, Optimizing transactions, Table partitioning, TRUNCATE TABLE, ALTER TABLE, and Create table as select (CTAS).

使用最小可能的列大小Use the smallest possible column size

在定义 DDL 时,使用可支持数据的最小数据类型,能够改善查询性能。When defining your DDL, using the smallest data type that will support your data will improve query performance. 此方法对 CHAR 和 VARCHAR 列特别重要。This approach is especially important for CHAR and VARCHAR columns.

如果列中最长的值是 25 个字符,请将列定义为 VARCHAR(25)。If the longest value in a column is 25 characters, then define your column as VARCHAR(25). 避免将所有字符列定义为较大的默认长度。Avoid defining all character columns to a large default length. 此外,将列定义为 VARCHAR(当它只需要这样的大小时)而非 NVARCHAR。In addition, define columns as VARCHAR when that is all that is needed rather than use NVARCHAR.

另请参阅表概述表数据类型CREATE TABLESee also Table overview, Table data types, and CREATE TABLE.

优化聚集列存储表Optimize clustered columnstore tables

聚集列存储索引是将数据存储在 SQL 池中最有效率的方式之一。Clustered columnstore indexes are one of the most efficient ways you can store your data in SQL pool. 默认情况下,SQL 池中的表会创建为聚集列存储。By default, tables in SQL pool are created as Clustered ColumnStore.


为了让列存储表的查询实现最佳性能,良好的分段质量很重要。To achieve optimal performance for queries on columnstore tables, having good segment quality is important.

当行在内存不足的状态下写入列存储表时,列存储分段质量可能降低。When rows are written to columnstore tables under memory pressure, columnstore segment quality may suffer. 压缩行组中的行数可以测量分段质量。Segment quality can be measured by number of rows in a compressed Row Group.

有关检测和改善聚集列存储表分段质量的分步说明,请参阅表索引一文中的列存储索引质量不佳的原因See the Causes of poor columnstore index quality in the Table indexes article for step-by-step instructions on detecting and improving segment quality for clustered columnstore tables.

由于高质量列存储段很重要,因此可以考虑使用用来加载数据的中型或大型资源类中的用户 ID。Because high-quality columnstore segments are important, it's a good idea to use users IDs that are in the medium or large resource class for loading data. 使用较低的数据仓库单位值意味着需要向加载用户分配较大的资源类。Using lower data warehouse units means you want to assign a larger resource class to your loading user.

列存储表通常要等到每个表有超过 1 百万个行之后才会数据推送到压缩的列存储区段,并且每个 SQL 池表会分区成 60 个表。因此,一般情况下,列存储表对于查询没有好处,除非表有超过 6 千万行。Since columnstore tables generally won't push data into a compressed columnstore segment until there are more than 1 million rows per table and each SQL pool table is partitioned into 60 tables, generally, columnstore tables won't benefit a query unless the table has more than 60 million rows.

小于 6 千万列的表使用列存储索引似乎不太合理,For a table with less than 60 million rows, it may not make any sense to have a columnstore index. 但也无伤大雅。It also may not hurt.

此外,如果要进行数据分区,则要考虑的是每个分区必须有 1 百万个行,使用聚集列存储索引才有益。Furthermore, if you partition your data, then you will want to consider that each partition will need to have 1 million rows to benefit from a clustered columnstore index. 如果表有 100 个分区,则它至少必须拥有 60 亿个行才会受益于聚集列存储(60 个分布区 100 个分区 1 百万行)。If a table has 100 partitions, then it will need to have at least 6 billion rows to benefit from a clustered columns store (60 distributions 100 partitions 1 million rows).

如果表在本示例中并没有 60 亿个行,请减少分区数目,或考虑改用堆表。If your table does not have 6 billion rows in this example, either reduce the number of partitions or consider using a heap table instead. 使用辅助索引配合堆表而不是列存储表,也可能是值得进行的实验,看看是否可以获得较佳的性能。It also may be worth experimenting to see if better performance can be gained with a heap table with secondary indexes rather than a columnstore table.


查询列存储表时,如果只选择需要的列,查询运行速度更快。When querying a columnstore table, queries will run faster if you select only the columns you need.

另请参阅表索引列存储索引指南重新生成列存储索引See also Table indexes, Columnstore indexes guide, and Rebuilding columnstore indexes.

后续步骤Next steps

如果在本文中没有找到所需内容,可尝试使用本页左侧的“搜索文档”来搜索所有 Azure Synapse 文档。If you don't find what you are looking for in this article, try using the "Search for docs" on the left side of this page to search all of the Azure Synapse documents.

Azure Synapse 论坛中,可以向其他用户和 Azure Synapse 产品小组提问。The Azure Synapse Forum is a place for you to post questions to other users and to the Azure Synapse Product Group. 我们会主动观察此论坛,确保用户的问题获得其他用户或我们的回答。We actively monitor this forum to ensure that your questions are answered either by another user or one of us.