Azure Synapse Analytics(以前称为 SQL DW)中 Synapse SQL 池的最佳做法Best practices for Synapse SQL pool in Azure Synapse Analytics (formerly SQL DW)

本文包含一系列最佳做法,可帮助你从 SQL 池部署中获得最佳性能。This article is a collection of best practices to help you to achieve optimal performance from your SQL pool deployment. 本文的目的是提供一些基本指导,并侧重介绍重点关注的领域。The purpose of this article is to give you some basic guidance and highlight important areas of focus.

使用暂停和缩放来降低成本Reduce cost with pause and scale

若要了解如何通过暂停和缩放来降低成本,请参阅管理计算For more information about reducing costs through pausing and scaling, see the Manage compute.

维护统计信息Maintain statistics

可以将 SQL 池配置为自动检测列并为其创建统计信息。SQL pool can be configured to automatically detect and create statistics on columns. 优化器创建的查询计划的好坏取决于可用的统计信息。The query plans created by the optimizer are only as good as the available statistics.

建议为数据库启用 AUTO_CREATE_STATISTICS,并且每天或在每次加载后更新统计信息,以确保查询中使用的列的统计信息始终是最新的。We recommend that you enable AUTO_CREATE_STATISTICS for your databases and keep the statistics updated daily or after each load to ensure that statistics on columns used in your queries are always up to date.

如果发现更新所有统计信息所需时间太长,可能要更谨慎地选择哪些列需要频繁更新统计信息。If you find it is taking too long to update all of your statistics, you may want to try to be more selective about which columns need frequent statistics updates. 例如,可能想要更新每天都要添加新值的日期列。For example, you might want to update date columns, where new values may be added, daily.

提示

对涉及联接的列、WHERE 子句中使用的列以及在 GROUP BY 中找到的列更新统计信息,可以获得最大效益。You will gain the most benefit by having updated statistics on columns involved in joins, columns used in the WHERE clause and columns found in GROUP BY.

另请参阅管理表统计信息CREATE STATISTICSUPDATE STATISTICSSee also Manage table statistics, CREATE STATISTICS, and UPDATE STATISTICS.

使用 DMV 来监视和优化查询Use DMVs to monitor and optimize your queries

SQL 池有多个可用于监视查询执行的 DMV。SQL pool has several DMVs that can be used to monitor query execution. 《使用 DMV 监视工作负荷》一文逐步详细说明了如何查看正在执行的查询的详细信息。The Monitor your workload using DMVs article details step-by-step instructions on how to look at the details of an executing query.

若要在这些 DMV 中快速找到查询,可在查询中使用 LABEL 选项。To quickly find queries in these DMVs, using the LABEL option with your queries can help.

另请参阅使用 DMV 监视工作负荷LABELOPTIONsys.dm_exec_sessionssys.dm_pdw_exec_requestssys.dm_pdw_request_stepssys.dm_pdw_sql_requestssys.dm_pdw_dms_workersDBCC PDW_SHOWEXECUTIONPLANsys.dm_pdw_waitsSee also Monitor your workload using DMVs, LABEL, OPTION, sys.dm_exec_sessions, sys.dm_pdw_exec_requests, sys.dm_pdw_request_steps, sys.dm_pdw_sql_requests, sys.dm_pdw_dms_workers, DBCC PDW_SHOWEXECUTIONPLAN, and sys.dm_pdw_waits.

利用新产品增强功能优化查询性能Tune query performance with new product enhancements

将 INSERT 语句分组为批Group INSERT statements into batches

若要将数据一次性加载到小型表中,使用一个 INSERT 语句即可完成;即使是定期重载某个查找,也可使用 INSERT INTO MyLookup VALUES (1, 'Type 1') 之类的语句来完成。A one-time load to a small table with an INSERT statement or even a periodic reload of a look-up may perform well for your needs with a statement like INSERT INTO MyLookup VALUES (1, 'Type 1').

但是,如果一整天都要加载数千或数百万个行,可能发现单个 INSERT 跟不上要求。However, if you need to load thousands or millions of rows throughout the day, you might find that singleton INSERTS just can't keep up. 请开发自己的可写入文件的进程,并开发另一个进程定期处理并加载此文件。Instead, develop your processes so that they write to a file and another process periodically comes along and loads this file.

另请参阅 INSERTSee also INSERT.

使用 PolyBase 快速加载和导出数据Use PolyBase to load and export data quickly

SQL 池支持通过多种工具(包括 Azure 数据工厂、PolyBase 和 BCP)来加载和导出数据。SQL pool supports loading and exporting data through several tools including Azure Data Factory, PolyBase, and BCP. 对于少量的数据,性能不是那么重要,任何工具都可以满足需求。For small amounts of data where performance isn't critical, any tool may be sufficient for your needs. 但是,当要加载或导出大量数据,或者需要快速的性能时,PolyBase 是最佳选择。However, when you are loading or exporting large volumes of data or fast performance is needed, PolyBase is the best choice.

PolyBase 设计为使用 MPP(大规模并行处理)体系结构,因此加载和导出巨量数据的速度比其他任何工具更快。PolyBase is designed to leverage the MPP (Massively Parallel Processing) architecture and will load and export data magnitudes faster than any other tool. 可使用 CTAS 或 INSERT INTO 来运行 PolyBase 加载。PolyBase loads can be run using CTAS or INSERT INTO.

提示

使用 CTAS 可以减少事务日志记录,是加载数据最快的方法。Using CTAS will minimize transaction logging and the fastest way to load your data.

Azure 数据工厂还支持 PolyBase 加载,并且可以实现与 CTAS 类似的性能。Azure Data Factory also supports PolyBase loads and can achieve similar performance as CTAS. PolyBase 支持各种不同的文件格式,包括 Gzip 文件。PolyBase supports a variety of file formats including Gzip files.

备注

若要在使用 gzip 文本文件时获得最大的吞吐量,请将文件拆分成至少 60 个文件,以便最大程度提高加载的并行度。To maximize throughput when using gzip text files, break up files into 60 or more files to maximize parallelism of your load. 若要更快的总吞吐量,请考虑并行加载数据。For faster total throughput, consider loading data concurrently.

另请参阅加载数据PolyBase 使用指南SQL 池加载模式和策略使用 Azure 数据工厂加载数据CREATE EXTERNAL FILE FORMATCreate table as select (CTAS)See also Load data, Guide for using PolyBase, SQL pool loading patterns and strategies, Load Data with Azure Data Factory, CREATE EXTERNAL FILE FORMAT, and Create table as select (CTAS).

加载并查询外部表Load then query external tables

虽然 Polybase(也称外部表)可以最快速地加载数据,但并不特别适合查询。While Polybase, also known as external tables, can be the fastest way to load data, it is not optimal for queries. Polybase 表目前仅支持 Azure Blob 文件。Polybase tables currently only support Azure blob files. 这些文件并没有任何计算资源的支持。These files do not have any compute resources backing them.

因此,SQL 池无法卸载此工作,因此必须读取整个文件,方法是将其加载到 tempdb 来读取数据。As a result, SQL pool cannot offload this work and therefore must read the entire file by loading it to tempdb in order to read the data. 因此,如果有多个查询需要查询此数据,则最好是先加载一次此数据,然后让查询使用本地表。Therefore, if you have several queries that will be querying this data, it is better to load this data once and have queries use the local table.

另请参阅 PolyBase 使用指南See also Guide for using PolyBase.

哈希分布大型表Hash distribute large tables

默认情况下,表是以轮循机制分布的。By default, tables are Round Robin distributed. 这可让用户更容易开始创建表,而不必确定应该如何分布其表。This makes it easy for users to get started creating tables without having to decide how their tables should be distributed. 轮循机制表在某些工作负荷中执行良好,但大多数情况下,选择分布列的执行性能将更好。Round Robin tables may perform sufficiently for some workloads, but in most cases selecting a distribution column will perform much better.

按列分布表的性能远远高于轮循机制表的最常见例子是联接两个大型事实表。The most common example of when a table distributed by a column will far outperform a Round Robin table is when two large fact tables are joined.

例如,如果有一个依 order_id 分布的订单表,以及一个也是依 order_id 分布的事务表,如果将订单数据联接到事务表上的 order_id,此查询将变成传递查询,也就是数据移动操作会被消除。For example, if you have an orders table, which is distributed by order_id, and a transactions table, which is also distributed by order_id, when you join your orders table to your transactions table on order_id, this query becomes a pass-through query, which means we eliminate data movement operations. 减少步骤意味着加快查询速度。Fewer steps mean a faster query. 更少的数据移动也将让查询更快。Less data movement also makes for faster queries.

提示

加载分布的表时,请确保传入数据的分布键没有排序,因为这会拖慢加载速度。When loading a distributed table, be sure that your incoming data is not sorted on the distribution key as this will slow down your loads.

有关选择分布列如何能提升性能,以及如何在 CREATE TABLE 语句的 WITH 子句中定义分布式表的更多详细信息,请参阅以下链接。See the following links for more details on how selecting a distribution column can improve performance as well as how to define a distributed table in the WITH clause of your CREATE TABLE statement.

另请参阅表概述表分布选择表分布CREATE TABLECREATE TABLE AS SELECTSee also Table overview, Table distribution, Selecting table distribution, CREATE TABLE, CREATE TABLE AS SELECT.

不要过度分区Do not over-partition

尽管数据分区对于维护数据(通过分区切换)或优化扫描(通过分区排除)很有效,但分区过多会减慢查询速度。While partitioning data can be effective for maintaining your data through partition switching or optimizing scans by with partition elimination, having too many partitions can slow down your queries. 通常,在 SQL Server 上运行良好的高粒度分区策略可能无法在 SQL 池中正常工作。Often a high granularity partitioning strategy, which may work well on SQL Server may not work well in SQL pool.

如果每个分区的行数少于 1 百万,太多分区还会降低聚集列存储索引的效率。Having too many partitions can also reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows. 请记住,SQL 池在后台将数据分区成 60 个数据库,因此如果创建包含 100 个分区的表,实际上会产生 6000 个分区。Keep in mind that behind the scenes, SQL pool partitions your data for you into 60 databases, so if you create a table with 100 partitions, this actually results in 6000 partitions.

每个工作负荷都不同,因此最佳建议是尝试不同的分区,找出最适合工作负荷的分区。Each workload is different so the best advice is to experiment with partitioning to see what works best for your workload. 请考虑比 SQL Server 上运行良好的数据粒度更低的粒度。Consider lower granularity than what may have worked for you in SQL Server. 例如,考虑使用每周或每月分区,而不是每日分区。For example, consider using weekly or monthly partitions rather than daily partitions.

另请参阅表分区See also Table partitioning.

最小化事务大小Minimize transaction sizes

在事务中运行的 INSERT、UPDATE、DELETE 语句,失败时必须回滚。INSERT, UPDATE, and DELETE statements run in a transaction and when they fail they must be rolled back. 为了将长时间回滚的可能性降到最低,请尽可能将事务大小最小化。To minimize the potential for a long rollback, minimize transaction sizes whenever possible. 这可以通过将 INSERT、UPDATE、DELETE 语句分成小部分来达成。This can be done by dividing INSERT, UPDATE, and DELETE statements into parts.

例如,如果预期 INSERT 需要 1 小时,可能的话,将 INSERT 分成四个部分,每个运行 15 分钟。For example, if you have an INSERT that you expect to take 1 hour, if possible, break up the INSERT into four parts, which will each run in 15 minutes. 使用特殊的最低限度日志记录方案(例如 CTAS、TRUNCATE、DROP TABLE 或 INSERT 到空表)来降低回滚的风险。Leverage special Minimal Logging cases, like CTAS, TRUNCATE, DROP TABLE, or INSERT to empty tables, to reduce rollback risk.

另一个消除回滚的作法是使用“仅元数据”操作(像是分区切换)进行数据管理。Another way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. 例如,不要执行 DELETE 语句来删除表中所有 order_date 为 2001 年 10 月的行,而是将数据按月分区,然后将有数据的分区换出成另一个表中的空分区(请参阅 ALTER TABLE 示例)。For example, rather than execute a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data monthly and then switch out the partition with data for an empty partition from another table (see ALTER TABLE examples).

对于未分区的表,请考虑使用 CTAS 将想要保留的数据写入表中,而不是使用 DELETE。For unpartitioned tables, consider using a CTAS to write the data you want to keep in a table rather than using DELETE. 如果 CTAS 花费的时间相同,则运行起来要安全得多,因为它的事务日志记录最少,并且可以根据需要快速取消。If a CTAS takes the same amount of time, it is a much safer operation to run as it has minimal transaction logging and can be canceled quickly if needed.

另请参阅了解事务优化事务表分区TRUNCATE TABLEALTER TABLECreate table as select (CTAS)See also Understanding transactions, Optimizing transactions, Table partitioning, TRUNCATE TABLE, ALTER TABLE, and Create table as select (CTAS).

缩减查询结果大小Reduce query result sizes

此步骤可帮助你避免由大型查询结果引起的客户端问题。This step helps you avoid client-side issues caused by large query result. 你可以编辑查询以减少返回的行数。You can edit your query to reduce the number of rows returned. 一些查询生成工具允许你向每个查询添加“top N”语法。Some query generation tools allow you to add "top N" syntax to each query. 你还可以将查询结果 CETAS 到临时表,然后使用 PolyBase 导出进行下层处理。You can also CETAS the query result to a temporary table and then use PolyBase export for the downlevel processing.

使用最小可能的列大小Use the smallest possible column size

定义 DDL 时,使用支持数据的最小数据类型将提高查询性能。When defining your DDL, using the smallest data type that will support your data will improve query performance. 此方法对于 CHAR 和 VARCHAR 列尤其重要。This approach is particularly important for CHAR and VARCHAR columns.

如果列中最长的值是 25 个字符,请将列定义为 VARCHAR(25)。If the longest value in a column is 25 characters, then define your column as VARCHAR(25). 避免将所有字符列定义为较大的默认长度。Avoid defining all character columns to a large default length. 此外,将列定义为 VARCHAR(当它只需要这样的大小时)而非 NVARCHAR。In addition, define columns as VARCHAR when that is all that is needed rather than use NVARCHAR.

另请参阅表概述表数据类型CREATE TABLESee also Table overview, Table data types, CREATE TABLE.

对暂时性数据使用临时堆表Use temporary heap tables for transient data

临时加载数据时,可能会发现使用堆表可让整个过程更快速。When you are temporarily landing data, you may find that using a heap table will make the overall process faster. 如果加载数据只是在做运行更多转换之前的预备,将表载入堆表会远快于将数据载入聚集列存储表。If you are loading data only to stage it before running more transformations, loading the table to heap table will be much faster than loading the data to a clustered columnstore table.

此外,将数据载入临时表也比将表载入永久存储更快速。In addition, loading data to a temp table will also load much faster than loading a table to permanent storage. 临时表以“#”开头,并且只能由创建它的会话访问,因此只能在有限的情况下使用。Temporary tables start with a "#" and are only accessible by the session that created it, so they may only work in limited scenarios.

堆表在 CREATE TABLE 的 WITH 子句中定义。Heap tables are defined in the WITH clause of a CREATE TABLE. 如果使用临时表,请记得同时在该临时表上创建统计信息。If you do use a temporary table, remember to create statistics on that temporary table too.

另请参阅临时表CREATE TABLECREATE TABLE AS SELECTSee also Temporary tables, CREATE TABLE, CREATE TABLE AS SELECT.

优化聚集列存储表Optimize clustered columnstore tables

聚集列存储索引是将数据存储在 SQL 池中最有效率的方式之一。Clustered columnstore indexes are one of the most efficient ways you can store your data in SQL pool. 默认情况下,SQL 池中的表创建为聚集列存储。By default, tables in SQL pool are created as Clustered ColumnStore. 为了让列存储表的查询获得最佳性能,良好的分段质量很重要。To get the best performance for queries on columnstore tables, having good segment quality is important.

当行在内存不足的状态下写入列存储表时,列存储分段质量可能降低。When rows are written to columnstore tables under memory pressure, columnstore segment quality may suffer. 压缩行组中的行数可以测量分段质量。Segment quality can be measured by number of rows in a compressed Row Group. 有关检测和改善聚集列存储表分段质量的分步说明,请参阅表索引一文中的列存储索引质量不佳的原因See the Causes of poor columnstore index quality in the Table indexes article for step-by-step instructions on detecting and improving segment quality for clustered columnstore tables.

由于高质量列存储段很重要,因此可以考虑使用中型或大型资源类中的用户 ID 来加载数据。Because high-quality columnstore segments are important, it's a good idea to use users IDs that are in the medium or large resource class for loading data. 使用较低的数据仓库单位值意味着需要向加载用户分配较大的资源类。Using lower data warehouse units means you want to assign a larger resource class to your loading user.

由于列存储表通常要等到每个表中的行数超过 100 万且每个 SQL 池表分区成 60 个表之后,才会数据推送到压缩的列存储段,根据经验法则,除非列存储表中的行数超过 6000 万,否则该表对于查询没有任何好处。Since columnstore tables generally won't push data into a compressed columnstore segment until there are more than 1 million rows per table and each SQL pool table is partitioned into 60 tables, as a rule of thumb, columnstore tables won't benefit a query unless the table has more than 60 million rows. 小于 6 千万列的表使用列存储索引似乎不太合理,For table with less than 60 million rows, it may not make any sense to have a columnstore index. 但也无伤大雅。It also may not hurt.

此外,如果将分区,则要考虑的是每个分区必须有 1 百万个行,使用聚集列存储索引才有益。Furthermore, if you partition your data, then you will want to consider that each partition will need to have 1 million rows to benefit from a clustered columnstore index. 如果表有 100 个分区,则它至少必须有 60 亿行才能受益于聚集列存储(60 个分布区 100 个分区 100 万行)。If a table has 100 partitions, then it will need to have at least 6 billion rows to benefit from a clustered columns store (60 distributions 100 partitions 1 million rows).

如果表在本示例中并没有 60 亿个行,请减少分区数目,或考虑改用堆表。If your table does not have 6 billion rows in this example, either reduce the number of partitions or consider using a heap table instead. 使用辅助索引配合堆表而不是列存储表,也可能是值得进行的实验,看看是否可以获得较佳的性能。It also may be worth experimenting to see if better performance can be gained with a heap table with secondary indexes rather than a columnstore table.

提示

查询列存储表时,如果只选择需要的列,查询运行将更快速。When querying a columnstore table, queries will run faster if you select only the columns you need.

另请参阅表索引列存储索引指南重新生成列存储索引See also Table indexes, Columnstore indexes guide, Rebuilding columnstore indexes

使用较大的资源类来改善查询性能Use larger resource class to improve query performance

SQL 池使用资源组作为将内存分配给查询的一种方式。SQL pool uses resource groups as a way to allocate memory to queries. 默认情况下,所有用户都分配有小型资源类,此类授予每个分布区 100 MB 内存。Out of the box, all users are assigned to the small resource class, which grants 100 MB of memory per distribution. 因为永远将有 60 个分布区,每个分布区有至少 100 MB,整个系统的总内存分配为 6000 MB 或者刚好接近6 GB。Since there are always 60 distributions and each distribution is given a minimum of 100 MB, system wide the total memory allocation is 6,000 MB, or just under 6 GB.

有些查询,例如大型联接或载入聚集列存储表,将受益于较大的内存分配。Certain queries, like large joins or loads to clustered columnstore tables, will benefit from larger memory allocations. 某些查询,例如纯扫描,则不会获得任何好处。Some queries, like pure scans, will yield no benefit. 但是,使用较大的资源类会降低并发性,因此,在将所有用户转移到大型资源类之前,需要考虑这种影响。However, utilizing larger resource classes reduces concurrency, so you will want to take this impact into consideration before moving all of your users to a large resource class.

另请参阅用于工作负荷管理的资源类See also Resource classes for workload management.

使用较小的资源类来增加并发性Use Smaller Resource Class to Increase Concurrency

如果注意到用户查询似乎长时间延迟,可能是因为用户正在较大资源类中运行并占用大量并发槽位,从而导致其他查询排入队列。If you notice that user queries seem to have a long delay, it could be that your users are running in larger resource classes and are consuming many concurrency slots causing other queries to queue up. 若要确认用户的查询是否被排入队列,请运行 SELECT * FROM sys.dm_pdw_waits 来看是否返回了任何行。To see if users queries are queued, run SELECT * FROM sys.dm_pdw_waits to see if any rows are returned.

另请参阅用于工作负荷管理的资源类sys.dm_pdw_waitsSee also Resource classes for workload management, sys.dm_pdw_waits.

其他资源Other resources

另请参阅故障诊断一文,了解常见的问题和解决方案。Also see our Troubleshooting article for common issues and solutions.