最大化列存储的行组质量Maximizing rowgroup quality for columnstore

行组质量由行组中的行数决定。Rowgroup quality is determined by the number of rows in a rowgroup. 增加可用内存可以使列存储索引压缩到每个行组中的行数最大化。Increasing the available memory can maximize the number of rows a columnstore index compresses into each rowgroup. 使用这些方法来提高列存储索引的压缩率和请求性能。Use these methods to improve compression rates and query performance for columnstore indexes.

行组大小之所以重要的原因Why the rowgroup size matters

由于列存储索引会通过扫描单个行组的列段来扫描表,所以,使每个行组的行数最大化可增强查询性能。Since a columnstore index scans a table by scanning column segments of individual rowgroups, maximizing the number of rows in each rowgroup enhances query performance.

如果行组具有的行数较多,则会增强数据压缩,这意味着需要从磁盘读取的数据变少。When rowgroups have a high number of rows, data compression improves which means there is less data to read from disk.

有关行组的详细信息,请参阅列存储索引指南For more information about rowgroups, see Columnstore Indexes Guide.

行组的目标大小Target size for rowgroups

为了获得最佳性能,需要使列存储索引中每个行组的行数最大化。For best query performance, the goal is to maximize the number of rows per rowgroup in a columnstore index. 一个行组最多可有 1,048,576 个行。A rowgroup can have a maximum of 1,048,576 rows.

每个行组不一定要具有最大行数。It's okay to not have the maximum number of rows per rowgroup. 行组至少有 100,000 行时,列存储索引可获得良好的性能。Columnstore indexes achieve good performance when rowgroups have at least 100,000 rows.

在压缩过程中,可对行组进行修剪Rowgroups can get trimmed during compression

批量加载或重建列存储索引期间,有时可能因内存不足而无法压缩为每个行组指定的所有行。During a bulk load or columnstore index rebuild, sometimes there isn't enough memory available to compress all the rows designated for each rowgroup. 如果存在内存压力,列存储索引会修剪行组大小,以便成功地将行组压缩到列存储中。When memory pressure is present, columnstore indexes trim the rowgroup sizes so compression into the columnstore can succeed.

如果内存不足,无法将至少 10,000 个行压缩到每个行组中,就会生成错误。When there is insufficient memory to compress at least 10,000 rows into each rowgroup, an error will be generated.

有关批量加载的详细信息,请参阅 Bulk load into a clustered columnstore index(批量加载到聚集列存储索引中)。For more information on bulk loading, see Bulk load into a clustered columnstore index.

如何监视行组质量How to monitor rowgroup quality

DMV sys.dm_pdw_nodes_db_column_store_row_group_physical_stats(sys.dm_db_column_store_row_group_physical_stats 包含与 SQL DB 匹配的视图定义),用于公开一些有用信息,例如行组中的行数,以及修整原因(如果有过修整)。The DMV sys.dm_pdw_nodes_db_column_store_row_group_physical_stats (sys.dm_db_column_store_row_group_physical_stats contains the view definition matching SQL DB) that exposes useful information such as number of rows in rowgroups and the reason for trimming, if there was trimming.

可创建下列视图来轻松查询此 DMV,以便获得关于行组修整的信息。You can create the following view as a handy way to query this DMV to get information on rowgroup trimming.

create view dbo.vCS_rg_physical_stats
as
with cte
as
(
select   tb.[name]                    AS [logical_table_name]
,        rg.[row_group_id]            AS [row_group_id]
,        rg.[state]                   AS [state]
,        rg.[state_desc]              AS [state_desc]
,        rg.[total_rows]              AS [total_rows]
,        rg.[trim_reason_desc]        AS trim_reason_desc
,        mp.[physical_name]           AS physical_name
FROM    sys.[schemas] sm
JOIN    sys.[tables] tb               ON  sm.[schema_id]          = tb.[schema_id]
JOIN    sys.[pdw_table_mappings] mp   ON  tb.[object_id]          = mp.[object_id]
JOIN    sys.[pdw_nodes_tables] nt     ON  nt.[name]               = mp.[physical_name]
JOIN    sys.[dm_pdw_nodes_db_column_store_row_group_physical_stats] rg      ON  rg.[object_id]     = nt.[object_id]
                                                                            AND rg.[pdw_node_id]   = nt.[pdw_node_id]
                                        AND rg.[distribution_id]    = nt.[distribution_id]
)
select *
from cte;

trim_reason_desc 指示行组是否已修整(trim_reason_desc = NO_TRIM 表示没有修整,且行组的质量为最佳)。The trim_reason_desc tells whether the rowgroup was trimmed(trim_reason_desc = NO_TRIM implies there was no trimming and row group is of optimal quality). 下列修整原因指示行组的过早修整:The following trim reasons indicate premature trimming of the rowgroup:

  • BULKLOAD:当加载的行的传入批小于 100 万行时,使用此修整原因。BULKLOAD: This trim reason is used when the incoming batch of rows for the load had less than 1 million rows. 如果插入的行数(而不是插入增量存储)大于 100,000,引擎将创建压缩的行组,但会将修整原因设置为 BULKLOAD。The engine will create compressed row groups if there are greater than 100,000 rows being inserted (as opposed to inserting into the delta store) but sets the trim reason to BULKLOAD. 在此方案中,请考虑增加批负荷,使之包含更多的行。In this scenario, consider increasing your batch load to include more rows. 另外,请重新评估分区方案,避免其太过细化,因为行组无法跨越分区边界。Also, reevaluate your partitioning scheme to ensure it's not too granular as row groups can't span partition boundaries.
  • MEMORY_LIMITATION:要创建具有 100 万行的行组,引擎需要一定量的工作内存。MEMORY_LIMITATION: To create row groups with 1 million rows, a certain amount of working memory is required by the engine. 当加载会话的可用内存小于所需的工作内存时,将提前修整行组。When available memory of the loading session is less than the required working memory, row groups get prematurely trimmed. 以下各部分说明了如何估计所需内存以及分配更多内存。The following sections explain how to estimate required memory and allocate more memory.
  • DICTIONARY_SIZE:此修整原因指示,由于至少有一个字符串列具有宽和/或高基数字符串,因此发生行组修整。DICTIONARY_SIZE: This trim reason indicates that rowgroup trimming occurred because there was at least one string column with wide and/or high cardinality strings. 字典大小的内存限制为 16 MB,一旦达到此限制,将压缩行组。The dictionary size is limited to 16 MB in memory and once this limit is reached the row group is compressed. 如果遇到这种情况,请考虑将有问题的列隔离到单独的表中。If you do run into this situation, consider isolating the problematic column into a separate table.

如何估算内存需求How to estimate memory requirements

压缩单个行组所需的最大内存大约为The maximum required memory to compress one rowgroup is approximately

  • 72 MB +72 MB +
  • #rows * #columns * 8 字节 +#rows * #columns * 8 bytes +
  • #rows * #short-string-columns * 32 字节 +#rows * #short-string-columns * 32 bytes +
  • #long-string-columns * 16 MB 用于压缩字典#long-string-columns * 16 MB for compression dictionary

备注

short-string-columns 使用 <= 32 字节的字符串数据类型,long-string-columns 使用 > 32 字节的字符串数据类型。Short-string-columns use string data types of <= 32 bytes and long-string-columns use string data types of > 32 bytes.

使用专为压缩文本设计的压缩方法来压缩长字符串。Long strings are compressed with a compression method designed for compressing text. 此压缩方法使用 词典 来存储文本模式。This compression method uses a dictionary to store text patterns. 词典最大大小为 16 MB。The maximum size of a dictionary is 16 MB. 行组中每个长字符串列只能有一个词典。There is only one dictionary for each long string column in the rowgroup.

减少内存需求的方法Ways to reduce memory requirements

使用以下技巧来减少内存需求,以便能将行组压缩到列存储索引中。Use the following techniques to reduce the memory requirements for compressing rowgroups into columnstore indexes.

减少所用列数Use fewer columns

设计表时尽可能减少所用列数。If possible, design the table with fewer columns. 如果行组已压缩到列存储中,列存储索引会单独压缩每个列段。When a rowgroup is compressed into the columnstore, the columnstore index compresses each column segment separately.

因此,用于压缩行组的内存需求随列数的增加而增加。As such, the memory requirements to compress a rowgroup increase as the number of columns increases.

减少字符串列数Use fewer string columns

字符串数据类型的列比数字和日期数据类型需要更多内存。Columns of string data types require more memory than numeric and date data types. 要减少内存需求,请考虑从事实表中删除字符串列,并将其置于维度较小的表中。To reduce memory requirements, consider removing string columns from fact tables and putting them in smaller dimension tables.

字符串压缩的额外内存需求:Additional memory requirements for string compression:

  • 对于最多 32 个字符的字符串数据类型,每个值可能需要 32 个额外字节。String data types up to 32 characters can require 32 additional bytes per value.
  • 具有超过 32 个字符的字符串数据类型会通过词典的方法来进行压缩。String data types with more than 32 characters are compressed using dictionary methods. 行组中每个列可能需要最多 16 MB 的额外内存来生成词典。Each column in the rowgroup can require up to an additional 16 MB to build the dictionary.

避免过度分区Avoid over-partitioning

列存储索引会为每个分区创建一个或多个行组。Columnstore indexes create one or more rowgroups per partition. 对于 Azure Synapse Analytics 中的 SQL 池,由于数据是分布式的并且每次分布都会进行分区,因此分区数会快速增加。For SQL pool in Azure Synapse Analytics, the number of partitions grows quickly because the data is distributed and each distribution is partitioned.

如果表中分区过多,可能没有足够行来填充行组。If the table has too many partitions, there might not be enough rows to fill the rowgroups. 如果缺少行,在压缩过程中不会产生内存不足的情况,The lack of rows does not create memory pressure during compression. 但是这会导致行组无法实现最佳列存储查询性能。But, it leads to rowgroups that do not achieve the best columnstore query performance.

要避免过度分区的另一个原因是,在分区表上将行加载到列存储索引中会产生内存开销。Another reason to avoid over-partitioning is there is a memory overhead for loading rows into a columnstore index on a partitioned table.

在加载期间,多数分区可能会收到传入行,它们会保存在内存中,直到每个分区都有足够行可进行压缩。During a load, many partitions could receive the incoming rows, which are held in memory until each partition has enough rows to be compressed. 如果分区过多,可能产生额外的内存压力。Having too many partitions creates additional memory pressure.

简化加载查询Simplify the load query

数据库会在查询的所有运算符之间共享查询的内存授予。The database shares the memory grant for a query among all the operators in the query. 如果加载查询的排序和联接复杂,可用于压缩的内存将减少。When a load query has complex sorts and joins, the memory available for compression is reduced.

请仅针对加载查询而设计加载查询。Design the load query to focus only on loading the query. 如果要对数据运行转换,请与加载查询分开来运行转换。If you need to run transformations on the data, run them separate from the load query. 例如,将数据暂存在一个堆表中,运行转换,并将临时表加载到列存储索引中。For example, stage the data in a heap table, run the transformations, and then load the staging table into the columnstore index.

提示

也可先加载数据,并使用 MPP 系统来转换数据。You can also load the data first and then use the MPP system to transform the data.

调整 MAXDOPAdjust MAXDOP

当每次分发有多个 CPU 内核可用时,每次分发会将行组并行压缩到列存储中。Each distribution compresses rowgroups into the columnstore in parallel when there is more than one CPU core available per distribution.

并行度需要额外的内存资源,这可能会造成内存压力和行组修剪。The parallelism requires additional memory resources, which can lead to memory pressure and rowgroup trimming.

若要减少内存压力,可使用 MAXDOP 查询提示,在每次分发中强制加载操作以串行模式运行。To reduce memory pressure, you can use the MAXDOP query hint to force the load operation to run in serial mode within each distribution.

CREATE TABLE MyFactSalesQuota
WITH (DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM FactSalesQuota
OPTION (MAXDOP 1);

分配更多内存的方法Ways to allocate more memory

DWU 大小和用户资源类共同确定用户查询可用的内存量。DWU size and the user resource class together determine how much memory is available for a user query.

若要增加加载查询的内存授予,可增加 DWU 的数量或增加资源类。To increase the memory grant for a load query, you can either increase the number of DWUs or increase the resource class.

后续步骤Next steps

若要了解提升 SQL 池性能的更多方法,请参阅性能概述To find more ways to improve performance for SQL pool, see the Performance overview.