为 SQL 数据仓库中的表编制索引Indexing tables in SQL Data Warehouse

在 Azure SQL 数据仓库中为表编制索引的建议和示例。Recommendations and examples for indexing tables in Azure SQL Data Warehouse.

索引选项是什么?What are index choices?

SQL 数据仓库提供多种索引选项,包括聚集列存储索引聚集索引和非聚集索引,以及一个称作的非索引选项。SQL Data Warehouse offers several indexing options including clustered columnstore indexes, clustered indexes and nonclustered indexes, and a non-index option also known as heap.

若要创建带有索引的表,请参阅 CREATE TABLE(Azure SQL 数据仓库)文档。To create a table with an index, see the CREATE TABLE (Azure SQL Data Warehouse) documentation.

聚集列存储索引Clustered columnstore indexes

默认情况下,如果未在表中指定任何索引选项,则 SQL 数据仓库将创建聚集列存储索引。By default, SQL Data Warehouse creates a clustered columnstore index when no index options are specified on a table. 聚集列存储表提供最高级别的数据压缩,以及最好的总体查询性能。Clustered columnstore tables offer both the highest level of data compression as well as the best overall query performance. 一般而言,聚集列存储表优于聚集索引或堆表,并且通常是大型表的最佳选择。Clustered columnstore tables will generally outperform clustered index or heap tables and are usually the best choice for large tables. 出于这些原因,在不确定如何编制表索引时,聚集列存储是最佳起点。For these reasons, clustered columnstore is the best place to start when you are unsure of how to index your table.

若要创建聚集列存储表,只需在 WITH 子句中指定 CLUSTERED COLUMNSTORE INDEX,或省略 WITH 子句:To create a clustered columnstore table, simply specify CLUSTERED COLUMNSTORE INDEX in the WITH clause, or leave the WITH clause off:

CREATE TABLE myTable   
  (  
    id int NOT NULL,  
    lastName varchar(20),  
    zipCode varchar(6)  
  )  
WITH ( CLUSTERED COLUMNSTORE INDEX );

在某些情况下,聚集列存储可能不是很好的选择:There are a few scenarios where clustered columnstore may not be a good option:

  • 列存储表不支持 varchar(max)、nvarchar(max) 和 varbinary(max)。Columnstore tables do not support varchar(max), nvarchar(max) and varbinary(max). 可以考虑使用堆或聚集索引。Consider heap or clustered index instead.
  • 对瞬态数据使用列存储表可能会降低效率。Columnstore tables may be less efficient for transient data. 可以考虑使用堆,甚至临时表。Consider heap and perhaps even temporary tables.
  • 包含少于 6000 万行的小型表。Small tables with less than 60 million rows. 可以考虑使用堆表。Consider heap tables.

堆表Heap tables

将数据暂时移入 SQL 数据仓库时,可能会发现使用堆表可让整个过程更快速。When you are temporarily landing data in SQL Data Warehouse, you may find that using a heap table makes the overall process faster. 这是因为堆的加载速度比索引表还要快,在某些情况下,可以从缓存执行后续读取。This is because loads to heaps are faster than to index tables and in some cases the subsequent read can be done from cache. 如果加载数据只是在做运行更多转换之前的预备,将表载入堆表会远快于将数据载入聚集列存储表。If you are loading data only to stage it before running more transformations, loading the table to heap table is much faster than loading the data to a clustered columnstore table. 此外,将数据载入临时表也比将表载入永久存储更快速。In addition, loading data to a temporary table loads faster than loading a table to permanent storage.

对于包含少于 6000 万行的小型查找表,堆表通常比较适合。For small lookup tables, less than 60 million rows, often heap tables make sense. 超过 6000 万行后,聚集列存储表开始达到最佳压缩性能。Cluster columnstore tables begin to achieve optimal compression once there is more than 60 million rows.

若要创建堆表,只需在 WITH 子句中指定 HEAP:To create a heap table, simply specify HEAP in the WITH clause:

CREATE TABLE myTable   
  (  
    id int NOT NULL,  
    lastName varchar(20),  
    zipCode varchar(6)  
  )  
WITH ( HEAP );

聚集与非聚集索引Clustered and nonclustered indexes

需要快速检索单个行时,聚集索引可能优于聚集列存储表。Clustered indexes may outperform clustered columnstore tables when a single row needs to be quickly retrieved. 对于需要单个或极少数行查找才能极速执行的查询,请考虑使用聚集索引或非聚集辅助索引。For queries where a single or very few row lookup is required to performance with extreme speed, consider a cluster index or nonclustered secondary index. 使用聚集索引的缺点是只有在聚集索引列上使用高度可选筛选器的查询才可受益。The disadvantage to using a clustered index is that only queries that benefit are the ones that use a highly selective filter on the clustered index column. 要改善其他列中的筛选器,可将非聚集索引添加到其他列。To improve filter on other columns a nonclustered index can be added to other columns. 但是,添加到表中的每个索引会增大空间和加载处理时间。However, each index which is added to a table adds both space and processing time to loads.

若要创建聚集索引表,只需在 WITH 子句中指定 CLUSTERED INDEX:To create a clustered index table, simply specify CLUSTERED INDEX in the WITH clause:

CREATE TABLE myTable   
  (  
    id int NOT NULL,  
    lastName varchar(20),  
    zipCode varchar(6)  
  )  
WITH ( CLUSTERED INDEX (id) );

若要对表添加非聚集索引,请使用以下语法:To add a non-clustered index on a table, use the following syntax:

CREATE INDEX zipCodeIndex ON myTable (zipCode);

优化聚集列存储索引Optimizing clustered columnstore indexes

聚集列存储表将数据组织成多个段。Clustered columnstore tables are organized in data into segments. 拥有较高的段质量是在列存储表中实现最佳查询性能的关键。Having high segment quality is critical to achieving optimal query performance on a columnstore table. 压缩行组中的行数可以测量分段质量。Segment quality can be measured by the number of rows in a compressed row group. 每个压缩的行组至少有 10 万行时的段质量最佳,而随着每个行组的行数趋于 1,048,576 行(这是行组可以包含的最大行数),性能会随之提升。Segment quality is most optimal where there are at least 100K rows per compressed row group and gain in performance as the number of rows per row group approach 1,048,576 rows, which is the most rows a row group can contain.

可以在系统上创建并使用以下视图来计算每个行组的平均行数,以及识别所有欠佳的聚集列存储索引。The below view can be created and used on your system to compute the average rows per row group and identify any sub-optimal cluster columnstore indexes. 此视图中的最后一列将生成为 SQL 语句,以用于重建索引。The last column on this view generates a SQL statement which can be used to rebuild your indexes.

CREATE VIEW dbo.vColumnstoreDensity
AS
SELECT
        GETDATE()                                                               AS [execution_date]
,       DB_Name()                                                               AS [database_name]
,       s.name                                                                  AS [schema_name]
,       t.name                                                                  AS [table_name]
,    COUNT(DISTINCT rg.[partition_number])                    AS [table_partition_count]
,       SUM(rg.[total_rows])                                                    AS [row_count_total]
,       SUM(rg.[total_rows])/COUNT(DISTINCT rg.[distribution_id])               AS [row_count_per_distribution_MAX]
,    CEILING    ((SUM(rg.[total_rows])*1.0/COUNT(DISTINCT rg.[distribution_id]))/1048576) AS [rowgroup_per_distribution_MAX]
,       SUM(CASE WHEN rg.[State] = 0 THEN 1                   ELSE 0    END)    AS [INVISIBLE_rowgroup_count]
,       SUM(CASE WHEN rg.[State] = 0 THEN rg.[total_rows]     ELSE 0    END)    AS [INVISIBLE_rowgroup_rows]
,       MIN(CASE WHEN rg.[State] = 0 THEN rg.[total_rows]     ELSE NULL END)    AS [INVISIBLE_rowgroup_rows_MIN]
,       MAX(CASE WHEN rg.[State] = 0 THEN rg.[total_rows]     ELSE NULL END)    AS [INVISIBLE_rowgroup_rows_MAX]
,       AVG(CASE WHEN rg.[State] = 0 THEN rg.[total_rows]     ELSE NULL END)    AS [INVISIBLE_rowgroup_rows_AVG]
,       SUM(CASE WHEN rg.[State] = 1 THEN 1                   ELSE 0    END)    AS [OPEN_rowgroup_count]
,       SUM(CASE WHEN rg.[State] = 1 THEN rg.[total_rows]     ELSE 0    END)    AS [OPEN_rowgroup_rows]
,       MIN(CASE WHEN rg.[State] = 1 THEN rg.[total_rows]     ELSE NULL END)    AS [OPEN_rowgroup_rows_MIN]
,       MAX(CASE WHEN rg.[State] = 1 THEN rg.[total_rows]     ELSE NULL END)    AS [OPEN_rowgroup_rows_MAX]
,       AVG(CASE WHEN rg.[State] = 1 THEN rg.[total_rows]     ELSE NULL END)    AS [OPEN_rowgroup_rows_AVG]
,       SUM(CASE WHEN rg.[State] = 2 THEN 1                   ELSE 0    END)    AS [CLOSED_rowgroup_count]
,       SUM(CASE WHEN rg.[State] = 2 THEN rg.[total_rows]     ELSE 0    END)    AS [CLOSED_rowgroup_rows]
,       MIN(CASE WHEN rg.[State] = 2 THEN rg.[total_rows]     ELSE NULL END)    AS [CLOSED_rowgroup_rows_MIN]
,       MAX(CASE WHEN rg.[State] = 2 THEN rg.[total_rows]     ELSE NULL END)    AS [CLOSED_rowgroup_rows_MAX]
,       AVG(CASE WHEN rg.[State] = 2 THEN rg.[total_rows]     ELSE NULL END)    AS [CLOSED_rowgroup_rows_AVG]
,       SUM(CASE WHEN rg.[State] = 3 THEN 1                   ELSE 0    END)    AS [COMPRESSED_rowgroup_count]
,       SUM(CASE WHEN rg.[State] = 3 THEN rg.[total_rows]     ELSE 0    END)    AS [COMPRESSED_rowgroup_rows]
,       SUM(CASE WHEN rg.[State] = 3 THEN rg.[deleted_rows]   ELSE 0    END)    AS [COMPRESSED_rowgroup_rows_DELETED]
,       MIN(CASE WHEN rg.[State] = 3 THEN rg.[total_rows]     ELSE NULL END)    AS [COMPRESSED_rowgroup_rows_MIN]
,       MAX(CASE WHEN rg.[State] = 3 THEN rg.[total_rows]     ELSE NULL END)    AS [COMPRESSED_rowgroup_rows_MAX]
,       AVG(CASE WHEN rg.[State] = 3 THEN rg.[total_rows]     ELSE NULL END)    AS [COMPRESSED_rowgroup_rows_AVG]
,       'ALTER INDEX ALL ON ' + s.name + '.' + t.NAME + ' REBUILD;'             AS [Rebuild_Index_SQL]
FROM    sys.[pdw_nodes_column_store_row_groups] rg
JOIN    sys.[pdw_nodes_tables] nt                   ON  rg.[object_id]          = nt.[object_id]
                                                    AND rg.[pdw_node_id]        = nt.[pdw_node_id]
                                                    AND rg.[distribution_id]    = nt.[distribution_id]
JOIN    sys.[pdw_table_mappings] mp                 ON  nt.[name]               = mp.[physical_name]
JOIN    sys.[tables] t                              ON  mp.[object_id]          = t.[object_id]
JOIN    sys.[schemas] s                             ON t.[schema_id]            = s.[schema_id]
GROUP BY
        s.[name]
,       t.[name]
;

现在已创建视图,请运行此查询来识别哪些表的行组中包含的行少于 10 万个。Now that you have created the view, run this query to identify tables with row groups with less than 100K rows. 当然,如果要寻求更理想的段质量,可以将 10 万这个阈值增大。Of course, you may want to increase the threshold of 100K if you are looking for more optimal segment quality.

SELECT    *
FROM    [dbo].[vColumnstoreDensity]
WHERE    COMPRESSED_rowgroup_rows_AVG < 100000
        OR INVISIBLE_rowgroup_rows_AVG < 100000

运行查询后就可以开始查看数据并分析结果。Once you have run the query you can begin to look at the data and analyze your results. 下表解释了要在行组分析中查看的内容。This table explains what to look for in your row group analysis.

Column 如何使用此数据How to use this data
[table_partition_count][table_partition_count] 如果表被分区,那么你可能希望看到更高的打开的行组的计数。If the table is partitioned, then you may expect to see higher Open row group counts. 在理论上分布式架构中每个分区具有一个与之相关联的打开的行组。Each partition in the distribution could in theory have an open row group associated with it. 请在分析中考虑这一点。Factor this into your analysis. 可以通过完全删除分区来优化已分区的小型表,因为这样可以改进压缩。A small table that has been partitioned could be optimized by removing the partitioning altogether as this would improve compression.
[row_count_total][row_count_total] 表的行总数。Total row count for the table. 例如,可以使用此值来计算处于压缩状态的行的百分比。For example, you can use this value to calculate percentage of rows in the compressed state.
[row_count_per_distribution_MAX][row_count_per_distribution_MAX] 如果所有行均匀分布,那么此值为每个分布式架构的目标行数。If all rows are evenly distributed this value would be the target number of rows per distribution. 将此值与 compressed_rowgroup_count 进行比较。Compare this value with the compressed_rowgroup_count.
[COMPRESSED_rowgroup_rows][COMPRESSED_rowgroup_rows] 表的列存储格式的行的总数。Total number of rows in columnstore format for the table.
[COMPRESSED_rowgroup_rows_AVG][COMPRESSED_rowgroup_rows_AVG] 如果平均行数明显小于行组的最大行数,则可以考虑使用 CTAS 或 ALTER INDEX REBUILD 重新压缩数据If the average number of rows is significantly less than the maximum # of rows for a row group, then consider using CTAS or ALTER INDEX REBUILD to recompress the data
[COMPRESSED_rowgroup_count][COMPRESSED_rowgroup_count] 列存储格式的行组数。Number of row groups in columnstore format. 如果该值相对于表的大小显得非常高,那么它表示列存储密度低。If this number is very high in relation to the table it is an indicator that the columnstore density is low.
[COMPRESSED_rowgroup_rows_DELETED][COMPRESSED_rowgroup_rows_DELETED] 逻辑上被删除的列存储格式的行数。Rows are logically deleted in columnstore format. 如果该数值相对于表的大小来说是高的,请考虑重新创建分区或重新生成索引,因为这会以物理方式删除它们。If the number is high relative to table size, consider recreating the partition or rebuilding the index as this removes them physically.
[COMPRESSED_rowgroup_rows_MIN][COMPRESSED_rowgroup_rows_MIN] 将该值与 AVG 和 MAX 列一起使用,以便了解列存储格式的行组的值的范围。Use this in conjunction with the AVG and MAX columns to understand the range of values for the row groups in your columnstore. 如果该值稍微高出加载阈值(每分区对齐的分布区 102,400 行),则提示可在数据加载中进行优化A low number over the load threshold (102,400 per partition aligned distribution) suggests that optimizations are available in the data load
[COMPRESSED_rowgroup_rows_MAX][COMPRESSED_rowgroup_rows_MAX] 同上As above
[OPEN_rowgroup_count][OPEN_rowgroup_count] 打开的行组都是正常的。Open row groups are normal. 理论上,每个表分布区 (60) 预期有一个 OPEN 状态的行组。One would reasonably expect one OPEN row group per table distribution (60). 如果该值过大,则提示需要跨分区进行数据加载。Excessive numbers suggest data loading across partitions. 仔细检查分区策略,以确保该值合理。Double check the partitioning strategy to make sure it is sound
[OPEN_rowgroup_rows][OPEN_rowgroup_rows] 每个行组最多可以具有 1,048,576 行。Each row group can have 1,048,576 rows in it as a maximum. 使用此值可以了解当前打开的行组的填满程度。Use this value to see how full the open row groups are currently
[OPEN_rowgroup_rows_MIN][OPEN_rowgroup_rows_MIN] 打开的组表示数据正在以点滴方式加载到表,或者之前的加载超出了剩余行,溢出到此行组。Open groups indicate that data is either being trickle loaded into the table or that the previous load spilled over remaining rows into this row group. 使用 MIN、MAX、AVG 列来查看 OPEN 行组中有多少数据。Use the MIN, MAX, AVG columns to see how much data is sat in OPEN row groups. 对于小型表,此数据量可以为所有数据的 100%!For small tables it could be 100% of all the data! 在此情况下使用 ALTER INDEX REBUILD 语句将数据强制转换为列存储格式。In which case ALTER INDEX REBUILD to force the data to columnstore.
[OPEN_rowgroup_rows_MAX][OPEN_rowgroup_rows_MAX] 同上As above
[OPEN_rowgroup_rows_AVG][OPEN_rowgroup_rows_AVG] 同上As above
[CLOSED_rowgroup_rows][CLOSED_rowgroup_rows] 查看已关闭行组的行以进行健全性检查。Look at the closed row group rows as a sanity check.
[CLOSED_rowgroup_count][CLOSED_rowgroup_count] 如果没有看到几个行组,则已关闭的行组数应该较低。The number of closed row groups should be low if any are seen at all. 可以将已关闭的行组转换为压缩行组,方法是使用 ALTER INDEX ...REORGANIZE 命令。Closed row groups can be converted to compressed row groups using the ALTER INDEX ... REORGANIZE command. 但是,通常并不需要使用此命令。However, this is not normally required. 后台的“tuple mover”进程会自动将关闭的组转换为列存储行组。Closed groups are automatically converted to columnstore row groups by the background "tuple mover" process.
[CLOSED_rowgroup_rows_MIN][CLOSED_rowgroup_rows_MIN] 已关闭的行组应具有非常高的填充率。Closed row groups should have a very high fill rate. 如果已关闭的行组的填充率较低,则需要进一步分析列存储。If the fill rate for a closed row group is low, then further analysis of the columnstore is required.
[CLOSED_rowgroup_rows_MAX][CLOSED_rowgroup_rows_MAX] 同上As above
[CLOSED_rowgroup_rows_AVG][CLOSED_rowgroup_rows_AVG] 同上As above
[Rebuild_Index_SQL][Rebuild_Index_SQL] 用于重建表的列存储索引的 SQLSQL to rebuild columnstore index for a table

列存储索引质量不佳的原因Causes of poor columnstore index quality

如果已识别出段质量不佳的表,接下来可以找出根本原因。If you have identified tables with poor segment quality, you want to identify the root cause. 下面是段质量不佳的其他一些常见原因:Below are some other common causes of poor segment quality:

  1. 生成索引时内存有压力Memory pressure when index was built
  2. 有大量的 DML 操作High volume of DML operations
  3. 小型或渗透负载操作Small or trickle load operations
  4. 过多的分区Too many partitions

这些因素可能导致列存储索引在每个行组中的行远远少于最佳数量(100 万)。These factors can cause a columnstore index to have significantly less than the optimal 1 million rows per row group. 它们还会造成行转到增量行组而不是压缩的行组。They can also cause rows to go to the delta row group instead of a compressed row group.

生成索引时内存有压力Memory pressure when index was built

每个压缩行组的行数,与行宽度以及可用于处理行组的内存量直接相关。The number of rows per compressed row group are directly related to the width of the row and the amount of memory available to process the row group. 当行在内存不足的状态下写入列存储表时,列存储分段质量可能降低。When rows are written to columnstore tables under memory pressure, columnstore segment quality may suffer. 因此,最佳做法是尽可能让写入到列存储索引表的会话访问最多的内存。Therefore, the best practice is to give the session which is writing to your columnstore index tables access to as much memory as possible. 因为内存与并发性之间有所取舍,正确的内存分配指导原则取决于表的每个行中的数据、已分配给系统的数据仓库,以及可以提供给将数据写入表的会话的并发访问槽位数。Since there is a trade-off between memory and concurrency, the guidance on the right memory allocation depends on the data in each row of your table, the data warehouse units allocated to your system, and the number of concurrency slots you can give to the session which is writing data to your table.

有大量的 DML 操作High volume of DML operations

更新和删除行的大量 DML 操作可能造成列存储低效。A high volume of DML operations that update and delete rows can introduce inefficiency into the columnstore. 当行组中的大多数行已修改时尤其如此。This is especially true when the majority of the rows in a row group are modified.

  • 从压缩的行组中删除某行只会以逻辑方式将该行标记为已删除。Deleting a row from a compressed row group only logically marks the row as deleted. 该行仍会保留在压缩的行组中,直到重新生成分区或表为止。The row remains in the compressed row group until the partition or table is rebuilt.
  • 插入某行会将该行添加到名为增量行组的内部行存储表中。Inserting a row adds the row to an internal rowstore table called a delta row group. 在增量行组已满且标记为已关闭之前,插入的行不会转换成列存储。The inserted row is not converted to columnstore until the delta row group is full and is marked as closed. 达到 1,048,576 个行的容量上限后,行组会关闭。Row groups are closed once they reach the maximum capacity of 1,048,576 rows.
  • 更新采用列存储格式的行将依次作为逻辑删除和插入进行处理。Updating a row in columnstore format is processed as a logical delete and then an insert. 插入的行可以存储在增量存储中。The inserted row may be stored in the delta store.

超出每个分区对齐分布区的 102,400 行批量阈值的批次更新和插入操作将直接写入列存储格式。Batched update and insert operations that exceed the bulk threshold of 102,400 rows per partition-aligned distribution go directly to the columnstore format. 但是,假设在平均分布的情况下,需要在单个操作中修改超过 6.144 百万行才发生这种情况。However, assuming an even distribution, you would need to be modifying more than 6.144 million rows in a single operation for this to occur. 如果给定分区对齐分布区的行数少于 102,400 个,行将转到增量存储,并在插入足够的行、修改行以关闭行组或已创建索引之前,保留在增量存储中。If the number of rows for a given partition-aligned distribution is less than 102,400 then the rows go to the delta store andstay there until sufficient rows have been inserted or modified to close the row group or the index has been rebuilt.

小型或渗透负载操作Small or trickle load operations

流入 SQL 数据仓库的小型负载有时也称为渗透负载。Small loads that flow into SQL Data Warehouse are also sometimes known as trickle loads. 它们通常代表系统引入的数据的接近恒定流。They typically represent a near constant stream of data being ingested by the system. 但是,由于此流接近连续状态,因此行的容量并不特别大。However, as this stream is near continuous the volume of rows is not particularly large. 通常数据远低于直接加载到列存储格式所需的阈值。More often than not the data is significantly under the threshold required for a direct load to columnstore format.

在这些情况下,最好先将数据保存到 Azure Blob 存储中,并让它在加载之前累积。In these situations, it is often better to land the data first in Azure blob storage and let it accumulate prior to loading. 此技术通常称为微批处理This technique is often known as micro-batching.

过多的分区Too many partitions

另一个考虑因素是分区对聚集列存储表的影响。Another thing to consider is the impact of partitioning on your clustered columnstore tables. 分区之前,SQL 数据仓库已将数据分散到 60 个数据库。Before partitioning, SQL Data Warehouse already divides your data into 60 databases. 进一步分区会分割数据。Partitioning further divides your data. 如果将分区,则要考虑的是每个分区必须至少有 100 万行,使用聚集列存储索引才有益。If you partition your data, then consider that each partition needs at least 1 million rows to benefit from a clustered columnstore index. 如果将表分割成 100 个分区,则表必须至少包含 60 亿行才能受益于聚集列存储索引(60 个分布区 * 100 个分区 * 100 万行)。If you partition your table into 100 partitions, then your table needs at least 6 billion rows to benefit from a clustered columnstore index (60 distributions * 100 partitions * 1 million rows). 如果包含 100 个分区的表没有 60 亿行,请减少分区数目,或考虑改用堆表。If your 100-partition table does not have 6 billion rows, either reduce the number of partitions or consider using a heap table instead.

在表中加载一些数据后,请遵循以下步骤来识别并重建聚集列存储索引质量欠佳的表。Once your tables have been loaded with some data, follow the below steps to identify and rebuild tables with sub-optimal clustered columnstore indexes.

重建索引以提升段质量Rebuilding indexes to improve segment quality

步骤 1:识别或创建使用适当资源类的用户Step 1: Identify or create user which uses the right resource class

立即提升段质量的快速方法是重建索引。One quick way to immediately improve segment quality is to rebuild the index. 上述视图返回的 SQL 将返回可用于重建索引的 ALTER INDEX REBUILD 语句。The SQL returned by the above view returns an ALTER INDEX REBUILD statement which can be used to rebuild your indexes. 重建索引时,请确保将足够的内存分配给要重建索引的会话。When rebuilding your indexes, be sure that you allocate enough memory to the session that rebuilds your index. 为此,请提高用户的资源类,该用户有权将此表中的索引重建为建议的最小值。To do this, increase the resource class of a user which has permissions to rebuild the index on this table to the recommended minimum.

以下示例演示如何通过提高资源类向用户分配更多内存。Below is an example of how to allocate more memory to a user by increasing their resource class. 若要使用资源类,请参阅用于工作负荷管理的资源类To work with resource classes, see Resource classes for workload management.

EXEC sp_addrolemember 'xlargerc', 'LoadUser'

步骤 2:使用更高的用户资源类重建聚集列存储索引Step 2: Rebuild clustered columnstore indexes with higher resource class user

以步骤 1 中所述用户的身份(例如 LoadUser,该用户现在使用更高的资源类)登录,并执行 ALTER INDEX 语句。Log in as the user from step 1 (e.g. LoadUser), which is now using a higher resource class, and execute the ALTER INDEX statements. 请确保此用户对重建索引的表拥有 ALTER 权限。Be sure that this user has ALTER permission to the tables where the index is being rebuilt. 这些示例演示如何重新生成整个列存储索引或如何重建单个分区。These examples show how to rebuild the entire columnstore index or how to rebuild a single partition. 对于大型表,一次重建一个分区的索引比较合适。On large tables, it is more practical to rebuild indexes a single partition at a time.

或者,可以使用 CTAS 将表复制到新表,而不要重建索引。Alternatively, instead of rebuilding the index, you could copy the table to a new table using CTAS. 哪种方法最合适?Which way is best? 如果数据量很大,CTAS 的速度通常比 ALTER INDEX 要快。For large volumes of data, CTAS is usually faster than ALTER INDEX. 对于少量的数据,ALTER INDEX 更容易使用,不需要换出表。For smaller volumes of data, ALTER INDEX is easier to use and won't require you to swap out the table.

-- Rebuild the entire clustered index
ALTER INDEX ALL ON [dbo].[DimProduct] REBUILD
-- Rebuild a single partition
ALTER INDEX ALL ON [dbo].[FactInternetSales] REBUILD Partition = 5
-- Rebuild a single partition with archival compression
ALTER INDEX ALL ON [dbo].[FactInternetSales] REBUILD Partition = 5 WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
-- Rebuild a single partition with columnstore compression
ALTER INDEX ALL ON [dbo].[FactInternetSales] REBUILD Partition = 5 WITH (DATA_COMPRESSION = COLUMNSTORE)

在 SQL 数据仓库中重建索引是一项脱机操作。Rebuilding an index in SQL Data Warehouse is an offline operation. 有关重建索引的详细信息,请参阅列存储索引碎片整理中的“ALTER INDEX REBUILD”部分和 ALTER INDEXFor more information about rebuilding indexes, see the ALTER INDEX REBUILD section in Columnstore Indexes Defragmentation, and ALTER INDEX.

步骤 3:验证聚集列存储段质量是否已改善Step 3: Verify clustered columnstore segment quality has improved

重新运行识别出段质量不佳的表的查询,并验证段质量是否已改善。Rerun the query which identified table with poor segment quality and verify segment quality has improved. 如果段质量并未改善,原因可能是表中的行太宽。If segment quality did not improve, it could be that the rows in your table are extra wide. 请考虑在重建索引时使用较高的资源类或 DWU。Consider using a higher resource class or DWU when rebuilding your indexes.

使用 CTAS 和分区切换重建索引Rebuilding indexes with CTAS and partition switching

此示例使用 CREATE TABLE AS SELECT (CTAS) 语句和分区切换重建表分区。This example uses the CREATE TABLE AS SELECT (CTAS) statement and partition switching to rebuild a table partition.

-- Step 1: Select the partition of data and write it out to a new table using CTAS
CREATE TABLE [dbo].[FactInternetSales_20000101_20010101]
    WITH    (   DISTRIBUTION = HASH([ProductKey])
            ,   CLUSTERED COLUMNSTORE INDEX
            ,   PARTITION   (   [OrderDateKey] RANGE RIGHT FOR VALUES
                                (20000101,20010101
                                )
                            )
            )
AS
SELECT  *
FROM    [dbo].[FactInternetSales]
WHERE   [OrderDateKey] >= 20000101
AND     [OrderDateKey] <  20010101
;

-- Step 2: Switch IN the rebuilt data with TRUNCATE_TARGET option
ALTER TABLE [dbo].[FactInternetSales_20000101_20010101] SWITCH PARTITION 2 TO  [dbo].[FactInternetSales] PARTITION 2 WITH (TRUNCATE_TARGET = ON);

有关使用 CTAS 重新创建分区的更多详细信息,请参阅在 SQL 数据仓库中使用分区For more details about re-creating partitions using CTAS, see Using partitions in SQL Data Warehouse.

后续步骤Next steps

有关开发表的详细信息,请参阅开发表For more information about developing tables, see Developing tables.