在 Azure SQL 数据仓库中设计表Designing tables in Azure SQL Data Warehouse

了解有关在 Azure SQL 数据仓库中设计表的重要概念。Learn key concepts for designing tables in Azure SQL Data Warehouse.

确定表类别Determine table category

星型架构将数据组织成事实数据表和维度表。A star schema organizes data into fact and dimension tables. 某些表在转移到事实数据表或维度表之前已用于集成或暂存数据。Some tables are used for integration or staging data before it moves to a fact or dimension table. 设计某个表时,请确定该表的数据是属于事实数据表、维度表还是集成表。As you design a table, decide whether the table data belongs in a fact, dimension, or integration table. 此项决策可以明确相应的表结构和分布方式。This decision informs the appropriate table structure and distribution.

  • 事实数据表包含定量数据,这些数据通常在事务系统中生成,然后加载到数据仓库中。Fact tables contain quantitative data that are commonly generated in a transactional system, and then loaded into the data warehouse. 例如,零售企业每天会生成销售事务,然后将数据载入数据仓库事实数据表进行分析。For example, a retail business generates sales transactions every day, and then loads the data into a data warehouse fact table for analysis.

  • 维度表包含属性数据,这些数据可能会更改,但一般不会经常更改。Dimension tables contain attribute data that might change but usually changes infrequently. 例如,客户的姓名和地址存储在维度表中,仅当客户的个人资料发生更改时,这些数据才会更新。For example, a customer's name and address are stored in a dimension table and updated only when the customer's profile changes. 为了尽量缩小大型事实数据表的大小,不需要将客户的姓名和地址输入到事实数据表的每一行中。To minimize the size of a large fact table, the customer's name and address do not need to be in every row of a fact table. 事实数据表和维度表可以共享一个客户 ID。Instead, the fact table and the dimension table can share a customer ID. 查询可以联接两个表,以关联客户的个人资料和事务。A query can join the two tables to associate a customer's profile and transactions.

  • 集成表为集成或暂存数据提供位置。Integration tables provide a place for integrating or staging data. 可以将集成表创建为常规表、外部表或临时表。You can create an integration table as a regular table, an external table, or a temporary table. 例如,可将数据加载到临时表,在暂存位置对数据执行转换,然后将数据插入生产表中。For example, you can load data to a staging table, perform transformations on the data in staging, and then insert the data into a production table.

架构和表名称Schema and table names

可通过架构将以相似方式使用的表组合在一起。Schemas are a good way to group tables, used in a similar fashion, together. 若要将多个数据库从本地解决方案迁移到 SQL 数据仓库,最好是将所有事实数据表、维度表和集成表迁移到 SQL 数据仓库中的一个架构内。If you are migrating multiple databases from an on-prem solution to SQL Data Warehouse, it works best to migrate all of the fact, dimension, and integration tables to one schema in SQL Data Warehouse. 例如,可将所有表存储在 WideWorldImportersDW 示例数据仓库中一个名为 wwi 的架构内。For example, you could store all the tables in the WideWorldImportersDW sample data warehouse within one schema called wwi. 以下代码创建名为 wwi 的用户定义的架构The following code creates a user-defined schema called wwi.

CREATE SCHEMA wwi;

若要在 SQL 数据仓库中显示表的组织方式,可以使用 fact、dim 和 int 作为表名称的前缀。To show the organization of the tables in SQL Data Warehouse, you could use fact, dim, and int as prefixes to the table names. 下表显示了 WideWorldImportersDW 的一些架构和表名称。The following table shows some of the schema and table names for WideWorldImportersDW.

WideWorldImportersDW 表WideWorldImportersDW table 表类型Table type SQL 数据仓库SQL Data Warehouse
城市City 维度Dimension wwi.DimCitywwi.DimCity
顺序Order FactFact wwi.FactOrderwwi.FactOrder

表暂留Table persistence

表将数据永久或临时存储在 Azure 存储中,或者存储在数据仓库外部的数据存储中。Tables store data either permanently in Azure Storage, temporarily in Azure Storage, or in a data store external to data warehouse.

常规表Regular table

常规表将 Azure 存储中的数据存储为数据仓库的一部分。A regular table stores data in Azure Storage as part of the data warehouse. 不管是否打开了会话,表和数据都会持久保留。The table and the data persist regardless of whether a session is open. 此示例创建一个包含两个列的常规表。This example creates a regular table with two columns.

CREATE TABLE MyTable (col1 int, col2 int );  

临时表Temporary table

临时表只在会话持续期间存在。A temporary table only exists for the duration of the session. 可以使用临时表来防止其他用户查看临时结果,以及减少清理需求。You can use a temporary table to prevent other users from seeing temporary results and also to reduce the need for cleanup. 临时表利用本地存储来提供快速操作的性能。Temporary tables utilize local storage to offer fast performance. 有关详细信息,请参阅临时表For more information, see Temporary tables.

外部表External table

外部表指向位于 Azure 存储 Blob 中的数据。An external table points to data located in Azure Storage blob. 与 CREATE TABLE AS SELECT 语句结合使用时,从外部表中选择数据可将数据导入到 SQL 数据仓库。When used in conjunction with the CREATE TABLE AS SELECT statement, selecting from an external table imports data into SQL Data Warehouse. 因此,外部表可用于加载数据。External tables are therefore useful for loading data. 有关加载教程,请参阅使用 PolyBase 从 Azure Blob 存储加载数据For a loading tutorial, see Use PolyBase to load data from Azure blob storage.

数据类型Data types

SQL 数据仓库支持最常用的数据类型。SQL Data Warehouse supports the most commonly used data types. 有关受支持数据类型的列表,请参阅 CREATE TABLE 语句中的 CREATE TABLE 引用中的数据类型For a list of the supported data types, see data types in CREATE TABLE reference in the CREATE TABLE statement. 有关使用数据类型的指导,请参阅数据类型For guidance on using data types, see Data types.

分布式表Distributed tables

SQL 数据仓库的一个基本功能是它可以跨分布区以特定方式对表进行存储和运算。A fundamental feature of SQL Data Warehouse is the way it can store and operate on tables across distributions. SQL 数据仓库支持使用以下三种方法来分配数据:轮询机制(默认)、哈希和复制。SQL Data Warehouse supports three methods for distributing data, round-robin (default), hash and replicated.

哈希分布表Hash-distributed tables

哈希分布表根据分布列中的值来分布行。A hash distributed table distributes rows based on the value in the distribution column. 根据设计,在对大型表进行查询时,哈希分布表可以实现高性能。A hash distributed table is designed to achieve high performance for queries on large tables. 选择分布列时,需考虑多项因素。There are several factors to consider when choosing a distribution column.

有关详细信息,请参阅分布式表的设计准则For more information, see Design guidance for distributed tables.

复制表Replicated tables

复制表在每个计算节点上提供表的完整副本。A replicated table has a full copy of the table available on every Compute node. 对复制表运行的查询速度较快,因为复制表中的联接不需要移动数据。Queries run fast on replicated tables since joins on replicated tables do not require data movement. 不过,复制需要额外的存储,并且对于大型表不可行。Replication requires extra storage, though, and is not practical for large tables.

有关详细信息,请参阅复制表的设计准则For more information, see Design guidance for replicated tables.

循环表Round-robin tables

循环表将表行均匀地分布到所有分布区中。A round-robin table distributes table rows evenly across all distributions. 行将随机分布。The rows are distributed randomly. 将数据加载到循环表中的速度很快。Loading data into a round-robin table is fast. 不过,与其他分布方法相比,查询可能需要进行更多的数据移动。However, queries can require more data movement than the other distribution methods.

有关详细信息,请参阅分布式表的设计准则For more information, see Design guidance for distributed tables.

表的常用分布方法Common distribution methods for tables

表类别通常确定了要选择哪个选项来分布表。The table category often determines which option to choose for distributing the table.

表类别Table category 建议的分布选项Recommended distribution option
FactFact 结合聚集列存储索引使用哈希分布。Use hash-distribution with clustered columnstore index. 在同一个分布列中联接两个哈希表时,可以提高性能。Performance improves when two hash tables are joined on the same distribution column.
维度Dimension 对小型表使用复制表。Use replicated for smaller tables. 如果表太大,以致无法在每个计算节点上存储,可以使用哈希分布式表。If tables are too large to store on each Compute node, use hash-distributed.
过渡Staging 对临时表使用轮循机制表。Use round-robin for the staging table. 使用 CTAS 执行加载的速度较快。The load with CTAS is fast. 将数据存储到临时表后,可以使用 INSERT...SELECT 将数据移到生产表。Once the data is in the staging table, use INSERT...SELECT to move the data to production tables.

表分区Table partitions

分区表存储根据数据范围存储表行并对其执行操作。A partitioned table stores and performs operations on the table rows according to data ranges. 例如,可以按日、月或年将某个表分区。For example, a table could be partitioned by day, month, or year. 可以通过分区消除来提高查询性能,否则查询扫描范围将限制为分区中的数据。You can improve query performance through partition elimination, which limits a query scan to data within a partition. 还可以通过分区切换来维护数据。You can also maintain the data through partition switching. 由于 SQL 数据仓库中的数据已经是分布式的,过多的分区可能会降低查询性能。Since the data in SQL Data Warehouse is already distributed, too many partitions can slow query performance. 有关详细信息,请参阅分区指南For more information, see Partitioning guidance. 以分区切换的方式切换成不为空的表分区时,若要截断现有数据,可考虑在 ALTER TABLE 语句中使用 TRUNCATE_TARGET 选项。When partition switching into table partitions that are not empty, consider using the TRUNCATE_TARGET option in your ALTER TABLE statement if the existing data is to be truncated. 以下代码将已转换的日常数据切换成 SalesFact,覆盖任何现有的数据。The below code switches in the transformed daily data into the SalesFact overwriting any existing data.

ALTER TABLE SalesFact_DailyFinalLoad SWITCH PARTITION 256 TO SalesFact PARTITION 256 WITH (TRUNCATE_TARGET = ON);  

列存储索引Columnstore indexes

默认情况下,SQL 数据仓库将表存储为聚集列存储索引。By default, SQL Data Warehouse stores a table as a clustered columnstore index. 对于大型表而言,这种数据存储形式可以实现较高的数据压缩率和查询性能。This form of data storage achieves high data compression and query performance on large tables. 聚集列存储索引通常是最佳选择,但在某些情况下,聚集索引或堆是适当的存储结构。The clustered columnstore index is usually the best choice, but in some cases a clustered index or a heap is the appropriate storage structure. 堆表可能特别适用于加载临时数据,例如将转换成最终表的临时表。A heap table can be especially useful for loading transient data, such as a staging table which is transformed into a final table.

有关列存储功能的列表,请参阅列存储索引的新增功能For a list of columnstore features, see What's new for columnstore indexes. 若要提高列存储索引性能,请参阅最大化列存储索引的行组质量To improve columnstore index performance, see Maximizing rowgroup quality for columnstore indexes.

统计信息Statistics

查询优化器在创建用于执行查询的计划时,使用列级统计信息。The query optimizer uses column-level statistics when it creates the plan for executing a query. 若要提高查询性能,必须有基于各个列(尤其是查询联接中使用的列)的统计信息。To improve query performance, it's important to have statistics on individual columns, especially columns used in query joins. 创建统计信息的过程是自动发生的。Creating statistics happens automatically. 但是,更新统计信息的过程不会自动发生。However, updating statistics does not happen automatically. 添加或更改了大量的行之后更新统计信息。Update statistics after a significant number of rows are added or changed. 例如,在执行加载后更新统计信息。For example, update statistics after a load. 有关详细信息,请参阅统计信息指南For more information, see Statistics guidance.

用于创建表的命令Commands for creating tables

可以创建一个新的空表。You can create a table as a new empty table. 还可以创建一个表并在其中填充 select 语句的结果。You can also create and populate a table with the results of a select statement. 下面是用于创建表的 T-SQL 命令。The following are the T-SQL commands for creating a table.

T-SQL 语句T-SQL Statement 说明Description
CREATE TABLECREATE TABLE 通过定义所有表列和选项来创建空表。Creates an empty table by defining all the table columns and options.
CREATE EXTERNAL TABLECREATE EXTERNAL TABLE 创建外部表。Creates an external table. 表定义存储在 SQL 数据仓库中。The definition of the table is stored in SQL Data Warehouse. 表数据存储在 Azure Blob 存储中。The table data is stored in Azure Blob storage.
CREATE TABLE AS SELECTCREATE TABLE AS SELECT 在新表中填充 select 语句的结果。Populates a new table with the results of a select statement. 表列和数据类型基于 select 语句的结果。The table columns and data types are based on the select statement results. 若要导入数据,此语句可从外部表中进行选择。To import data, this statement can select from an external table.
CREATE EXTERNAL TABLE AS SELECTCREATE EXTERNAL TABLE AS SELECT 通过将 select 语句的结果导出到外部位置,来创建新的外部表。Creates a new external table by exporting the results of a select statement to an external location. 该位置为 Azure Blob 存储。The location is Azure Blob storage.

使源数据与数据仓库相符Aligning source data with the data warehouse

从其他数据源加载数据可以填充数据仓库表。Data warehouse tables are populated by loading data from another data source. 若要成功执行加载操作,源数据中列的数目和数据类型必须与数据仓库中的表定义相符。To perform a successful load, the number and data types of the columns in the source data must align with the table definition in the data warehouse. 使数据相符可能是设计表时的最难部分。Getting the data to align might be the hardest part of designing your tables.

如果数据来自多个数据存储,可将数据载入数据仓库,并将其存储在集成表中。If data is coming from multiple data stores, you can bring the data into the data warehouse and store it in an integration table. 将数据存储到集成表中后,可以使用 SQL 数据仓库的功能来执行转换操作。Once data is in the integration table, you can use the power of SQL Data Warehouse to perform transformation operations. 准备好数据后,可以将其插入到生产表中。Once the data is prepared, you can insert it into production tables.

不支持的表功能Unsupported table features

SQL 数据仓库支持其他数据库所提供的许多(但不是全部)表功能。SQL Data Warehouse supports many, but not all, of the table features offered by other databases. 以下列表显示了 SQL 数据仓库不支持的一些表功能。The following list shows some of the table features that are not supported in SQL Data Warehouse.

表大小查询Table size queries

若要确定这 60 个分布中每个分布的表所占用的空间和行,一个简单的方法是使用 DBCC PDW_SHOWSPACEUSEDOne simple way to identify space and rows consumed by a table in each of the 60 distributions, is to use DBCC PDW_SHOWSPACEUSED.

DBCC PDW_SHOWSPACEUSED('dbo.FactInternetSales');

但是,使用 DBCC 命令可能会受到很大限制。However, using DBCC commands can be quite limiting. 动态管理视图 (DMV) 显示的信息比 DBCC 命令更详细。Dynamic management views (DMVs) show more detail than DBCC commands. 请先创建此视图。Start by creating this view.

CREATE VIEW dbo.vTableSizes
AS
WITH base
AS
(
SELECT 
 GETDATE()                                                             AS  [execution_time]
, DB_NAME()                                                            AS  [database_name]
, s.name                                                               AS  [schema_name]
, t.name                                                               AS  [table_name]
, QUOTENAME(s.name)+'.'+QUOTENAME(t.name)                              AS  [two_part_name]
, nt.[name]                                                            AS  [node_table_name]
, ROW_NUMBER() OVER(PARTITION BY nt.[name] ORDER BY (SELECT NULL))     AS  [node_table_name_seq]
, tp.[distribution_policy_desc]                                        AS  [distribution_policy_name]
, c.[name]                                                             AS  [distribution_column]
, nt.[distribution_id]                                                 AS  [distribution_id]
, i.[type]                                                             AS  [index_type]
, i.[type_desc]                                                        AS  [index_type_desc]
, nt.[pdw_node_id]                                                     AS  [pdw_node_id]
, pn.[type]                                                            AS  [pdw_node_type]
, pn.[name]                                                            AS  [pdw_node_name]
, di.name                                                              AS  [dist_name]
, di.position                                                          AS  [dist_position]
, nps.[partition_number]                                               AS  [partition_nmbr]
, nps.[reserved_page_count]                                            AS  [reserved_space_page_count]
, nps.[reserved_page_count] - nps.[used_page_count]                    AS  [unused_space_page_count]
, nps.[in_row_data_page_count] 
    + nps.[row_overflow_used_page_count] 
    + nps.[lob_used_page_count]                                        AS  [data_space_page_count]
, nps.[reserved_page_count] 
 - (nps.[reserved_page_count] - nps.[used_page_count]) 
 - ([in_row_data_page_count] 
         + [row_overflow_used_page_count]+[lob_used_page_count])       AS  [index_space_page_count]
, nps.[row_count]                                                      AS  [row_count]
from 
    sys.schemas s
INNER JOIN sys.tables t
    ON s.[schema_id] = t.[schema_id]
INNER JOIN sys.indexes i
    ON  t.[object_id] = i.[object_id]
    AND i.[index_id] <= 1
INNER JOIN sys.pdw_table_distribution_properties tp
    ON t.[object_id] = tp.[object_id]
INNER JOIN sys.pdw_table_mappings tm
    ON t.[object_id] = tm.[object_id]
INNER JOIN sys.pdw_nodes_tables nt
    ON tm.[physical_name] = nt.[name]
INNER JOIN sys.dm_pdw_nodes pn
    ON  nt.[pdw_node_id] = pn.[pdw_node_id]
INNER JOIN sys.pdw_distributions di
    ON  nt.[distribution_id] = di.[distribution_id]
INNER JOIN sys.dm_pdw_nodes_db_partition_stats nps
    ON nt.[object_id] = nps.[object_id]
    AND nt.[pdw_node_id] = nps.[pdw_node_id]
    AND nt.[distribution_id] = nps.[distribution_id]
LEFT OUTER JOIN (select * from sys.pdw_column_distribution_properties where distribution_ordinal = 1) cdp
    ON t.[object_id] = cdp.[object_id]
LEFT OUTER JOIN sys.columns c
    ON cdp.[object_id] = c.[object_id]
    AND cdp.[column_id] = c.[column_id]
)
, size
AS
(
SELECT
   [execution_time]
,  [database_name]
,  [schema_name]
,  [table_name]
,  [two_part_name]
,  [node_table_name]
,  [node_table_name_seq]
,  [distribution_policy_name]
,  [distribution_column]
,  [distribution_id]
,  [index_type]
,  [index_type_desc]
,  [pdw_node_id]
,  [pdw_node_type]
,  [pdw_node_name]
,  [dist_name]
,  [dist_position]
,  [partition_nmbr]
,  [reserved_space_page_count]
,  [unused_space_page_count]
,  [data_space_page_count]
,  [index_space_page_count]
,  [row_count]
,  ([reserved_space_page_count] * 8.0)                                 AS [reserved_space_KB]
,  ([reserved_space_page_count] * 8.0)/1000                            AS [reserved_space_MB]
,  ([reserved_space_page_count] * 8.0)/1000000                         AS [reserved_space_GB]
,  ([reserved_space_page_count] * 8.0)/1000000000                      AS [reserved_space_TB]
,  ([unused_space_page_count]   * 8.0)                                 AS [unused_space_KB]
,  ([unused_space_page_count]   * 8.0)/1000                            AS [unused_space_MB]
,  ([unused_space_page_count]   * 8.0)/1000000                         AS [unused_space_GB]
,  ([unused_space_page_count]   * 8.0)/1000000000                      AS [unused_space_TB]
,  ([data_space_page_count]     * 8.0)                                 AS [data_space_KB]
,  ([data_space_page_count]     * 8.0)/1000                            AS [data_space_MB]
,  ([data_space_page_count]     * 8.0)/1000000                         AS [data_space_GB]
,  ([data_space_page_count]     * 8.0)/1000000000                      AS [data_space_TB]
,  ([index_space_page_count]  * 8.0)                                   AS [index_space_KB]
,  ([index_space_page_count]  * 8.0)/1000                              AS [index_space_MB]
,  ([index_space_page_count]  * 8.0)/1000000                           AS [index_space_GB]
,  ([index_space_page_count]  * 8.0)/1000000000                        AS [index_space_TB]
FROM base
)
SELECT * 
FROM size
;

表空间摘要Table space summary

此查询返回行以及按表划分的空间。This query returns the rows and space by table. 使用此查询可以查看哪些表是最大的表,以及这些表是按轮循机制分布的、按复制分布的还是按哈希分布的。It allows you to see which tables are your largest tables and whether they are round-robin, replicated, or hash -distributed. 对于哈希分布式表,此查询会显示分布列。For hash-distributed tables, the query shows the distribution column.

SELECT 
     database_name
,    schema_name
,    table_name
,    distribution_policy_name
,      distribution_column
,    index_type_desc
,    COUNT(distinct partition_nmbr) as nbr_partitions
,    SUM(row_count)                 as table_row_count
,    SUM(reserved_space_GB)         as table_reserved_space_GB
,    SUM(data_space_GB)             as table_data_space_GB
,    SUM(index_space_GB)            as table_index_space_GB
,    SUM(unused_space_GB)           as table_unused_space_GB
FROM 
    dbo.vTableSizes
GROUP BY 
     database_name
,    schema_name
,    table_name
,    distribution_policy_name
,      distribution_column
,    index_type_desc
ORDER BY
    table_reserved_space_GB desc
;

按分布类型划分的表空间Table space by distribution type

SELECT 
     distribution_policy_name
,    SUM(row_count)                as table_type_row_count
,    SUM(reserved_space_GB)        as table_type_reserved_space_GB
,    SUM(data_space_GB)            as table_type_data_space_GB
,    SUM(index_space_GB)           as table_type_index_space_GB
,    SUM(unused_space_GB)          as table_type_unused_space_GB
FROM dbo.vTableSizes
GROUP BY distribution_policy_name
;

按索引类型划分的表空间Table space by index type

SELECT 
     index_type_desc
,    SUM(row_count)                as table_type_row_count
,    SUM(reserved_space_GB)        as table_type_reserved_space_GB
,    SUM(data_space_GB)            as table_type_data_space_GB
,    SUM(index_space_GB)           as table_type_index_space_GB
,    SUM(unused_space_GB)          as table_type_unused_space_GB
FROM dbo.vTableSizes
GROUP BY index_type_desc
;

分布空间摘要Distribution space summary

SELECT 
    distribution_id
,    SUM(row_count)                as total_node_distribution_row_count
,    SUM(reserved_space_MB)        as total_node_distribution_reserved_space_MB
,    SUM(data_space_MB)            as total_node_distribution_data_space_MB
,    SUM(index_space_MB)           as total_node_distribution_index_space_MB
,    SUM(unused_space_MB)          as total_node_distribution_unused_space_MB
FROM dbo.vTableSizes
GROUP BY     distribution_id
ORDER BY    distribution_id
;

后续步骤Next steps

为数据仓库创建表后,接下来可将数据载入该表。After creating the tables for your data warehouse, the next step is to load data into the table. 有关加载教程,请参阅将数据加载到 SQL 数据仓库For a loading tutorial, see Loading data to SQL Data Warehouse.