使用 IDENTITY 通过 AzureSynapse Analytics 中的专用 SQL 池创建代理键Using IDENTITY to create surrogate keys using dedicated SQL pool in AzureSynapse Analytics

本文介绍如何使用 IDENTITY 属性在专用 SQL 池中创建基于表的代理键的建议和示例。In this article, you'll find recommendations and examples for using the IDENTITY property to create surrogate keys on tables in dedicated SQL pool.

什么是代理键What is a surrogate key

基于表的代理键是一个列,其中包含针对每个行的唯一标识符。A surrogate key on a table is a column with a unique identifier for each row. 此键不是从表数据生成的。The key is not generated from the table data. 数据建模者想要在设计数据仓库模型时在其表上创建代理键。Data modelers like to create surrogate keys on their tables when they design data warehouse models. 可以使用 IDENTITY 属性轻松高效地实现此目标,而不会影响负载性能。You can use the IDENTITY property to achieve this goal simply and effectively without affecting load performance.

备注

在 Azure Synapse Analytics 中,每个发行版中的 IDENTITY 值都会独自增加,而不会与其他发行版中的 IDENTITY 值重叠。In Azure Synapse Analytics, the IDENTITY value increases on its own in each distribution and does not overlap with IDENTITY values in other distributions. 如果用户通过“SET IDENTITY_INSERT ON”显式插入重复值,或为 IDENTITY 重新设定种子,则 Synapse 中的 IDENTITY 值不能保证是唯一的。The IDENTITY value in Synapse is not guaranteed to be unique if the user explicitly inserts a duplicate value with "SET IDENTITY_INSERT ON" or reseeds IDENTITY. 有关详细信息,请参阅 CREATE TABLE (Transact-SQL) IDENTITY (Property)For details, see CREATE TABLE (Transact-SQL) IDENTITY (Property).

创建包含 IDENTITY 列的表Creating a table with an IDENTITY column

IDENTITY 属性设计为能够在专用 SQL 池的所有分布区中横向扩展,而不会影响加载性能。The IDENTITY property is designed to scale out across all the distributions in the dedicated SQL pool without affecting load performance. 因此,IDENTITY 的实现旨在实现这些目标。Therefore, the implementation of IDENTITY is oriented toward achieving these goals.

在首次使用类似以下语句的语法创建表时,可以将表定义为具有 IDENTITY 属性:You can define a table as having the IDENTITY property when you first create the table by using syntax that is similar to the following statement:

CREATE TABLE dbo.T1
(    C1 INT IDENTITY(1,1) NOT NULL
,    C2 INT NULL
)
WITH
(   DISTRIBUTION = HASH(C2)
,   CLUSTERED COLUMNSTORE INDEX
)
;

然后,可以使用 INSERT..SELECT 来填充表。You can then use INSERT..SELECT to populate the table.

本部分的此剩余部分重点介绍实现的细微差别,以帮助用户更全面地了解这些实现。This remainder of this section highlights the nuances of the implementation to help you understand them more fully.

值的分配Allocation of values

由于数据仓库的分布式体系结构,IDENTITY 属性不能保证代理值的分配顺序。The IDENTITY property doesn't guarantee the order in which the surrogate values are allocated due to the distributed architecture of the data warehouse. IDENTITY 属性设计为能够在专用 SQL 池的所有分布区中横向扩展,而不会影响加载性能。The IDENTITY property is designed to scale out across all the distributions in the dedicated SQL pool without affecting load performance.

以下示例对此做了演示:The following example is an illustration:

CREATE TABLE dbo.T1
(    C1 INT IDENTITY(1,1)    NOT NULL
,    C2 VARCHAR(30)                NULL
)
WITH
(   DISTRIBUTION = HASH(C2)
,   CLUSTERED COLUMNSTORE INDEX
)
;

INSERT INTO dbo.T1
VALUES (NULL);

INSERT INTO dbo.T1
VALUES (NULL);

SELECT *
FROM dbo.T1;

DBCC PDW_SHOWSPACEUSED('dbo.T1');

在前面的示例中,两行位于分布 1 中。In the preceding example, two rows landed in distribution 1. 第一行在列 C1 中包含代理值 1,且第二行包含代理值 61。The first row has the surrogate value of 1 in column C1, and the second row has the surrogate value of 61. 这两个值均由 IDENTITY 属性生成。Both of these values were generated by the IDENTITY property. 但是,值的分配不是连续的。However, the allocation of the values is not contiguous. 此行为是设计使然。This behavior is by design.

倾斜的数据Skewed data

数据类型的值范围在各个分布区之间是均匀分配的。The range of values for the data type are spread evenly across the distributions. 如果分布式表受偏斜数据的影响,则可用于数据类型的值范围可能会过早耗尽。If a distributed table suffers from skewed data, then the range of values available to the datatype can be exhausted prematurely. 例如,如果所有数据最终都会处于单个分发中,则表实际上只能访问六十分之一的数据类型值。For example, if all the data ends up in a single distribution, then effectively the table has access to only one-sixtieth of the values of the data type. 出于此原因,IDENTITY 属性仅限用于 INTBIGINT 数据类型。For this reason, the IDENTITY property is limited to INT and BIGINT data types only.

SELECT..INTOSELECT..INTO

在将现有的 IDENTITY 列选入新表时,新列将继承该 IDENTITY 属性,除非下列条件之一为 true:When an existing IDENTITY column is selected into a new table, the new column inherits the IDENTITY property, unless one of the following conditions is true:

  • SELECT 语句包含联接。The SELECT statement contains a join.
  • 使用 UNION 联接多个 SELECT 语句。Multiple SELECT statements are joined by using UNION.
  • IDENTITY 列在 SELECT 列表中多次列出。The IDENTITY column is listed more than one time in the SELECT list.
  • IDENTITY 列是表达式的一部分。The IDENTITY column is part of an expression.

如果其中的任一条件为 true,则创建属性为 NOT NULL 的列,而不继承 IDENTITY 属性。If any one of these conditions is true, the column is created NOT NULL instead of inheriting the IDENTITY property.

CREATE TABLE AS SELECTCREATE TABLE AS SELECT

CREATE TABLE AS SELECT (CTAS) 遵循 SELECT..INTO 中记录的相同 SQL Server 行为。CREATE TABLE AS SELECT (CTAS) follows the same SQL Server behavior that's documented for SELECT..INTO. 但是,不能指定语句的 CREATE TABLE 部分的列定义中的 IDENTITY 属性。However, you can't specify an IDENTITY property in the column definition of the CREATE TABLE part of the statement. 同样,也不能在 CTAS 的 SELECT 部分中使用 IDENTITY 函数。You also can't use the IDENTITY function in the SELECT part of the CTAS. 若要填充表,需要使用 CREATE TABLE 来定义后跟 INSERT..SELECT 的表来进行填充。To populate a table, you need to use CREATE TABLE to define the table followed by INSERT..SELECT to populate it.

将值显式插入到 IDENTITY 列Explicitly inserting values into an IDENTITY column

专用 SQL 池支持 SET IDENTITY_INSERT <your table> ON|OFF 语法。Dedicated SQL pool supports SET IDENTITY_INSERT <your table> ON|OFF syntax. 可以使用此语法显式将值插入到 IDENTITY 列中。You can use this syntax to explicitly insert values into the IDENTITY column.

许多数据建模者喜欢在其维度中为某些行使用预定义的负值。Many data modelers like to use predefined negative values for certain rows in their dimensions. 例如,-1 或“未知成员”行。An example is the -1 or "unknown member" row.

下一个脚本演示如何使用 SET IDENTITY_INSERT 显式添加此行:The next script shows how to explicitly add this row by using SET IDENTITY_INSERT:

SET IDENTITY_INSERT dbo.T1 ON;

INSERT INTO dbo.T1
(   C1
,   C2
)
VALUES (-1,'UNKNOWN')
;

SET IDENTITY_INSERT dbo.T1 OFF;

SELECT     *
FROM    dbo.T1
;

加载数据Loading data

IDENTITY 属性的存在对数据加载代码有一定影响。The presence of the IDENTITY property has some implications to your data-loading code. 本节重点介绍使用 IDENTITY 将数据加载到表中的一些基本模式。This section highlights some basic patterns for loading data into tables by using IDENTITY.

若要使用 IDENTITY 将数据加载到表中并生成代理键,请创建表,然后使用 INSERT..SELECT 或 INSERT..VALUES 执行加载。To load data into a table and generate a surrogate key by using IDENTITY, create the table and then use INSERT..SELECT or INSERT..VALUES to perform the load.

下面的示例重点介绍了基本模式:The following example highlights the basic pattern:

--CREATE TABLE with IDENTITY
CREATE TABLE dbo.T1
(    C1 INT IDENTITY(1,1)
,    C2 VARCHAR(30)
)
WITH
(   DISTRIBUTION = HASH(C2)
,   CLUSTERED COLUMNSTORE INDEX
)
;

--Use INSERT..SELECT to populate the table from an external table
INSERT INTO dbo.T1
(C2)
SELECT     C2
FROM    ext.T1
;

SELECT *
FROM   dbo.T1
;

DBCC PDW_SHOWSPACEUSED('dbo.T1');

备注

在将数据加载到包含 IDENTITY 列的表时,当前无法使用 CREATE TABLE AS SELECTIt's not possible to use CREATE TABLE AS SELECT currently when loading data into a table with an IDENTITY column.

若要详细了解如何加载数据,请参阅为专用 SQL 池设计提取、加载和转换 (ELT)加载最佳做法For more information on loading data, see Designing Extract, Load, and Transform (ELT) for dedicated SQL pool and Loading best practices.

系统视图System views

可以使用 sys.identity_columns 目录视图来标识具有 IDENTITY 属性的列。You can use the sys.identity_columns catalog view to identify a column that has the IDENTITY property.

为了更好地了解数据库架构,本示例演示如何将 sys.identity_column 与其他系统目录视图集成:To help you better understand the database schema, this example shows how to integrate sys.identity_column` with other system catalog views:

SELECT  sm.name
,       tb.name
,       co.name
,       CASE WHEN ic.column_id IS NOT NULL
             THEN 1
        ELSE 0
        END AS is_identity
FROM        sys.schemas AS sm
JOIN        sys.tables  AS tb           ON  sm.schema_id = tb.schema_id
JOIN        sys.columns AS co           ON  tb.object_id = co.object_id
LEFT JOIN   sys.identity_columns AS ic  ON  co.object_id = ic.object_id
                                        AND co.column_id = ic.column_id
WHERE   sm.name = 'dbo'
AND     tb.name = 'T1'
;

限制Limitations

以下情况不能使用 IDENTITY 属性:The IDENTITY property can't be used:

  • 当列数据类型不是 INT 或 BIGINT 时When the column data type is not INT or BIGINT
  • 当列也同样是分发键时When the column is also the distribution key
  • 当表是外部表时When the table is an external table

专用 SQL 池不支持以下相关函数:The following related functions are not supported in dedicated SQL pool:

常见任务Common tasks

本部分提供在使用 IDENTITY 列时可用于执行常见任务的一些示例代码。This section provides some sample code you can use to perform common tasks when you work with IDENTITY columns.

在下列所有任务中,C1 列都是 IDENTITY。Column C1 is the IDENTITY in all the following tasks.

查找表的最高已分配值Find the highest allocated value for a table

可以使用 MAX() 函数来确定为分布式表分配的最高值:Use the MAX() function to determine the highest value allocated for a distributed table:

SELECT MAX(C1)
FROM dbo.T1

查找 IDENTITY 属性的种子和增量Find the seed and increment for the IDENTITY property

目录视图可用于通过使用以下查询来发现表的标识增量和种子配置值:You can use the catalog views to discover the identity increment and seed configuration values for a table by using the following query:

SELECT  sm.name
,       tb.name
,       co.name
,       ic.seed_value
,       ic.increment_value
FROM        sys.schemas AS sm
JOIN        sys.tables  AS tb           ON  sm.schema_id = tb.schema_id
JOIN        sys.columns AS co           ON  tb.object_id = co.object_id
JOIN        sys.identity_columns AS ic  ON  co.object_id = ic.object_id
                                        AND co.column_id = ic.column_id
WHERE   sm.name = 'dbo'
AND     tb.name = 'T1'
;

后续步骤Next steps