Azure SQL 数据仓库中的 CREATE TABLE AS SELECT (CTAS)CREATE TABLE AS SELECT (CTAS) in Azure SQL Data Warehouse

本文介绍如何使用 Azure SQL 数据仓库中的 CREATE TABLE AS SELECT (CTAS) T-SQL 语句来开发解决方案。This article explains the CREATE TABLE AS SELECT (CTAS) T-SQL statement in Azure SQL Data Warehouse for developing solutions. 本文还会提供代码示例。The article also provides code examples.

CREATE TABLE AS SELECTCREATE TABLE AS SELECT

CREATE TABLE AS SELECT (CTAS) 语句是所提供的最重要的 T-SQL 功能之一。The CREATE TABLE AS SELECT (CTAS) statement is one of the most important T-SQL features available. CTAS 是根据 SELECT 语句的输出创建新表的并行化操作。CTAS is a parallel operation that creates a new table based on the output of a SELECT statement. CTAS 是在表中创建和插入数据的最简单快速方法,只需使用一条命令。CTAS is the simplest and fastest way to create and insert data into a table with a single command.

SELECT..INTO 与CTASSELECT...INTO vs. CTAS

CTAS 是 SELECT...INTO 语句的自定义程度更高的版本。CTAS is a more customizable version of the SELECT...INTO statement.

下面是一个简单的 SELECT..INTO 示例:The following is an example of a simple SELECT...INTO:

SELECT *
INTO    [dbo].[FactInternetSales_new]
FROM    [dbo].[FactInternetSales]

SELECT...INTO 不允许在操作过程中更改分布方法或索引类型。SELECT...INTO doesn't allow you to change either the distribution method or the index type as part of the operation. 使用默认分布类型 ROUND_ROBIN 以及默认表结构 CLUSTERED COLUMNSTORE INDEX 创建 [dbo].[FactInternetSales_new]You create [dbo].[FactInternetSales_new] by using the default distribution type of ROUND_ROBIN, and the default table structure of CLUSTERED COLUMNSTORE INDEX.

另一方面,使用 CTAS 可以指定表数据的分布方式以及表结构类型。With CTAS, on the other hand, you can specify both the distribution of the table data as well as the table structure type. 将以上示例转换为 CTAS:To convert the previous example to CTAS:

CREATE TABLE [dbo].[FactInternetSales_new]
WITH
(
    DISTRIBUTION = ROUND_ROBIN
   ,CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT  *
FROM    [dbo].[FactInternetSales]
;

Note

如果只是想要尝试更改 CTAS 操作中的索引并且源表经过哈希分布,请保留相同的分布列和数据类型。If you're only trying to change the index in your CTAS operation, and the source table is hash distributed, maintain the same distribution column and data type. 这会避免操作期间的交叉分布数据移动,从而更加高效。This avoids cross-distribution data movement during the operation, which is more efficient.

使用 CTAS 复制表Use CTAS to copy a table

CTAS 最常见的用途之一就是创建表副本,以便可以更改 DDL。Perhaps one of the most common uses of CTAS is creating a copy of a table in order to change the DDL. 假设你最初将表创建为 ROUND_ROBIN,现在想要将其更改在列中分布的表。Let's say you originally created your table as ROUND_ROBIN, and now want to change it to a table distributed on a column. 可以使用 CTAS 来更改分布列。CTAS is how you would change the distribution column. 还可以使用 CTAS 来更改分区、索引或列类型。You can also use CTAS to change partitioning, indexing, or column types.

假设你默认分布类型 ROUND_ROBIN 创建了此表,且未在 CREATE TABLE 中指定分布列。Let's say you created this table by using the default distribution type of ROUND_ROBIN, not specifying a distribution column in the CREATE TABLE.

CREATE TABLE FactInternetSales
(
    ProductKey int NOT NULL,
    OrderDateKey int NOT NULL,
    DueDateKey int NOT NULL,
    ShipDateKey int NOT NULL,
    CustomerKey int NOT NULL,
    PromotionKey int NOT NULL,
    CurrencyKey int NOT NULL,
    SalesTerritoryKey int NOT NULL,
    SalesOrderNumber nvarchar(20) NOT NULL,
    SalesOrderLineNumber tinyint NOT NULL,
    RevisionNumber tinyint NOT NULL,
    OrderQuantity smallint NOT NULL,
    UnitPrice money NOT NULL,
    ExtendedAmount money NOT NULL,
    UnitPriceDiscountPct float NOT NULL,
    DiscountAmount float NOT NULL,
    ProductStandardCost money NOT NULL,
    TotalProductCost money NOT NULL,
    SalesAmount money NOT NULL,
    TaxAmt money NOT NULL,
    Freight money NOT NULL,
    CarrierTrackingNumber nvarchar(25),
    CustomerPONumber nvarchar(25)
);

现在想要创建此表的新副本并包含 Clustered Columnstore Index,以便可以使用群集列存储表的性能。Now you want to create a new copy of this table, with a Clustered Columnstore Index, so you can take advantage of the performance of Clustered Columnstore tables. 你还想在 ProductKey 上分布此表(因为预期此列会发生联接)并在联接 ProductKey 期间避免数据移动。You also want to distribute this table on ProductKey, because you're anticipating joins on this column and want to avoid data movement during joins on ProductKey. 最后,你还希望在 OrderDateKey 上添加分区,以便通过删除旧分区来快速删除旧数据。Lastly, you also want to add partitioning on OrderDateKey, so you can quickly delete old data by dropping old partitions. 以下是可将旧表复制到新表的 CTAS 语句。Here is the CTAS statement, which copies your old table into a new table.

CREATE TABLE FactInternetSales_new
WITH
(
    CLUSTERED COLUMNSTORE INDEX,
    DISTRIBUTION = HASH(ProductKey),
    PARTITION
    (
        OrderDateKey RANGE RIGHT FOR VALUES
        (
        20000101,20010101,20020101,20030101,20040101,20050101,20060101,20070101,20080101,20090101,
        20100101,20110101,20120101,20130101,20140101,20150101,20160101,20170101,20180101,20190101,
        20200101,20210101,20220101,20230101,20240101,20250101,20260101,20270101,20280101,20290101
        )
    )
)
AS SELECT * FROM FactInternetSales;

最后,可以重命名表以切换到新表,并删除旧表。Finally, you can rename your tables, to swap in your new table and then drop your old table.

RENAME OBJECT FactInternetSales TO FactInternetSales_old;
RENAME OBJECT FactInternetSales_new TO FactInternetSales;

DROP TABLE FactInternetSales_old;

使用 CTAS 解决不支持的功能Use CTAS to work around unsupported features

还可以使用 CTAS 来解决下列多种不支持的功能。You can also use CTAS to work around a number of the unsupported features listed below. 经过证实,此方法往往可以带来帮助,因为代码不但能够兼容,而且通常可以在 SQL 数据仓库中更快速运行。This method can often prove helpful, because not only will your code be compliant, but it will often run faster on SQL Data Warehouse. 这种性能提升是完全并行化设计的结果。This performance is a result of its fully parallelized design. 方案包括:Scenarios include:

  • UPDATE 中的 ANSI JOINANSI JOINS on UPDATEs
  • DELETE 中的 ANSI JOINANSI JOINs on DELETEs
  • MERGE 语句MERGE statement

Tip

尽量考虑“CTAS 优先”。Try to think "CTAS first". 一般情况下,最好是使用 CTAS 来解决问题,即使因此要写入更多数据。Solving a problem by using CTAS is generally a good approach, even if you're writing more data as a result.

替换 Update 语句的 ANSI JoinANSI join replacement for update statements

你可能有一个复杂的更新。You might find that you have a complex update. 该更新使用 ANSI 联接语法来执行 UPDATE 或 DELETE,以将两个以上的表联接在一起。The update joins more than two tables together by using ANSI join syntax to perform the UPDATE or DELETE.

假设必须更新此表:Imagine you had to update this table:

CREATE TABLE [dbo].[AnnualCategorySales]
(    [EnglishProductCategoryName]    NVARCHAR(50)    NOT NULL
,    [CalendarYear]                    SMALLINT        NOT NULL
,    [TotalSalesAmount]                MONEY            NOT NULL
)
WITH
(
    DISTRIBUTION = ROUND_ROBIN
)
;

原始查询看起来可能类似于以下示例:The original query might have looked something like this example:

UPDATE    acs
SET        [TotalSalesAmount] = [fis].[TotalSalesAmount]
FROM    [dbo].[AnnualCategorySales]     AS acs
JOIN    (
        SELECT    [EnglishProductCategoryName]
        ,        [CalendarYear]
        ,        SUM([SalesAmount])                AS [TotalSalesAmount]
        FROM    [dbo].[FactInternetSales]        AS s
        JOIN    [dbo].[DimDate]                    AS d    ON s.[OrderDateKey]                = d.[DateKey]
        JOIN    [dbo].[DimProduct]                AS p    ON s.[ProductKey]                = p.[ProductKey]
        JOIN    [dbo].[DimProductSubCategory]    AS u    ON p.[ProductSubcategoryKey]    = u.[ProductSubcategoryKey]
        JOIN    [dbo].[DimProductCategory]        AS c    ON u.[ProductCategoryKey]        = c.[ProductCategoryKey]
        WHERE     [CalendarYear] = 2004
        GROUP BY
                [EnglishProductCategoryName]
        ,        [CalendarYear]
        ) AS fis
ON    [acs].[EnglishProductCategoryName]    = [fis].[EnglishProductCategoryName]
AND    [acs].[CalendarYear]                = [fis].[CalendarYear]
;

SQL 数据仓库不支持在 UPDATE 语句的 FROM 子句中使用 ANSI Join,因此,只有在修改上述示例之后才能使用它。SQL Data Warehouse doesn't support ANSI joins in the FROM clause of an UPDATE statement, so you can't use the previous example without modifying it.

可以使用 CTAS 和隐式联接的组合来替换上述示例:You can use a combination of a CTAS and an implicit join to replace the previous example:

-- Create an interim table
CREATE TABLE CTAS_acs
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT    ISNULL(CAST([EnglishProductCategoryName] AS NVARCHAR(50)),0)    AS [EnglishProductCategoryName]
,        ISNULL(CAST([CalendarYear] AS SMALLINT),0)                         AS [CalendarYear]
,        ISNULL(CAST(SUM([SalesAmount]) AS MONEY),0)                        AS [TotalSalesAmount]
FROM    [dbo].[FactInternetSales]        AS s
JOIN    [dbo].[DimDate]                    AS d    ON s.[OrderDateKey]                = d.[DateKey]
JOIN    [dbo].[DimProduct]                AS p    ON s.[ProductKey]                = p.[ProductKey]
JOIN    [dbo].[DimProductSubCategory]    AS u    ON p.[ProductSubcategoryKey]    = u.[ProductSubcategoryKey]
JOIN    [dbo].[DimProductCategory]        AS c    ON u.[ProductCategoryKey]        = c.[ProductCategoryKey]
WHERE     [CalendarYear] = 2004
GROUP BY
        [EnglishProductCategoryName]
,        [CalendarYear]
;

-- Use an implicit join to perform the update
UPDATE  AnnualCategorySales
SET     AnnualCategorySales.TotalSalesAmount = CTAS_ACS.TotalSalesAmount
FROM    CTAS_acs
WHERE   CTAS_acs.[EnglishProductCategoryName] = AnnualCategorySales.[EnglishProductCategoryName]
AND     CTAS_acs.[CalendarYear]               = AnnualCategorySales.[CalendarYear]
;

--Drop the interim table
DROP TABLE CTAS_acs
;

替换 Delete 语句的 ANSI JoinANSI join replacement for delete statements

有时,删除数据的最佳方法是使用 CTAS,特别是对于使用 ANSI Join 语法的 DELETE 语句。Sometimes the best approach for deleting data is to use CTAS, especially for DELETE statements that use ANSI join syntax. 这是因为,SQL 数据仓库不支持在 DELETE 语句的 FROM 子句中使用 ANSI Join。This is because SQL Data Warehouse doesn't support ANSI joins in the FROM clause of a DELETE statement. 无需删除数据,可以选择要保留的数据。Rather than deleting the data, select the data you want to keep.

下面是转换后的 DELETE 语句示例:The following is an example of a converted DELETE statement:

CREATE TABLE dbo.DimProduct_upsert
WITH
(   Distribution=HASH(ProductKey)
,   CLUSTERED INDEX (ProductKey)
)
AS -- Select Data you want to keep
SELECT     p.ProductKey
,          p.EnglishProductName
,          p.Color
FROM       dbo.DimProduct p
RIGHT JOIN dbo.stg_DimProduct s
ON         p.ProductKey = s.ProductKey
;

RENAME OBJECT dbo.DimProduct        TO DimProduct_old;
RENAME OBJECT dbo.DimProduct_upsert TO DimProduct;

替换 Merge 语句Replace merge statements

使用 CTAS 至少可以部分替换 merge 语句。You can replace merge statements, at least in part, by using CTAS. 可以将 INSERTUPDATE 合并成单个语句。You can combine the INSERT and the UPDATE into a single statement. 任何已删除的记录将在 SELECT 语句中受到限制,以便从结果中省略。Any deleted records should be restricted from the SELECT statement to omit from the results.

以下示例适用于 UPSERTThe following example is for an UPSERT:

CREATE TABLE dbo.[DimProduct_upsert]
WITH
(   DISTRIBUTION = HASH([ProductKey])
,   CLUSTERED INDEX ([ProductKey])
)
AS
-- New rows and new versions of rows
SELECT      s.[ProductKey]
,           s.[EnglishProductName]
,           s.[Color]
FROM      dbo.[stg_DimProduct] AS s
UNION ALL  
-- Keep rows that are not being touched
SELECT      p.[ProductKey]
,           p.[EnglishProductName]
,           p.[Color]
FROM      dbo.[DimProduct] AS p
WHERE NOT EXISTS
(   SELECT  *
    FROM    [dbo].[stg_DimProduct] s
    WHERE   s.[ProductKey] = p.[ProductKey]
)
;

RENAME OBJECT dbo.[DimProduct]          TO [DimProduct_old];
RENAME OBJECT dbo.[DimProduct_upsert]  TO [DimProduct];

显式声明数据类型和输出是否可为 nullExplicitly state data type and nullability of output

迁移代码时,可能会遇到这种类型的编码模式:When migrating code, you might find you run across this type of coding pattern:

DECLARE @d decimal(7,2) = 85.455
,       @f float(24)    = 85.455

CREATE TABLE result
(result DECIMAL(7,2) NOT NULL
)
WITH (DISTRIBUTION = ROUND_ROBIN)

INSERT INTO result
SELECT @d*@f
;

你可能认为应该将此代码迁移到 CTAS,这是对的。You might think you should migrate this code to CTAS, and you'd be correct. 但是,这里有一个隐含的问题。However, there's a hidden issue here.

以下代码不会生成相同的结果:The following code doesn't yield the same result:

DECLARE @d decimal(7,2) = 85.455
,       @f float(24)    = 85.455
;

CREATE TABLE ctas_r
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT @d*@f as result
;

请注意,列“result”沿用表达式的数据类型和可为 null 的值。Notice that the column "result" carries forward the data type and nullability values of the expression. 传递数据类型可能会导致值存在细微的差异。Carrying the data type forward can lead to subtle variances in values if you aren't careful.

尝试运行以下示例:Try this example:

SELECT result,result*@d
from result
;

SELECT result,result*@d
from ctas_r
;

为结果存储的值不相同。The value stored for result is different. 因为结果列中保留的值用于其他表达式,错误变得更加严重。As the persisted value in the result column is used in other expressions, the error becomes even more significant.

CTAS 结果的屏幕截图

这对于数据迁移非常重要。This is important for data migrations. 尽管第二个查询看起来更准确,但仍有一个问题。Even though the second query is arguably more accurate, there's a problem. 与源系统相比,此数据有所不同,会在迁移中造成完整性问题。The data would be different compared to the source system, and that leads to questions of integrity in the migration. 这是“错误”答案其实是正确答案的极少见情况之一!This is one of those rare cases where the "wrong" answer is actually the right one!

这两个结果之间存在差异的原因隐式类型转换。The reason we see a disparity between the two results is due to implicit type casting. 在第一个示例中,表定义了列定义。In the first example, the table defines the column definition. 插入行后,会发生隐式类型转换。When the row is inserted, an implicit type conversion occurs. 在第二个示例中,没有隐式类型转换,因为表达式定义了列的数据类型。In the second example, there is no implicit type conversion as the expression defines the data type of the column.

请注意,第二个示例中的列已定义为可为 Null 的列,而在第一个示例中还没有定义。Notice also that the column in the second example has been defined as a NULLable column, whereas in the first example it has not. 在第一个示例中创建表时,尚未显式定义列可为 null。When the table was created in the first example, column nullability was explicitly defined. 在第二个示例中,它已留给了表达式,默认情况下,这会导致 NULL 定义。In the second example, it was left to the expression, and by default would result in a NULL definition.

若要解决这些问题,必须在 CTAS 语句的 SELECT 部分中明确设置类型转换和可为 null 属性。To resolve these issues, you must explicitly set the type conversion and nullability in the SELECT portion of the CTAS statement. 无法在“CREATE TABLE”中设置这些属性。You can't set these properties in 'CREATE TABLE'. 以下示例演示如何修复代码:The following example demonstrates how to fix the code:

DECLARE @d decimal(7,2) = 85.455
,       @f float(24)    = 85.455

CREATE TABLE ctas_r
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT ISNULL(CAST(@d*@f AS DECIMAL(7,2)),0) as result

注意以下事项:Note the following:

  • 可以使用 CAST 或 CONVERT。You can use CAST or CONVERT.
  • 使用 ISNULL 而不是 COALESCE 来强制可为 NULL 性。Use ISNULL, not COALESCE, to force NULLability. 请参阅以下注释。See the following note.
  • ISNULL 是最外层的函数。ISNULL is the outermost function.
  • ISNULL 的第二个部分是常量,即 0。The second part of the ISNULL is a constant, 0.

Note

若要正确设置可为 null 属性,必须使用 ISNULL 而不是 COALESCE。For the nullability to be correctly set, it's vital to use ISNULL and not COALESCE. COALESCE 不是确定性的函数,因此表达式的结果始终可为 Null。COALESCE is not a deterministic function, and so the result of the expression will always be NULLable. ISNULL 则不同。ISNULL is different. 它是确定性的。It's deterministic. 因此当 ISNULL 函数的第二个部分是常量或文本时,结果值将是 NOT NULL。Therefore, when the second part of the ISNULL function is a constant or a literal, the resulting value will be NOT NULL.

确保计算的完整性对于表分区切换而言也很重要。Ensuring the integrity of your calculations is also important for table partition switching. 假设已将此表定义为事实表:Imagine you have this table defined as a fact table:

CREATE TABLE [dbo].[Sales]
(
    [date]      INT     NOT NULL
,   [product]   INT     NOT NULL
,   [store]     INT     NOT NULL
,   [quantity]  INT     NOT NULL
,   [price]     MONEY   NOT NULL
,   [amount]    MONEY   NOT NULL
)
WITH
(   DISTRIBUTION = HASH([product])
,   PARTITION   (   [date] RANGE RIGHT FOR VALUES
                    (20000101,20010101,20020101
                    ,20030101,20040101,20050101
                    )
                )
)
;

但是,数量字段是计算的表达式。However, the amount field is a calculated expression. 它不是源数据的一部分。It isn't part of the source data.

若要创建分区数据集,可能需要使用以下代码:To create your partitioned dataset, you might want to use the following code:

CREATE TABLE [dbo].[Sales_in]
WITH
(   DISTRIBUTION = HASH([product])
,   PARTITION   (   [date] RANGE RIGHT FOR VALUES
                    (20000101,20010101
                    )
                )
)
AS
SELECT
    [date]
,   [product]
,   [store]
,   [quantity]
,   [price]
,   [quantity]*[price]  AS [amount]
FROM [stg].[source]
OPTION (LABEL = 'CTAS : Partition IN table : Create')
;

该查询会顺利运行。The query would run perfectly well. 但是,尝试执行分区切换时,会出现问题。The problem comes when you try to do the partition switch. 表定义不匹配。The table definitions don't match. 若要使表定义匹配,请修改 CTAS,以添加一个 ISNULL 函数用于保留列的可为 null 性属性。To make the table definitions match, modify the CTAS to add an ISNULL function to preserve the column's nullability attribute.

CREATE TABLE [dbo].[Sales_in]
WITH
(   DISTRIBUTION = HASH([product])
,   PARTITION   (   [date] RANGE RIGHT FOR VALUES
                    (20000101,20010101
                    )
                )
)
AS
SELECT
    [date]
,   [product]
,   [store]
,   [quantity]
,   [price]   
,   ISNULL(CAST([quantity]*[price] AS MONEY),0) AS [amount]
FROM [stg].[source]
OPTION (LABEL = 'CTAS : Partition IN table : Create');

可以看出,保持类型一致性并维护 CTAS 上的可为 null 属性是工程最佳做法。You can see that type consistency and maintaining nullability properties on a CTAS is an engineering best practice. 这有助于维护计算的完整性,而且还可确保分区切换能够实现。It helps to maintain integrity in your calculations, and also ensures that partition switching is possible.

CTAS 是 SQL 数据仓库中最重要的语句之一。CTAS is one of the most important statements in SQL Data Warehouse. 请确保全面了解该语句。Make sure you thoroughly understand it. 请参阅 CTAS 文档See the CTAS documentation.

后续步骤Next steps

有关更多开发技巧,请参阅开发概述For more development tips, see the development overview.