将 Contoso 零售数据加载到 Synapse SQLLoad Contoso retail data to Synapse SQL

本教程介绍如何使用 PolyBase 和 T-SQL 命令将两个表从 Contoso 零售数据载入 Synapse SQL 数据仓库。In this tutorial, you learn to use PolyBase and T-SQL commands to load two tables from the Contoso retail data into a Synapse SQL data warehouse.

在本教程中,你会:In this tutorial you will:

  1. 配置 PolyBase 以从 Azure Blob 存储加载数据Configure PolyBase to load from Azure blob storage
  2. 将公共数据加载到数据库Load public data into your database
  3. 完成加载后执行优化。Perform optimizations after the load is finished.

准备阶段Before you begin

若要运行本教程,需要一个已包含 Synapse SQL 数据仓库的 Azure 帐户。To run this tutorial, you need an Azure account that already has a Synapse SQL data warehouse. 如果尚未预配数据仓库,请参阅创建数据仓库并设置服务器级防火墙规则If you don't have a data warehouse provisioned, see Create a data warehouse and set server-level firewall rule.

配置数据源Configure the data source

PolyBase 使用 T-SQL 外部对象,定义外部数据的位置和属性。PolyBase uses T-SQL external objects to define the location and attributes of the external data. 外部对象定义存储在 Synapse SQL 数据仓库中。The external object definitions are stored in your Synapse SQL data warehouse. 数据存储在外部。The data is stored externally.

创建凭据Create a credential

如果要加载 Contoso 公共数据,请跳过此步骤Skip this step if you're loading the Contoso public data. 不需要以安全方式访问公共数据,因为它已经可供任何人访问。You don't need secure access to the public data since it's already accessible to anyone.

如果使用本教程作为加载自己数据的模板,请不要跳过此步骤Don't skip this step if you're using this tutorial as a template for loading your own data. 若要通过凭据访问数据,请使用以下脚本创建数据库范围的凭据,To access data through a credential, use the following script to create a database-scoped credential. 并在定义数据源位置时使用该凭据。Then use it when defining the location of the data source.

-- A: Create a master key.
-- Only necessary if one does not already exist.
-- Required to encrypt the credential secret in the next step.

CREATE MASTER KEY;


-- B: Create a database scoped credential
-- IDENTITY: Provide any string, it is not used for authentication to Azure storage.
-- SECRET: Provide your Azure storage account key.


CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
    IDENTITY = 'user',
    SECRET = '<azure_storage_account_key>'
;


-- C: Create an external data source
-- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure blob storage.
-- LOCATION: Provide Azure storage account name and blob container name.
-- CREDENTIAL: Provide the credential created in the previous step.

CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.chinacloudapi.cn',
    CREDENTIAL = AzureStorageCredential
);

创建外部数据源Create the external data source

使用此 CREATE EXTERNAL DATA SOURCE 命令存储数据的位置和数据类型。Use this CREATE EXTERNAL DATA SOURCE command to store the location of the data, and the data type.

CREATE EXTERNAL DATA SOURCE AzureStorage_west_public
WITH
(  
    TYPE = Hadoop
,   LOCATION = 'wasbs://contosoretaildw-tables@contosoretaildw.blob.core.windows.net/'
);

重要

如果选择公开 Azure Blob 存储容器,请记住,由于是数据所有者,因此在数据离开数据中心时,需要支付数据传出费用。If you choose to make your azure blob storage containers public, remember that as the data owner you will be charged for data egress charges when data leaves the data center.

配置数据格式Configure the data format

数据存储在 Azure Blob 存储中的文本文件内,每个字段以分隔符隔开。The data is stored in text files in Azure blob storage, and each field is separated with a delimiter. 在 SSMS 中运行以下 CREATE EXTERNAL FILE FORMAT 命令,以指定文本文件中数据的格式。In SSMS, run the following CREATE EXTERNAL FILE FORMAT command to specify the format of the data in the text files. Contoso 数据未压缩,以坚线分隔。The Contoso data is uncompressed and pipe delimited.

CREATE EXTERNAL FILE FORMAT TextFileFormat
WITH
(   FORMAT_TYPE = DELIMITEDTEXT
,    FORMAT_OPTIONS    (   FIELD_TERMINATOR = '|'
                    ,    STRING_DELIMITER = ''
                    ,    DATE_FORMAT         = 'yyyy-MM-dd HH:mm:ss.fff'
                    ,    USE_TYPE_DEFAULT = FALSE
                    )
);

创建外部表的架构Create the schema for the external tables

指定数据源和文件格式后,便可以开始创建外部表的架构了。Now that you've specified the data source and file format, you're ready to create the schema for the external tables.

若要创建一个位置用于存储数据库中的 Contoso 数据,请创建架构。To create a place to store the Contoso data in your database, create a schema.

CREATE SCHEMA [asb]
GO

创建外部表Create the external tables

运行以下脚本以创建 DimProduct 和 FactOnlineSales 外部表。Run the following script to create the DimProduct and FactOnlineSales external tables. 在此处,只需定义列名和数据类型,并以 Azure blob 存储文件的格式将其绑定到这些文件的位置。All you're doing here is defining column names and data types, and binding them to the location and format of the Azure blob storage files. 定义存储在数据仓库中,数据仍位于 Azure 存储 Blob 中。The definition is stored in the data warehouse and the data is still in the Azure Storage Blob.

LOCATION 参数是 Azure 存储 Blob 中根文件夹下的文件夹。The LOCATION parameter is the folder under the root folder in the Azure Storage Blob. 每个表位于不同的文件夹中。Each table is in a different folder.

--DimProduct
CREATE EXTERNAL TABLE [asb].DimProduct (
    [ProductKey] [int] NOT NULL,
    [ProductLabel] [nvarchar](255) NULL,
    [ProductName] [nvarchar](500) NULL,
    [ProductDescription] [nvarchar](400) NULL,
    [ProductSubcategoryKey] [int] NULL,
    [Manufacturer] [nvarchar](50) NULL,
    [BrandName] [nvarchar](50) NULL,
    [ClassID] [nvarchar](10) NULL,
    [ClassName] [nvarchar](20) NULL,
    [StyleID] [nvarchar](10) NULL,
    [StyleName] [nvarchar](20) NULL,
    [ColorID] [nvarchar](10) NULL,
    [ColorName] [nvarchar](20) NOT NULL,
    [Size] [nvarchar](50) NULL,
    [SizeRange] [nvarchar](50) NULL,
    [SizeUnitMeasureID] [nvarchar](20) NULL,
    [Weight] [float] NULL,
    [WeightUnitMeasureID] [nvarchar](20) NULL,
    [UnitOfMeasureID] [nvarchar](10) NULL,
    [UnitOfMeasureName] [nvarchar](40) NULL,
    [StockTypeID] [nvarchar](10) NULL,
    [StockTypeName] [nvarchar](40) NULL,
    [UnitCost] [money] NULL,
    [UnitPrice] [money] NULL,
    [AvailableForSaleDate] [datetime] NULL,
    [StopSaleDate] [datetime] NULL,
    [Status] [nvarchar](7) NULL,
    [ImageURL] [nvarchar](150) NULL,
    [ProductURL] [nvarchar](150) NULL,
    [ETLLoadID] [int] NULL,
    [LoadDate] [datetime] NULL,
    [UpdateDate] [datetime] NULL
)
WITH
(
    LOCATION='/DimProduct/'
,   DATA_SOURCE = AzureStorage_west_public
,   FILE_FORMAT = TextFileFormat
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
)
;

--FactOnlineSales
CREATE EXTERNAL TABLE [asb].FactOnlineSales
(
    [OnlineSalesKey] [int]  NOT NULL,
    [DateKey] [datetime] NOT NULL,
    [StoreKey] [int] NOT NULL,
    [ProductKey] [int] NOT NULL,
    [PromotionKey] [int] NOT NULL,
    [CurrencyKey] [int] NOT NULL,
    [CustomerKey] [int] NOT NULL,
    [SalesOrderNumber] [nvarchar](20) NOT NULL,
    [SalesOrderLineNumber] [int] NULL,
    [SalesQuantity] [int] NOT NULL,
    [SalesAmount] [money] NOT NULL,
    [ReturnQuantity] [int] NOT NULL,
    [ReturnAmount] [money] NULL,
    [DiscountQuantity] [int] NULL,
    [DiscountAmount] [money] NULL,
    [TotalCost] [money] NOT NULL,
    [UnitCost] [money] NULL,
    [UnitPrice] [money] NULL,
    [ETLLoadID] [int] NULL,
    [LoadDate] [datetime] NULL,
    [UpdateDate] [datetime] NULL
)
WITH
(
    LOCATION='/FactOnlineSales/'
,   DATA_SOURCE = AzureStorage_west_public
,   FILE_FORMAT = TextFileFormat
,   REJECT_TYPE = VALUE
,   REJECT_VALUE = 0
)
;

加载数据Load the data

可通过其他方式访问外部数据。There are different ways to access external data. 可以直接从外部表查询数据、将数据载入数据仓库中的新表,或者将外部数据添加到现有的数据仓库表。You can query data directly from the external tables, load the data into new tables in the data warehouse, or add external data to existing data warehouse tables.

创建新架构Create a new schema

CTAS 可创建包含数据的新表。CTAS creates a new table that contains data. 首先,请创建 Contoso 数据的架构。First, create a schema for the contoso data.

CREATE SCHEMA [cso]
GO

将数据加载到新表Load the data into new tables

若要将 Azure Blob 存储中的数据载入数据仓库表,请使用 CREATE TABLE AS SELECT (Transact-SQL) 语句。To load data from Azure blob storage into the data warehouse table, use the CREATE TABLE AS SELECT (Transact-SQL) statement. 使用 CTAS 加载会使用已创建的强类型外部表。Loading with CTAS leverages the strongly typed external tables you've created. 若要将数据载入新表,请对每个表使用一个 CTAS 语句。To load the data into new tables, use one CTAS statement per table.

CTAS 会创建新表,并在该表中填充 select 语句的结果。CTAS creates a new table and populates it with the results of a select statement. CTAS 将新表定义为包含与 select 语句结果相同的列和数据类型。CTAS defines the new table to have the same columns and data types as the results of the select statement. 如果选择了外部表中的所有列,新表将是外部表中的列和数据类型的副本。If you select all the columns from an external table, the new table will be a replica of the columns and data types in the external table.

在此示例中,我们以哈希分布表的形式创建维度表和事实表。In this example, we create both the dimension and the fact table as hash distributed tables.

SELECT GETDATE();
GO

CREATE TABLE [cso].[DimProduct]            WITH (DISTRIBUTION = HASH([ProductKey]  ) ) AS SELECT * FROM [asb].[DimProduct]             OPTION (LABEL = 'CTAS : Load [cso].[DimProduct]             ');
CREATE TABLE [cso].[FactOnlineSales]       WITH (DISTRIBUTION = HASH([ProductKey]  ) ) AS SELECT * FROM [asb].[FactOnlineSales]        OPTION (LABEL = 'CTAS : Load [cso].[FactOnlineSales]        ');

跟踪加载进度Track the load progress

可使用动态管理视图 (DMV) 跟踪加载操作的进度。You can track the progress of your load using dynamic management views (DMVs).

-- To see all requests
SELECT * FROM sys.dm_pdw_exec_requests;

-- To see a particular request identified by its label
SELECT * FROM sys.dm_pdw_exec_requests as r
WHERE r.[label] = 'CTAS : Load [cso].[DimProduct]             '
      OR r.[label] = 'CTAS : Load [cso].[FactOnlineSales]        '
;

-- To track bytes and files
SELECT
    r.command,
    s.request_id,
    r.status,
    count(distinct input_name) as nbr_files,
    sum(s.bytes_processed)/1024/1024/1024 as gb_processed
FROM
    sys.dm_pdw_exec_requests r
    inner join sys.dm_pdw_dms_external_work s
        on r.request_id = s.request_id
WHERE
    r.[label] = 'CTAS : Load [cso].[DimProduct]             '
    OR r.[label] = 'CTAS : Load [cso].[FactOnlineSales]        '
GROUP BY
    r.command,
    s.request_id,
    r.status
ORDER BY
    nbr_files desc,
    gb_processed desc;

优化列存储压缩Optimize columnstore compression

默认情况下,Synapse SQL 数据仓库以聚集列存储索引形式存储表。By default, the Synapse SQL data warehouse stores the table as a clustered columnstore index. 加载完成后,某些数据行可能未压缩到列存储中。After a load completes, some of the data rows might not be compressed into the columnstore. 有不同的原因会导致发生此问题:There are different reasons why this can happen. 若要了解详细信息,请参阅管理列存储索引To learn more, see manage columnstore indexes.

若要在加载后优化查询性能和列存储压缩,请重新生成表,以强制列存储索引压缩所有行。To optimize query performance and columnstore compression after a load, rebuild the table to force the columnstore index to compress all the rows.

SELECT GETDATE();
GO

ALTER INDEX ALL ON [cso].[DimProduct]               REBUILD;
ALTER INDEX ALL ON [cso].[FactOnlineSales]          REBUILD;

有关维护列存储索引的详细信息,请参阅管理列存储索引一文。For more information on maintaining columnstore indexes, see the manage columnstore indexes article.

优化统计信息Optimize statistics

最好是在加载之后马上创建单列统计信息。It's best to create single-column statistics immediately after a load. 如果知道某些列不会在查询谓词中使用,可以不创建有关这些列的统计信息。If you know certain columns aren't going to be in query predicates, you can skip creating statistics on those columns. 如果针对每个列创建单列统计信息,则重新生成所有统计信息可能需要花费很长时间。If you create single-column statistics on every column, it might take a long time to rebuild all the statistics.

如果决定针对每个表的每个列创建单列统计信息,可以使用 statistics(统计信息)一文中的存储过程代码示例 prc_sqldw_create_statsIf you decide to create single-column statistics on every column of every table, you can use the stored procedure code sample prc_sqldw_create_stats in the statistics article.

以下示例是创建统计信息的不错起点。The following example is a good starting point for creating statistics. 它会针对维度表中的每个列以及事实表中的每个联接列创建单列统计信息。It creates single-column statistics on each column in the dimension table, and on each joining column in the fact tables. 以后,随时可以将单列或多列统计信息添加到其他事实表列。You can always add single or multi-column statistics to other fact table columns later on.

CREATE STATISTICS [stat_cso_DimProduct_AvailableForSaleDate] ON [cso].[DimProduct]([AvailableForSaleDate]);
CREATE STATISTICS [stat_cso_DimProduct_BrandName] ON [cso].[DimProduct]([BrandName]);
CREATE STATISTICS [stat_cso_DimProduct_ClassID] ON [cso].[DimProduct]([ClassID]);
CREATE STATISTICS [stat_cso_DimProduct_ClassName] ON [cso].[DimProduct]([ClassName]);
CREATE STATISTICS [stat_cso_DimProduct_ColorID] ON [cso].[DimProduct]([ColorID]);
CREATE STATISTICS [stat_cso_DimProduct_ColorName] ON [cso].[DimProduct]([ColorName]);
CREATE STATISTICS [stat_cso_DimProduct_ETLLoadID] ON [cso].[DimProduct]([ETLLoadID]);
CREATE STATISTICS [stat_cso_DimProduct_ImageURL] ON [cso].[DimProduct]([ImageURL]);
CREATE STATISTICS [stat_cso_DimProduct_LoadDate] ON [cso].[DimProduct]([LoadDate]);
CREATE STATISTICS [stat_cso_DimProduct_Manufacturer] ON [cso].[DimProduct]([Manufacturer]);
CREATE STATISTICS [stat_cso_DimProduct_ProductDescription] ON [cso].[DimProduct]([ProductDescription]);
CREATE STATISTICS [stat_cso_DimProduct_ProductKey] ON [cso].[DimProduct]([ProductKey]);
CREATE STATISTICS [stat_cso_DimProduct_ProductLabel] ON [cso].[DimProduct]([ProductLabel]);
CREATE STATISTICS [stat_cso_DimProduct_ProductName] ON [cso].[DimProduct]([ProductName]);
CREATE STATISTICS [stat_cso_DimProduct_ProductSubcategoryKey] ON [cso].[DimProduct]([ProductSubcategoryKey]);
CREATE STATISTICS [stat_cso_DimProduct_ProductURL] ON [cso].[DimProduct]([ProductURL]);
CREATE STATISTICS [stat_cso_DimProduct_Size] ON [cso].[DimProduct]([Size]);
CREATE STATISTICS [stat_cso_DimProduct_SizeRange] ON [cso].[DimProduct]([SizeRange]);
CREATE STATISTICS [stat_cso_DimProduct_SizeUnitMeasureID] ON [cso].[DimProduct]([SizeUnitMeasureID]);
CREATE STATISTICS [stat_cso_DimProduct_Status] ON [cso].[DimProduct]([Status]);
CREATE STATISTICS [stat_cso_DimProduct_StockTypeID] ON [cso].[DimProduct]([StockTypeID]);
CREATE STATISTICS [stat_cso_DimProduct_StockTypeName] ON [cso].[DimProduct]([StockTypeName]);
CREATE STATISTICS [stat_cso_DimProduct_StopSaleDate] ON [cso].[DimProduct]([StopSaleDate]);
CREATE STATISTICS [stat_cso_DimProduct_StyleID] ON [cso].[DimProduct]([StyleID]);
CREATE STATISTICS [stat_cso_DimProduct_StyleName] ON [cso].[DimProduct]([StyleName]);
CREATE STATISTICS [stat_cso_DimProduct_UnitCost] ON [cso].[DimProduct]([UnitCost]);
CREATE STATISTICS [stat_cso_DimProduct_UnitOfMeasureID] ON [cso].[DimProduct]([UnitOfMeasureID]);
CREATE STATISTICS [stat_cso_DimProduct_UnitOfMeasureName] ON [cso].[DimProduct]([UnitOfMeasureName]);
CREATE STATISTICS [stat_cso_DimProduct_UnitPrice] ON [cso].[DimProduct]([UnitPrice]);
CREATE STATISTICS [stat_cso_DimProduct_UpdateDate] ON [cso].[DimProduct]([UpdateDate]);
CREATE STATISTICS [stat_cso_DimProduct_Weight] ON [cso].[DimProduct]([Weight]);
CREATE STATISTICS [stat_cso_DimProduct_WeightUnitMeasureID] ON [cso].[DimProduct]([WeightUnitMeasureID]);
CREATE STATISTICS [stat_cso_FactOnlineSales_CurrencyKey] ON [cso].[FactOnlineSales]([CurrencyKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_CustomerKey] ON [cso].[FactOnlineSales]([CustomerKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_DateKey] ON [cso].[FactOnlineSales]([DateKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_OnlineSalesKey] ON [cso].[FactOnlineSales]([OnlineSalesKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_ProductKey] ON [cso].[FactOnlineSales]([ProductKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_PromotionKey] ON [cso].[FactOnlineSales]([PromotionKey]);
CREATE STATISTICS [stat_cso_FactOnlineSales_StoreKey] ON [cso].[FactOnlineSales]([StoreKey]);

大功告成!Achievement unlocked!

已成功将公共数据加载到你的数据仓库。You have successfully loaded public data into your data warehouse. 干得不错!Great job!

现在可以开始查询表以探索数据。You can now start querying the tables to explore your data. 运行以下查询,找出每个品牌的总销售额:Run the following query to find out total sales per brand:

SELECT  SUM(f.[SalesAmount]) AS [sales_by_brand_amount]
,       p.[BrandName]
FROM    [cso].[FactOnlineSales] AS f
JOIN    [cso].[DimProduct]      AS p ON f.[ProductKey] = p.[ProductKey]
GROUP BY p.[BrandName]

后续步骤Next steps

若要加载完整的数据集,请运行 Microsoft SQL Server 示例存储库中的加载完整的 Contoso 零售数据仓库示例。To load the full data set, run the example load the full Contoso retail data warehouse from the Microsoft SQL Server samples repository. 有关更多开发技巧,请参阅数据仓库的设计决策和编码技术For more development tips, see Design decisions and coding techniques for data warehouses.