教程:加载纽约出租车数据集Tutorial: Load the New York Taxicab dataset

本教程使用 PolyBase 从全局 Azure Blob 存储帐户加载纽约出租车数据。This tutorial uses PolyBase to load New York Taxicab data from a global Azure blob storage account. 本教程使用 Azure 门户SQL Server Management Studio (SSMS) 执行以下操作:The tutorial uses the Azure portal and SQL Server Management Studio (SSMS) to:

  • 在 Azure 门户中创建 SQL 池Create a SQL pool in the Azure portal
  • 在 Azure 门户中设置服务器级防火墙规则Set up a server-level firewall rule in the Azure portal
  • 使用 SSMS 连接到数据仓库Connect to the data warehouse with SSMS
  • 创建专用于加载数据的用户Create a user designated for loading data
  • 在 Azure Blob 存储中为数据创建外部表Create external tables for data in Azure blob storage
  • 使用 CTAS T-SQL 语句将数据加载到数据仓库Use the CTAS T-SQL statement to load data into your data warehouse
  • 查看正在加载的数据的进度View the progress of data as it is loading
  • 创建新加载的数据的统计信息Create statistics on the newly loaded data

如果没有 Azure 订阅,请在开始前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.

准备阶段Before you begin

开始本教程之前,请下载并安装最新版 SQL Server Management Studio (SSMS)。Before you begin this tutorial, download and install the newest version of SQL Server Management Studio (SSMS).

登录到 Azure 门户Log in to the Azure portal

登录到 Azure 门户Log in to the Azure portal.

创建空数据库Create a blank database

SQL 池是使用定义的一组计算资源创建的。A SQL pool is created with a defined set of compute resources. 数据库在 Azure 资源组Azure SQL 逻辑服务器中创建。The database is created within an Azure resource group and in an Azure SQL logical server.

按照以下步骤创建空白的数据库。Follow these steps to create a blank database.

  1. 在 Azure 门户的左上角选择“创建资源”。 Select Create a resource in the upper left-hand corner of the Azure portal.

  2. 在“新建”页中选择“数据库”,然后在“新建”页上的“特色”下选择“Azure Synapse Analytics” 。Select Databases from the New page, and select Azure Synapse Analytics under Featured on the New page.

    创建数据仓库

  3. 使用以下信息填写窗体:Fill out the form with the following information:

    设置Setting 建议的值Suggested value 说明Description
    名称*Name* mySampleDataWarehousemySampleDataWarehouse 如需有效的数据库名称,请参阅 Database Identifiers(数据库标识符)。For valid database names, see Database Identifiers.
    订阅Subscription 你的订阅Your subscription 有关订阅的详细信息,请参阅订阅For details about your subscriptions, see Subscriptions.
    资源组Resource group MyResourceGroupmyResourceGroup 有关有效的资源组名称,请参阅 Naming rules and restrictions(命名规则和限制)。For valid resource group names, see Naming rules and restrictions.
    选择源Select source 空白数据库Blank database 指定创建空白数据库。Specifies to create a blank database. 请注意,数据仓库是一种数据库。Note, a data warehouse is one type of database.

    创建数据仓库

  4. 选择“服务器”,为新数据库创建并配置新服务器。 Select Server to create and configure a new server for your new database. 使用以下信息填写“新建服务器”窗体: Fill out the New server form with the following information:

    设置Setting 建议的值Suggested value 说明Description
    服务器名称Server name 任何全局唯一名称Any globally unique name 有关有效的服务器名称,请参阅 Naming rules and restrictions(命名规则和限制)。For valid server names, see Naming rules and restrictions.
    服务器管理员登录名Server admin login 任何有效的名称Any valid name 有关有效的登录名,请参阅 Database Identifiers(数据库标识符)。For valid login names, see Database Identifiers.
    密码Password 任何有效的密码Any valid password 密码必须至少有八个字符,且必须包含以下类别中的三个类别的字符:大写字符、小写字符、数字以及非字母数字字符。Your password must have at least eight characters and must contain characters from three of the following categories: upper case characters, lower case characters, numbers, and non-alphanumeric characters.
    位置Location 任何有效的位置Any valid location 有关区域的信息,请参阅 Azure 区域For information about regions, see Azure Regions.

    创建数据库服务器

  5. 选择“选择” 。Select Select.

  6. 选择“性能级别” ,指定数据仓库是 Gen1 还是 Gen2,以及数据仓库单位的数量。Select Performance level to specify whether the data warehouse is Gen1 or Gen2, and the number of data warehouse units.

  7. 对于本教程,请选择 SQL 池“Gen2” 。For this tutorial, select SQL pool Gen2. 滑块默认设置为“DW1000c” 。The slider is set to DW1000c by default. 请尝试上下移动滑块,以查看其工作原理。Try moving it up and down to see how it works.

    配置性能

  8. 选择“应用”。 Select Apply.

  9. 在预配边栏选项卡中,为空白数据库选择一个排序规则 。In the provisioning blade, select a collation for the blank database. 对于本教程,请使用默认值。For this tutorial, use the default value. 有关排序规则的详细信息,请参阅 Collations(排序规则)For more information about collations, see Collations

  10. 填写窗体后,请选择“创建”以预配数据库 。Now that you have completed the form, select Create to provision the database. 预配需要数分钟。Provisioning takes a few minutes.

  11. 在工具栏上,选择“通知”监视部署过程。 On the toolbar, select Notifications to monitor the deployment process.

    通知

创建服务器级防火墙规则Create a server-level firewall rule

服务器级别的防火墙会阻止外部应用程序和工具连接到服务器或服务器上的任何数据库。A firewall at the server-level that prevents external applications and tools from connecting to the server or any databases on the server. 要启用连接,可以添加防火墙规则,为特定 IP 地址启用连接。To enable connectivity, you can add firewall rules that enable connectivity for specific IP addresses. 按照以下步骤为客户端的 IP 地址创建服务器级防火墙规则Follow these steps to create a server-level firewall rule for your client's IP address.

Note

SQL 数据仓库通过端口 1433 进行通信。SQL Data Warehouse communicates over port 1433. 如果尝试从企业网络内部进行连接,则该网络的防火墙可能不允许经端口 1433 的出站流量。If you are trying to connect from within a corporate network, outbound traffic over port 1433 might not be allowed by your network's firewall. 如果是这样,则无法连接到 Azure SQL 数据库服务器,除非 IT 部门打开了端口 1433。If so, you cannot connect to your Azure SQL Database server unless your IT department opens port 1433.

  1. 部署完成后,在左侧菜单中选择“SQL 数据库” ,然后在“SQL 数据库” 页上选择“mySampleDatabase” 。After the deployment completes, select SQL databases from the left-hand menu and then select mySampleDatabase on the SQL databases page. 此时会打开数据库的概述页,显示完全限定的服务器名称(例如 mynewserver-20181129.database.chinacloudapi.cn),并且会提供进行进一步配置所需的选项。The overview page for your database opens, showing you the fully qualified server name (such as mynewserver-20181129.database.chinacloudapi.cn) and provides options for further configuration.

  2. 在后续的快速入门中,请复制此完全限定的服务器名称,将其用于连接到服务器及其数据库。Copy this fully qualified server name for use to connect to your server and its databases in subsequent quick starts. 然后选择服务器名称,打开服务器设置。Then select on the server name to open server settings.

    查找服务器名称

  3. 选择服务器名称,打开服务器设置。Select the server name to open server settings.

    服务器设置

  4. 选择“显示防火墙设置”。 Select Show firewall settings. 此时会打开 SQL 数据库服务器的“防火墙设置”页。 The Firewall settings page for the SQL Database server opens.

    服务器防火墙规则

  5. 在工具栏上选择“添加客户端 IP”,将当前的 IP 地址添加到新的防火墙规则。 Select Add client IP on the toolbar to add your current IP address to a new firewall rule. 防火墙规则可以针对单个 IP 地址或一系列 IP 地址打开端口 1433。A firewall rule can open port 1433 for a single IP address or a range of IP addresses.

  6. 选择“保存” 。Select Save. 此时会针对当前的 IP 地址创建服务器级防火墙规则,在逻辑服务器上打开 端口 1433。A server-level firewall rule is created for your current IP address opening port 1433 on the logical server.

  7. 选择“确定”,然后关闭“防火墙设置”页。 Select OK and then close the Firewall settings page.

现在,可使用此 IP 地址连接到 SQL Server 及其数据仓库。You can now connect to the SQL server and its data warehouses using this IP address. 可从 SQL Server Management Studio 或另一种所选工具进行连接。The connection works from SQL Server Management Studio or another tool of your choice. 连接时,请使用之前创建的 ServerAdmin 帐户。When you connect, use the ServerAdmin account you created previously.

Important

默认情况下,所有 Azure 服务都允许通过 SQL 数据库防火墙进行访问。By default, access through the SQL Database firewall is enabled for all Azure services. 在此页上选择“关闭”,然后选择“保存”,对所有 Azure 服务禁用防火墙。 Select OFF on this page and then select Save to disable the firewall for all Azure services.

获取完全限定的服务器名称Get the fully qualified server name

请在 Azure 门户中获取 SQL Server 的完全限定的服务器名称。Get the fully qualified server name for your SQL server in the Azure portal. 稍后,在连接到服务器时,将使用该完全限定的名称。Later you will use the fully qualified name when connecting to the server.

  1. 登录到 Azure 门户Log in to the Azure portal.

  2. 在左侧菜单中选择“Azure Synapse Analytics”,然后在“Azure Synapse Analytics”页上选择你的数据库。 Select Azure Synapse Analytics from the left-hand menu, and select your database on the Azure Synapse Analytics page.

  3. 在数据库的“Azure 门户”页的“概要”窗格中,找到并复制“服务器名称”。 In the Essentials pane in the Azure portal page for your database, locate and then copy the Server name. 在此示例中,完全限定的名称为 mynewserver-20181129.database.chinacloudapi.cn。In this example, the fully qualified name is mynewserver-20181129.database.chinacloudapi.cn.

    连接信息

以服务器管理员的身份连接到服务器Connect to the server as server admin

本部分使用 SQL Server Management Studio (SSMS) 来建立与 Azure SQL Server 的连接。This section uses SQL Server Management Studio (SSMS) to establish a connection to your Azure SQL server.

  1. 打开 SQL Server Management Studio。Open SQL Server Management Studio.

  2. 在“连接到服务器”对话框中,输入以下信息: In the Connect to Server dialog box, enter the following information:

    设置Setting 建议的值Suggested value 说明Description
    服务器类型Server type 数据库引擎Database engine 此值是必需的This value is required
    服务器名称Server name 完全限定的服务器名称The fully qualified server name 该名称应类似于:mynewserver-20181129.database.chinacloudapi.cnThe name should be something like this: mynewserver-20181129.database.chinacloudapi.cn.
    身份验证Authentication SQL Server 身份验证SQL Server Authentication SQL 身份验证是本教程中配置的唯一身份验证类型。SQL Authentication is the only authentication type that we have configured in this tutorial.
    登录Login 服务器管理员帐户The server admin account 这是在创建服务器时指定的帐户。This is the account that you specified when you created the server.
    密码Password 服务器管理员帐户的密码The password for your server admin account 这是在创建服务器时指定的密码。This is the password that you specified when you created the server.

    连接到服务器

  3. 选择“连接” 。Select Connect. 此时会在 SSMS 中打开“对象资源管理器”窗口。The Object Explorer window opens in SSMS.

  4. 在“对象资源管理器”中,展开“数据库” 。In Object Explorer, expand Databases. 然后展开“系统数据库” 和“master” ,查看 master 数据库中的对象。Then expand System databases and master to view the objects in the master database. 展开“mySampleDatabase” ,查看新数据库中的对象。Expand mySampleDatabase to view the objects in your new database.

    数据库对象

创建用于加载数据的用户Create a user for loading data

服务器管理员帐户用于执行管理操作,不适合对用户数据运行查询。The server admin account is meant to perform management operations, and is not suited for running queries on user data. 加载数据是一种内存密集型操作。Loading data is a memory-intensive operation. 内存最大值根据配置的数据仓库单位资源类定义。Memory maximums are defined according to the data warehouse units and resource class configured.

最好创建专用于加载数据的登录名和用户。It's best to create a login and user that is dedicated for loading data. 然后,将加载用户添加到启用相应最大内存分配的资源类Then add the loading user to a resource class that enables an appropriate maximum memory allocation.

由于当前是以服务器管理员的身份连接的,因此可以创建登录名和用户。Since you are currently connected as the server admin, you can create logins and users. 使用以下步骤创建名为 LoaderRC20 的登录名和用户。Use these steps to create a login and user called LoaderRC20. 然后将该用户分配到 staticrc20 资源类。Then assign the user to the staticrc20 resource class.

  1. 在 SSMS 中右键单击“master” ,然后在显示的下拉菜单中选择“新建查询” 。In SSMS, right-select master to show a drop-down menu, and choose New Query. 此时将打开一个新的查询窗口。A new query window opens.

    在 Master 中新建查询

  2. 在查询窗口中,输入以下 T-SQL 命令,创建一个名为 LoaderRC20 的登录名和用户,并将“a123STRONGpassword!”替换为自己的密码。In the query window, enter these T-SQL commands to create a login and user named LoaderRC20, substituting your own password for 'a123STRONGpassword!'.

    CREATE LOGIN LoaderRC20 WITH PASSWORD = 'a123STRONGpassword!';
    CREATE USER LoaderRC20 FOR LOGIN LoaderRC20;
    
  3. 选择“执行” 。Select Execute.

  4. 右键单击“mySampleDataWarehouse” ,然后选择“新建查询” 。Right-click mySampleDataWarehouse, and choose New Query. 此时会打开一个新的查询窗口。A new query Window opens.

    针对示例数据仓库的新查询

  5. 输入以下 T-SQL 命令,为 LoaderRC20 登录名创建名为 LoaderRC20 的数据库用户。Enter the following T-SQL commands to create a database user named LoaderRC20 for the LoaderRC20 login. 第二行为新用户授予对新数据仓库的 CONTROL 权限。The second line grants the new user CONTROL permissions on the new data warehouse. 这些权限类似于使用户成为数据库所有者。These permissions are similar to making the user the owner of the database. 第三行将新用户添加为 staticrc20 资源类的成员。The third line adds the new user as a member of the staticrc20 resource class.

    CREATE USER LoaderRC20 FOR LOGIN LoaderRC20;
    GRANT CONTROL ON DATABASE::[mySampleDataWarehouse] to LoaderRC20;
    EXEC sp_addrolemember 'staticrc20', 'LoaderRC20';
    
  6. 选择“执行” 。Select Execute.

以加载用户的身份连接到服务器Connect to the server as the loading user

加载数据的第一步是以 LoaderRC20 的身份登录。The first step toward loading data is to login as LoaderRC20.

  1. 在对象资源管理器中选择“连接” 下拉菜单,然后选择“数据库引擎” 。In Object Explorer, select the Connect drop down menu and select Database Engine. 此时会显示“连接到服务器” 对话框。The Connect to Server dialog box appears.

    使用新登录名连接

  2. 输入完全限定的服务器名称,并输入“LoaderRC20” 作为登录名。Enter the fully qualified server name, and enter LoaderRC20 as the Login. 输入 LoaderRC20 的密码。Enter your password for LoaderRC20.

  3. 选择“连接” 。Select Connect.

  4. 当连接准备就绪时,对象资源管理器中出现两个服务器连接。When your connection is ready, you will see two server connections in Object Explorer. 一个是作为 ServerAdmin 连接,另一个是作为 MedRCLogin 连接。One connection as ServerAdmin and one connection as MedRCLogin.

    连接成功

为示例数据创建外部表Create external tables for the sample data

已准备好开始将数据加载到新的数据仓库。You are ready to begin the process of loading data into your new data warehouse. 本教程说明如何使用外部表从 Azure 存储 Blob 加载纽约市出租车数据。This tutorial shows you how to use external tables to load New York City taxi cab data from an Azure Storage blob. 若要了解如何将数据置于 Azure Blob 存储或直接从源加载数据以供将来参考,请参阅加载概述For future reference, to learn how to get your data to Azure blob storage or to load it directly from your source, see the loading overview.

运行以下 SQL 脚本,并指定要加载的数据的相关信息。Run the following SQL scripts and specify information about the data you wish to load. 此信息包括数据所在的位置、数据内容的格式以及数据的表定义。This information includes where the data is located, the format of the contents of the data, and the table definition for the data.

  1. 在前一节中,已经以 LoaderRC20 的身份登录数据仓库。In the previous section, you logged into your data warehouse as LoaderRC20. 在 SSMS 中,右键单击 LoaderRC20 连接,然后选择“新建查询” 。In SSMS, right-click your LoaderRC20 connection and select New Query. 此时会显示一个新的查询窗口。A new query window appears.

    新的加载查询窗口

  2. 比较查询窗口和之间的图像。Compare your query window to the previous image. 验证新的查询窗口以 LoaderRC20 的身份运行,并对 MySampleDataWarehouse 数据库执行查询。Verify your new query window is running as LoaderRC20 and performing queries on your MySampleDataWarehouse database. 使用此查询窗口执行所有加载步骤。Use this query window to perform all of the loading steps.

  3. 创建 MySampleDataWarehouse 数据库的主密钥。Create a master key for the MySampleDataWarehouse database. 只需要为每个数据库创建一次主密钥。You only need to create a master key once per database.

    CREATE MASTER KEY;
    
  4. 运行以下 CREATE EXTERNAL DATA SOURCE 语句,定义 Azure Blob 的位置。Run the following CREATE EXTERNAL DATA SOURCE statement to define the location of the Azure blob. 这是外部出租车数据的位置。This is the location of the external taxi cab data. 若要运行已追加到查询窗口的命令,请突出显示要运行的命令,然后选择“执行” 。To run a command that you have appended to the query window, highlight the commands you wish to run and select Execute.

    CREATE EXTERNAL DATA SOURCE NYTPublic
    WITH
    (
        TYPE = Hadoop,
        LOCATION = 'wasbs://2013@nytaxiblob.blob.core.windows.net/'
    );
    
  5. 运行以下 CREATE EXTERNAL FILE FORMAT T-SQL 语句,指定外部数据文件的格式设置特征和选项。Run the following CREATE EXTERNAL FILE FORMAT T-SQL statement to specify formatting characteristics and options for the external data file. 此语句指定外部数据存储为文本,且值由管道 ("|") 字符分隔。This statement specifies the external data is stored as text and the values are separated by the pipe ('|') character. 使用 Gzip 压缩外部文件。The external file is compressed with Gzip.

    CREATE EXTERNAL FILE FORMAT uncompressedcsv
    WITH (
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS (
            FIELD_TERMINATOR = ',',
            STRING_DELIMITER = '',
            DATE_FORMAT = '',
            USE_TYPE_DEFAULT = False
        )
    );
    CREATE EXTERNAL FILE FORMAT compressedcsv
    WITH (
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS ( FIELD_TERMINATOR = '|',
            STRING_DELIMITER = '',
        DATE_FORMAT = '',
            USE_TYPE_DEFAULT = False
        ),
        DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec'
    );
    
  6. 运行以下 CREATE SCHEMA 语句,创建外部文件格式的架构。Run the following CREATE SCHEMA statement to create a schema for your external file format. 该架构提供组织即将创建的外部表的方法。The schema provides a way to organize the external tables you are about to create.

    CREATE SCHEMA ext;
    
  7. 创建外部表。Create the external tables. 表定义存储在数据仓库中,但表引用数据存储在 Azure Blob 存储中。The table definitions are stored in the data warehouse, but the tables reference data that is stored in Azure blob storage. 运行以下 T-SQL 命令以创建若干外部表,这些表都指向我们之前在外部数据源中定义的 Azure blob。Run the following T-SQL commands to create several external tables that all point to the Azure blob we defined previously in our external data source.

    CREATE EXTERNAL TABLE [ext].[Date]
    (
        [DateID] int NOT NULL,
        [Date] datetime NULL,
        [DateBKey] char(10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [DayOfMonth] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [DaySuffix] varchar(4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [DayName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [DayOfWeek] char(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [DayOfWeekInMonth] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [DayOfWeekInYear] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [DayOfQuarter] varchar(3) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [DayOfYear] varchar(3) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [WeekOfMonth] varchar(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [WeekOfQuarter] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [WeekOfYear] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [Month] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [MonthName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [MonthOfQuarter] varchar(2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [Quarter] char(1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [QuarterName] varchar(9) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [Year] char(4) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [YearName] char(7) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [MonthYear] char(10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [MMYYYY] char(6) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [FirstDayOfMonth] date NULL,
        [LastDayOfMonth] date NULL,
        [FirstDayOfQuarter] date NULL,
        [LastDayOfQuarter] date NULL,
        [FirstDayOfYear] date NULL,
        [LastDayOfYear] date NULL,
        [IsHolidayUSA] bit NULL,
        [IsWeekday] bit NULL,
        [HolidayUSA] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
    )
    WITH
    (
        LOCATION = 'Date',
        DATA_SOURCE = NYTPublic,
        FILE_FORMAT = uncompressedcsv,
        REJECT_TYPE = value,
        REJECT_VALUE = 0
    );
    CREATE EXTERNAL TABLE [ext].[Geography]
    (
        [GeographyID] int NOT NULL,
        [ZipCodeBKey] varchar(10) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
        [County] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [City] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [State] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [Country] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [ZipCode] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
    )
    WITH
    (
        LOCATION = 'Geography',
        DATA_SOURCE = NYTPublic,
        FILE_FORMAT = uncompressedcsv,
        REJECT_TYPE = value,
        REJECT_VALUE = 0
    );
    CREATE EXTERNAL TABLE [ext].[HackneyLicense]
    (
        [HackneyLicenseID] int NOT NULL,
        [HackneyLicenseBKey] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
        [HackneyLicenseCode] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
    )
    WITH
    (
        LOCATION = 'HackneyLicense',
        DATA_SOURCE = NYTPublic,
        FILE_FORMAT = uncompressedcsv,
        REJECT_TYPE = value,
        REJECT_VALUE = 0
    );
    CREATE EXTERNAL TABLE [ext].[Medallion]
    (
        [MedallionID] int NOT NULL,
        [MedallionBKey] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
        [MedallionCode] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
    )
    WITH
    (
        LOCATION = 'Medallion',
        DATA_SOURCE = NYTPublic,
        FILE_FORMAT = uncompressedcsv,
        REJECT_TYPE = value,
        REJECT_VALUE = 0
    )
    ;  
    CREATE EXTERNAL TABLE [ext].[Time]
    (
        [TimeID] int NOT NULL,
        [TimeBKey] varchar(8) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
        [HourNumber] tinyint NOT NULL,
        [MinuteNumber] tinyint NOT NULL,
        [SecondNumber] tinyint NOT NULL,
        [TimeInSecond] int NOT NULL,
        [HourlyBucket] varchar(15) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
        [DayTimeBucketGroupKey] int NOT NULL,
        [DayTimeBucket] varchar(100) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
    )
    WITH
    (
        LOCATION = 'Time',
        DATA_SOURCE = NYTPublic,
        FILE_FORMAT = uncompressedcsv,
        REJECT_TYPE = value,
        REJECT_VALUE = 0
    );
    CREATE EXTERNAL TABLE [ext].[Trip]
    (
        [DateID] int NOT NULL,
        [MedallionID] int NOT NULL,
        [HackneyLicenseID] int NOT NULL,
        [PickupTimeID] int NOT NULL,
        [DropoffTimeID] int NOT NULL,
        [PickupGeographyID] int NULL,
        [DropoffGeographyID] int NULL,
        [PickupLatitude] float NULL,
        [PickupLongitude] float NULL,
        [PickupLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [DropoffLatitude] float NULL,
        [DropoffLongitude] float NULL,
        [DropoffLatLong] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [PassengerCount] int NULL,
        [TripDurationSeconds] int NULL,
        [TripDistanceMiles] float NULL,
        [PaymentType] varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
        [FareAmount] money NULL,
        [SurchargeAmount] money NULL,
        [TaxAmount] money NULL,
        [TipAmount] money NULL,
        [TollsAmount] money NULL,
        [TotalAmount] money NULL
    )
    WITH
    (
        LOCATION = 'Trip2013',
        DATA_SOURCE = NYTPublic,
        FILE_FORMAT = compressedcsv,
        REJECT_TYPE = value,
        REJECT_VALUE = 0
    );
    CREATE EXTERNAL TABLE [ext].[Weather]
    (
        [DateID] int NOT NULL,
        [GeographyID] int NOT NULL,
        [PrecipitationInches] float NOT NULL,
        [AvgTemperatureFahrenheit] float NOT NULL
    )
    WITH
    (
        LOCATION = 'Weather',
        DATA_SOURCE = NYTPublic,
        FILE_FORMAT = uncompressedcsv,
        REJECT_TYPE = value,
        REJECT_VALUE = 0
    )
    ;
    
  8. 在对象资源管理器中,展开“mySampleDataWarehouse”,查看刚刚创建的外部表列表。In Object Explorer, expand mySampleDataWarehouse to see the list of external tables you just created.

    查看外部表

将数据加载到数据仓库Load the data into your data warehouse

本部分使用刚才定义的外部表从 Azure 存储 Blob 加载示例数据。This section uses the external tables you just defined to load the sample data from Azure Storage Blob.

Note

本教程直接将数据加载到最终表。This tutorial loads the data directly into the final table. 在生产环境中,通常使用 CREATE TABLE AS SELECT 将数据加载到临时表。In a production environment, you will usually use CREATE TABLE AS SELECT to load into a staging table. 数据在临时表中时,可以执行任何必要的转换。While data is in the staging table you can perform any necessary transformations. 要将临时表中的数据追加到生产表,可以使用 INSERT...SELECT 语句。To append the data in the staging table to a production table, you can use the INSERT...SELECT statement. 有关详细信息,请参阅将数据插入到生产表For more information, see Inserting data into a production table.

下面的脚本使用 CREATE TABLE AS SELECT (CTAS) T-SQL 语句将数据从 Azure 存储 Blob 加载到数据仓库中的新表。The script uses the CREATE TABLE AS SELECT (CTAS) T-SQL statement to load the data from Azure Storage Blob into new tables in your data warehouse. CTAS 基于 select 语句的结果创建新表。CTAS creates a new table based on the results of a select statement. 新表包含与 select 语句结果相同的列和数据类型。The new table has the same columns and data types as the results of the select statement. 当 select 语句从外部表进行选择时,数据会导入到数据仓库的关系表中。When the select statement selects from an external table, data is imported into a relational table in the data warehouse.

  1. 运行以下脚本,将数据加载到数据仓库中的新表。Run the following script to load the data into new tables in your data warehouse.

    CREATE TABLE [dbo].[Date]
    WITH
    (
        DISTRIBUTION = ROUND_ROBIN,
        CLUSTERED COLUMNSTORE INDEX
    )
    AS SELECT * FROM [ext].[Date]
    OPTION (LABEL = 'CTAS : Load [dbo].[Date]')
    ;
    CREATE TABLE [dbo].[Geography]
    WITH
    (
        DISTRIBUTION = ROUND_ROBIN,
        CLUSTERED COLUMNSTORE INDEX
    )
    AS
    SELECT * FROM [ext].[Geography]
    OPTION (LABEL = 'CTAS : Load [dbo].[Geography]')
    ;
    CREATE TABLE [dbo].[HackneyLicense]
    WITH
    (
        DISTRIBUTION = ROUND_ROBIN,
        CLUSTERED COLUMNSTORE INDEX
    )
    AS SELECT * FROM [ext].[HackneyLicense]
    OPTION (LABEL = 'CTAS : Load [dbo].[HackneyLicense]')
    ;
    CREATE TABLE [dbo].[Medallion]
    WITH
    (
        DISTRIBUTION = ROUND_ROBIN,
        CLUSTERED COLUMNSTORE INDEX
    )
    AS SELECT * FROM [ext].[Medallion]
    OPTION (LABEL = 'CTAS : Load [dbo].[Medallion]')
    ;
    CREATE TABLE [dbo].[Time]
    WITH
    (
        DISTRIBUTION = ROUND_ROBIN,
        CLUSTERED COLUMNSTORE INDEX
    )
    AS SELECT * FROM [ext].[Time]
    OPTION (LABEL = 'CTAS : Load [dbo].[Time]')
    ;
    CREATE TABLE [dbo].[Weather]
    WITH
    (
        DISTRIBUTION = ROUND_ROBIN,
        CLUSTERED COLUMNSTORE INDEX
    )
    AS SELECT * FROM [ext].[Weather]
    OPTION (LABEL = 'CTAS : Load [dbo].[Weather]')
    ;
    CREATE TABLE [dbo].[Trip]
    WITH
    (
        DISTRIBUTION = ROUND_ROBIN,
        CLUSTERED COLUMNSTORE INDEX
    )
    AS SELECT * FROM [ext].[Trip]
    OPTION (LABEL = 'CTAS : Load [dbo].[Trip]')
    ;
    
  2. 在加载数据的同时查看数据。View your data as it loads. 假设要加载几个 GB 的数据,并将其压缩成高性能群集列存储索引。You're loading several GBs of data and compressing it into highly performant clustered columnstore indexes. 运行以下使用动态管理视图 (DMV) 的查询以显示负载的状态。Run the following query that uses a dynamic management views (DMVs) to show the status of the load.

    SELECT
        r.command,
        s.request_id,
        r.status,
        count(distinct input_name) as nbr_files,
        sum(s.bytes_processed)/1024/1024/1024.0 as gb_processed
    FROM
        sys.dm_pdw_exec_requests r
        INNER JOIN sys.dm_pdw_dms_external_work s
        ON r.request_id = s.request_id
    WHERE
        r.[label] = 'CTAS : Load [dbo].[Date]' OR
        r.[label] = 'CTAS : Load [dbo].[Geography]' OR
        r.[label] = 'CTAS : Load [dbo].[HackneyLicense]' OR
        r.[label] = 'CTAS : Load [dbo].[Medallion]' OR
        r.[label] = 'CTAS : Load [dbo].[Time]' OR
        r.[label] = 'CTAS : Load [dbo].[Weather]' OR
        r.[label] = 'CTAS : Load [dbo].[Trip]'
    GROUP BY
        r.command,
        s.request_id,
        r.status
    ORDER BY
        nbr_files desc,
        gb_processed desc;
    
  3. 查看所有系统查询。View all system queries.

    SELECT * FROM sys.dm_pdw_exec_requests;
    
  4. 结果让人欣慰,数据已顺利载入数据仓库。Enjoy seeing your data nicely loaded into your data warehouse.

    查看已加载的表

使用托管标识进行身份验证,以便进行加载(可选)Authenticate using managed identities to load (optional)

使用 PolyBase 进行加载和通过托管标识进行身份验证是最安全的机制,可以让你通过 Azure 存储来利用虚拟网络服务终结点。Loading using PolyBase and authenticating through managed identities is the most secure mechanism and enables you to leverage virtual network service endpoints with Azure Storage.

先决条件Prerequisites

  1. 按照此指南安装 Azure PowerShell。Install Azure PowerShell using this guide.
  2. 如果有常规用途 v1 或 Blob 存储帐户,则必须先按照此指南将该帐户升级到常规用途 v2 帐户。If you have a general-purpose v1 or blob storage account, you must first upgrade to general-purpose v2 using this guide.
  3. 必须在 Azure 存储帐户的“防火墙和虚拟网络”设置菜单下 启用“允许受信任的 Microsoft 服务访问此存储帐户”。You must have Allow trusted Microsoft services to access this storage account turned on under Azure Storage account Firewalls and Virtual networks settings menu. 有关详细信息,请参阅此指南Refer to this guide for more information.

步骤Steps

  1. 在 PowerShell 中,将 SQL Server 注册到 Azure Active Directory (AAD) :In PowerShell, register your SQL server with Azure Active Directory (AAD):

    Connect-AzAccount -Environment AzureChinaCloud
    Select-AzSubscription -SubscriptionId your-subscriptionId
    Set-AzSqlServer -ResourceGroupName your-database-server-resourceGroup -ServerName your-database-servername -AssignIdentity
    
  2. 按照此指南创建常规用途 v2 存储帐户Create a general-purpose v2 Storage Account using this guide.

    Note

    如果有常规用途 v1 或 Blob 存储帐户,则必须先按照此指南将该帐户升级到 v2 帐户。If you have a general-purpose v1 or blob storage account, you must first upgrade to v2 using this guide.

  3. 在存储帐户下导航到“访问控制(标识和访问管理)”,然后选择“添加角色分配”。 Under your storage account, navigate to Access Control (IAM), and select Add role assignment. 向 SQL 数据库服务器分配“存储 Blob 数据参与者” RBAC 角色。Assign Storage Blob Data Contributor RBAC role to your SQL Database server.

    Note

    只有具有“所有者”特权的成员能够执行此步骤。Only members with Owner privilege can perform this step. 若要了解 Azure 资源的各种内置角色,请参阅此指南For various built-in roles for Azure resources, refer to this guide.

通过 Polybase 连接到 Azure 存储帐户:Polybase connectivity to the Azure Storage account:

  1. 使用 IDENTITY = '托管服务标识' 创建数据库范围的凭据:Create your database scoped credential with IDENTITY = 'Managed Service Identity':

    CREATE DATABASE SCOPED CREDENTIAL msi_cred WITH IDENTITY = 'Managed Service Identity';
    

    Note

    • 使用 Azure 存储访问密钥时,不需指定 SECRET,因为此机制在后台使用托管标识There is no need to specify SECRET with Azure Storage access key because this mechanism uses Managed Identity under the covers.
    • 使用 Azure 存储帐户时,IDENTITY 名称应该为 '托管服务标识' ,以便通过 PolyBase 进行连接。IDENTITY name should be 'Managed Service Identity' for PolyBase connectivity to work with Azure Storage account.
  2. 创建外部数据源,使用托管服务标识指定数据库范围的凭据。Create the External Data Source specifying the Database Scoped Credential with the Managed Service Identity.

  3. 使用外部表进行正常查询。Query as normal using external tables.

若要为 Azure Synapse Analytics 设置虚拟网络服务终结点,请参阅以下文档Refer to the following documentation if you'd like to set up virtual network service endpoints for Azure Synapse Analytics.

清理资源Clean up resources

需要为加载到数据仓库中的计算资源和数据付费。You are being charged for compute resources and data that you loaded into your data warehouse. 这些需要单独计费。These are billed separately.

  • 如果想要将数据保留在存储中,可以在不使用数据仓库时暂停计算。If you want to keep the data in storage, you can pause compute when you aren't using the data warehouse. 暂停计算后,仅需为数据存储付费,并且随时都可在准备处理数据时恢复计算。By pausing compute you will only be charge for data storage and you can resume the compute whenever you are ready to work with the data.
  • 如果不想支付将来的费用,则可以删除数据仓库。If you want to remove future charges, you can delete the data warehouse.

请按照下列步骤按需清理资源。Follow these steps to clean up resources as you desire.

  1. 登录到 Azure 门户,选择你的数据仓库。Log in to the Azure portal, select your data warehouse.

    清理资源

  2. 若要暂停计算,请选择“暂停” 按钮。To pause compute, select the Pause button. 暂停数据仓库后,可看到“启动” 按钮。When the data warehouse is paused, you will see a Start button. 若要恢复计算,请选择“启动” 。To resume compute, select Start.

  3. 若要删除数据仓库,以便不再为计算或存储付费,请选择“删除” 。To remove the data warehouse so you won't be charged for compute or storage, select Delete.

  4. 若要删除所创建的 SQL 服务器,请选择上图中的“mynewserver-20181129.database.chinacloudapi.cn”,然后选择“删除” 。To remove the SQL server you created, select mynewserver-20181129.database.chinacloudapi.cn in the previous image, and then select Delete. 请审慎执行此操作,因为删除服务器会删除分配给该服务器的所有数据库。Be careful with this as deleting the server will delete all databases assigned to the server.

  5. 若要删除资源组,请选择“myResourceGroup” ,然后选择“删除资源组” 。To remove the resource group, select myResourceGroup, and then select Delete resource group.

后续步骤Next steps

在本教程中,已学习了如何创建数据仓库以及用于加载数据的用户。In this tutorial, you learned how to create a data warehouse and create a user for loading data. 创建了外部表以定义 Azure 存储 Blob 中存储的数据的结构,然后使用 PolyBase CREATE TABLE AS SELECT 语句将数据加载到数据仓库。You created external tables to define the structure for data stored in Azure Storage Blob, and then used the PolyBase CREATE TABLE AS SELECT statement to load data into your data warehouse.

完成了以下操作:You did these things:

  • 在 Azure 门户中创建数据仓库Created a data warehouse in the Azure portal
  • 在 Azure 门户中设置服务器级防火墙规则Set up a server-level firewall rule in the Azure portal
  • 使用 SSMS 连接到数据仓库Connected to the data warehouse with SSMS
  • 创建了专用于加载数据的用户Created a user designated for loading data
  • 在 Azure 存储 Blob 中为数据创建了外部表Created external tables for data in Azure Storage Blob
  • 使用 CTAS T-SQL 语句将数据加载到数据仓库Used the CTAS T-SQL statement to load data into your data warehouse
  • 查看了正在加载的数据的进度Viewed the progress of data as it is loading
  • 创建了新加载的数据的统计信息Created statistics on the newly loaded data

转到开发概述,了解如何将现有数据库迁移到 Azure Synapse Analytics。Advance to the development overview to learn how to migrate an existing database to Azure Synapse Analytics.