针对 Synapse SQL 池的数据加载策略Data loading strategies for Synapse SQL pool

传统的 SMP SQL 池通过提取、转换和加载 (ETL) 过程来加载数据。Traditional SMP SQL pools use an Extract, Transform, and Load (ETL) process for loading data. Azure Synapse Analytics 中的 Synapse SQL 池具有大规模并行处理 (MPP) 体系结构,可以利用计算和存储资源的可伸缩性和灵活性。Synapse SQL pool, within Azure Synapse Analytics, has a massively parallel processing (MPP) architecture that takes advantage of the scalability and flexibility of compute and storage resources.

可以通过提取、加载和转换 (ELT) 过程来利用 MPP,无需在加载数据之前投入资源来转换数据。Using an Extract, Load, and Transform (ELT) process leverages MPP and eliminates the resources needed for data transformation prior to loading.

虽然 SQL 池支持包括常用 SQL Server 选项(例如 bcpSqlBulkCopy API)在内的许多加载方法,但最快且最具可伸缩性的数据加载方法是使用 PolyBase 外部表和 COPY 语句(预览版)。While SQL pool supports many loading methods, including popular SQL Server options such as bcp and the SqlBulkCopy API, the fastest and most scalable way to load data is through PolyBase external tables and the COPY statement (preview).

使用 PolyBase 和 COPY 语句可以通过 T-SQL 语言访问存储在 Azure Blob 存储中的外部数据。With PolyBase and the COPY statement, you can access external data stored in Azure Blob storage via the T-SQL language. 为了在加载时获得最大的灵活性,建议使用 COPY 语句。For the most flexibility when loading, we recommend using the COPY statement.

Note

COPY 语句目前为公共预览版功能。The COPY statement is currently in public preview.

什么是 ELT?What is ELT?

提取、加载和转换 (ELT) 是指将数据从源系统提取并加载到 SQL 池然后再进行转换的过程。Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a SQL pool, and then transformed.

实现 ELT 的基本步骤如下:The basic steps for implementing ELT are:

  1. 将源数据提取到文本文件中。Extract the source data into text files.
  2. 将数据移入 Azure Blob 存储中。Land the data into Azure Blob storage.
  3. 准备要加载的数据。Prepare the data for loading.
  4. 使用 PolyBase 或 COPY 命令将数据载入临时表。Load the data into staging tables with PolyBase or the COPY command.
  5. 转换数据。Transform the data.
  6. 将数据插入生产表。Insert the data into production tables.

有关 PolyBase 加载操作的教程,请参阅使用 PolyBase 从 Azure Blob 存储加载数据For a PolyBase loading tutorial, see Use PolyBase to load data from Azure blob storage.

有关详细信息,请参阅加载模式博客For more information, see Loading patterns blog.

1.将源数据提取到文本文件中1. Extract the source data into text files

从源系统中取出数据的过程取决于存储位置。Getting data out of your source system depends on the storage location. 目标是将数据移入 PolyBase 和 COPY 支持的带分隔符文本文件或 CSV 文件。The goal is to move the data into PolyBase and the COPY supported delimited text or CSV files.

PolyBase 和 COPY 外部文件格式PolyBase and COPY external file formats

使用 PolyBase 和 COPY 语句,可以从 UTF-8 和 UTF-16 编码的带分隔符文本文件或 CSV 文件加载数据。With PolyBase and the COPY statement, you can load data from UTF-8 and UTF-16 encoded delimited text or CSV files. 除了带分隔符文本文件或 CSV 文件以外,它还可以从 ORC 和 Parquet 等 Hadoop 文件格式加载数据。In addition to delimited text or CSV files, it loads from the Hadoop file formats such as ORC and Parquet. PolyBase 和 COPY 语句还可以从 Gzip 和 Snappy 压缩文件加载数据。PolyBase and the COPY statement can also load data from Gzip and Snappy compressed files.

不支持扩展的 ASCII、固定宽度格式以及 WinZip 或 XML 等嵌套格式。Extended ASCII, fixed-width format, and nested formats such as WinZip or XML aren't supported. 如果是从 SQL Server 导出,则可使用 bcp 命令行工具将数据导出到带分隔符的文本文件中。If you're exporting from SQL Server, you can use the bcp command-line tool to export the data into delimited text files.

2.将数据移入 Azure Blob 存储中2. Land the data into Azure Blob storage

若要将数据移入 Azure 存储,可以将其移动到 Azure Blob 存储To land the data in Azure storage, you can move it to Azure Blob storage. 不管什么位置,都应将数据存储在文本文件中。In either location, the data should be stored in text files. PolyBase 和 COPY 语句可从任一位置加载数据。PolyBase and the COPY statement can load from either location.

可使用以下工具和服务将数据移到 Azure 存储:Tools and services you can use to move data to Azure Storage:

  • Azure ExpressRoute 服务可以增强网络吞吐量、性能和可预测性。Azure ExpressRoute service enhances network throughput, performance, and predictability. ExpressRoute 是通过专用连接将数据路由到 Azure 的服务。ExpressRoute is a service that routes your data through a dedicated private connection to Azure. ExpressRoute 连接不通过公共 Internet 路由数据。ExpressRoute connections do not route data through the public internet. 与基于公共 Internet 的典型连接相比,这些连接提供更高的可靠性、更快的速度、更低的延迟和更高的安全性。The connections offer more reliability, faster speeds, lower latencies, and higher security than typical connections over the public internet.
  • AZCopy 实用工具可以通过公共 Internet 将数据移到 Azure 存储。AZCopy utility moves data to Azure Storage over the public internet. 如果数据小于 10 TB,则很适合使用此工具。This works if your data sizes are less than 10 TB. 若要使用 AZCopy 定期执行加载操作,请测试网络速度是否在可接受的范围内。To perform loads on a regular basis with AZCopy, test the network speed to see if it is acceptable.
  • Azure 数据工厂 (ADF) 提供一个可以安装在本地服务器上的网关。Azure Data Factory (ADF) has a gateway that you can install on your local server. 然后,你可以创建管道,以便将数据从本地服务器移到 Azure 存储。Then you can create a pipeline to move data from your local server up to Azure Storage. 若要将数据工厂与 SQL Analytics 配合使用,请参阅加载 SQL Analytics 的数据To use Data Factory with SQL Analytics, see Loading data for SQL Analytics.

3.准备要加载的数据3. Prepare the data for loading

在加载存储帐户中的数据之前,可能需要对其进行准备和清理。You might need to prepare and clean the data in your storage account before loading. 可以在数据仍保留在源中、将数据导出到文本文件时或者在数据进入 Azure 存储之后执行数据准备。Data preparation can be performed while your data is in the source, as you export the data to text files, or after the data is in Azure Storage. 最好是在加载过程的早期阶段处理数据。It is easiest to work with the data as early in the process as possible.

定义外部表Define external tables

如果使用 PolyBase,则需在加载之前在 SQL 池中定义外部表。If you are using PolyBase, you need to define external tables in your SQL pool before loading. COPY 语句不需要外部表。External tables are not required by the COPY statement. PolyBase 使用外部表来定义和访问 Azure 存储中的数据。PolyBase uses external tables to define and access the data in Azure Storage.

外部表类似于数据库视图。An external table is similar to a database view. 外部表包含表架构,并指向在 SQL 池外部存储的数据。The external table contains the table schema and points to data that is stored outside the SQL pool.

定义外部表涉及到指定数据源、文本文件的格式和表定义。Defining external tables involves specifying the data source, the format of the text files, and the table definitions. 相关的 T-SQL 语法参考文章如下:T-SQL syntax reference articles that you will need are:

加载 Parquet 时,SQL 数据类型映射为:When loading Parquet, the SQL data type mapping is:

Parquet 数据类型Parquet Data Type SQL 数据类型SQL Data Type
tinyinttinyint tinyinttinyint
smallintsmallint smallintsmallint
intint intint
bigintbigint bigintbigint
booleanboolean bitbit
Doubledouble floatfloat
floatfloat realreal
Doubledouble moneymoney
Doubledouble smallmoneysmallmoney
stringstring ncharnchar
stringstring nvarcharnvarchar
stringstring charchar
stringstring varcharvarchar
binarybinary binarybinary
binarybinary varbinaryvarbinary
timestamptimestamp datedate
timestamptimestamp smalldatetimesmalldatetime
timestamptimestamp datetime2datetime2
timestamptimestamp datetimedatetime
timestamptimestamp timetime
datedate datedate
decimaldecimal decimaldecimal

有关创建外部对象的示例,请参阅加载教程中的创建外部表步骤。For an example of creating external objects, see the Create external tables step in the loading tutorial.

设置文本文件的格式Format text files

使用 PolyBase 时,定义的外部对象需要使文本文件中的行与外部表和文件格式定义相符。If you are using PolyBase, the external objects defined need to align the rows of the text files with the external table and file format definition. 文本文件的每一行中的数据必须与表定义相符。The data in each row of the text file must align with the table definition. 设置文本文件的格式:To format the text files:

  • 如果数据来自非关系源,则需要将其转换为行与列。If your data is coming from a non-relational source, you need to transform it into rows and columns. 不管数据来自关系源还是非关系源,都必须转换数据,使之与数据预期要载入到的表的列定义相符。Whether the data is from a relational or non-relational source, the data must be transformed to align with the column definitions for the table into which you plan to load the data.
  • 设置本文件中数据的格式,使之与目标表中的列和数据类型相符。Format data in the text file to align with the columns and data types in the destination table. 外部文本文件与 SQL 池表中的数据类型不相符会导致系统在加载期间拒绝行。Misalignment between data types in the external text files and the SQL pool table causes rows to be rejected during the load.
  • 使用终止符分隔文本文件中的字段。Separate fields in the text file with a terminator. 请务必使用源数据中不包含的字符或字符序列。Be sure to use a character or a character sequence that isn't found in your source data. 使用通过 CREATE EXTERNAL FILE FORMAT 指定的终止符。Use the terminator you specified with CREATE EXTERNAL FILE FORMAT.

4.使用 PolyBase 或 COPY 语句加载数据4. Load the data using PolyBase or the COPY statement

最佳做法是将数据载入临时表。It is best practice to load data into a staging table. 使用临时表可以处理错误且不干扰生产表。Staging tables allow you to handle errors without interfering with the production tables. 将数据插入生产表之前,还可以通过临时表使用 SQL 池 MPP 进行数据转换。A staging table also gives you the opportunity to use the SQL pool MPP for data transformations before inserting the data into production tables.

使用 COPY 载入临时表时,需要预先创建表。The table will need to be pre-created when loading into a staging table with COPY.

使用 PolyBase 和 COPY 语句加载数据的选项Options for loading with PolyBase and COPY statement

若要使用 PolyBase 加载数据,可以使用下列任一加载选项:To load data with PolyBase, you can use any of these loading options:

  • 如果数据位于 Azure Blob 存储中,则 PolyBase 与 T-SQL 可以发挥作用。PolyBase with T-SQL works well when your data is in Azure Blob storage. 使用此方法可以获得加载过程的最大控制度,不过同时需要定义外部数据对象。It gives you the most control over the loading process, but also requires you to define external data objects. 其他方法在你将源表映射到目标表时,在幕后定义这些对象。The other methods define these objects behind the scenes as you map source tables to destination tables. 若要协调 T-SQL 负载,可以使用 Azure 数据工厂、SSIS。To orchestrate T-SQL loads, you can use Azure Data Factory, SSIS.
  • 如果源数据位于本地 SQL Server 或云中的 SQL Server,则 PolyBase 与 SSIS 可以发挥作用。PolyBase with SSIS works well when your source data is in SQL Server, either SQL Server on-premises or in the cloud. SSIS 定义源到目标表的映射,同时可协调负载。SSIS defines the source to destination table mappings, and also orchestrates the load. 如果已有 SSIS 包,可将这些包修改为使用新的数据仓库目标。If you already have SSIS packages, you can modify the packages to work with the new data warehouse destination.
  • Azure 数据工厂 (ADF) 的 PolyBase 和 COPY 语句是另一个业务流程工具。PolyBase and COPY statement with Azure Data Factory (ADF) is another orchestration tool. 它定义管道并计划作业。It defines a pipeline and schedules jobs.

其他加载选项Other loading options

除 PolyBase 和 COPY 语句以外,还可以使用 bcpSqlBulkCopy APIIn addition to PolyBase and the COPY statement, you can use bcp or the SqlBulkCopy API. bcp 可将数据直接加载到数据库而无需经过 Azure Blob 存储,但此方法只适用于小规模的加载。bcp loads directly to the database without going through Azure Blob storage, and is intended only for small loads.

Note

这些选项的加载性能低于 PolyBase 和 COPY 语句。The load performance of these options is slower than PolyBase and the COPY statement.

5.转换数据5. Transform the data

在数据已进入临时表时,请执行工作负荷所需的转换。While data is in the staging table, perform transformations that your workload requires. 然后将数据移到生产表。Then move the data into a production table.

6.将数据插入生产表6. Insert the data into production tables

INSERT INTO ...SELECT 语句将数据从临时表移到永久表。The INSERT INTO ... SELECT statement moves the data from the staging table to the permanent table.

设计 ETL 过程时,请尝试针对一个较小的测试示例运行该过程。As you design an ETL process, try running the process on a small test sample. 尝试将表中的 1000 行提取到某个文件,将该文件移到 Azure,然后将其载入临时表。Try extracting 1000 rows from the table to a file, move it to Azure, and then try loading it into a staging table.

后续步骤Next steps

有关加载指南,请参阅加载数据的指南For loading guidance, see Guidance for loading data.