针对 Synapse SQL 池的数据加载策略Data loading strategies for Synapse SQL pool

传统的 SMP SQL 池通过提取、转换和加载 (ETL) 过程来加载数据。Traditional SMP SQL pools use an Extract, Transform, and Load (ETL) process for loading data. Azure Synapse Analytics 中的 Synapse SQL 池具有大规模并行处理 (MPP) 体系结构,可以利用计算和存储资源的可伸缩性和灵活性。Synapse SQL pool, within Azure Synapse Analytics, has a massively parallel processing (MPP) architecture that takes advantage of the scalability and flexibility of compute and storage resources.

可以通过提取、加载和转换 (ELT) 过程来利用 MPP,无需在加载数据之前投入资源来转换数据。Using an Extract, Load, and Transform (ELT) process leverages MPP and eliminates the resources needed for data transformation prior to loading.

虽然 SQL 池支持包括常用 SQL Server 选项(例如 bcpSqlBulkCopy API)在内的许多加载方法,但最快且最具可伸缩性的数据加载方法是使用 PolyBase 外部表和 COPY 语句(预览版)。While SQL pool supports many loading methods, including popular SQL Server options such as bcp and the SqlBulkCopy API, the fastest and most scalable way to load data is through PolyBase external tables and the COPY statement (preview).

使用 PolyBase 和 COPY 语句可以通过 T-SQL 语言访问存储在 Azure Blob 存储中的外部数据。With PolyBase and the COPY statement, you can access external data stored in Azure Blob storage via the T-SQL language. 为了在加载时获得最大的灵活性,建议使用 COPY 语句。For the most flexibility when loading, we recommend using the COPY statement.

备注

COPY 语句目前为公共预览版功能。The COPY statement is currently in public preview.

什么是 ELT?What is ELT?

提取、加载和转换 (ELT) 是指将数据从源系统提取并加载到 SQL 池然后再进行转换的过程。Extract, Load, and Transform (ELT) is a process by which data is extracted from a source system, loaded into a SQL pool, and then transformed.

实现 ELT 的基本步骤如下:The basic steps for implementing ELT are:

  1. 将源数据提取到文本文件中。Extract the source data into text files.
  2. 将数据移入 Azure Blob 存储中。Land the data into Azure Blob storage.
  3. 准备要加载的数据。Prepare the data for loading.
  4. 使用 PolyBase 或 COPY 命令将数据载入临时表。Load the data into staging tables with PolyBase or the COPY command.
  5. 转换数据。Transform the data.
  6. 将数据插入生产表。Insert the data into production tables.

有关加载教程,请参阅从 Azure Blob 存储加载数据For a loading tutorial, see loading data from Azure blob storage.

1.将源数据提取到文本文件中1. Extract the source data into text files

从源系统中取出数据的过程取决于存储位置。Getting data out of your source system depends on the storage location. 目标是将数据移入支持的带分隔符的文本文件或 CSV 文件。The goal is to move the data into supported delimited text or CSV files.

支持的文件格式Supported file formats

使用 PolyBase 和 COPY 语句,可以从 UTF-8 和 UTF-16 编码的带分隔符文本文件或 CSV 文件加载数据。With PolyBase and the COPY statement, you can load data from UTF-8 and UTF-16 encoded delimited text or CSV files. 除了带分隔符文本文件或 CSV 文件以外,它还可以从 ORC 和 Parquet 等 Hadoop 文件格式加载数据。In addition to delimited text or CSV files, it loads from the Hadoop file formats such as ORC and Parquet. PolyBase 和 COPY 语句还可以从 Gzip 和 Snappy 压缩文件加载数据。PolyBase and the COPY statement can also load data from Gzip and Snappy compressed files.

不支持扩展的 ASCII、固定宽度格式以及 WinZip 或 XML 等嵌套格式。Extended ASCII, fixed-width format, and nested formats such as WinZip or XML aren't supported. 如果是从 SQL Server 导出,则可使用 bcp 命令行工具将数据导出到带分隔符的文本文件中。If you're exporting from SQL Server, you can use the bcp command-line tool to export the data into delimited text files.

2.将数据移入 Azure Blob 存储中2. Land the data into Azure Blob storage

若要将数据移入 Azure 存储,可以将其移动到 Azure Blob 存储To land the data in Azure storage, you can move it to Azure Blob storage. 不管什么位置,都应将数据存储在文本文件中。In either location, the data should be stored in text files. PolyBase 和 COPY 语句可从任一位置加载数据。PolyBase and the COPY statement can load from either location.

可使用以下工具和服务将数据移到 Azure 存储:Tools and services you can use to move data to Azure Storage:

  • Azure ExpressRoute 服务可以增强网络吞吐量、性能和可预测性。Azure ExpressRoute service enhances network throughput, performance, and predictability. ExpressRoute 是通过专用连接将数据路由到 Azure 的服务。ExpressRoute is a service that routes your data through a dedicated private connection to Azure. ExpressRoute 连接不通过公共 Internet 路由数据。ExpressRoute connections do not route data through the public internet. 与基于公共 Internet 的典型连接相比,这些连接提供更高的可靠性、更快的速度、更低的延迟和更高的安全性。The connections offer more reliability, faster speeds, lower latencies, and higher security than typical connections over the public internet.
  • AZCopy 实用工具可以通过公共 Internet 将数据移到 Azure 存储。AZCopy utility moves data to Azure Storage over the public internet. 如果数据小于 10 TB,则很适合使用此工具。This works if your data sizes are less than 10 TB. 若要使用 AZCopy 定期执行加载操作,请测试网络速度是否在可接受的范围内。To perform loads on a regular basis with AZCopy, test the network speed to see if it is acceptable.
  • Azure 数据工厂 (ADF) 提供一个可以安装在本地服务器上的网关。Azure Data Factory (ADF) has a gateway that you can install on your local server. 然后,你可以创建管道,以便将数据从本地服务器移到 Azure 存储。Then you can create a pipeline to move data from your local server up to Azure Storage. 若要将数据工厂与 SQL 池配合使用,请参阅加载 SQL 池数据To use Data Factory with SQL pool, see Loading data for SQL pool.

3.准备要加载的数据3. Prepare the data for loading

在加载存储帐户中的数据之前,可能需要对其进行准备和清理。You might need to prepare and clean the data in your storage account before loading. 可以在数据仍保留在源中、将数据导出到文本文件时或者在数据进入 Azure 存储之后执行数据准备。Data preparation can be performed while your data is in the source, as you export the data to text files, or after the data is in Azure Storage. 最好是在加载过程的早期阶段处理数据。It is easiest to work with the data as early in the process as possible.

定义表Define the tables

使用 COPY 语句时,必须先在 SQL 池中定义要加载到的表。You must first defined the table(s) you are loading to in your SQL pool when using the COPY statement.

如果使用 PolyBase,则需在加载之前在 SQL 池中定义外部表。If you are using PolyBase, you need to define external tables in your SQL pool before loading. PolyBase 使用外部表来定义和访问 Azure 存储中的数据。PolyBase uses external tables to define and access the data in Azure Storage. 外部表类似于数据库视图。An external table is similar to a database view. 外部表包含表架构,并指向在 SQL 池外部存储的数据。The external table contains the table schema and points to data that is stored outside the SQL pool.

定义外部表涉及到指定数据源、文本文件的格式和表定义。Defining external tables involves specifying the data source, the format of the text files, and the table definitions. 相关的 T-SQL 语法参考文章如下:T-SQL syntax reference articles that you will need are:

加载 Parquet 文件时,使用以下 SQL 数据类型映射:Use the following SQL data type mapping when loading Parquet files:

Parquet 类型Parquet type Parquet 逻辑类型(批注)Parquet logical type (annotation) SQL 数据类型SQL data type
BOOLEANBOOLEAN bitbit
BINARY/BYTE_ARRAYBINARY / BYTE_ARRAY varbinaryvarbinary
DOUBLEDOUBLE floatfloat
FLOATFLOAT realreal
INT32INT32 intint
INT64INT64 bigintbigint
INT96INT96 datetime2datetime2
FIXED_LEN_BYTE_ARRAYFIXED_LEN_BYTE_ARRAY binarybinary
BINARYBINARY UTF8UTF8 nvarcharnvarchar
BINARYBINARY STRINGSTRING nvarcharnvarchar
BINARYBINARY ENUMENUM nvarcharnvarchar
BINARYBINARY UUIDUUID uniqueidentifieruniqueidentifier
BINARYBINARY DECIMALDECIMAL decimaldecimal
BINARYBINARY JSONJSON nvarchar(MAX)nvarchar(MAX)
BINARYBINARY BSONBSON varbinary(max)varbinary(max)
FIXED_LEN_BYTE_ARRAYFIXED_LEN_BYTE_ARRAY DECIMALDECIMAL decimaldecimal
BYTE_ARRAYBYTE_ARRAY INTERVALINTERVAL varchar(max),varchar(max),
INT32INT32 INT(8, true)INT(8, true) smallintsmallint
INT32INT32 INT(16, true)INT(16, true) smallintsmallint
INT32INT32 INT(32, true)INT(32, true) intint
INT32INT32 INT(8, false)INT(8, false) tinyinttinyint
INT32INT32 INT(16, false)INT(16, false) intint
INT32INT32 INT(32, false)INT(32, false) bigintbigint
INT32INT32 DATEDATE datedate
INT32INT32 DECIMALDECIMAL decimaldecimal
INT32INT32 TIME (MILLIS)TIME (MILLIS ) timetime
INT64INT64 INT(64, true)INT(64, true) bigintbigint
INT64INT64 INT(64, false )INT(64, false ) decimal(20,0)decimal(20,0)
INT64INT64 DECIMALDECIMAL decimaldecimal
INT64INT64 TIME (MICROS/NANOS)TIME (MICROS / NANOS) timetime
INT64INT64 TIMESTAMP (MILLIS / MICROS / NANOS)TIMESTAMP (MILLIS / MICROS / NANOS) datetime2datetime2
复杂类型Complex type 列表LIST varchar(max)varchar(max)
复杂类型Complex type MAPMAP varchar(max)varchar(max)

有关创建外部对象的示例,请参阅创建外部表For an example of creating external objects, see Create external tables.

设置文本文件的格式Format text files

使用 PolyBase 时,定义的外部对象需要使文本文件中的行与外部表和文件格式定义相符。If you are using PolyBase, the external objects defined need to align the rows of the text files with the external table and file format definition. 文本文件的每一行中的数据必须与表定义相符。The data in each row of the text file must align with the table definition. 设置文本文件的格式:To format the text files:

  • 如果数据来自非关系源,则需要将其转换为行与列。If your data is coming from a non-relational source, you need to transform it into rows and columns. 不管数据来自关系源还是非关系源,都必须转换数据,使之与数据预期要载入到的表的列定义相符。Whether the data is from a relational or non-relational source, the data must be transformed to align with the column definitions for the table into which you plan to load the data.
  • 设置本文件中数据的格式,使之与目标表中的列和数据类型相符。Format data in the text file to align with the columns and data types in the destination table. 外部文本文件与 SQL 池表中的数据类型不相符会导致系统在加载期间拒绝行。Misalignment between data types in the external text files and the SQL pool table causes rows to be rejected during the load.
  • 使用终止符分隔文本文件中的字段。Separate fields in the text file with a terminator. 请务必使用源数据中不包含的字符或字符序列。Be sure to use a character or a character sequence that isn't found in your source data. 使用通过 CREATE EXTERNAL FILE FORMAT 指定的终止符。Use the terminator you specified with CREATE EXTERNAL FILE FORMAT.

4.使用 PolyBase 或 COPY 语句加载数据4. Load the data using PolyBase or the COPY statement

最佳做法是将数据载入临时表。It is best practice to load data into a staging table. 使用临时表可以处理错误且不干扰生产表。Staging tables allow you to handle errors without interfering with the production tables. 将数据插入生产表之前,还可以通过临时表使用 SQL 池并行处理体系结构进行数据转换。A staging table also gives you the opportunity to use the SQL pool parallel processing architecture for data transformations before inserting the data into production tables.

用于加载的选项Options for loading

若要加载数据,可以使用下列任一加载选项:To load data, you can use any of these loading options:

  • COPY 语句是推荐使用的加载实用工具,因为它可以无缝且灵活地加载数据。The COPY statement is the recommended loading utility as it enables you to seamlessly and flexibly load data. 该语句具有许多 PolyBase 不提供的其他加载功能。The statement has many additional loading capabilities that PolyBase does not provide.
  • 具有 T-SQL 的 PolyBase 要求定义外部数据对象。PolyBase with T-SQL requires you to define external data objects.
  • Azure 数据工厂 (ADF) 的 PolyBase 和 COPY 语句是另一个业务流程工具。PolyBase and COPY statement with Azure Data Factory (ADF) is another orchestration tool. 它定义管道并计划作业。It defines a pipeline and schedules jobs.
  • 源数据位于 SQL Server 时,具有 SSIS 的 PolyBase 运行良好。PolyBase with SSIS works well when your source data is in SQL Server. SSIS 定义源到目标表的映射,同时可协调负载。SSIS defines the source to destination table mappings, and also orchestrates the load. 如果已有 SSIS 包,可将这些包修改为使用新的数据仓库目标。If you already have SSIS packages, you can modify the packages to work with the new data warehouse destination.

其他加载选项Other loading options

除 PolyBase 和 COPY 语句以外,还可以使用 bcpSqlBulkCopy APIIn addition to PolyBase and the COPY statement, you can use bcp or the SqlBulkCopy API. bcp 可将数据直接加载到数据库而无需经过 Azure Blob 存储,但此方法只适用于小规模的加载。bcp loads directly to the database without going through Azure Blob storage, and is intended only for small loads.

备注

这些选项的加载性能低于 PolyBase 和 COPY 语句。The load performance of these options is slower than PolyBase and the COPY statement.

5.转换数据5. Transform the data

在数据已进入临时表时,请执行工作负荷所需的转换。While data is in the staging table, perform transformations that your workload requires. 然后将数据移到生产表。Then move the data into a production table.

6.将数据插入生产表6. Insert the data into production tables

INSERT INTO ...SELECT 语句将数据从临时表移到永久表。The INSERT INTO ... SELECT statement moves the data from the staging table to the permanent table.

设计 ETL 过程时,请尝试针对一个较小的测试示例运行该过程。As you design an ETL process, try running the process on a small test sample. 尝试将表中的 1000 行提取到某个文件,将该文件移到 Azure,然后将其载入临时表。Try extracting 1000 rows from the table to a file, move it to Azure, and then try loading it into a staging table.

后续步骤Next steps

有关加载指南,请参阅加载数据的指南For loading guidance, see Guidance for loading data.