使用 Synapse SQL 池加载数据的最佳做法Best practices for loading data using Synapse SQL pool

在本文中,你将了解有关使用 SQL 池加载数据的建议和性能优化。In this article, you'll learn recommendations and performance optimizations for loading data using SQL pool.

在 Azure 存储中准备数据Preparing data in Azure Storage

若要尽量减少延迟,请将你的存储层级和 SQL 池置于一起。To minimize latency, colocate your storage layer and your SQL pool.

将数据导出为 ORC 文件格式时,如果存在较大的文本列,可能会收到“Java 内存不足”错误。When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large text columns. 若要解决此限制方面的问题,请仅导出列的一个子集。To work around this limitation, export only a subset of the columns.

PolyBase 无法加载数据大小超过 1,000,000 字节的行。PolyBase can't load rows that have more than 1,000,000 bytes of data. 将数据置于 Azure Blob 存储或 Azure Data Lake Storage 的文本文件中时,这些数据必须少于 1,000,000 字节。When you put data into the text files in Azure Blob storage or Azure Data Lake Storage, they must have fewer than 1,000,000 bytes of data. 无论表架构如何,都有此字节限制。This byte limitation is true regardless of the table schema.

所有文件格式有不同的性能特征。All file formats have different performance characteristics. 为实现最快加载,请使用压缩的分隔文本文件。For the fastest load, use compressed delimited text files. UTF-8 和 UTF-16 之间的性能差异是最小的。The difference between UTF-8 and UTF-16 performance is minimal.

将大型压缩文件拆分为较小的压缩文件。Split large compressed files into smaller compressed files.

使用足够的计算资源运行负载Running loads with enough compute

若要尽量提高加载速度,请一次只运行一个加载作业。For fastest loading speed, run only one load job at a time. 如果这不可行,请将同时运行的负载的数量降至最低。If that isn't feasible, run a minimal number of loads concurrently. 如果预期的加载作业较大,可以考虑在加载前纵向扩展 SQL 池。If you expect a large loading job, consider scaling up your SQL pool before the load.

若要使用适当的计算资源运行负载,请创建指定运行负载的加载用户。To run loads with appropriate compute resources, create loading users designated for running loads. 将每个加载用户分配给一个特定的资源类或工作负载组。Assign each loading user to a specific resource class or workload group. 若要运行负载,请以某个加载用户的身份登录,然后运行该负载。To run a load, sign in as one of the loading users, and then run the load. 该负载使用用户的资源类运行。The load runs with the user's resource class.

Note

与尝试根据当前的资源类需求更改用户的资源类相比,此方法更简单。This method is simpler than trying to change a user's resource class to fit the current resource class need.

创建加载用户的示例Example of creating a loading user

此示例为 staticrc20 资源类创建加载用户。This example creates a loading user for the staticrc20 resource class. 第一步是连接到主服务器并创建登录名。The first step is to connect to master and create a login.

   -- Connect to master
   CREATE LOGIN LoaderRC20 WITH PASSWORD = 'a123STRONGpassword!';

连接到 SQL 池并创建一个用户。Connect to the SQL pool and create a user. 以下代码假定你连接到名为 mySampleDataWarehouse 的数据库。The following code assumes you're connected to the database called mySampleDataWarehouse. 它演示如何创建一个名为 LoaderRC20 的用户,并向该用户授予对某个数据库的控制权限,It shows how to create a user called LoaderRC20 and gives the user control permission on a database. 然后将该用户添加为 staticrc20 数据库角色的成员。Then, it adds the user as a member of the staticrc20 database role.

   -- Connect to the database
   CREATE USER LoaderRC20 FOR LOGIN LoaderRC20;
   GRANT CONTROL ON DATABASE::[mySampleDataWarehouse] to LoaderRC20;
   EXEC sp_addrolemember 'staticrc20', 'LoaderRC20';

若要使用 staticRC20 资源类的资源运行负载,请以 LoaderRC20 身份登录,然后运行该负载。To run a load with resources for the staticRC20 resource classes, sign in as LoaderRC20 and run the load.

在静态而非动态资源类下运行负载。Run loads under static rather than dynamic resource classes. 使用静态资源类可确保不管数据仓库单元如何,资源始终不变。Using the static resource classes guarantees the same resources regardless of your data warehouse units. 如果使用动态资源类,则资源因服务级别而异。If you use a dynamic resource class, the resources vary according to your service level.

对于动态类,如果服务级别降低,则意味着可能需要对加载用户使用更大的资源类。For dynamic classes, a lower service level means you probably need to use a larger resource class for your loading user.

允许多个用户进行加载Allowing multiple users to load

通常需要允许多个用户将数据加载到 SQL 池中。There's often a need to have multiple users load data into a SQL pool. 使用 CREATE TABLE AS SELECT (Transact-SQL) 进行加载需要数据库的“控制”权限。Loading with the CREATE TABLE AS SELECT (Transact-SQL) requires CONTROL permissions of the database. “控制”权限允许对所有架构进行控制性访问。The CONTROL permission gives control access to all schemas.

可能不需要让所有加载用户都具有对所有架构的控制访问权限。You might not want all loading users to have control access on all schemas. 若要限制权限,请使用 DENY CONTROL 语句。To limit permissions, use the DENY CONTROL statement.

例如,考虑为部门 A 使用数据库架构 schema_A,为部门 B 使用 schema_B;让数据库用户 user_A 和 user_B 分别作为部门 A 和 B 中加载的 PolyBase 用户。For example, consider database schemas, schema_A for dept A, and schema_B for dept B. Let database users user_A and user_B be users for PolyBase loading in dept A and B, respectively. 这些用户已被授予 CONTROL 数据库权限。They both have been granted CONTROL database permissions. 架构 A 和 B 的创建者现在使用 DENY 锁定其架构:The creators of schema A and B now lock down their schemas using DENY:

   DENY CONTROL ON SCHEMA :: schema_A TO user_B;
   DENY CONTROL ON SCHEMA :: schema_B TO user_A;

现在 user_A 和 user_B 被锁在其他部门的架构之外。User_A and user_B are now locked out from the other dept's schema.

加载到临时表Loading to a staging table

若要尽量提高将数据移到 SQL 池表中的加载速度,请将数据加载到临时表中。To achieve the fastest loading speed for moving data into a SQL pool table, load data into a staging table. 将临时表定义为堆,并将轮循机制用于分发选项。Define the staging table as a heap and use round-robin for the distribution option.

可以认为,加载通常是一个两步的过程:首先将数据加载到临时表中,然后将数据插入生产 SQL 池表中。Consider that loading is usually a two-step process in which you first load to a staging table and then insert the data into a production SQL pool table. 如果生产表使用哈希分发,在在使用哈希分发来定义临时表的情况下,加载和插入的总时间可能会更短。If the production table uses a hash distribution, the total time to load and insert might be faster if you define the staging table with the hash distribution.

加载到临时表需要的时间较长,但第二步将行插入到生产表中不会导致数据跨分布区移动。Loading to the staging table takes longer, but the second step of inserting the rows to the production table does not incur data movement across the distributions.

加载到列存储索引Loading to a columnstore index

列存储索引需要将数据压缩成高质量的行组,因此需要大量的内存。Columnstore indexes require large amounts of memory to compress data into high-quality rowgroups. 若要最大程度地提高压缩和索引效率,列存储索引需将最多 1,048,576 行压缩到每个行组中。For best compression and index efficiency, the columnstore index needs to compress the maximum of 1,048,576 rows into each rowgroup.

存在内存压力时,列存储索引可能无法达到最大压缩率。When there is memory pressure, the columnstore index might not be able to achieve maximum compression rates. 此情况反过来会影响查询性能。This scenario, in turn, effects query performance. 若要进行深入了解,请参阅列存储内存优化For a deep dive, see Columnstore memory optimizations.

  • 若要确保加载用户有足够的内存来实现最大压缩率,请使用属于中大型资源类的加载用户。To ensure the loading user has enough memory to achieve maximum compression rates, use loading users that are a member of a medium or large resource class.
  • 加载足够的行,以便完全填充新的行组。Load enough rows to completely fill new rowgroups. 在大容量加载期间,数据会以 1,048,576 行为一个完整的行组直接压缩到列存储中。During a bulk load, every 1,048,576 rows get compressed directly into the columnstore as a full rowgroup. 不到 102,400 行的加载会将行发送到增量存储中以 B 树索引的形式保存。Loads with fewer than 102,400 rows send the rows to the deltastore where rows are held in a b-tree index.

Note

如果加载的行太少,这些行可能会全部进入增量存储中,不会立即压缩成列存储格式。If you load too few rows, they might all route to the deltastore and not get compressed immediately into columnstore format.

使用 SqLBulkCopy API 或 bcp 时增加批大小Increase batch size when using SqLBulkCopy API or bcp

使用 PolyBase 进行加载将为 SQL 池提供最高吞吐量。Loading with PolyBase will provide the highest throughput with SQL pool. 如果无法使用 PolyBase 进行加载,并且必须使用 SqLBulkCopy APIbcp,则应考虑增加批大小以获得更高的吞吐量。If you cannot use PolyBase to load and must use the SqLBulkCopy API or bcp, you should consider increasing batch size for better throughput.

Tip

10 万行到 1 百万行之间的批大小是建议用于确定最佳批大小容量的基线。A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.

处理加载失败Handling loading failures

使用外部表的加载可能因“查询已中止 -- 从外部源读取时已达最大拒绝阈值”错误而失败。 A load using an external table can fail with the error "Query aborted-- the maximum reject threshold was reached while reading from an external source". 此消息表示外部数据包含脏记录。This message indicates that your external data contains dirty records.

如果数据记录满足以下条件之一,则会将其视为脏记录:A data record is considered to be dirty if it meets one of the following conditions:

  • 列的数据类型和数目与外部表的列定义不匹配。The data types and number of columns do not match the column definitions of the external table.
  • 数据不符合指定的外部文件格式。The data doesn't conform to the specified external file format.

若要解决脏记录问题,请确保外部表和外部文件格式定义正确,并且外部数据符合这些定义。To fix the dirty records, ensure that your external table and external file format definitions are correct and your external data conforms to these definitions.

如果外部数据记录的子集是脏的,可通过使用 CREATE EXTERNAL TABLE (Transact-SQL) 中的拒绝选项,选择拒绝这些查询记录。If a subset of external data records are dirty, you can choose to reject these records for your queries by using the reject options in CREATE EXTERNAL TABLE (Transact-SQL).

将数据插入生产表Inserting data into a production table

可以使用 INSERT 语句将数据一次性加载到小型表中,甚至可以使用 INSERT INTO MyLookup VALUES (1, 'Type 1') 之类的语句定期重新加载某个查找。A one-time load to a small table with an INSERT statement, or even a periodic reload of a look-up might perform good enough with a statement like INSERT INTO MyLookup VALUES (1, 'Type 1'). 但是,单独插入的效率不如执行大容量加载的效率。However, singleton inserts are not as efficient as performing a bulk load.

如果一天中有成千上万的单个插入,可将插入成批进行大容量加载。If you have thousands or more single inserts throughout the day, batch the inserts so you can bulk load them. 制定将单个插入追加到某个文件的流程,然后创建另一流程来定期加载该文件。Develop your processes to append the single inserts to a file, and then create another process that periodically loads the file.

创建加载后的统计信息Creating statistics after the load

为了改进查询性能,在首次加载数据或者在数据发生重大更改之后,必须针对所有表的所有列创建统计信息。To improve query performance, it's important to create statistics on all columns of all tables after the first load, or substantial changes occur in the data. 可以手动创建统计信息,也可以启用 AUTO_CREATE_STATISTICSCreating statistics can be done manually or you can enable AUTO_CREATE_STATISTICS.

有关统计信息的详细说明,请参阅统计信息For a detailed explanation of statistics, see Statistics. 以下示例演示如何针对 Customer_Speed 表的五个列创建统计信息。The following example shows how to manually create statistics on five columns of the Customer_Speed table.

create statistics [SensorKey] on [Customer_Speed] ([SensorKey]);
create statistics [CustomerKey] on [Customer_Speed] ([CustomerKey]);
create statistics [GeographyKey] on [Customer_Speed] ([GeographyKey]);
create statistics [Speed] on [Customer_Speed] ([Speed]);
create statistics [YearMeasured] on [Customer_Speed] ([YearMeasured]);

轮换存储密钥Rotate storage keys

好的安全做法是定期更改 Blob 存储的访问密钥。It is good security practice to change the access key to your blob storage on a regular basis. 由于有两个用于 Blob 存储帐户的存储密钥,因此可以转换着使用这两个密钥。You have two storage keys for your blob storage account, which enables you to transition the keys.

若要轮换 Azure 存储帐户密钥,请执行以下操作:To rotate Azure Storage account keys:

对于每个已更改密钥的存储帐户,请发出 ALTER DATABASE SCOPED CREDENTIAL 命令。For each storage account whose key has changed, issue ALTER DATABASE SCOPED CREDENTIAL.

示例:Example:

已创建原始密钥Original key is created

CREATE DATABASE SCOPED CREDENTIAL my_credential WITH IDENTITY = 'my_identity', SECRET = 'key1'

将密钥从密钥 1 轮换为密钥 2Rotate key from key 1 to key 2

ALTER DATABASE SCOPED CREDENTIAL my_credential WITH IDENTITY = 'my_identity', SECRET = 'key2'

无需对基础外部数据源进行更改。No other changes to underlying external data sources are needed.

后续步骤Next steps