使用 Synapse SQL 池加载数据的最佳做法Best practices for loading data using Synapse SQL pool

在本文中,你将了解有关使用 SQL 池加载数据的建议和性能优化。In this article, you'll learn recommendations and performance optimizations for loading data using SQL pool.

在 Azure 存储中准备数据Preparing data in Azure Storage

若要尽量减少延迟,请将你的存储层级和 SQL 池置于一起。To minimize latency, colocate your storage layer and your SQL pool.

将数据导出为 ORC 文件格式时,如果存在较大的文本列,可能会收到“Java 内存不足”错误。When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large text columns. 若要解决此限制方面的问题,请仅导出列的一个子集。To work around this limitation, export only a subset of the columns.

所有文件格式有不同的性能特征。All file formats have different performance characteristics. 为实现最快加载,请使用压缩的分隔文本文件。For the fastest load, use compressed delimited text files. UTF-8 和 UTF-16 之间的性能差异是最小的。The difference between UTF-8 and UTF-16 performance is minimal.

将大型压缩文件拆分为较小的压缩文件。Split large compressed files into smaller compressed files.

使用足够的计算资源运行负载Running loads with enough compute

若要尽量提高加载速度,请一次只运行一个加载作业。For fastest loading speed, run only one load job at a time. 如果这不可行,请将同时运行的负载的数量降至最低。If that isn't feasible, run a minimal number of loads concurrently. 如果预期的加载作业较大,可以考虑在加载前纵向扩展 SQL 池。If you expect a large loading job, consider scaling up your SQL pool before the load.

若要使用适当的计算资源运行负载,请创建指定运行负载的加载用户。To run loads with appropriate compute resources, create loading users designated for running loads. 将每个加载用户分类到特定的工作负荷组。Classify each loading user to a specific workload group. 若要运行负载,请以某个加载用户的身份登录,然后运行该负载。To run a load, sign in as one of the loading users, and then run the load. 负载与用户的工作负荷组一起运行。The load runs with the user's workload group.

创建加载用户的示例Example of creating a loading user

此示例将创建一个分类到特定工作负荷组的加载用户。This example creates a loading user classified to a specific workload group. 第一步是连接到主服务器并创建登录名。The first step is to connect to master and create a login.

   -- Connect to master
   CREATE LOGIN loader WITH PASSWORD = 'a123STRONGpassword!';

连接到 SQL 池并创建一个用户。Connect to the SQL pool and create a user. 以下代码假定你连接到名为 mySampleDataWarehouse 的数据库。The following code assumes you're connected to the database called mySampleDataWarehouse. 它展示了如何创建一个名为“loader”的用户,并授予该用户使用 COPY 语句创建表和进行加载的权限。It shows how to create a user called loader and gives the user permissions to create tables and load using the COPY statement. 然后,它将该用户分类到包含最多资源的 DataLoads 工作负荷组。Then it classifies the user to the DataLoads workload group with maximum resources.

   -- Connect to the SQL pool
   CREATE USER loader FOR LOGIN loader;
   GRANT ADMINISTER DATABASE BULK OPERATIONS TO loader;
   GRANT INSERT ON <yourtablename> TO loader;
   GRANT SELECT ON <yourtablename> TO loader;
   GRANT CREATE TABLE TO loader;
   GRANT ALTER ON SCHEMA::dbo TO loader;
   
   CREATE WORKLOAD GROUP DataLoads
   WITH ( 
      MIN_PERCENTAGE_RESOURCE = 100
       ,CAP_PERCENTAGE_RESOURCE = 100
       ,REQUEST_MIN_RESOURCE_GRANT_PERCENT = 100
    );

   CREATE WORKLOAD CLASSIFIER [wgcELTLogin]
   WITH (
         WORKLOAD_GROUP = 'DataLoads'
       ,MEMBERNAME = 'loader'
   );

若要使用加载工作负荷组的资源运行加载,请以 loader 身份登录并运行该加载。To run a load with resources for the loading workload group, sign in as loader and run the load.

允许多个用户加载 (PolyBase)Allowing multiple users to load (PolyBase)

通常需要允许多个用户将数据加载到 SQL 池中。There's often a need to have multiple users load data into a SQL pool. 使用 CREATE TABLE AS SELECT (Transact-SQL) (PolyBase) 进行加载需要数据库的“控制”权限。Loading with the CREATE TABLE AS SELECT (Transact-SQL) (PolyBase) requires CONTROL permissions of the database. “控制”权限允许对所有架构进行控制性访问。The CONTROL permission gives control access to all schemas.

可能不需要让所有加载用户都具有对所有架构的控制访问权限。You might not want all loading users to have control access on all schemas. 若要限制权限,请使用 DENY CONTROL 语句。To limit permissions, use the DENY CONTROL statement.

例如,考虑为部门 A 使用数据库架构 schema_A,为部门 B 使用 schema_B;让数据库用户 user_A 和 user_B 分别作为部门 A 和 B 中加载的 PolyBase 用户。For example, consider database schemas, schema_A for dept A, and schema_B for dept B. Let database users user_A and user_B be users for PolyBase loading in dept A and B, respectively. 这些用户已被授予 CONTROL 数据库权限。They both have been granted CONTROL database permissions. 架构 A 和 B 的创建者现在使用 DENY 锁定其架构:The creators of schema A and B now lock down their schemas using DENY:

   DENY CONTROL ON SCHEMA :: schema_A TO user_B;
   DENY CONTROL ON SCHEMA :: schema_B TO user_A;

现在 user_A 和 user_B 被锁在其他部门的架构之外。User_A and user_B are now locked out from the other dept's schema.

加载到临时表Loading to a staging table

若要尽量提高将数据移到 SQL 池表中的加载速度,请将数据加载到临时表中。To achieve the fastest loading speed for moving data into a SQL pool table, load data into a staging table. 将临时表定义为堆,并将轮循机制用于分发选项。Define the staging table as a heap and use round-robin for the distribution option.

可以认为,加载通常是一个两步的过程:首先将数据加载到临时表中,然后将数据插入生产 SQL 池表中。Consider that loading is usually a two-step process in which you first load to a staging table and then insert the data into a production SQL pool table. 如果生产表使用哈希分发,在在使用哈希分发来定义临时表的情况下,加载和插入的总时间可能会更短。If the production table uses a hash distribution, the total time to load and insert might be faster if you define the staging table with the hash distribution.

加载到临时表需要的时间较长,但第二步将行插入到生产表中不会导致数据跨分布区移动。Loading to the staging table takes longer, but the second step of inserting the rows to the production table does not incur data movement across the distributions.

加载到列存储索引Loading to a columnstore index

列存储索引需要将数据压缩成高质量的行组,因此需要大量的内存。Columnstore indexes require large amounts of memory to compress data into high-quality rowgroups. 若要最大程度地提高压缩和索引效率,列存储索引需将最多 1,048,576 行压缩到每个行组中。For best compression and index efficiency, the columnstore index needs to compress the maximum of 1,048,576 rows into each rowgroup.

存在内存压力时,列存储索引可能无法达到最大压缩率。When there is memory pressure, the columnstore index might not be able to achieve maximum compression rates. 此情况反过来会影响查询性能。This scenario, in turn, effects query performance. 若要进行深入了解,请参阅列存储内存优化For a deep dive, see Columnstore memory optimizations.

  • 若要确保加载用户有足够的内存来实现最大压缩率,请使用属于中大型资源类的加载用户。To ensure the loading user has enough memory to achieve maximum compression rates, use loading users that are a member of a medium or large resource class.
  • 加载足够的行,以便完全填充新的行组。Load enough rows to completely fill new rowgroups. 在大容量加载期间,数据会以 1,048,576 行为一个完整的行组直接压缩到列存储中。During a bulk load, every 1,048,576 rows get compressed directly into the columnstore as a full rowgroup. 不到 102,400 行的加载会将行发送到增量存储中以 B 树索引的形式保存。Loads with fewer than 102,400 rows send the rows to the deltastore where rows are held in a b-tree index.

备注

如果加载的行太少,这些行可能会全部进入增量存储中,不会立即压缩成列存储格式。If you load too few rows, they might all route to the deltastore and not get compressed immediately into columnstore format.

使用 SqLBulkCopy API 或 bcp 时增加批大小Increase batch size when using SqLBulkCopy API or bcp

使用 COPY 语句进行加载将为 SQL 池提供最高吞吐量。Loading with the COPY statement will provide the highest throughput with SQL pool. 如果无法使用 COPY 进行加载,并且必须使用 SqLBulkCopy APIbcp,则应考虑增加批大小以提高吞吐量。If you cannot use the COPY to load and must use the SqLBulkCopy API or bcp, you should consider increasing batch size for better throughput.

提示

10 万行到 1 百万行之间的批大小是建议用于确定最佳批大小容量的基线。A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.

处理加载失败Handling loading failures

使用外部表的加载可能因“查询已中止 -- 从外部源读取时已达最大拒绝阈值”错误而失败。A load using an external table can fail with the error "Query aborted-- the maximum reject threshold was reached while reading from an external source". 此消息表示外部数据包含脏记录。This message indicates that your external data contains dirty records.

如果数据记录满足以下条件之一,则会将其视为脏记录:A data record is considered to be dirty if it meets one of the following conditions:

  • 列的数据类型和数目与外部表的列定义不匹配。The data types and number of columns do not match the column definitions of the external table.
  • 数据不符合指定的外部文件格式。The data doesn't conform to the specified external file format.

若要解决脏记录问题,请确保外部表和外部文件格式定义正确,并且外部数据符合这些定义。To fix the dirty records, ensure that your external table and external file format definitions are correct and your external data conforms to these definitions.

如果外部数据记录的子集是脏的,可通过使用 CREATE EXTERNAL TABLE (Transact-SQL) 中的拒绝选项,选择拒绝这些查询记录。If a subset of external data records are dirty, you can choose to reject these records for your queries by using the reject options in CREATE EXTERNAL TABLE (Transact-SQL).

将数据插入生产表Inserting data into a production table

可以使用 INSERT 语句将数据一次性加载到小型表中,甚至可以使用 INSERT INTO MyLookup VALUES (1, 'Type 1') 之类的语句定期重新加载某个查找。A one-time load to a small table with an INSERT statement, or even a periodic reload of a look-up might perform good enough with a statement like INSERT INTO MyLookup VALUES (1, 'Type 1'). 但是,单独插入的效率不如执行大容量加载的效率。However, singleton inserts are not as efficient as performing a bulk load.

如果一天中有成千上万的单个插入,可将插入成批进行大容量加载。If you have thousands or more single inserts throughout the day, batch the inserts so you can bulk load them. 制定将单个插入追加到某个文件的流程,然后创建另一流程来定期加载该文件。Develop your processes to append the single inserts to a file, and then create another process that periodically loads the file.

创建加载后的统计信息Creating statistics after the load

为了改进查询性能,在首次加载数据或者在数据发生重大更改之后,必须针对所有表的所有列创建统计信息。To improve query performance, it's important to create statistics on all columns of all tables after the first load, or substantial changes occur in the data. 可以手动创建统计信息,也可以启用 AUTO_CREATE_STATISTICSCreating statistics can be done manually or you can enable AUTO_CREATE_STATISTICS.

有关统计信息的详细说明,请参阅统计信息For a detailed explanation of statistics, see Statistics. 以下示例演示如何针对 Customer_Speed 表的五个列创建统计信息。The following example shows how to manually create statistics on five columns of the Customer_Speed table.

create statistics [SensorKey] on [Customer_Speed] ([SensorKey]);
create statistics [CustomerKey] on [Customer_Speed] ([CustomerKey]);
create statistics [GeographyKey] on [Customer_Speed] ([GeographyKey]);
create statistics [Speed] on [Customer_Speed] ([Speed]);
create statistics [YearMeasured] on [Customer_Speed] ([YearMeasured]);

轮换存储密钥 (PolyBase)Rotate storage keys (PolyBase)

好的安全做法是定期更改 Blob 存储的访问密钥。It is good security practice to change the access key to your blob storage on a regular basis. 由于有两个用于 Blob 存储帐户的存储密钥,因此可以转换着使用这两个密钥。You have two storage keys for your blob storage account, which enables you to transition the keys.

若要轮换 Azure 存储帐户密钥,请执行以下操作:To rotate Azure Storage account keys:

对于每个已更改密钥的存储帐户,请发出 ALTER DATABASE SCOPED CREDENTIAL 命令。For each storage account whose key has changed, issue ALTER DATABASE SCOPED CREDENTIAL.

示例:Example:

已创建原始密钥Original key is created

CREATE DATABASE SCOPED CREDENTIAL my_credential WITH IDENTITY = 'my_identity', SECRET = 'key1'

将密钥从密钥 1 轮换为密钥 2Rotate key from key 1 to key 2

ALTER DATABASE SCOPED CREDENTIAL my_credential WITH IDENTITY = 'my_identity', SECRET = 'key2'

无需对基础外部数据源进行更改。No other changes to underlying external data sources are needed.

后续步骤Next steps