生成并优化表以便快速将数据并行导入到 Azure VM 上的 SQL ServerBuild and optimize tables for fast parallel import of data into a SQL Server on an Azure VM

本文介绍如何构建分区表来快速将数据并行批量导入到 SQL Server 数据库。This article describes how to build partitioned tables for fast parallel bulk importing of data to a SQL Server database. 要将大型数据加载/传输到 SQL 数据库,可以通过使用分区表和视图加快将数据导入 SQL 数据库和后续查询的速度。For big data loading/transfer to a SQL database, importing data to the SQL DB and subsequent queries can be improved by using Partitioned Tables and Views.

创建一个新数据库和一组文件组Create a new database and a set of filegroups

  • 创建一个新数据库(如果不存在)。Create a new database, if it doesn't exist already.

  • 将数据库文件组添加到将保存已分区物理文件的数据库。Add database filegroups to the database, which holds the partitioned physical files.

  • 该操作可以通过 CREATE DATABASE(如果是新数据库)或通过 ALTER DATABASE(如果数据库已存在)完成。This can be done with CREATE DATABASE if new or ALTER DATABASE if the database exists already.

  • 向每个数据库文件组中添加一个或多个文件(根据需要)。Add one or more files (as needed) to each database filegroup.

    Note

    指定保存此分区数据的目标文件组和将存储文件组数据的物理数据库文件名称。Specify the target filegroup, which holds data for this partition and the physical database file name(s) where the filegroup data is stored.

下面的示例使用除主组和日志组以外的三个主文件组创建新数据库,每个文件组中包含一个物理文件。The following example creates a new database with three filegroups other than the primary and log groups, containing one physical file in each. 在默认 SQL Server 数据文件夹中创建数据库文件,如 SQL Server 实例中所配置。The database files are created in the default SQL Server Data folder, as configured in the SQL Server instance. 有关默认文件位置的详细信息,请参阅 SQL Server 默认和已命名实例的文件位置For more information about the default file locations, see File Locations for Default and Named Instances of SQL Server.

DECLARE @data_path nvarchar(256);
SET @data_path = (SELECT SUBSTRING(physical_name, 1, CHARINDEX(N'master.mdf', LOWER(physical_name)) - 1)
  FROM master.sys.master_files
  WHERE database_id = 1 AND file_id = 1);

EXECUTE ('
    CREATE DATABASE <database_name>
     ON  PRIMARY 
    ( NAME = ''Primary'', FILENAME = ''' + @data_path + '<primary_file_name>.mdf'', SIZE = 4096KB , FILEGROWTH = 1024KB ), 
     FILEGROUP [filegroup_1] 
    ( NAME = ''FileGroup1'', FILENAME = ''' + @data_path + '<file_name_1>.ndf'' , SIZE = 4096KB , FILEGROWTH = 1024KB ), 
     FILEGROUP [filegroup_2] 
    ( NAME = ''FileGroup2'', FILENAME = ''' + @data_path + '<file_name_2>.ndf'' , SIZE = 4096KB , FILEGROWTH = 1024KB ), 
     FILEGROUP [filegroup_3] 
    ( NAME = ''FileGroup3'', FILENAME = ''' + @data_path + '<file_name_3>.ndf'' , SIZE = 102400KB , FILEGROWTH = 10240KB ) 
     LOG ON 
    ( NAME = ''LogFileGroup'', FILENAME = ''' + @data_path + '<log_file_name>.ldf'' , SIZE = 1024KB , FILEGROWTH = 10%)
')

创建分区表Create a partitioned table

若要根据映射到在上一步中创建的数据库文件组的数据架构创建分区表,必须先创建分区函数和方案。To create partitioned table(s) according to the data schema, mapped to the database filegroups created in the previous step, you must first create a partition function and scheme. 将数据批量导入到分区表时,会根据分区方案在文件组之中分布记录,如下所述。When data is bulk imported to the partitioned table(s), records are distributed among the filegroups according to a partition scheme, as described below.

1.创建分区函数1. Create a partition function

创建分区函数此函数用于定义要包括在每个分区表中的值/边界范围,例如,按 2013 年的月份(某些 _datetime_ 字段)限制分区:Create a partition function This function defines the range of values/boundaries to be included in each individual partition table, for example, to limit partitions by month(some_datetime_field) in the year 2013:

    CREATE PARTITION FUNCTION <DatetimeFieldPFN>(<datetime_field>)  
    AS RANGE RIGHT FOR VALUES (
        '20130201', '20130301', '20130401',
        '20130501', '20130601', '20130701', '20130801',
        '20130901', '20131001', '20131101', '20131201' )

2.创建分区方案2. Create a partition scheme

创建分区方案Create a partition scheme. 此方案将分区函数中的每个分区范围映射到物理文件组,例如:This scheme maps each partition range in the partition function to a physical filegroup, for example:

    CREATE PARTITION SCHEME <DatetimeFieldPScheme> AS  
    PARTITION <DatetimeFieldPFN> TO (
    <filegroup_1>, <filegroup_2>, <filegroup_3>, <filegroup_4>,
    <filegroup_5>, <filegroup_6>, <filegroup_7>, <filegroup_8>,
    <filegroup_9>, <filegroup_10>, <filegroup_11>, <filegroup_12> )

若要根据函数/方案验证每个分区中的有效范围,请运行以下查询:To verify the ranges in effect in each partition according to the function/scheme, run the following query:

    SELECT psch.name as PartitionScheme,
        prng.value AS PartitionValue,
        prng.boundary_id AS BoundaryID
    FROM sys.partition_functions AS pfun
    INNER JOIN sys.partition_schemes psch ON pfun.function_id = psch.function_id
    INNER JOIN sys.partition_range_values prng ON prng.function_id=pfun.function_id
    WHERE pfun.name = <DatetimeFieldPFN>

3.创建分区表3. Create a partition table

创建分区表(根据数据架构),并指定用于对表进行分区的分区方案和约束字段,例如:Create partitioned table(s) according to your data schema, and specify the partition scheme and constraint field used to partition the table, for example:

    CREATE TABLE <table_name> ( [include schema definition here] )
    ON <TablePScheme>(<partition_field>)

有关详细信息,请参阅创建分区表和索引For more information, see Create Partitioned Tables and Indexes.

批量导入每个分区表的数据Bulk import the data for each individual partition table

  • 可以使用 BCP、BULK INSERT 或其他方法(如 SQL Server 迁移向导)。You may use BCP, BULK INSERT, or other methods such as SQL Server Migration Wizard. 提供的示例使用 BCP 方法。The example provided uses the BCP method.

  • 更改数据库,以将事务日志记录方案更改为 BULK_LOGGED 以最大限度降低日志记录开销,例如:Alter the database to change transaction logging scheme to BULK_LOGGED to minimize overhead of logging, for example:

      ALTER DATABASE <database_name> SET RECOVERY BULK_LOGGED
    
  • 若要加快数据加载,请并行启动批量导入操作。To expedite data loading, launch the bulk import operations in parallel. 有关加快将大型数据批量导入到 SQL Server 数据库的提示,请参阅一小时之内加载 1TB 数据For tips on expediting bulk importing of big data into SQL Server databases, see Load 1TB in less than 1 hour.

下面的 PowerShell 脚本是使用 BCP并行加载数据的示例。The following PowerShell script is an example of parallel data loading using BCP.

# Set database name, input data directory, and output log directory
# This example loads comma-separated input data files
# The example assumes the partitioned data files are named as <base_file_name>_<partition_number>.csv
# Assumes the input data files include a header line. Loading starts at line number 2.

$dbname = "<database_name>"
$indir  = "<path_to_data_files>"
$logdir = "<path_to_log_directory>"

# Select authentication mode
$sqlauth = 0

# For SQL authentication, set the server and user credentials
$sqlusr = "<user@server>"
$server = "<tcp:serverdns>"
$pass   = "<password>"

# Set number of partitions per table - Should match the number of input data files per table
$numofparts = <number_of_partitions>

# Set table name to be loaded, basename of input data files, input format file, and number of partitions
$tbname = "<table_name>"
$basename = "<base_input_data_filename_no_extension>"
$fmtfile = "<full_path_to_format_file>"

# Create log directory if it does not exist
New-Item -ErrorAction Ignore -ItemType directory -Path $logdir

# BCP example using Windows authentication
$ScriptBlock1 = {
   param($dbname, $tbname, $basename, $fmtfile, $indir, $logdir, $num)
   bcp ($dbname + ".." + $tbname) in ($indir + "\" + $basename + "_" + $num + ".csv") -o ($logdir + "\" + $tbname + "_" + $num + ".txt") -h "TABLOCK" -F 2 -C "RAW" -f ($fmtfile) -T -b 2500 -t "," -r \n
}

# BCP example using SQL authentication
$ScriptBlock2 = {
   param($dbname, $tbname, $basename, $fmtfile, $indir, $logdir, $num, $sqlusr, $server, $pass)
   bcp ($dbname + ".." + $tbname) in ($indir + "\" + $basename + "_" + $num + ".csv") -o ($logdir + "\" + $tbname + "_" + $num + ".txt") -h "TABLOCK" -F 2 -C "RAW" -f ($fmtfile) -U $sqlusr -S $server -P $pass -b 2500 -t "," -r \n
}

# Background processing of all partitions
for ($i=1; $i -le $numofparts; $i++)
{
   Write-Output "Submit loading trip and fare partitions # $i"
   if ($sqlauth -eq 0) {
      # Use Windows authentication
      Start-Job -ScriptBlock $ScriptBlock1 -Arg ($dbname, $tbname, $basename, $fmtfile, $indir, $logdir, $i)
   } 
   else {
      # Use SQL authentication
      Start-Job -ScriptBlock $ScriptBlock2 -Arg ($dbname, $tbname, $basename, $fmtfile, $indir, $logdir, $i, $sqlusr, $server, $pass)
   }
}

Get-Job

# Optional - Wait till all jobs complete and report date and time
date
While (Get-Job -State "Running") { Start-Sleep 10 }
date

创建索引以优化联接和查询性能Create indexes to optimize joins and query performance

  • 如果从多个表中提取数据进行建模,请在联接键上创建索引来提高联接性能。If you extract data for modeling from multiple tables, create indexes on the join keys to improve the join performance.

  • 创建索引(群集或非群集)为每个分区锁定相同文件组,例如:Create indexes (clustered or non-clustered) targeting the same filegroup for each partition, for example:

      CREATE CLUSTERED INDEX <table_idx> ON <table_name>( [include index columns here] )
      ON <TablePScheme>(<partition)field>)
    

    或者,or,

      CREATE INDEX <table_idx> ON <table_name>( [include index columns here] )
      ON <TablePScheme>(<partition)field>)
    

    Note

    可以选择创建索引后再批量导入数据。You may choose to create the indexes before bulk importing the data. 进行批量导入之前创建索引会降低数据加载速度。Index creation before bulk importing slows down the data loading.

操作示例中的高级分析流程和技术Advanced Analytics Process and Technology in Action Example

有关使用公用数据集的团队数据科学过程的端到端演练示例,请参阅运行中的团队数据科学过程:使用 SQL ServerFor an end-to-end walkthrough example using the Team Data Science Process with a public dataset, see Team Data Science Process in Action: using SQL Server.