使用 Azure 数据工厂将数据从 SQL Server 数据库移到 SQL 数据库Move data from a SQL Server database to SQL Database with Azure Data Factory

本文介绍如何使用 Azure 数据工厂 (ADF) 通过 Azure Blob 存储将数据从 SQL Server 数据库移到 Azure SQL 数据库:此方法是一种受支持的旧方法,具有复制临时副本的优势,尽管我们建议查看“数据迁移”页以了解最新选项This article shows how to move data from a SQL Server database to Azure SQL Database via Azure Blob Storage using the Azure Data Factory (ADF): this method is a supported legacy approach that has the advantages of a replicated staging copy, though we suggest to look at our data migration page for the latest options.

有关汇总了用于将数据移到 Azure SQL 数据库的各种选项的表格,请参阅将数据移到 Azure SQL 数据库进行 Azure 机器学习For a table that summarizes various options for moving data to an Azure SQL Database, see Move data to an Azure SQL Database for Azure Machine Learning.

简介:什么是 ADF 以及它应何时用于迁移数据?Introduction: What is ADF and when should it be used to migrate data?

Azure 数据工厂是一项完全托管、基于云的数据集成服务,可安排并自动化处理数据的移动和转换。Azure Data Factory is a fully managed cloud-based data integration service that orchestrates and automates the movement and transformation of data. ADF 模型中的关键概念是管道。The key concept in the ADF model is pipeline. 管道是活动的逻辑分组,其中每个活动定义对包含在数据集中的数据所执行的操作。A pipeline is a logical grouping of Activities, each of which defines the actions to perform on the data contained in Datasets. 链接服务用于定义数据工厂连接到数据资源所需的信息。Linked services are used to define the information needed for Data Factory to connect to the data resources.

凭借 ADF,可将现有的数据处理服务整合到数据管道中,该数据管道具有高可用性且托管在云中。With ADF, existing data processing services can be composed into data pipelines that are highly available and managed in the cloud. 可安排这些数据管道用于引入、准备、转换、分析和发布数据,ADF 可管理并安排复杂数据和处理依赖关系。These data pipelines can be scheduled to ingest, prepare, transform, analyze, and publish data, and ADF manages and orchestrates the complex data and processing dependencies. 可在云中快速生成并部署解决方案,连接越来越多的本地和云数据源。Solutions can be quickly built and deployed in the cloud, connecting a growing number of on-premises and cloud data sources.

在以下情况中,请考虑使用 ADF:Consider using ADF:

  • 在同时访问本地和云资源的混合方案中需要不断迁移数据时when data needs to be continually migrated in a hybrid scenario that accesses both on-premises and cloud resources
  • 数据需要转换或在迁移过程中向其添加了业务逻辑时。when the data needs transformations or have business logic added to it when being migrated.

ADF 允许使用简单的 JSON 脚本计划和监视作业,JSON 脚本可定期管理数据移动。ADF allows for the scheduling and monitoring of jobs using simple JSON scripts that manage the movement of data on a periodic basis. ADF 还具有其他功能,例如支持复杂操作。ADF also has other capabilities such as support for complex operations. 有关 ADF 的详细信息,请参阅 Azure 数据工厂 (ADF) 中的文档。For more information on ADF, see the documentation at Azure Data Factory (ADF).

方案The Scenario

我们设置了一个由两个数据迁移活动组成的 ADF 管道。We set up an ADF pipeline that composes two data migration activities. 它们每天共同在 SQL Server 数据库和 Azure SQL 数据库之间移动数据。Together they move data on a daily basis between a SQL Server database and Azure SQL Database. 这两个活动是:The two activities are:

  • 将数据从 SQL Server 数据库复制到 Azure Blob 存储帐户Copy data from a SQL Server database to an Azure Blob Storage account
  • 从 Azure Blob 存储帐户将数据复制到 Azure SQL 数据库。Copy data from the Azure Blob Storage account to Azure SQL Database.

备注

此处所示的步骤改编自 ADF 团队提供的更详细教程:将数据从 SQL Server 数据库复制到 Azure Blob 存储 适时提供对该主题相关部分的引用。The steps shown here have been adapted from the more detailed tutorial provided by the ADF team: Copy data from a SQL Server database to Azure Blob storage References to the relevant sections of that topic are provided when appropriate.

先决条件Prerequisites

本教程假设你拥有:This tutorial assumes you have:

备注

此过程将使用 Azure 门户This procedure uses the Azure portal.

将数据上传到 SQL Server 实例Upload the data to your SQL Server instance

将使用 NYC 出租车数据集来演示迁移过程。We use the NYC Taxi dataset to demonstrate the migration process. 该文章所述的 NYC 出租车数据集在 Azure Blob 存储 NYC 出租车数据上可用。The NYC Taxi dataset is available, as noted in that post, on Azure blob storage NYC Taxi Data. 该数据具有两个文件,trip_data.csv 文件(包含行程详情)和 trip_far.csv 文件(包含每次行程的费用详情)。The data has two files, the trip_data.csv file, which contains trip details, and the trip_far.csv file, which contains details of the fare paid for each trip. NYC 出租车行程数据集说明中介绍了这些文件的示例和说明。A sample and description of these files are provided in NYC Taxi Trips Dataset Description.

可将此处提供的流程调整为自己的一组数据,或者使用 NYC 出租车数据集遵循所述的步骤进行操作。You can either adapt the procedure provided here to a set of your own data or follow the steps as described by using the NYC Taxi dataset. 若要将 NYC 出租车数据集上传到 SQL Server 数据库,请按照将数据批量导入 SQL Server 数据库中概述的过程进行操作。To upload the NYC Taxi dataset into your SQL Server database, follow the procedure outlined in Bulk Import Data into SQL Server database.

创建 Azure 数据工厂Create an Azure Data Factory

创建 Azure 数据工厂中提供了在 Azure 门户中创建新的 Azure 数据工厂和资源组的相关说明。The instructions for creating a new Azure Data Factory and a resource group in the Azure portal are provided Create an Azure Data Factory. 将新的 ADF 实例命名为 adfdsp,将创建的资源组命名为 adfdsprgName the new ADF instance adfdsp and name the resource group created adfdsprg.

安装和配置 Azure 数据工厂集成运行时Install and configure Azure Data Factory Integration Runtime

集成运行时是 Azure 数据工厂用于在不同的网络环境之间提供数据集成功能的客户托管数据集成。The Integration Runtime is a customer managed data integration infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments. 此运行时以前称为“数据管理网关”。This runtime was formerly called "Data Management Gateway".

若要进行设置,请按照有关创建管道的说明进行操作To set up, follow the instructions for creating a pipeline

创建链接服务以连接到数据资源Create linked services to connect to the data resources

链接服务定义 Azure 数据工厂连接到数据资源所需的信息。A linked service defines the information needed for Azure Data Factory to connect to a data resource. 在此方案中,我们有三个需要链接服务的资源:We have three resources in this scenario for which linked services are needed:

  1. 本地 SQL ServerOn-premises SQL Server
  2. Azure Blob 存储Azure Blob Storage
  3. Azure SQL 数据库Azure SQL database

创建链接服务中提供了创建链接服务的分步过程。The step-by-step procedure for creating linked services is provided in Create linked services.

定义和创建表以指定访问数据集的方式Define and create tables to specify how to access the datasets

使用以下基于脚本的过程,创建指定数据集的结构、位置和可用性的表。Create tables that specify the structure, location, and availability of the datasets with the following script-based procedures. 可使用 JSON 文件定义表。JSON files are used to define the tables. 若要深入了解这些文件的结构,请参阅数据集For more information on the structure of these files, see Datasets.

备注

在执行 New-AzureDataFactoryTable cmdlet 之前,应先执行 Add-AzureAccount cmdlet,以确认是否为命令执行选择了正确的 Azure 订阅。You should execute the Add-AzureAccount cmdlet before executing the New-AzureDataFactoryTable cmdlet to confirm that the right Azure subscription is selected for the command execution. 有关此 cmdlet 的文档,请参阅 Add-AzureAccountFor documentation of this cmdlet, see Add-AzureAccount.

表中基于 JSON 的定义使用以下名称:The JSON-based definitions in the tables use the following names:

  • SQL Server 中的表名为“nyctaxi_data”the table name in the SQL Server is nyctaxi_data
  • Azure Blob 存储帐户中的容器名为 containernamethe container name in the Azure Blob Storage account is containername

此 ADF 管道所需的表定义有 3 个:Three table definitions are needed for this ADF pipeline:

  1. SQL 本地表SQL on-premises Table
  2. Blob 表Blob Table
  3. SQL Azure 表SQL Azure Table

备注

这些过程使用 Azure PowerShell 来定义和创建 ADF 活动。These procedures use Azure PowerShell to define and create the ADF activities. 但是,也可使用 Azure 门户来完成这些任务。But these tasks can also be accomplished using the Azure portal. 有关详细信息,请参阅创建数据集For details, see Create datasets.

SQL 本地表SQL on-premises Table

SQL Server 的表定义在以下 JSON 文件中指定:The table definition for the SQL Server is specified in the following JSON file:

{
    "name": "OnPremSQLTable",
    "properties":
    {
        "location":
        {
            "type": "OnPremisesSqlServerTableLocation",
            "tableName": "nyctaxi_data",
            "linkedServiceName": "adfonpremsql"
        },
        "availability":
        {
            "frequency": "Day",
            "interval": 1,
            "waitOnExternal":
            {
                "retryInterval": "00:01:00",
                "retryTimeout": "00:10:00",
                "maximumRetry": 3
            }
        }
    }
}

此处不包括列名称。The column names were not included here. 可通过将列名称包含在此处以对其进行子选择(有关详细信息,请参阅 ADF 文档主题)。You can sub-select on the column names by including them here (for details check the ADF documentation topic.

将表的 JSON 定义复制到名为 onpremtabledef.json 的文件中,并将其保存到已知位置(此处假定为 C:\temp\onpremtabledef.json)。Copy the JSON definition of the table into a file called onpremtabledef.json file and save it to a known location (here assumed to be C:\temp\onpremtabledef.json). 使用以下 Azure PowerShell cmdlet 在 ADF 中创建表:Create the table in ADF with the following Azure PowerShell cmdlet:

New-AzureDataFactoryTable -ResourceGroupName ADFdsprg -DataFactoryName ADFdsp –File C:\temp\onpremtabledef.json

Blob 表Blob Table

以下是适用于输出 blob 位置的表的定义(这将引入的数据从本地映射到 Azure blob):Definition for the table for the output blob location is in the following (this maps the ingested data from on-premises to Azure blob):

{
    "name": "OutputBlobTable",
    "properties":
    {
        "location":
        {
            "type": "AzureBlobLocation",
            "folderPath": "containername",
            "format":
            {
                "type": "TextFormat",
                "columnDelimiter": "\t"
            },
            "linkedServiceName": "adfds"
        },
        "availability":
        {
            "frequency": "Day",
            "interval": 1
        }
    }
}

将表的 JSON 定义复制到名为 bloboutputtabledef.json 的文件中,并将其保存到已知位置(此处假定为 C:\temp\bloboutputtabledef.json)。Copy the JSON definition of the table into a file called bloboutputtabledef.json file and save it to a known location (here assumed to be C:\temp\bloboutputtabledef.json). 使用以下 Azure PowerShell cmdlet 在 ADF 中创建表:Create the table in ADF with the following Azure PowerShell cmdlet:

New-AzureDataFactoryTable -ResourceGroupName adfdsprg -DataFactoryName adfdsp -File C:\temp\bloboutputtabledef.json

SQL Azure 表SQL Azure Table

以下是适用于 SQL Azure 输出的表的定义(此架构将映射来自 blob 的数据):Definition for the table for the SQL Azure output is in the following (this schema maps the data coming from the blob):

{
    "name": "OutputSQLAzureTable",
    "properties":
    {
        "structure":
        [
            { "name": "column1", "type": "String"},
            { "name": "column2", "type": "String"}
        ],
        "location":
        {
            "type": "AzureSqlTableLocation",
            "tableName": "your_db_name",
            "linkedServiceName": "adfdssqlazure_linked_servicename"
        },
        "availability":
        {
            "frequency": "Day",
            "interval": 1
        }
    }
}

将表的 JSON 定义复制到名为 AzureSqlTable.json 的文件中,并将其保存到已知位置(此处假定为 C:\temp\AzureSqlTable.json)。Copy the JSON definition of the table into a file called AzureSqlTable.json file and save it to a known location (here assumed to be C:\temp\AzureSqlTable.json). 使用以下 Azure PowerShell cmdlet 在 ADF 中创建表:Create the table in ADF with the following Azure PowerShell cmdlet:

New-AzureDataFactoryTable -ResourceGroupName adfdsprg -DataFactoryName adfdsp -File C:\temp\AzureSqlTable.json

定义和创建管道Define and create the pipeline

使用以下基于脚本的过程,指定属于管道的活动并创建管道。Specify the activities that belong to the pipeline and create the pipeline with the following script-based procedures. 可使用 JSON 文件定义管道属性。A JSON file is used to define the pipeline properties.

  • 该脚本假设管道名称是 AMLDSProcessPipeline。The script assumes that the pipeline name is AMLDSProcessPipeline.
  • 另请注意:我们将管道的周期设置为每天执行,并且为作业使用默认的执行时间(UTC 的凌晨 12 点)。Also note that we set the periodicity of the pipeline to be executed on daily basis and use the default execution time for the job (12 am UTC).

备注

以下过程使用 Azure PowerShell 来定义和创建 ADF 管道。The following procedures use Azure PowerShell to define and create the ADF pipeline. 但是,也可使用 Azure 门户来完成此任务。But this task can also be accomplished using the Azure portal. 有关详细信息,请参阅创建管道For details, see Create pipeline.

使用前面提供的表定义,按以下步骤指定适用于 ADF 的管道定义:Using the table definitions provided previously, the pipeline definition for the ADF is specified as follows:

{
    "name": "AMLDSProcessPipeline",
    "properties":
    {
        "description" : "This pipeline has two activities: the first one copies data from SQL Server to Azure Blob, and the second one copies from Azure Blob to Azure Database Table",
        "activities":
        [
            {
                "name": "CopyFromSQLtoBlob",
                "description": "Copy data from SQL Server to blob",
                "type": "CopyActivity",
                "inputs": [ {"name": "OnPremSQLTable"} ],
                "outputs": [ {"name": "OutputBlobTable"} ],
                "transformation":
                {
                    "source":
                    {
                        "type": "SqlSource",
                        "sqlReaderQuery": "select * from nyctaxi_data"
                    },
                    "sink":
                    {
                        "type": "BlobSink"
                    }
                },
                "Policy":
                {
                    "concurrency": 3,
                    "executionPriorityOrder": "NewestFirst",
                    "style": "StartOfInterval",
                    "retry": 0,
                    "timeout": "01:00:00"
                }
            },
            {
                "name": "CopyFromBlobtoSQLAzure",
                "description": "Push data to Sql Azure",
                "type": "CopyActivity",
                "inputs": [ {"name": "OutputBlobTable"} ],
                "outputs": [ {"name": "OutputSQLAzureTable"} ],
                "transformation":
                {
                    "source":
                    {
                        "type": "BlobSource"
                    },
                    "sink":
                    {
                        "type": "SqlSink",
                        "WriteBatchTimeout": "00:5:00",
                    }
                },
                "Policy":
                {
                    "concurrency": 3,
                    "executionPriorityOrder": "NewestFirst",
                    "style": "StartOfInterval",
                    "retry": 2,
                    "timeout": "02:00:00"
                }
            }
        ]
    }
}

将管道的 JSON 定义复制到名为 pipelinedef.json 的文件中,并将其保存到已知位置(此处假定为 C:\temp\pipelinedef.json)。Copy this JSON definition of the pipeline into a file called pipelinedef.json file and save it to a known location (here assumed to be C:\temp\pipelinedef.json). 使用以下 Azure PowerShell cmdlet 在 ADF 中创建管道:Create the pipeline in ADF with the following Azure PowerShell cmdlet:

New-AzureDataFactoryPipeline  -ResourceGroupName adfdsprg -DataFactoryName adfdsp -File C:\temp\pipelinedef.json

启动管道Start the Pipeline

现在可使用以下命令来运行管道:The pipeline can now be run using the following command:

Set-AzureDataFactoryPipelineActivePeriod -ResourceGroupName ADFdsprg -DataFactoryName ADFdsp -StartDateTime startdateZ –EndDateTime enddateZ –Name AMLDSProcessPipeline

需要将 startdateenddate 参数值替换为想要在此期间运行管道的实际日期。The startdate and enddate parameter values need to be replaced with the actual dates between which you want the pipeline to run.

在管道执行后,应能看到数据显示在为 blob 选择的容器中,每天一个文件。Once the pipeline executes, you should be able to see the data show up in the container selected for the blob, one file per day.

我们并未使用 ADF 提供的功能以增量方式对数据进行管道传输。We have not leveraged the functionality provided by ADF to pipe data incrementally. 若要深入了解如何执行此操作以及 ADF 提供的其他功能,请参阅 ADF 文档For more information on how to do this and other capabilities provided by ADF, see the ADF documentation.