使用 Azure 数据工厂从 Amazon Redshift 复制数据Copy data from Amazon Redshift using Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

本文概述了如何使用 Azure 数据工厂中的复制活动从 Amazon Redshift 复制数据。This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an Amazon Redshift. 它是基于概述复制活动总体的复制活动概述一文。It builds on the copy activity overview article that presents a general overview of copy activity.

支持的功能Supported capabilities

以下活动支持此 Amazon Redshift 连接器:This Amazon Redshift connector is supported for the following activities:

可以将数据从 Amazon Redshift 复制到任何支持的接收器数据存储。You can copy data from Amazon Redshift to any supported sink data store. 有关复制活动支持作为源/接收器的数据存储列表,请参阅支持的数据存储表。For a list of data stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.

具体而言,通过使用查询或内置 Redshift UNLOAD 支持,此 Amazon Redshift 连接器支持从 Redshift 检索数据。Specifically, this Amazon Redshift connector supports retrieving data from Redshift using query or built-in Redshift UNLOAD support.

Tip

若要在从 Redshift 复制大量数据时获得最佳性能,请考虑通过 Amazon S3 使用内置的 Redshift UNLOAD。To achieve the best performance when copying large amounts of data from Redshift, consider using the built-in Redshift UNLOAD through Amazon S3. 有关详细信息,请参阅使用 UNLOAD 从Amazon Redshift 复制数据部分。See Use UNLOAD to copy data from Amazon Redshift section for details.

先决条件Prerequisites

入门Getting started

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

对于特定于 Amazon Redshift 连接器的数据工厂实体,以下部分提供了有关用于定义这些实体的属性的详细信息。The following sections provide details about properties that are used to define Data Factory entities specific to Amazon Redshift connector.

链接服务属性Linked service properties

Amazon Redshift 链接的服务支持以下属性:The following properties are supported for Amazon Redshift linked service:

属性Property 说明Description 必须Required
typetype type 属性必须设置为:AmazonRedshiftThe type property must be set to: AmazonRedshift Yes
serverserver Amazon Redshift 服务器的 IP 地址或主机名。IP address or host name of the Amazon Redshift server. Yes
portport Amazon Redshift 服务器用于侦听客户端连接的 TCP 端口数。The number of the TCP port that the Amazon Redshift server uses to listen for client connections. 否,默认值为 5439No, default is 5439
databasedatabase Amazon Redshift 数据库名称。Name of the Amazon Redshift database. Yes
usernameusername 有权访问数据库的用户的名称。Name of user who has access to the database. Yes
passwordpassword 用户帐户密码。Password for the user account. 将此字段标记为 SecureString 以安全地将其存储在数据工厂中或引用存储在 Azure Key Vault 中的机密Mark this field as a SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault. Yes
connectViaconnectVia 用于连接到数据存储的集成运行时The Integration Runtime to be used to connect to the data store. 如果数据存储位于专用网络,则可以使用 Azure Integration Runtime 或自承载集成运行时。You can use Azure Integration Runtime or Self-hosted Integration Runtime (if your data store is located in private network). 如果未指定,则使用默认 Azure Integration Runtime。If not specified, it uses the default Azure Integration Runtime. No

示例:Example:

{
    "name": "AmazonRedshiftLinkedService",
    "properties":
    {
        "type": "AmazonRedshift",
        "typeProperties":
        {
            "server": "<server name>",
            "database": "<database name>",
            "username": "<username>",
            "password": {
                "type": "SecureString",
                "value": "<password>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the datasets article. 本部分提供 Amazon Redshift 数据集支持的属性列表。This section provides a list of properties supported by Amazon Redshift dataset.

若要从 Amazon Redshift 复制数据,支持以下属性:To copy data from Amazon Redshift, the following properties are supported:

属性Property 说明Description 必须Required
typetype 数据集的 type 属性必须设置为:AmazonRedshiftTableThe type property of the dataset must be set to: AmazonRedshiftTable Yes
架构schema 架构的名称。Name of the schema. 否(如果指定了活动源中的“query”)No (if "query" in activity source is specified)
table 表的名称。Name of the table. 否(如果指定了活动源中的“query”)No (if "query" in activity source is specified)
tableNametableName 具有架构的表的名称。Name of the table with schema. 支持此属性是为了向后兼容。This property is supported for backward compatibility. 对于新的工作负荷,请使用 schematableUse schema and table for new workload. 否(如果指定了活动源中的“query”)No (if "query" in activity source is specified)

示例Example

{
    "name": "AmazonRedshiftDataset",
    "properties":
    {
        "type": "AmazonRedshiftTable",
        "typeProperties": {},
        "schema": [],
        "linkedServiceName": {
            "referenceName": "<Amazon Redshift linked service name>",
            "type": "LinkedServiceReference"
        }
    }
}

如果使用 RelationalTable 类型数据集,该数据集仍按原样受支持,但我们建议今后使用新数据集。If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the new one going forward.

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供 Amazon Redshift 源支持的属性列表。This section provides a list of properties supported by Amazon Redshift source.

作为源的 Amazon RedshiftAmazon Redshift as source

要从 Amazon Redshift 复制数据,请将复制活动中的源类型设置为“AmazonRedshiftSource” 。To copy data from Amazon Redshift, set the source type in the copy activity to AmazonRedshiftSource. 复制活动source部分支持以下属性:The following properties are supported in the copy activity source section:

属性Property 说明Description 必须Required
typetype 复制活动 source 的 type 属性必须设置为:AmazonRedshiftSourceThe type property of the copy activity source must be set to: AmazonRedshiftSource Yes
查询query 使用自定义查询读取数据。Use the custom query to read data. 例如:select * from MyTable。For example: select * from MyTable. 否(如果指定了数据集中的“tableName”)No (if "tableName" in dataset is specified)
redshiftUnloadSettingsredshiftUnloadSettings 使用 Amazon Redshift UNLOAD 时的属性组。Property group when using Amazon Redshift UNLOAD. No
s3LinkedServiceNames3LinkedServiceName 表示通过指定“AmazonS3”类型的链接服务名称,将用作临时存储的 Amazon S3。Refers to an Amazon S3 to-be-used as an interim store by specifying a linked service name of "AmazonS3" type. 是(如果使用的是 UNLOAD)Yes if using UNLOAD
bucketNamebucketName 指示 S3 Bucket 以存储临时数据。Indicate the S3 bucket to store the interim data. 如果未提供,数据工厂服务将自动生成它。If not provided, Data Factory service generates it automatically. 是(如果使用的是 UNLOAD)Yes if using UNLOAD

示例:复制活动中使用 UNLOAD 的 Amazon Redshift 源Example: Amazon Redshift source in copy activity using UNLOAD

"source": {
    "type": "AmazonRedshiftSource",
    "query": "<SQL query>",
    "redshiftUnloadSettings": {
        "s3LinkedServiceName": {
            "referenceName": "<Amazon S3 linked service>",
            "type": "LinkedServiceReference"
        },
        "bucketName": "bucketForUnload"
    }
}

在下一节中,深入了解如何使用 UNLOAD 高效地从 Amazon Redshift 复制数据。Learn more on how to use UNLOAD to copy data from Amazon Redshift efficiently from next section.

使用 UNLOAD 从Amazon Redshift 复制数据Use UNLOAD to copy data from Amazon Redshift

UNLOAD 是 Amazon Redshift 提供的一种机制,可将查询结果卸载到 Amazon 简单存储服务 (Amazon S3) 上的一个或多个文件中。UNLOAD is a mechanism provided by Amazon Redshift, which can unload the results of a query to one or more files on Amazon Simple Storage Service (Amazon S3). Amazon 推荐使用此方式从 Redshift 复制大数据集。It is the way recommended by Amazon for copying large data set from Redshift.

示例:使用 UNLOAD、暂存复制和 PolyBase 将数据从 Amazon Redshift 复制到 Azure SQL 数据仓库Example: copy data from Amazon Redshift to Azure SQL Data Warehouse using UNLOAD, staged copy and PolyBase

对于此用例,复制活动按“redshiftUnloadSettings”中的配置将数据从 Amazon Redshift 卸载到 Amazon S3,然后按“stagingSettings”中指定的内容将数据从 Amazon S3 复制到 Azure Blob,最后使用 PolyBase 将数据载入 SQL 数据仓库。For this sample use case, copy activity unloads data from Amazon Redshift to Amazon S3 as configured in "redshiftUnloadSettings", and then copy data from Amazon S3 to Azure Blob as specified in "stagingSettings", lastly use PolyBase to load data into SQL Data Warehouse. 所有临时格式均由复制活动正确处理。All the interim format is handled by copy activity properly.

Redshift 到 SQL DW 复制工作流

"activities":[
    {
        "name": "CopyFromAmazonRedshiftToSQLDW",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "AmazonRedshiftDataset",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "AzureSQLDWDataset",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "AmazonRedshiftSource",
                "query": "select * from MyTable",
                "redshiftUnloadSettings": {
                    "s3LinkedServiceName": {
                        "referenceName": "AmazonS3LinkedService",
                        "type": "LinkedServiceReference"
                    },
                    "bucketName": "bucketForUnload"
                }
            },
            "sink": {
                "type": "SqlDWSink",
                "allowPolyBase": true
            },
            "enableStaging": true,
            "stagingSettings": {
                "linkedServiceName": "AzureStorageLinkedService",
                "path": "adfstagingcopydata"
            },
            "dataIntegrationUnits": 32
        }
    }
]

Amazon Redshift 的数据类型映射Data type mapping for Amazon Redshift

从 Amazon Redshift 复制数据时,以下映射用于从 Amazon Redshift 数据类型映射到 Azure 数据工厂临时数据类型。When copying data from Amazon Redshift, the following mappings are used from Amazon Redshift data types to Azure Data Factory interim data types. 若要了解复制活动如何将源架构和数据类型映射到接收器,请参阅架构和数据类型映射See Schema and data type mappings to learn about how copy activity maps the source schema and data type to the sink.

Amazon Redshift 数据类型Amazon Redshift data type 数据工厂临时数据类型Data factory interim data type
BIGINTBIGINT Int64Int64
BOOLEANBOOLEAN StringString
CHARCHAR StringString
DATEDATE DateTimeDateTime
DECIMALDECIMAL 小数Decimal
DOUBLE PRECISIONDOUBLE PRECISION DoubleDouble
INTEGERINTEGER Int32Int32
realREAL SingleSingle
SMALLINTSMALLINT Int16Int16
TEXTTEXT StringString
TIMESTAMPTIMESTAMP DateTimeDateTime
VARCHARVARCHAR StringString

查找活动属性Lookup activity properties

若要了解有关属性的详细信息,请查看 Lookup 活动To learn details about the properties, check Lookup activity.

后续步骤Next steps

有关 Azure 数据工厂中复制活动支持作为源和接收器的数据存储的列表,请参阅支持的数据存储For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported data stores.