使用 Azure 数据工厂从 Amazon Redshift 复制数据Copy data from Amazon Redshift using Azure Data Factory
适用于:
Azure 数据工厂
Azure Synapse Analytics
本文概述了如何使用 Azure 数据工厂中的复制活动从 Amazon Redshift 复制数据。This article outlines how to use the Copy Activity in Azure Data Factory to copy data from an Amazon Redshift. 它是基于概述复制活动总体的复制活动概述一文。It builds on the copy activity overview article that presents a general overview of copy activity.
支持的功能Supported capabilities
以下活动支持此 Amazon Redshift 连接器:This Amazon Redshift connector is supported for the following activities:
可以将数据从 Amazon Redshift 复制到任何支持的接收器数据存储。You can copy data from Amazon Redshift to any supported sink data store. 有关复制活动支持作为源/接收器的数据存储列表,请参阅支持的数据存储表。For a list of data stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
具体而言,通过使用查询或内置 Redshift UNLOAD 支持,此 Amazon Redshift 连接器支持从 Redshift 检索数据。Specifically, this Amazon Redshift connector supports retrieving data from Redshift using query or built-in Redshift UNLOAD support.
提示
若要在从 Redshift 复制大量数据时获得最佳性能,请考虑通过 Amazon S3 使用内置的 Redshift UNLOAD。To achieve the best performance when copying large amounts of data from Redshift, consider using the built-in Redshift UNLOAD through Amazon S3. 有关详细信息,请参阅使用 UNLOAD 从Amazon Redshift 复制数据部分。See Use UNLOAD to copy data from Amazon Redshift section for details.
先决条件Prerequisites
- 如果使用自承载集成运行时将数据复制到本地数据存储,请授权集成运行时(使用计算机的 IP 地址)对 Amazon Redshift 群集的访问权限。If you are copying data to an on-premises data store using Self-hosted Integration Runtime, grant Integration Runtime (use IP address of the machine) the access to Amazon Redshift cluster. 有关说明,请参阅授予对群集的访问权限。See Authorize access to the cluster for instructions.
- 如果要将数据复制到 Azure 数据存储,请参阅 Azure 数据中心 IP 范围,了解 Azure 数据中心使用的计算 IP 地址和 SQL 范围。If you are copying data to an Azure data store, see Azure Data Center IP Ranges for the Compute IP address and SQL ranges used by the Azure data centers.
入门Getting started
若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
- 复制数据工具The Copy Data tool
- Azure 门户The Azure portal
- .NET SDKThe .NET SDK
- Python SDKThe Python SDK
- Azure PowerShellAzure PowerShell
- REST APIThe REST API
- Azure 资源管理器模板The Azure Resource Manager template
对于特定于 Amazon Redshift 连接器的数据工厂实体,以下部分提供了有关用于定义这些实体的属性的详细信息。The following sections provide details about properties that are used to define Data Factory entities specific to Amazon Redshift connector.
链接服务属性Linked service properties
Amazon Redshift 链接的服务支持以下属性:The following properties are supported for Amazon Redshift linked service:
属性Property | 说明Description | 必须Required |
---|---|---|
typetype | type 属性必须设置为:AmazonRedshiftThe type property must be set to: AmazonRedshift | 是Yes |
serverserver | Amazon Redshift 服务器的 IP 地址或主机名。IP address or host name of the Amazon Redshift server. | 是Yes |
portport | Amazon Redshift 服务器用于侦听客户端连接的 TCP 端口数。The number of the TCP port that the Amazon Redshift server uses to listen for client connections. | 否,默认值为 5439No, default is 5439 |
databasedatabase | Amazon Redshift 数据库名称。Name of the Amazon Redshift database. | 是Yes |
usernameusername | 有权访问数据库的用户的名称。Name of user who has access to the database. | 是Yes |
passwordpassword | 用户帐户密码。Password for the user account. 将此字段标记为 SecureString 以安全地将其存储在数据工厂中或引用存储在 Azure Key Vault 中的机密。Mark this field as a SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault. | 是Yes |
connectViaconnectVia | 用于连接到数据存储的集成运行时。The Integration Runtime to be used to connect to the data store. 如果数据存储位于专用网络,则可以使用 Azure Integration Runtime 或自承载集成运行时。You can use Azure Integration Runtime or Self-hosted Integration Runtime (if your data store is located in private network). 如果未指定,则使用默认 Azure Integration Runtime。If not specified, it uses the default Azure Integration Runtime. | 否No |
示例:Example:
{
"name": "AmazonRedshiftLinkedService",
"properties":
{
"type": "AmazonRedshift",
"typeProperties":
{
"server": "<server name>",
"database": "<database name>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
数据集属性Dataset properties
有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the datasets article. 本部分提供 Amazon Redshift 数据集支持的属性列表。This section provides a list of properties supported by Amazon Redshift dataset.
若要从 Amazon Redshift 复制数据,支持以下属性:To copy data from Amazon Redshift, the following properties are supported:
属性Property | 说明Description | 必须Required |
---|---|---|
typetype | 数据集的 type 属性必须设置为:AmazonRedshiftTableThe type property of the dataset must be set to: AmazonRedshiftTable | 是Yes |
架构schema | 架构的名称。Name of the schema. | 否(如果指定了活动源中的“query”)No (if "query" in activity source is specified) |
表table | 表的名称。Name of the table. | 否(如果指定了活动源中的“query”)No (if "query" in activity source is specified) |
tableNametableName | 具有架构的表的名称。Name of the table with schema. 支持此属性是为了向后兼容。This property is supported for backward compatibility. 对于新的工作负荷,请使用 schema 和 table 。Use schema and table for new workload. |
否(如果指定了活动源中的“query”)No (if "query" in activity source is specified) |
示例Example
{
"name": "AmazonRedshiftDataset",
"properties":
{
"type": "AmazonRedshiftTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Amazon Redshift linked service name>",
"type": "LinkedServiceReference"
}
}
}
如果使用 RelationalTable
类型数据集,该数据集仍按原样受支持,但我们建议今后使用新数据集。If you were using RelationalTable
typed dataset, it is still supported as-is, while you are suggested to use the new one going forward.
复制活动属性Copy activity properties
有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供 Amazon Redshift 源支持的属性列表。This section provides a list of properties supported by Amazon Redshift source.
作为源的 Amazon RedshiftAmazon Redshift as source
要从 Amazon Redshift 复制数据,请将复制活动中的源类型设置为“AmazonRedshiftSource” 。To copy data from Amazon Redshift, set the source type in the copy activity to AmazonRedshiftSource. 复制活动 source 部分支持以下属性:The following properties are supported in the copy activity source section:
属性Property | 说明Description | 必须Required |
---|---|---|
typetype | 复制活动 source 的 type 属性必须设置为:AmazonRedshiftSourceThe type property of the copy activity source must be set to: AmazonRedshiftSource | 是Yes |
查询query | 使用自定义查询读取数据。Use the custom query to read data. 例如:select * from MyTable。For example: select * from MyTable. | 否(如果指定了数据集中的“tableName”)No (if "tableName" in dataset is specified) |
redshiftUnloadSettingsredshiftUnloadSettings | 使用 Amazon Redshift UNLOAD 时的属性组。Property group when using Amazon Redshift UNLOAD. | 否No |
s3LinkedServiceNames3LinkedServiceName | 表示通过指定“AmazonS3”类型的链接服务名称,将用作临时存储的 Amazon S3。Refers to an Amazon S3 to-be-used as an interim store by specifying a linked service name of "AmazonS3" type. | 是(如果使用的是 UNLOAD)Yes if using UNLOAD |
bucketNamebucketName | 指示 S3 Bucket 以存储临时数据。Indicate the S3 bucket to store the interim data. 如果未提供,数据工厂服务将自动生成它。If not provided, Data Factory service generates it automatically. | 是(如果使用的是 UNLOAD)Yes if using UNLOAD |
示例:复制活动中使用 UNLOAD 的 Amazon Redshift 源Example: Amazon Redshift source in copy activity using UNLOAD
"source": {
"type": "AmazonRedshiftSource",
"query": "<SQL query>",
"redshiftUnloadSettings": {
"s3LinkedServiceName": {
"referenceName": "<Amazon S3 linked service>",
"type": "LinkedServiceReference"
},
"bucketName": "bucketForUnload"
}
}
在下一节中,深入了解如何使用 UNLOAD 高效地从 Amazon Redshift 复制数据。Learn more on how to use UNLOAD to copy data from Amazon Redshift efficiently from next section.
使用 UNLOAD 从Amazon Redshift 复制数据Use UNLOAD to copy data from Amazon Redshift
UNLOAD 是 Amazon Redshift 提供的一种机制,可将查询结果卸载到 Amazon 简单存储服务 (Amazon S3) 上的一个或多个文件中。UNLOAD is a mechanism provided by Amazon Redshift, which can unload the results of a query to one or more files on Amazon Simple Storage Service (Amazon S3). Amazon 推荐使用此方式从 Redshift 复制大数据集。It is the way recommended by Amazon for copying large data set from Redshift.
示例:使用 UNLOAD、暂存复制和 PolyBase 将数据从 Amazon Redshift 复制到 Azure Synapse AnalyticsExample: copy data from Amazon Redshift to Azure Synapse Analytics using UNLOAD, staged copy and PolyBase
对于此示例用例,复制活动按“redshiftUnloadSettings”中的配置将数据从 Amazon Redshift 卸载到 Amazon S3,然后按“stagingSettings”中指定的要求将数据从 Amazon S3 复制到 Azure Blob,最后使用 PolyBase 将数据载入 Azure Synapse Analytics。For this sample use case, copy activity unloads data from Amazon Redshift to Amazon S3 as configured in "redshiftUnloadSettings", and then copy data from Amazon S3 to Azure Blob as specified in "stagingSettings", lastly use PolyBase to load data into Azure Synapse Analytics. 所有临时格式均由复制活动正确处理。All the interim format is handled by copy activity properly.
"activities":[
{
"name": "CopyFromAmazonRedshiftToSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "AmazonRedshiftDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AmazonRedshiftSource",
"query": "select * from MyTable",
"redshiftUnloadSettings": {
"s3LinkedServiceName": {
"referenceName": "AmazonS3LinkedService",
"type": "LinkedServiceReference"
},
"bucketName": "bucketForUnload"
}
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "AzureStorageLinkedService",
"path": "adfstagingcopydata"
},
"dataIntegrationUnits": 32
}
}
]
Amazon Redshift 的数据类型映射Data type mapping for Amazon Redshift
从 Amazon Redshift 复制数据时,以下映射用于从 Amazon Redshift 数据类型映射到 Azure 数据工厂临时数据类型。When copying data from Amazon Redshift, the following mappings are used from Amazon Redshift data types to Azure Data Factory interim data types. 若要了解复制活动如何将源架构和数据类型映射到接收器,请参阅架构和数据类型映射。See Schema and data type mappings to learn about how copy activity maps the source schema and data type to the sink.
Amazon Redshift 数据类型Amazon Redshift data type | 数据工厂临时数据类型Data factory interim data type |
---|---|
BIGINTBIGINT | Int64Int64 |
BOOLEANBOOLEAN | StringString |
CHARCHAR | StringString |
DATEDATE | DateTimeDateTime |
DECIMALDECIMAL | 小数Decimal |
DOUBLE PRECISIONDOUBLE PRECISION | DoubleDouble |
INTEGERINTEGER | Int32Int32 |
realREAL | SingleSingle |
SMALLINTSMALLINT | Int16Int16 |
TEXTTEXT | StringString |
TIMESTAMPTIMESTAMP | DateTimeDateTime |
VARCHARVARCHAR | StringString |
查找活动属性Lookup activity properties
若要了解有关属性的详细信息,请查看 Lookup 活动。To learn details about the properties, check Lookup activity.
后续步骤Next steps
有关 Azure 数据工厂中复制活动支持作为源和接收器的数据存储的列表,请参阅支持的数据存储。For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported data stores.