使用 Azure 数据工厂复制和转换 Azure Database for PostgreSQL 中的数据Copy and transform data in Azure Database for PostgreSQL by using Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics

本文概述了如何使用 Azure 数据工厂中的复制活动从/向 Azure Database for PostgreSQL 复制数据,以及如何使用数据流转换 Azure Database for PostgreSQL 中的数据。This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Database for PostgreSQL, and use Data Flow to transform data in Azure Database for PostgreSQL. 若要了解 Azure 数据工厂,请阅读介绍性文章To learn about Azure Data Factory, read the introductory article.

此连接器专用于 Azure Database for PostgreSQL 服务This connector is specialized for the Azure Database for PostgreSQL service. 若要从位于本地或云中的通用 PostgreSQL 数据库复制数据,请使用 PostgreSQL 连接器To copy data from a generic PostgreSQL database located on-premises or in the cloud, use the PostgreSQL connector.

支持的功能Supported capabilities

以下活动支持此 Azure Database for PostgreSQL 连接器:This Azure Database for PostgreSQL connector is supported for the following activities:

入门Getting started

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

对于特定于 Azure Database for PostgreSQL 连接器的数据工厂实体,以下部分提供有关用于定义这些实体的属性的详细信息。The following sections offer details about properties that are used to define Data Factory entities specific to Azure Database for PostgreSQL connector.

链接服务属性Linked service properties

Azure Database for PostgreSQL 链接服务支持以下属性:The following properties are supported for the Azure Database for PostgreSQL linked service:

属性Property 说明Description 必需Required
typetype type 属性必须设置为:AzurePostgreSqlThe type property must be set to: AzurePostgreSql. Yes
connectionStringconnectionString 用于连接到 Azure Database for PostgreSQL 的 ODBC 连接字符串。An ODBC connection string to connect to Azure Database for PostgreSQL.
还可以将密码放在 Azure 密钥保管库中,并从连接字符串中拉取 password 配置。You can also put a password in Azure Key Vault and pull the password configuration out of the connection string. 有关更多详细信息,请参阅以下示例和在 Azure 密钥保管库中存储凭据See the following samples and Store credentials in Azure Key Vault for more details.
Yes
connectViaconnectVia 此属性表示用于连接到数据存储的集成运行时This property represents the integration runtime to be used to connect to the data store. 如果数据存储位于专用网络,则可以使用 Azure Integration Runtime 或自承载集成运行时。You can use Azure Integration Runtime or Self-hosted Integration Runtime (if your data store is located in private network). 如果未指定,则使用默认 Azure Integration Runtime。If not specified, it uses the default Azure Integration Runtime. No

典型的连接字符串为 Server=<server>.postgres.database.chinacloudapi.cn;Database=<database>;Port=<port>;UID=<username>;Password=<Password>A typical connection string is Server=<server>.postgres.database.chinacloudapi.cn;Database=<database>;Port=<port>;UID=<username>;Password=<Password>. 以下是你可以根据具体情况设置的更多属性:Here are more properties you can set per your case:

属性Property 说明Description 选项Options 必须Required
EncryptionMethod (EM)EncryptionMethod (EM) 驱动程序用于加密在驱动程序和数据库服务器之间发送的数据的方法。The method the driver uses to encrypt data sent between the driver and the database server. 例如 EncryptionMethod=<0/1/6>;For example, EncryptionMethod=<0/1/6>; 0 (No Encryption) (Default) / 1 (SSL) / 6 (RequestSSL)0 (No Encryption) (Default) / 1 (SSL) / 6 (RequestSSL) No
ValidateServerCertificate (VSC)ValidateServerCertificate (VSC) 启用 SSL 加密后,确定驱动程序是否验证数据库服务器发送的证书(加密方法=1)。Determines whether the driver validates the certificate that's sent by the database server when SSL encryption is enabled (Encryption Method=1). 例如 ValidateServerCertificate=<0/1>;For example, ValidateServerCertificate=<0/1>; 0 (Disabled) (Default) / 1 (Enabled)0 (Disabled) (Default) / 1 (Enabled) No

示例Example:

{
    "name": "AzurePostgreSqlLinkedService",
    "properties": {
        "type": "AzurePostgreSql",
        "typeProperties": {
            "connectionString": "Server=<server>.postgres.database.chinacloudapi.cn;Database=<database>;Port=<port>;UID=<username>;Password=<Password>"
        }
    }
}

示例Example:

*在 Azure 密钥保管库中存储密码 _*Store password in Azure Key Vault _

{
    "name": "AzurePostgreSqlLinkedService",
    "properties": {
        "type": "AzurePostgreSql",
        "typeProperties": {
            "connectionString": "Server=<server>.postgres.database.chinacloudapi.cn;Database=<database>;Port=<port>;UID=<username>;",
            "password": { 
                "type": "AzureKeyVaultSecret", 
                "store": { 
                    "referenceName": "<Azure Key Vault linked service name>", 
                    "type": "LinkedServiceReference" 
                }, 
                "secretName": "<secretName>" 
            }
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各个部分和属性的完整列表,请参阅 Azure 数据工厂中的数据集For a full list of sections and properties available for defining datasets, see Datasets in Azure Data Factory. 本部分提供数据集中 Azure Database for PostgreSQL 支持的属性列表。This section provides a list of properties that Azure Database for PostgreSQL supports in datasets.

要从 Azure Database for PostgreSQL 复制数据,请将数据集的 type 属性设置为 _*AzurePostgreSqlTable**。To copy data from Azure Database for PostgreSQL, set the type property of the dataset to _*AzurePostgreSqlTable**. 支持以下属性:The following properties are supported:

propertiesProperty 说明Description 必需Required
typetype 数据集的 type 属性必须设置为 AzurePostgreSqlTable The type property of the dataset must be set to AzurePostgreSqlTable Yes
tableNametableName 表名称Name of the table 否(如果指定了活动源中的“query”)No (if "query" in activity source is specified)

示例Example:

{
    "name": "AzurePostgreSqlDataset",
    "properties": {
        "type": "AzurePostgreSqlTable",
        "linkedServiceName": {
            "referenceName": "<AzurePostgreSql linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {}
    }
}

复制活动属性Copy activity properties

有关可用于定义活动的各部分和属性的完整列表,请参见 Azure 数据工厂中的管道和活动For a full list of sections and properties available for defining activities, see Pipelines and activities in Azure Data Factory. 本部分提供 Azure Database for PostgreSQL 源支持的属性列表。This section provides a list of properties supported by an Azure Database for PostgreSQL source.

用于 PostgreSql 的 Azure 数据库作为源Azure Database for PostgreSql as source

要从 Azure Database for PostgreSQL 复制数据,请将复制活动中的源类型设置为 AzurePostgreSqlSourceTo copy data from Azure Database for PostgreSQL, set the source type in the copy activity to AzurePostgreSqlSource. 复制活动 source 部分支持以下属性:The following properties are supported in the copy activity source section:

propertiesProperty 说明Description 必需Required
typetype 复制活动源的 type 属性必须设置为 AzurePostgreSqlSourceThe type property of the copy activity source must be set to AzurePostgreSqlSource Yes
queryquery 使用自定义 SQL 查询读取数据。Use the custom SQL query to read data. 例如 SELECT * FROM mytableSELECT * FROM "MyTable"For example: SELECT * FROM mytable or SELECT * FROM "MyTable". 请注意,在 PostgreSQL 中,如果未加引号,则实体名称不区分大小写。Note in PostgreSQL, the entity name is treated as case-insensitive if not quoted. 否(如果指定了数据集中的 tableName 属性)No (if the tableName property in the dataset is specified)

示例Example:

"activities":[
    {
        "name": "CopyFromAzurePostgreSql",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<AzurePostgreSql input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "AzurePostgreSqlSource",
                "query": "<custom query e.g. SELECT * FROM mytable>"
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

Azure Database for PostgreSQL 作为接收器Azure Database for PostgreSQL as sink

将数据复制到 Azure Database for PostgreSQL 时,复制活动的 sink 节支持以下属性:To copy data to Azure Database for PostgreSQL, the following properties are supported in the copy activity sink section:

PropertyProperty 说明Description 必需Required
typetype 复制活动接收器的 type 属性必须设置为 AzurePostgreSQLSinkThe type property of the copy activity sink must be set to AzurePostgreSQLSink. Yes
preCopyScriptpreCopyScript 每次运行时将数据写入 Azure Database for PostgreSQL 之前,为要执行的复制活动指定 SQL 查询。Specify a SQL query for the copy activity to execute before you write data into Azure Database for PostgreSQL in each run. 可以使用此属性清除预加载的数据。You can use this property to clean up the preloaded data. No
writeBatchSizewriteBatchSize 当缓冲区大小达到 writeBatchSize 时,会将数据插入 Azure Database for PostgreSQL 表。Inserts data into the Azure Database for PostgreSQL table when the buffer size reaches writeBatchSize.
允许的值是表示行数的整数。Allowed value is an integer that represents the number of rows.
否(默认值为 10,000)No (default is 10,000)
writeBatchTimeoutwriteBatchTimeout 超时之前等待批插入操作完成时的等待时间。Wait time for the batch insert operation to complete before it times out.
允许的值为 Timespan 字符串。Allowed values are Timespan strings. 示例为 00:30:00(30 分钟)。An example is 00:30:00 (30 minutes).
否(默认值为 00:00:30)No (default is 00:00:30)

示例Example:

"activities":[
    {
        "name": "CopyToAzureDatabaseForPostgreSQL",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<Azure PostgreSQL output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>"
            },
            "sink": {
                "type": "AzurePostgreSQLSink",
                "preCopyScript": "<custom SQL script>",
                "writeBatchSize": 100000
            }
        }
    }
]

映射数据流属性Mapping data flow properties

在映射数据流中转换数据时,可以从 Azure Database for PostgreSQL 读取表以及将数据写入表。When transforming data in mapping data flow, you can read and write to tables from Azure Database for PostgreSQL. 有关详细信息,请参阅映射数据流中的源转换接收器转换For more information, see the source transformation and sink transformation in mapping data flows. 你可以选择使用 Azure Database for PostgreSQL 数据集或内联数据集作为源和接收器类型。You can choose to use an Azure Database for PostgreSQL dataset or an inline dataset as source and sink type.

源转换Source transformation

下表列出了 Azure Database for PostgreSQL 源支持的属性。The below table lists the properties supported by Azure Database for PostgreSQL source. 你可以在“源选项”选项卡中编辑这些属性。You can edit these properties in the Source options tab.

名称Name 说明Description 必需Required 允许的值Allowed values 数据流脚本属性Data flow script property
Table 如果你选择“表”作为输入,则数据流会从数据集中指定的表中提取所有数据。If you select Table as input, data flow fetches all the data from the table specified in the dataset. No - (仅适用于内联数据集)(for inline dataset only)
tableNametableName
查询Query 如果你选择“查询”作为输入,请指定一个用来从源中提取数据的 SQL 查询,这将替代在数据集中指定的任何表。If you select Query as input, specify a SQL query to fetch data from source, which overrides any table you specify in dataset. 使用查询是一个好方法,可以减少用于测试或查找的行数。Using queries is a great way to reduce rows for testing or lookups.

不支持 Order By 子句,但你可以设置完整的 SELECT FROM 语句。Order By clause is not supported, but you can set a full SELECT FROM statement. 还可以使用用户定义的表函数。You can also use user-defined table functions. select * from udfGetData() 是 SQL 中的一个 UDF,它返回你可以在数据流中使用的表。select * from udfGetData() is a UDF in SQL that returns a table that you can use in data flow.
查询示例:select * from mytable where customerId > 1000 and customerId < 2000select * from "MyTable"Query example: select * from mytable where customerId > 1000 and customerId < 2000 or select * from "MyTable". 请注意,在 PostgreSQL 中,如果未加引号,则实体名称不区分大小写。Note in PostgreSQL, the entity name is treated as case-insensitive if not quoted.
No 字符串String queryquery
批大小Batch size 指定批大小,以将大型数据分成多个批。Specify a batch size to chunk large data into batches. No IntegerInteger batchSizebatchSize
隔离级别Isolation Level 选择下列隔离级别之一:Choose one of the following isolation levels:
- 读取已提交的内容- Read Committed
- 读取未提交的内容(默认)- Read Uncommitted (default)
- 可重复的读取- Repeatable Read
- 可序列化- Serializable
- 无(忽略隔离级别)- None (ignore isolation level)
No READ_COMMITTEDREAD_COMMITTED
READ_UNCOMMITTEDREAD_UNCOMMITTED
REPEATABLE_READREPEATABLE_READ
SERIALIZABLESERIALIZABLE
NONE
NONE
isolationLevelisolationLevel

Azure Database for PostgreSQL 源脚本示例Azure Database for PostgreSQL source script example

使用 Azure Database for PostgreSQL 作为源类型时,关联的数据流脚本为:When you use Azure Database for PostgreSQL as source type, the associated data flow script is:

source(allowSchemaDrift: true,
    validateSchema: false,
    isolationLevel: 'READ_UNCOMMITTED',
    query: 'select * from mytable',
    format: 'query') ~> AzurePostgreSQLSource

接收器转换Sink transformation

下表列出了 Azure Database for PostgreSQL 接收器支持的属性。The below table lists the properties supported by Azure Database for PostgreSQL sink. 可以在“接收器选项”选项卡中编辑这些属性。You can edit these properties in the Sink options tab.

名称Name 说明Description 必需Required 允许的值Allowed values 数据流脚本属性Data flow script property
Update 方法Update method 指定数据库目标上允许哪些操作。Specify what operations are allowed on your database destination. 默认设置为仅允许插入。The default is to only allow inserts.
若要更新、更新插入或删除行,需要进行“更改行”转换才能标记这些操作的行。To update, upsert, or delete rows, an Alter row transformation is required to tag rows for those actions.
Yes truefalsetrue or false deletabledeletable
insertableinsertable
updateableupdateable
upsertableupsertable
键列Key columns 对于更新、更新插入和删除操作,必须设置键列来确定要更改的行。For updates, upserts and deletes, key column(s) must be set to determine which row to alter.
后续的更新、更新插入和删除将使用你选取为密钥的列名称。The column name that you pick as the key will be used as part of the subsequent update, upsert, delete. 因此,你必须选取存在于接收器映射中的列。Therefore, you must pick a column that exists in the Sink mapping.
No ArrayArray 密钥keys
跳过写入键列Skip writing key columns 如果你不希望将值写入到键列,请选择“跳过写入键列”。If you wish to not write the value to the key column, select "Skip writing key columns". No truefalsetrue or false skipKeyWritesskipKeyWrites
表操作Table action 确定在写入之前是否从目标表重新创建或删除所有行。Determines whether to recreate or remove all rows from the destination table prior to writing.
- :不会对表进行任何操作。- None: No action will be done to the table.
- 重新创建:将删除表并重新创建表。- Recreate: The table will get dropped and recreated. 如果以动态方式创建表,则是必需的。Required if creating a new table dynamically.
- 截断:将删除目标表中的所有行。- Truncate: All rows from the target table will get removed.
No truefalsetrue or false recreaterecreate
truncatetruncate
批大小Batch size 指定每批中写入的行数。Specify how many rows are being written in each batch. 较大的批大小可提高压缩比并改进内存优化,但在缓存数据时可能会导致内存不足异常。Larger batch sizes improve compression and memory optimization, but risk out of memory exceptions when caching data. No IntegerInteger batchSizebatchSize
预处理和后处理 SQL 脚本Pre and Post SQL scripts 指定在将数据写入接收器数据库之前(预处理)和之后(后处理)会执行的多行 SQL 脚本。Specify multi-line SQL scripts that will execute before (pre-processing) and after (post-processing) data is written to your Sink database. No 字符串String preSQLspreSQLs
postSQLspostSQLs

Azure Database for PostgreSQL 接收器脚本示例Azure Database for PostgreSQL sink script example

使用 Azure Database for PostgreSQL 作为接收器类型时,关联的数据流脚本为:When you use Azure Database for PostgreSQL as sink type, the associated data flow script is:

IncomingStream sink(allowSchemaDrift: true,
    validateSchema: false,
    deletable:false,
    insertable:true,
    updateable:true,
    upsertable:true,
    keys:['keyColumn'],
    format: 'table',
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true) ~> AzurePostgreSQLSink

查找活动属性Lookup activity properties

有关属性的详细信息,请参阅 Azure 数据工厂中的 Lookup 活动For more information about the properties, see Lookup activity in Azure Data Factory.

后续步骤Next steps

有关 Azure 数据工厂中复制活动支持作为源和接收器的数据存储的列表,请参阅支持的数据存储For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported data stores.