使用 Azure 数据工厂在 Azure Synapse Analytics(以前称为 SQL 数据仓库)中复制和转换数据Copy and transform data in Azure Synapse Analytics (formerly SQL Data Warehouse) by using Azure Data Factory

适用于: Azure 数据工厂 Azure Synapse Analytics(预览版)

本文概述如何使用 Azure 数据工厂中的复制活动从/向 Azure Synapse Analytics 复制数据。This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Synapse Analytics. 若要了解 Azure 数据工厂,请阅读介绍性文章To learn about Azure Data Factory, read the introductory article.

支持的功能Supported capabilities

以下活动支持此 Azure Synapse Analytics 连接器:This Azure Synapse Analytics connector is supported for the following activities:

对于复制活动,此 Azure Synapse Analytics 连接器支持以下功能:For Copy activity, this Azure Synapse Analytics connector supports these functions:

  • 将 SQL 身份验证和 Azure Active Directory (Azure AD) 应用程序令牌身份验证与服务主体或 Azure 资源的托管标识配合使用来复制数据。Copy data by using SQL authentication and Azure Active Directory (Azure AD) Application token authentication with a service principal or managed identities for Azure resources.
  • 作为源,使用 SQL 查询或存储过程检索数据。As a source, retrieve data by using a SQL query or stored procedure. 还可选择从 Azure Synapse Analytics 源进行并行复制。有关详细信息,请参阅从 Synapse Analytics 进行并行复制部分。You can also choose to parallel copy from an Azure Synapse Analytics source, see the Parallel copy from Synapse Analytics section for details.
  • 作为接收器,使用 PolyBaseCOPY 语句(预览版)或批量插入来加载数据。As a sink, load data by using PolyBase or COPY statement (preview) or bulk insert. 为提高复制性能,我们建议使用 PolyBase 或 COPY 语句(预览版)。We recommend PolyBase or COPY statement (preview) for better copy performance. 连接器还支持基于源架构自动创建目标表(如果不存在)。The connector also supports automatically creating destination table if not exists based on the source schema.

重要

如果使用 Azure 数据工厂 Integration Runtime 复制数据,请配置服务器级防火墙规则,以便 Azure 服务可以访问逻辑 SQL 服务器If you copy data by using Azure Data Factory Integration Runtime, configure a server-level firewall rule so that Azure services can access the logical SQL server. 如果使用自承载集成运行时复制数据,请将防火墙配置为允许合适的 IP 范围。If you copy data by using a self-hosted integration runtime, configure the firewall to allow the appropriate IP range. 此范围包括用于连接 Azure Synapse Analytics 的计算机的 IP。This range includes the machine's IP that is used to connect to Azure Synapse Analytics.

入门Get started

提示

要实现最佳性能,请使用 PolyBase 将数据载入 Azure Synapse Analytics。To achieve best performance, use PolyBase to load data into Azure Synapse Analytics. 有关详细信息,请参阅使用 PolyBase 将数据载入 Azure Synapse Analytics 部分。The Use PolyBase to load data into Azure Synapse Analytics section has details. 有关带有用例的演练,请参阅在不到 15 分钟的时间里通过 Azure 数据工厂将 1 TB 的数据载入 Azure Synapse AnalyticsFor a walkthrough with a use case, see Load 1 TB into Azure Synapse Analytics under 15 minutes with Azure Data Factory.

若要使用管道执行复制活动,可以使用以下工具或 SDK 之一:To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:

对于特定于 Azure Synapse Analytics 连接器的数据工厂实体,以下部分提供了有关用于定义这些实体的属性的详细信息。The following sections provide details about properties that define Data Factory entities specific to an Azure Synapse Analytics connector.

链接服务属性Linked service properties

Azure Synapse Analytics 链接服务支持以下属性:The following properties are supported for an Azure Synapse Analytics linked service:

属性Property 说明Description 必须Required
typetype type 属性必须设置为 AzureSqlDWThe type property must be set to AzureSqlDW. Yes
connectionStringconnectionString connectionString 属性指定连接到 Azure Synapse Analytics 实例所需的信息。Specify the information needed to connect to the Azure Synapse Analytics instance for the connectionString property.
将此字段标记为 SecureString,以便安全地将其存储在数据工厂中。Mark this field as a SecureString to store it securely in Data Factory. 还可以将密码/服务主体密钥放在 Azure 密钥保管库中,如果是 SQL 身份验证,则从连接字符串中拉取 password 配置。You can also put password/service principal key in Azure Key Vault,and if it's SQL authentication pull the password configuration out of the connection string. 有关更多详细信息,请参阅表下方的 JSON 示例和将凭据存储在 Azure 密钥保管库中一文。See the JSON example below the table and Store credentials in Azure Key Vault article with more details.
Yes
servicePrincipalIdservicePrincipalId 指定应用程序的客户端 ID。Specify the application's client ID. 是,将 Azure AD 身份验证与服务主体配合使用时是必需的。Yes, when you use Azure AD authentication with a service principal.
servicePrincipalKeyservicePrincipalKey 指定应用程序的密钥。Specify the application's key. 将此字段标记为 SecureString 以安全地将其存储在数据工厂中或引用存储在 Azure Key Vault 中的机密Mark this field as a SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault. 是,将 Azure AD 身份验证与服务主体配合使用时是必需的。Yes, when you use Azure AD authentication with a service principal.
tenanttenant 指定应用程序的租户信息(域名或租户 ID)。Specify the tenant information (domain name or tenant ID) under which your application resides. 可将鼠标悬停在 Azure 门户右上角进行检索。You can retrieve it by hovering the mouse in the top-right corner of the Azure portal. 是,将 Azure AD 身份验证与服务主体配合使用时是必需的。Yes, when you use Azure AD authentication with a service principal.
azureCloudTypeazureCloudType 对于服务主体身份验证,请指定 Azure AD 应用程序注册到的 Azure 云环境的类型。For service principal authentication, specify the type of Azure cloud environment to which your Azure AD application is registered.
允许的值为“AzureChina”。Allowed value is AzureChina. 默认情况下,使用数据工厂的云环境。By default, the data factory's cloud environment is used.
No
connectViaconnectVia 用于连接到数据存储的集成运行时The integration runtime to be used to connect to the data store. 可使用 Azure 集成运行时或自承载集成运行时(如果数据存储位于专用网络)。You can use Azure Integration Runtime or a self-hosted integration runtime (if your data store is located in a private network). 如果未指定,则使用默认 Azure Integration Runtime。If not specified, it uses the default Azure Integration Runtime. No

有关各种身份验证类型,请参阅关于先决条件和 JSON 示例的以下各部分:For different authentication types, refer to the following sections on prerequisites and JSON samples, respectively:

提示

如果遇到错误(错误代码为“UserErrorFailedToConnectToSqlServer”,且消息如“数据库的会话限制为 XXX 且已达到。”),请将 Pooling=false 添加到连接字符串中,然后重试。If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the database is XXX and has been reached.", add Pooling=false to your connection string and try again.

SQL 身份验证SQL authentication

使用 SQL 身份验证的链接服务示例Linked service example that uses SQL authentication

{
    "name": "AzureSqlDWLinkedService",
    "properties": {
        "type": "AzureSqlDW",
        "typeProperties": {
            "connectionString": "Server=tcp:<servername>.database.chinacloudapi.cn,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Azure 密钥保管库中的密码:Password in Azure Key Vault:

{
    "name": "AzureSqlDWLinkedService",
    "properties": {
        "type": "AzureSqlDW",
        "typeProperties": {
            "connectionString": "Server=tcp:<servername>.database.chinacloudapi.cn,1433;Database=<databasename>;User ID=<username>@<servername>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30",
            "password": {
                "type": "AzureKeyVaultSecret",
                "store": {
                    "referenceName": "<Azure Key Vault linked service name>",
                    "type": "LinkedServiceReference"
                },
                "secretName": "<secretName>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

服务主体身份验证Service principal authentication

若要使用基于服务主体的 Azure AD 应用程序令牌身份验证,请执行以下步骤:To use service principal-based Azure AD application token authentication, follow these steps:

  1. 从 Azure 门户 创建 Azure Active Directory 应用程序Create an Azure Active Directory application from the Azure portal. 记下应用程序名称,以及以下定义链接服务的值:Make note of the application name and the following values that define the linked service:

    • 应用程序 IDApplication ID
    • 应用程序密钥Application key
    • 租户 IDTenant ID
  2. 为 Azure 门户中的服务器预配 Azure Active Directory 管理员(如果尚未这样做)。Provision an Azure Active Directory administrator for your server in the Azure portal if you haven't already done so. Azure AD 管理员可以是 Azure AD 用户或 Azure AD 组。The Azure AD administrator can be an Azure AD user or Azure AD group. 如果授予包含托管标识的组管理员角色,则可跳过步骤 3 和步骤 4。If you grant the group with managed identity an admin role, skip steps 3 and 4. 管理员将拥有对数据库的完全访问权限。The administrator will have full access to the database.

  3. 为服务主体 创建包含的数据库用户Create contained database users for the service principal. 使用 SSMS 等工具连接到要从中复制数据或要将数据复制到其中的数据仓库,其 Azure AD 标识至少具有 ALTER ANY USER 权限。Connect to the data warehouse from or to which you want to copy data by using tools like SSMS, with an Azure AD identity that has at least ALTER ANY USER permission. 运行以下 T-SQL:Run the following T-SQL:

    CREATE USER [your application name] FROM EXTERNAL PROVIDER;
    
  4. 像通常对 SQL 用户或其他用户所做的那样 向服务主体授予所需的权限Grant the service principal needed permissions as you normally do for SQL users or others. 运行以下代码,或者参考此处的更多选项。Run the following code, or refer to more options here. 如果要使用 PolyBase 加载数据,请了解所需的数据库权限If you want to use PolyBase to load the data, learn the required database permission.

    EXEC sp_addrolemember db_owner, [your application name];
    
  5. 在 Azure 数据工厂中 配置 Azure Synapse Analytics 链接服务Configure an Azure Synapse Analytics linked service in Azure Data Factory.

使用服务主体身份验证的链接服务示例Linked service example that uses service principal authentication

{
    "name": "AzureSqlDWLinkedService",
    "properties": {
        "type": "AzureSqlDW",
        "typeProperties": {
            "connectionString": "Server=tcp:<servername>.database.chinacloudapi.cn,1433;Database=<databasename>;Connection Timeout=30",
            "servicePrincipalId": "<service principal id>",
            "servicePrincipalKey": {
                "type": "SecureString",
                "value": "<service principal key>"
            },
            "tenant": "<tenant info, e.g. microsoft.partner.onmschina.cn>"
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Azure 资源的托管标识身份验证Managed identities for Azure resources authentication

可将数据工厂与代表此特定工厂的 Azure 资源托管标识相关联。A data factory can be associated with a managed identity for Azure resources that represents the specific factory. 可将此托管标识用于 Azure Synapse Analytics 身份验证。You can use this managed identity for Azure Synapse Analytics authentication. 指定工厂可使用此标识访问数据仓库数据并从或向其中复制数据。The designated factory can access and copy data from or to your data warehouse by using this identity.

若要使用托管标识身份验证,请执行以下步骤:To use managed identity authentication, follow these steps:

  1. 为 Azure 门户上的服务器预配 Azure Active Directory 管理员(如果尚未这样做)。Provision an Azure Active Directory administrator for your server on the Azure portal if you haven't already done so. Azure AD 管理员可以是 Azure AD 用户或 Azure AD 组。The Azure AD administrator can be an Azure AD user or Azure AD group. 如果授予包含托管标识的组管理员角色,则可跳过步骤 3 和步骤 4。If you grant the group with managed identity an admin role, skip steps 3 and 4. 管理员将拥有对数据库的完全访问权限。The administrator will have full access to the database.

  2. 为数据工厂托管标识 创建包含的数据库用户Create contained database users for the Data Factory Managed Identity. 使用 SSMS 等工具连接到要从中复制数据或要将数据复制到其中的数据仓库,其 Azure AD 标识至少具有 ALTER ANY USER 权限。Connect to the data warehouse from or to which you want to copy data by using tools like SSMS, with an Azure AD identity that has at least ALTER ANY USER permission. 运行以下 T-SQL。Run the following T-SQL.

    CREATE USER [your Data Factory name] FROM EXTERNAL PROVIDER;
    
  3. 像通常对 SQL 用户和其他用户所做的那样 向数据工厂托管标识授予所需的权限Grant the Data Factory Managed Identity needed permissions as you normally do for SQL users and others. 运行以下代码,或者参考此处的更多选项。Run the following code, or refer to more options here. 如果要使用 PolyBase 加载数据,请了解所需的数据库权限If you want to use PolyBase to load the data, learn the required database permission.

    EXEC sp_addrolemember db_owner, [your Data Factory name];
    
  4. 在 Azure 数据工厂中 配置 Azure Synapse Analytics 链接服务Configure an Azure Synapse Analytics linked service in Azure Data Factory.

示例:Example:

{
    "name": "AzureSqlDWLinkedService",
    "properties": {
        "type": "AzureSqlDW",
        "typeProperties": {
            "connectionString": "Server=tcp:<servername>.database.chinacloudapi.cn,1433;Database=<databasename>;Connection Timeout=30"
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

数据集属性Dataset properties

有关可用于定义数据集的各部分和属性的完整列表,请参阅数据集一文。For a full list of sections and properties available for defining datasets, see the Datasets article.

Azure Synapse Analytics 数据集支持以下属性:The following properties are supported for Azure Synapse Analytics dataset:

属性Property 说明Description 必须Required
typetype 数据集的 type 属性必须设置为 AzureSqlDWTableThe type property of the dataset must be set to AzureSqlDWTable. Yes
架构schema 架构的名称。Name of the schema. 对于源为“No”,对于接收器为“Yes”No for source, Yes for sink
table 表/视图的名称。Name of the table/view. 对于源为“No”,对于接收器为“Yes”No for source, Yes for sink
tableNametableName 具有架构的表/视图的名称。Name of the table/view with schema. 此属性支持后向兼容性。This property is supported for backward compatibility. 对于新的工作负荷,请使用 schematableFor new workload, use schema and table. 对于源为“No”,对于接收器为“Yes”No for source, Yes for sink

数据集属性示例Dataset properties example

{
    "name": "AzureSQLDWDataset",
    "properties":
    {
        "type": "AzureSqlDWTable",
        "linkedServiceName": {
            "referenceName": "<Azure Synapse Analytics linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, retrievable during authoring > ],
        "typeProperties": {
            "schema": "<schema_name>",
            "table": "<table_name>"
        }
    }
}

复制活动属性Copy Activity properties

有关可用于定义活动的各部分和属性的完整列表,请参阅管道一文。For a full list of sections and properties available for defining activities, see the Pipelines article. 本部分提供 Azure Synapse Analytics 源和接收器支持的属性列表。This section provides a list of properties supported by the Azure Synapse Analytics source and sink.

Azure Synapse Analytics 用作源Azure Synapse Analytics as the source

提示

若要详细了解如何使用数据分区从 Azure Synapse Analytics 高效加载数据,请参阅从 Synapse Analytics 进行并行复制To load data from Azure Synapse Analytics efficiently by using data partitioning, learn more from Parallel copy from Synapse Analytics.

若要从 Azure Synapse Analytics 复制数据,请将复制活动源中的 type 属性设置为 SqlDWSourceTo copy data from Azure Synapse Analytics, set the type property in the Copy Activity source to SqlDWSource. 复制活动 source 节支持以下属性:The following properties are supported in the Copy Activity source section:

属性Property 说明Description 必须Required
typetype 复制活动源的 type 属性必须设置为 SqlDWSourceThe type property of the Copy Activity source must be set to SqlDWSource. Yes
sqlReaderQuerysqlReaderQuery 使用自定义 SQL 查询读取数据。Use the custom SQL query to read data. 示例:select * from MyTableExample: select * from MyTable. No
sqlReaderStoredProcedureNamesqlReaderStoredProcedureName 从源表读取数据的存储过程的名称。The name of the stored procedure that reads data from the source table. 最后一个 SQL 语句必须是存储过程中的 SELECT 语句。The last SQL statement must be a SELECT statement in the stored procedure. No
storedProcedureParametersstoredProcedureParameters 存储过程的参数。Parameters for the stored procedure.
允许的值为名称或值对。Allowed values are name or value pairs. 参数的名称和大小写必须与存储过程参数的名称和大小写匹配。Names and casing of parameters must match the names and casing of the stored procedure parameters.
No
isolationLevelisolationLevel 指定 SQL 源的事务锁定行为。Specifies the transaction locking behavior for the SQL source. 允许的值为:ReadCommitted、ReadUncommitted、RepeatableRead、Serializable、Snapshot 。The allowed values are: ReadCommitted , ReadUncommitted , RepeatableRead , Serializable , Snapshot. 如果未指定,则使用数据库的默认隔离级别。If not specified, the database's default isolation level is used. 请参阅此文档了解更多详细信息。Refer to this doc for more details. No
partitionOptionspartitionOptions 指定用于从 Azure Synapse Analytics 加载数据的数据分区选项。Specifies the data partitioning options used to load data from Azure Synapse Analytics.
允许值包括:None(默认值)、PhysicalPartitionsOfTable 和 DynamicRange 。Allowed values are: None (default), PhysicalPartitionsOfTable , and DynamicRange.
启用分区选项(且选项不是 None)时,用于从 Azure Synapse Analytics 并行加载数据的并行度由复制活动上的 parallelCopies 设置控制。When a partition option is enabled (that is, not None), the degree of parallelism to concurrently load data from an Azure Synapse Analytics is controlled by the parallelCopies setting on the copy activity.
No
partitionSettingspartitionSettings 指定数据分区的设置组。Specify the group of the settings for data partitioning.
当分区选项不是 None 时适用。Apply when the partition option isn't None.
No
partitionSettings 下: _**Under partitionSettings: _
partitionColumnNamepartitionColumnName 指定并行复制范围分区将使用的源列(整数类型或日期/日期时间类型)的名称。Specify the name of the source column _ in integer or date/datetime type * that will be used by range partitioning for parallel copy. 如果未指定,系统会自动检测表的索引或主键并将其用作分区列。If not specified, the index or the primary key of the table is auto-detected and used as the partition column.
当分区选项是 DynamicRange 时适用。Apply when the partition option is DynamicRange. 如果使用查询来检索源数据,请在 WHERE 子句中挂接 ?AdfDynamicRangePartitionCondition If you use a query to retrieve the source data, hook ?AdfDynamicRangePartitionCondition in the WHERE clause. 有关示例,请参阅从 SQL 数据库进行并行复制部分。For an example, see the Parallel copy from SQL database section.
No
partitionUpperBoundpartitionUpperBound 分区范围拆分的分区列的最大值。The maximum value of the partition column for partition range splitting. 此值用于决定分区步幅,不用于筛选表中的行。This value is used to decide the partition stride, not for filtering the rows in table. 将对表或查询结果中的所有行进行分区和复制。All rows in the table or query result will be partitioned and copied. 如果未指定,复制活动会自动检测该值。If not specified, copy activity auto detect the value.
当分区选项是 DynamicRange 时适用。Apply when the partition option is DynamicRange. 有关示例,请参阅从 SQL 数据库进行并行复制部分。For an example, see the Parallel copy from SQL database section.
No
partitionLowerBoundpartitionLowerBound 分区范围拆分的分区列的最小值。The minimum value of the partition column for partition range splitting. 此值用于决定分区步幅,不用于筛选表中的行。This value is used to decide the partition stride, not for filtering the rows in table. 将对表或查询结果中的所有行进行分区和复制。All rows in the table or query result will be partitioned and copied. 如果未指定,复制活动会自动检测该值。If not specified, copy activity auto detect the value.
当分区选项是 DynamicRange 时适用。Apply when the partition option is DynamicRange. 有关示例,请参阅从 SQL 数据库进行并行复制部分。For an example, see the Parallel copy from SQL database section.
No

示例:使用 SQL 查询Example: using SQL query

"activities":[
    {
        "name": "CopyFromAzureSQLDW",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Azure Synapse Analytics input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "SqlDWSource",
                "sqlReaderQuery": "SELECT * FROM MyTable"
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

示例:使用存储过程Example: using stored procedure

"activities":[
    {
        "name": "CopyFromAzureSQLDW",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Azure Synapse Analytics input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "SqlDWSource",
                "sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
                "storedProcedureParameters": {
                    "stringData": { "value": "str3" },
                    "identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type": "Int"}
                }
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

示例存储过程:Sample stored procedure:

CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
    @stringData varchar(20),
    @identifier int
)
AS
SET NOCOUNT ON;
BEGIN
    select *
    from dbo.UnitTestSrcTable
    where dbo.UnitTestSrcTable.stringData != stringData
    and dbo.UnitTestSrcTable.identifier != identifier
END
GO

Azure Synapse Analytics 用作接收器Azure Synapse Analytics as sink

Azure 数据工厂支持通过三种方式将数据加载到 Azure Synapse Analytics。Azure Data Factory supports three ways to load data into Azure Synapse Analytics.

Azure Synapse Analytics 接收器复制选项

最快且最灵活的数据加载方式是使用 PolyBaseCOPY 语句(预览版)。The fastest and most scalable way to load data is through PolyBase or the COPY statement (preview).

要将数据复制到 Azure Synapse Analytics,请将复制活动中的接收器类型设置为 SqlDWSink。To copy data to Azure Synapse Analytics, set the sink type in Copy Activity to SqlDWSink. 复制活动 sink 节支持以下属性:The following properties are supported in the Copy Activity sink section:

属性Property 说明Description 必须Required
typetype 复制活动接收器的 type 属性必须设置为 SqlDWSinkThe type property of the Copy Activity sink must be set to SqlDWSink. Yes
allowPolyBaseallowPolyBase 指示是否使用 PolyBase 将数据加载到 Azure Synapse Analytics。Indicates whether to use PolyBase to load data into Azure Synapse Analytics. allowCopyCommandallowPolyBase 不能同时为 true。allowCopyCommand and allowPolyBase cannot be both true.

有关约束和详细信息,请参阅使用 PolyBase 将数据加载到 Azure Synapse Analytics 部分。See Use PolyBase to load data into Azure Synapse Analytics section for constraints and details.

允许的值为 TrueFalse (默认值)。Allowed values are True and False (default).
否。No.
使用 PolyBase 时适用。Apply when using PolyBase.
polyBaseSettingspolyBaseSettings allowPolybase 属性设置为 true 时可以指定的一组属性。A group of properties that can be specified when the allowPolybase property is set to true. 否。No.
使用 PolyBase 时适用。Apply when using PolyBase.
allowCopyCommandallowCopyCommand 指示是否使用 COPY 语句(预览版)将数据加载到 Azure Synapse Analytics。Indicates whether to use COPY statement (preview) to load data into Azure Synapse Analytics. allowCopyCommandallowPolyBase 不能同时为 true。allowCopyCommand and allowPolyBase cannot be both true.

有关约束和详细信息,请参阅使用 COPY 语句将数据加载到 Azure Synapse Analytics 部分。See Use COPY statement to load data into Azure Synapse Analytics section for constraints and details.

允许的值为 TrueFalse (默认值)。Allowed values are True and False (default).
否。No.
使用 COPY 时适用。Apply when using COPY.
copyCommandSettingscopyCommandSettings allowCopyCommand 属性设置为 TRUE 时可以指定的一组属性。A group of properties that can be specified when allowCopyCommand property is set to TRUE. 否。No.
使用 COPY 时适用。Apply when using COPY.
writeBatchSizewriteBatchSize 每批 要插入到 SQL 表中的行数。Number of rows to inserts into the SQL table per batch.

允许的值为 integer (行数)。The allowed value is integer (number of rows). 默认情况下,数据工厂根据行大小动态确定适当的批大小。By default, Data Factory dynamically determines the appropriate batch size based on the row size.
否。No.
使用批量插入时适用。Apply when using bulk insert.
writeBatchTimeoutwriteBatchTimeout 超时前等待批量插入操作完成的时间。Wait time for the batch insert operation to finish before it times out.

允许的值为 timespanThe allowed value is timespan. 示例:"00:30:00"(30 分钟)。Example: "00:30:00" (30 minutes).
否。No.
使用批量插入时适用。Apply when using bulk insert.
preCopyScriptpreCopyScript 指定在每次运行中将数据写入到 Azure Synapse Analytics 之前要由复制活动运行的 SQL 查询。Specify a SQL query for Copy Activity to run before writing data into Azure Synapse Analytics in each run. 使用此属性清理预加载的数据。Use this property to clean up the preloaded data. No
tableOptiontableOption 指定是否根据源架构自动创建接收器表(如果不存在)。Specifies whether to automatically create the sink table if not exists based on the source schema. 允许的值为:none(默认值)、autoCreateAllowed values are: none (default), autoCreate. No
disableMetricsCollectiondisableMetricsCollection 数据工厂收集指标(如 Azure Synapse Analytics DTU),以获取复制性能优化和建议。Data Factory collects metrics such as Azure Synapse Analytics DWUs for copy performance optimization and recommendations. 如果你担心此行为,请指定 true 将其关闭。If you are concerned with this behavior, specify true to turn it off. 否(默认值为 falseNo (default is false)

Azure Synapse Analytics 接收器示例Azure Synapse Analytics sink example

"sink": {
    "type": "SqlDWSink",
    "allowPolyBase": true,
    "polyBaseSettings":
    {
        "rejectType": "percentage",
        "rejectValue": 10.0,
        "rejectSampleValue": 100,
        "useTypeDefault": true
    }
}

从 Synapse Analytics 进行并行复制Parallel copy from Synapse Analytics

复制活动中的 Azure Synapse Analytics 连接器提供内置的数据分区,用于并行复制数据。The Azure Synapse Analytics connector in copy activity provides built-in data partitioning to copy data in parallel. 可以在复制活动的“源”表中找到数据分区选项。 You can find data partitioning options on the Source tab of the copy activity.

分区选项的屏幕截图

启用分区复制时,复制活动将对 Azure Synapse Analytics 源运行并行查询,以按分区加载数据。When you enable partitioned copy, copy activity runs parallel queries against your Azure Synapse Analytics source to load data by partitions. 可通过复制活动中的 parallelCopies 设置控制并行度。The parallel degree is controlled by the parallelCopies setting on the copy activity. 例如,如果将 parallelCopies 设置为 4,则数据工厂会根据指定的分区选项和设置并行生成并运行 4 个查询,每个查询从 Azure Synapse Analytics 检索一部分数据。For example, if you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on your specified partition option and settings, and each query retrieves a portion of data from your Azure Synapse Analytics.

建议同时启用并行复制和数据分区,尤其是从 Azure Synapse Analytics 加载大量数据时。You are suggested to enable parallel copy with data partitioning especially when you load large amount of data from your Azure Synapse Analytics. 下面是适用于不同方案的建议配置。The following are suggested configurations for different scenarios. 将数据复制到基于文件的数据存储中时,建议将数据作为多个文件写入文件夹(仅指定文件夹名称),在这种情况下,性能优于写入单个文件。When copying data into file-based data store, it's recommended to write to a folder as multiple files (only specify folder name), in which case the performance is better than writing to a single file.

方案Scenario 建议的设置Suggested settings
从包含物理分区的大型表进行完整加载。Full load from large table, with physical partitions. 分区选项 :表的物理分区。Partition option : Physical partitions of table.

在执行期间,数据工厂将自动检测物理分区并按分区复制数据。During execution, Data Factory automatically detects the physical partitions, and copies data by partitions.

若要检查表是否有物理分区,可参考此查询To check if your table has physical partition or not, you can refer to this query.
从不包含物理分区但包含用于数据分区的整数或日期时间列的大型表进行完整加载。Full load from large table, without physical partitions, while with an integer or datetime column for data partitioning. 分区选项 :动态范围分区。Partition options : Dynamic range partition.
分区列(可选):指定用于对数据进行分区的列。Partition column (optional): Specify the column used to partition data. 如果未指定,将使用索引或主键列。If not specified, the index or primary key column is used.
分区上限和分区下限(可选) :指定是否要确定分区步幅。Partition upper bound and partition lower bound (optional): Specify if you want to determine the partition stride. 这不适用于筛选表中的行,表中的所有行都将进行分区和复制。This is not for filtering the rows in table, all rows in the table will be partitioned and copied. 如果未指定,复制活动会自动检测这些值。If not specified, copy activity auto detect the values.

例如,如果分区列“ID”的值范围为 1 至 100,其下限设置为 20、上限设置为 80,并行复制设置为 4,则数据工厂会按 4 个分区检索数据,ID 范围分别为 <=20、[21, 50]、[51, 80] 和 >=81。For example, if your partition column "ID" has values range from 1 to 100, and you set the lower bound as 20 and the upper bound as 80, with parallel copy as 4, Data Factory retrieves data by 4 partitions - IDs in range <=20, [21, 50], [51, 80], and >=81, respectively.
使用自定义查询从不包含物理分区但包含用于数据分区的整数或日期/日期时间列的表加载大量数据。Load a large amount of data by using a custom query, without physical partitions, while with an integer or date/datetime column for data partitioning. 分区选项 :动态范围分区。Partition options : Dynamic range partition.
查询SELECT * FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition AND <your_additional_where_clause>Query : SELECT * FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition AND <your_additional_where_clause>.
分区列 :指定用于对数据进行分区的列。Partition column : Specify the column used to partition data.
分区上限和分区下限(可选) :指定是否要确定分区步幅。Partition upper bound and partition lower bound (optional): Specify if you want to determine the partition stride. 这不适用于筛选表中的行,查询结果中的所有行都将进行分区和复制。This is not for filtering the rows in table, all rows in the query result will be partitioned and copied. 如果未指定,复制活动会自动检测该值。If not specified, copy activity auto detect the value.

在执行期间,数据工厂会将 ?AdfRangePartitionColumnName 替换为每个分区的实际列名和值范围,并将其发送到 Azure Synapse Analytics。During execution, Data Factory replaces ?AdfRangePartitionColumnName with the actual column name and value ranges for each partition, and sends to Azure Synapse Analytics.
例如,如果分区列“ID”的值范围为 1 至 100,其下限设置为 20、上限设置为 80,并行复制设置为 4,则数据工厂会按 4 个分区检索数据,ID 范围分别为 <=20、[21, 50]、[51, 80] 和 >=81。For example, if your partition column "ID" has values range from 1 to 100, and you set the lower bound as 20 and the upper bound as 80, with parallel copy as 4, Data Factory retrieves data by 4 partitions- IDs in range <=20, [21, 50], [51, 80], and >=81, respectively.

下面是针对不同场景的更多示例查询:Here are more sample queries for different scenarios:
1.查询整个表:1. Query the whole table:
SELECT * FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition
2.使用列选择和附加的 where 子句筛选器从表中查询:2. Query from a table with column selection and additional where-clause filters:
SELECT <column_list> FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition AND <your_additional_where_clause>
3.使用子查询进行查询:3. Query with subqueries:
SELECT <column_list> FROM (<your_sub_query>) AS T WHERE ?AdfDynamicRangePartitionCondition AND <your_additional_where_clause>
4.在子查询中使用分区查询:4. Query with partition in subquery:
SELECT <column_list> FROM (SELECT <your_sub_query_column_list> FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition) AS T

使用分区选项加载数据的最佳做法:Best practices to load data with partition option:

  1. 选择独特的列作为分区列(如主键或唯一键),以避免数据倾斜。Choose distinctive column as partition column (like primary key or unique key) to avoid data skew.
  2. 如果表具有内置分区,请使用名为“表的物理分区”分区选项来提升性能。If the table has built-in partition, use partition option "Physical partitions of table" to get better performance.
  3. 如果使用 Azure Integration Runtime 复制数据,则可设置较大的“数据集成单元 (DIU)”(>4) 以利用更多计算资源。If you use Azure Integration Runtime to copy data, you can set larger "Data Integration Units (DIU)" (>4) to utilize more computing resource. 检查此处适用的方案。Check the applicable scenarios there.
  4. 复制并行度”可控制分区数量,将此数字设置得太大有时会损害性能,建议将此数字设置按以下公式计算的值:(DIU 或自承载 IR 节点数)*(2 到 4)。"Degree of copy parallelism" control the partition numbers, setting this number too large sometime hurts the performance, recommend setting this number as (DIU or number of Self-hosted IR nodes) * (2 to 4).
  5. 请注意:Azure Synapse Analytics 一次最多可执行 32 个查询,将“复制并行度”设置得太大可能会导致 Synapse 限制问题。Note Azure Synapse Analytics can execute a maximum of 32 queries at a moment, setting "Degree of copy parallelism" too large may cause Synapse throttling issue.

示例:从包含物理分区的大型表进行完整加载Example: full load from large table with physical partitions

"source": {
    "type": "SqlDWSource",
    "partitionOption": "PhysicalPartitionsOfTable"
}

示例:使用动态范围分区进行查询Example: query with dynamic range partition

"source": {
    "type": "SqlDWSource",
    "query": "SELECT * FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition AND <your_additional_where_clause>",
    "partitionOption": "DynamicRange",
    "partitionSettings": {
        "partitionColumnName": "<partition_column_name>",
        "partitionUpperBound": "<upper_value_of_partition_column (optional) to decide the partition stride, not as data filter>",
        "partitionLowerBound": "<lower_value_of_partition_column (optional) to decide the partition stride, not as data filter>"
    }
}

检查物理分区的示例查询Sample query to check physical partition

SELECT DISTINCT s.name AS SchemaName, t.name AS TableName, c.name AS ColumnName, CASE WHEN c.name IS NULL THEN 'no' ELSE 'yes' END AS HasPartition
FROM sys.tables AS t
LEFT JOIN sys.objects AS o ON t.object_id = o.object_id
LEFT JOIN sys.schemas AS s ON o.schema_id = s.schema_id
LEFT JOIN sys.indexes AS i ON t.object_id = i.object_id
LEFT JOIN sys.index_columns AS ic ON ic.partition_ordinal > 0 AND ic.index_id = i.index_id AND ic.object_id = t.object_id
LEFT JOIN sys.columns AS c ON c.object_id = ic.object_id AND c.column_id = ic.column_id
LEFT JOIN sys.types AS y ON c.system_type_id = y.system_type_id
WHERE s.name='[your schema]' AND t.name = '[your table name]'

如果表具有物理分区,则会看到“HasPartition”为“是”。If the table has physical partition, you would see "HasPartition" as "yes".

使用 PolyBase 将数据加载到 Azure Synapse AnalyticsUse PolyBase to load data into Azure Synapse Analytics

使用 PolyBase 是以高吞吐量将大量数据载入 Azure Synapse Analytics 的有效方法。Using PolyBase is an efficient way to load a large amount of data into Azure Synapse Analytics with high throughput. 使用 PolyBase 而非默认 BULKINSERT 机制可以实现吞吐量的巨大增加。You'll see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism.

  • 如果源数据位于 Azure Blob、Azure Data Lake Storage Gen2 中,且格式与 PolyBase 兼容,则可使用复制活动直接调用 PolyBase,让 Azure Synapse Analytics 从源拉取数据 。If your source data is in Azure Blob, Azure Data Lake Storage Gen2 , and the format is PolyBase compatible , you can use copy activity to directly invoke PolyBase to let Azure Synapse Analytics pull the data from source. 有关详细信息,请参阅 使用 PolyBase 直接复制For details, see Direct copy by using PolyBase.
  • 如果 PolyBase 最初不支持源数据存储和格式,请改用 使用 PolyBase 的暂存复制 功能。If your source data store and format isn't originally supported by PolyBase, use the Staged copy by using PolyBase feature instead. 暂存复制功能也能提供更高的吞吐量。The staged copy feature also provides you better throughput. 它自动将数据转换为与 PolyBase 兼容的格式,将数据存储在 Azure Blob 存储中,然后调用 PolyBase 将数据加载到 Azure Synapse Analytics。It automatically converts the data into PolyBase-compatible format, stores the data in Azure Blob storage, then calls PolyBase to load data into Azure Synapse Analytics.

在复制活动中的 polyBaseSettings 下支持以下 PolyBase 设置:The following PolyBase settings are supported under polyBaseSettings in copy activity:

属性Property 说明Description 必须Required
rejectValuerejectValue 指定在查询失败之前可以拒绝的行数或百分比。Specifies the number or percentage of rows that can be rejected before the query fails.

有关 PolyBase 的拒绝选项的详细信息,请参阅 CREATE EXTERNAL TABLE (Transact-SQL) 的“参数”部分。Learn more about PolyBase's reject options in the Arguments section of CREATE EXTERNAL TABLE (Transact-SQL).

允许的值为 0(默认值)、1、2 等。Allowed values are 0 (default), 1, 2, etc.
No
rejectTyperejectType 指定 rejectValue 选项是文本值还是百分比。Specifies whether the rejectValue option is a literal value or a percentage.

允许的值为 Value (默认值)和 PercentageAllowed values are Value (default) and Percentage.
No
rejectSampleValuerejectSampleValue 确定在 PolyBase 重新计算被拒绝行的百分比之前要检索的行数。Determines the number of rows to retrieve before PolyBase recalculates the percentage of rejected rows.

允许的值为 1、2 等。Allowed values are 1, 2, etc.
如果 rejectTypepercentage ,则为“是”Yes, if the rejectType is percentage.
useTypeDefaultuseTypeDefault 指定在 PolyBase 从文本文件中检索数据时如何处理带分隔符的文本文件中的缺失值。Specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file.

有关此属性的详细信息,请参阅创建外部文件格式 (Transact SQL) 中的参数部分。Learn more about this property from the Arguments section in CREATE EXTERNAL FILE FORMAT (Transact-SQL).

允许的值为 TrueFalse (默认值)。Allowed values are True and False (default).

No

使用 PolyBase 直接复制Direct copy by using PolyBase

Azure Synapse Analytics PolyBase 直接支持 Azure Blob 和 Azure Data Lake Storage Gen2。Azure Synapse Analytics PolyBase directly supports Azure Blob, and Azure Data Lake Storage Gen2. 如果源数据满足本部分所述的条件,请使用 PolyBase 从源数据存储直接复制到 Azure Synapse Analytics。If your source data meets the criteria described in this section, use PolyBase to copy directly from the source data store to Azure Synapse Analytics. 否则,请改用使用 PolyBase 的暂存复制Otherwise, use Staged copy by using PolyBase.

如果不满足要求,Azure 数据工厂会检查设置,并自动回退到 BULKINSERT 机制以进行数据移动。If the requirements aren't met, Azure Data Factory checks the settings and automatically falls back to the BULKINSERT mechanism for the data movement.

  1. 源链接的服务 使用以下类型和身份验证方法:The source linked service is with the following types and authentication methods:

    支持的源数据存储类型Supported source data store type 支持的源身份验证类型Supported source authentication type
    Azure BlobAzure Blob 帐户密钥身份验证, 托管标识身份验证Account key authentication, managed identity authentication
    Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2 帐户密钥身份验证、托管标识身份验证Account key authentication, managed identity authentication

    重要

    如果 Azure 存储配置了 VNet 服务终结点,则必须使用托管标识身份验证 - 请参阅将 VNet 服务终结点用于 Azure 存储的影响If your Azure Storage is configured with VNet service endpoint, you must use managed identity authentication - refer to Impact of using VNet Service Endpoints with Azure storage. 请分别从 Azure Blob - 托管标识身份验证Azure Data Lake Storage Gen2 - 托管标识身份验证部分中了解数据工厂中所需的配置。Learn the required configurations in Data Factory from Azure Blob - managed identity authentication and Azure Data Lake Storage Gen2 - managed identity authentication section respectively.

  2. 源数据格式为 Parquet、ORC 或带分隔符的文本,具有以下配置 :The source data format is of Parquet , ORC , or Delimited text , with the following configurations:

    1. 文件夹路径不包含通配符筛选器。Folder path doesn't contain wildcard filter.
    2. 文件名为空或指向单个文件。File name is empty, or points to a single file. 如果在复制活动中指定通配符文件名,该通配符只能是 **.*If you specify wildcard file name in copy activity, it can only be * or *.*.
    3. rowDelimiterdefault\n\r\n\rrowDelimiter is default , \n , \r\n , or \r.
    4. nullValue 保留为默认值或设置为“空字符串”(“”),并将 treatEmptyAsNull 保留为默认值或设置为 true。nullValue is left as default or set to empty string (""), and treatEmptyAsNull is left as default or set to true.
    5. encodingName 保留为默认值或设置为 utf-8encodingName is left as default or set to utf-8.
    6. quoteCharescapeCharskipLineCount 未指定。quoteChar, escapeChar, and skipLineCount aren't specified. PolyBase 支持跳过可以在 ADF 中配置为 firstRowAsHeader 的标头行。PolyBase support skip header row, which can be configured as firstRowAsHeader in ADF.
    7. compression 可为 无压缩GZipDeflatecompression can be no compression , GZip , or Deflate.
  3. 如果源是文件夹,则必须将复制活动中的 recursive 设置为 true。If your source is a folder, recursive in copy activity must be set to true.

  4. 未指定 wildcardFolderPathwildcardFilenamemodifiedDateTimeStartmodifiedDateTimeEndprefixenablePartitionDiscoveryadditionalColumnswildcardFolderPath , wildcardFilename, modifiedDateTimeStart, modifiedDateTimeEnd, prefix, enablePartitionDiscovery and additionalColumns are not specified.

备注

如果源是一个文件夹,请注意,PolyBase 将从该文件夹及其所有子文件夹中检索文件,并且它不会从文件名以下划线 () 或句点 (.) 开头的文件中检索数据,如此处 - LOCATION 参数所述。If your source is a folder, note PolyBase retrieves files from the folder and all of its subfolders, and it doesn't retrieve data from files for which the file name begins with an underline () or a period (.), as documented here - LOCATION argument.

"activities":[
    {
        "name": "CopyFromAzureBlobToSQLDataWarehouseViaPolyBase",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "ParquetDataset",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "AzureSQLDWDataset",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "ParquetSource",
                "storeSettings":{
                    "type": "AzureBlobStorageReadSettings",
                    "recursive": true
                }
            },
            "sink": {
                "type": "SqlDWSink",
                "allowPolyBase": true
            }
        }
    }
]

使用 PolyBase 的暂存复制Staged copy by using PolyBase

如果源数据不与 PolyBase 本机兼容,请通过临时暂存 Azure Blob 或 Azure Data Lake Storage Gen2(不能为 Azure 高级存储)来启用数据复制。When your source data is not natively compatible with PolyBase, enable data copying via an interim staging Azure Blob or Azure Data Lake Storage Gen2 (it can't be Azure Premium Storage). 在这种情况下,Azure 数据工厂会自动转换数据,以满足 PolyBase 的数据格式要求。In this case, Azure Data Factory automatically converts the data to meet the data format requirements of PolyBase. 然后,它调用 PolyBase 将数据加载到 Azure Synapse Analytics。Then it invokes PolyBase to load data into Azure Synapse Analytics. 最后,它会从存储中清理临时数据。Finally, it cleans up your temporary data from the storage. 若要详细了解如何通过暂存方式复制数据,请参阅暂存复制See Staged copy for details about copying data via a staging.

若要使用此功能,请创建 Azure Blob 存储链接服务Azure Data Lake Storage Gen2 链接服务,这两个服务使用帐户密钥或托管标识身份验证来引用 Azure 存储帐户作为临时存储。To use this feature, create an Azure Blob Storage linked service or Azure Data Lake Storage Gen2 linked service with account key or managed identity authentication that refers to the Azure storage account as the interim storage.

重要

如果临时 Azure 存储配置了 VNet 服务终结点,则必须使用托管标识身份验证 - 请参阅将 VNet 服务终结点与 Azure 存储配合使用的影响If your staging Azure Storage is configured with VNet service endpoint, you must use managed identity authentication - refer to Impact of using VNet Service Endpoints with Azure storage. 若要了解数据工厂中所需的配置,请参阅 Azure Blob - 托管标识身份验证Azure Data Lake Storage Gen2 - 托管标识身份验证Learn the required configurations in Data Factory from Azure Blob - managed identity authentication and Azure Data Lake Storage Gen2 - managed identity authentication.

"activities":[
    {
        "name": "CopyFromSQLServerToSQLDataWarehouseViaPolyBase",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "SQLServerDataset",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "AzureSQLDWDataset",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "SqlSource",
            },
            "sink": {
                "type": "SqlDWSink",
                "allowPolyBase": true
            },
            "enableStaging": true,
            "stagingSettings": {
                "linkedServiceName": {
                    "referenceName": "MyStagingStorage",
                    "type": "LinkedServiceReference"
                }
            }
        }
    }
]

有关使用 PolyBase 的最佳做法Best practices for using PolyBase

除了 Azure Synapse Analytics 的最佳做法中提到的最佳做法以外,以下部分提供了其他最佳做法。The following sections provide best practices in addition to those mentioned in Best practices for Azure Synapse Analytics.

所需数据库权限Required database permission

若要使用 PolyBase,将数据加载到 Azure Synapse Analytics 的用户必须对目标数据库拥有“CONTROL”权限To use PolyBase, the user that loads data into Azure Synapse Analytics must have "CONTROL" permission on the target database. 一种实现方法是将该用户添加为 db_owner 角色的成员。One way to achieve that is to add the user as a member of the db_owner role. Azure Synapse Analytics 概述中了解如何执行此操作。Learn how to do that in the Azure Synapse Analytics overview.

行大小和数据类型限制Row size and data type limits

PolyBase 负载限制为小于 1 MB 的行。PolyBase loads are limited to rows smaller than 1 MB. 不能用它加载到 VARCHR(MAX)、NVARCHAR 或 VARBINARY(MAX)。It cannot be used to load to VARCHR(MAX), NVARCHAR(MAX), or VARBINARY(MAX). 有关详细信息,请参阅 Azure Synapse Analytics 服务容量限制For more information, see Azure Synapse Analytics service capacity limits.

如果数据源中的行大于 1 MB,可能需要将源表垂直拆分为多个小型表。When your source data has rows greater than 1 MB, you might want to vertically split the source tables into several small ones. 确保每行的最大大小不超过该限制。Make sure that the largest size of each row doesn't exceed the limit. 然后,可以使用 PolyBase 加载这些小型表,并在 Azure Synapse Analytics 中将它们合并在一起。The smaller tables can then be loaded by using PolyBase and merged together in Azure Synapse Analytics.

或者,对于具有此类较大列的数据,可以通过关闭“允许 PolyBase”设置,不使用 PolyBase 而使用 ADF 来加载数据。Alternatively, for data with such wide columns, you can use non-PolyBase to load the data using ADF, by turning off "allow PolyBase" setting.

Azure Synapse Analytics 资源类Azure Synapse Analytics resource class

若要实现最佳吞吐量,请将更大的资源类分配给通过 PolyBase 将数据加载到 Azure Synapse Analytics 的用户。To achieve the best possible throughput, assign a larger resource class to the user that loads data into Azure Synapse Analytics via PolyBase.

排查 PolyBase 问题PolyBase troubleshooting

加载到“小数”列Loading to Decimal column

如果源数据为文本格式,或者位于其他与 PolyBase 不兼容的存储(使用暂存复制和 PolyBase)中,并且包含需加载到 Azure Synapse Analytics“小数”列中的空值,则可能出现以下错误:If your source data is in text format or other non-PolyBase compatible stores (using staged copy and PolyBase), and it contains empty value to be loaded into Azure Synapse Analytics Decimal column, you may hit the following error:

ErrorCode=FailedDbOperation, ......HadoopSqlException: Error converting data type VARCHAR to DECIMAL.....Detailed Message=Empty string can't be converted to DECIMAL.....

解决方法是在复制活动接收器 -> PolyBase 设置中取消选择“使用类型默认”选项(设为 false)。The solution is to unselect " Use type default " option (as false) in copy activity sink -> PolyBase settings. USE_TYPE_DEFAULT”是一项 PolyBase 本地设置,它指定 PolyBase 从文本文件中检索数据时,如何处理带分隔符的文本文件中的缺失值。"USE_TYPE_DEFAULT" is a PolyBase native configuration, which specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file.

Azure Synapse Analytics 中的 tableNametableName in Azure Synapse Analytics

下表举例说明如何在 JSON 数据集中指定 tableName 属性。The following table gives examples of how to specify the tableName property in the JSON dataset. 其中显示了架构和表名称的多个组合。It shows several combinations of schema and table names.

DB 架构DB Schema 表名称Table name tableName JSON 属性tableName JSON property
dbodbo MyTableMyTable MyTable 或 dbo.MyTable 或 [dbo].[MyTable]MyTable or dbo.MyTable or [dbo].[MyTable]
dbo1dbo1 MyTableMyTable dbo1.MyTable 或 [dbo1].[MyTable]dbo1.MyTable or [dbo1].[MyTable]
dbodbo My.TableMy.Table [My.Table] 或 [dbo].[My.Table][My.Table] or [dbo].[My.Table]
dbo1dbo1 My.TableMy.Table [dbo1].[My.Table][dbo1].[My.Table]

如果看到以下错误,问题可能与为 tableName 属性指定的值有关。If you see the following error, the problem might be the value you specified for the tableName property. 有关为 tableName JSON 属性指定值的正确方法,请参阅上表。See the preceding table for the correct way to specify values for the tableName JSON property.

Type=System.Data.SqlClient.SqlException,Message=Invalid object name 'stg.Account_test'.,Source=.Net SqlClient Data Provider

具有默认值的列Columns with default values

目前,数据工厂中的 PolyBase 功能只接受与目标表中相同数量的列。Currently, the PolyBase feature in Data Factory accepts only the same number of columns as in the target table. 例如,某个表包含四列,其中一列定义了默认值。An example is a table with four columns where one of them is defined with a default value. 输入数据仍需包含四列。The input data still needs to have four columns. 包含三列的输入数据集生成类似于以下消息的错误:A three-column input dataset yields an error similar to the following message:

All columns of the table must be specified in the INSERT BULK statement.

NULL 值是特殊形式的默认值。The NULL value is a special form of the default value. 如果列可为 null,则该列的 Blob 中的输入数据可能为空。If the column is nullable, the input data in the blob for that column might be empty. 但输入数据集中不能缺少该数据。But it can't be missing from the input dataset. PolyBase 在 Azure Synapse Analytics 中插入 NULL 来表示缺少的值。PolyBase inserts NULL for missing values in Azure Synapse Analytics.

使用 COPY 语句将数据加载到 Azure Synapse Analytics(预览版)Use COPY statement to load data into Azure Synapse Analytics (preview)

Azure Synapse Analytics COPY 语句(预览版)直接支持从 Azure Blob 和 Azure Data Lake Storage Gen2 加载数据。Azure Synapse Analytics COPY statement (preview) directly supports loading data from Azure Blob and Azure Data Lake Storage Gen2. 如果源数据符合本部分所述的条件,则你可以选择使用 ADF 中的 COPY 语句将数据加载到 Azure Synapse Analytics。If your source data meets the criteria described in this section, you can choose to use COPY statement in ADF to load data into Azure Synapse Analytics. Azure 数据工厂将检查设置,如果不符合条件,复制活动运行将会失败。Azure Data Factory checks the settings and fails the copy activity run if the criteria is not met.

备注

目前,数据工厂仅支持从下面所述的 COPY 语句兼容源进行复制。Currently Data Factory only support copy from COPY statement compatible sources mentioned below.

使用 COPY 语句可支持以下配置:Using COPY statement supports the following configuration:

  1. 源链接服务和格式 使用以下类型和身份验证方法:The source linked service and format are with the following types and authentication methods:

    支持的源数据存储类型Supported source data store type 支持的格式Supported format 支持的源身份验证类型Supported source authentication type
    Azure BlobAzure Blob 带分隔符的文本Delimited text 帐户密钥身份验证、共享访问签名身份验证、服务主体身份验证、托管标识身份验证Account key authentication, shared access signature authentication, service principal authentication, managed identity authentication
      ParquetParquet 帐户密钥身份验证、共享访问签名身份验证Account key authentication, shared access signature authentication
      ORCORC 帐户密钥身份验证、共享访问签名身份验证Account key authentication, shared access signature authentication
    Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2 带分隔符的文本Delimited text
    ParquetParquet
    ORCORC
    帐户密钥身份验证、服务主体身份验证、托管标识身份验证Account key authentication, service principal authentication, managed identity authentication

    重要

    如果 Azure 存储配置了 VNet 服务终结点,则必须使用托管标识身份验证 - 请参阅将 VNet 服务终结点用于 Azure 存储的影响If your Azure Storage is configured with VNet service endpoint, you must use managed identity authentication - refer to Impact of using VNet Service Endpoints with Azure storage. 请分别从 Azure Blob - 托管标识身份验证Azure Data Lake Storage Gen2 - 托管标识身份验证部分中了解数据工厂中所需的配置。Learn the required configurations in Data Factory from Azure Blob - managed identity authentication and Azure Data Lake Storage Gen2 - managed identity authentication section respectively.

  2. 格式设置如下:Format settings are with the following:

    1. 对于 Parquet:compression 可以为无压缩、Snappy或 GZip 。For Parquet : compression can be no compression , Snappy , or GZip.
    2. 对于“ORC”:compression 可以是“无压缩”、“zlib”或“Snappy”。For ORC : compression can be no compression , zlib , or Snappy.
    3. 对于带分隔符的文本:For Delimited text :
      1. rowDelimiter 显式设置为 单字符 或“\r\n”,不支持默认值。rowDelimiter is explicitly set as single character or " \r\n ", the default value is not supported.
      2. nullValue 保留默认值或设置为 空字符串 ("")。nullValue is left as default or set to empty string ("").
      3. encodingName 保留默认值或设置为 utf-8 或 utf-16encodingName is left as default or set to utf-8 or utf-16.
      4. escapeChar 必须与 quoteChar 相同,且不能为空。escapeChar must be same as quoteChar, and is not empty.
      5. skipLineCount 保留默认值或设置为 0。skipLineCount is left as default or set to 0.
      6. compression 可以为无压缩或 GZip 。compression can be no compression or GZip.
  3. 如果源是文件夹,则必须将复制活动中的 recursive 设置为 true,并且 wildcardFilename 需要为 *If your source is a folder, recursive in copy activity must be set to true, and wildcardFilename need to be *.

  4. 未指定 wildcardFolderPathwildcardFilename* 除外)、modifiedDateTimeStartmodifiedDateTimeEndprefixenablePartitionDiscoveryadditionalColumnswildcardFolderPath , wildcardFilename (other than *), modifiedDateTimeStart, modifiedDateTimeEnd, prefix, enablePartitionDiscovery and additionalColumns are not specified.

复制活动中的 allowCopyCommand 下支持以下 COPY 语句设置:The following COPY statement settings are supported under allowCopyCommand in copy activity:

propertiesProperty 说明Description 必须Required
defaultValuesdefaultValues 为 Azure Synapse Analytics 中的每个目标列指定默认值。Specifies the default values for each target column in Azure Synapse Analytics. 属性中的默认值将覆盖数据仓库中设置的 DEFAULT 约束,标识列不能有默认值。The default values in the property overwrite the DEFAULT constraint set in the data warehouse, and identity column cannot have a default value. No
additionalOptionsadditionalOptions 将直接在 COPY 语句的“With”子句中传递给 Azure Synapse Analytics COPY 语句的其他选项。Additional options that will be passed to an Azure Synapse Analytics COPY statement directly in "With" clause in COPY statement. 根据需要将值括在引号中,以符合 COPY 语句要求。Quote the value as needed to align with the COPY statement requirements. No
"activities":[
    {
        "name": "CopyFromAzureBlobToSQLDataWarehouseViaCOPY",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "ParquetDataset",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "AzureSQLDWDataset",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "ParquetSource",
                "storeSettings":{
                    "type": "AzureBlobStorageReadSettings",
                    "recursive": true
                }
            },
            "sink": {
                "type": "SqlDWSink",
                "allowCopyCommand": true,
                "copyCommandSettings": {
                    "defaultValues": [
                        {
                            "columnName": "col_string",
                            "defaultValue": "DefaultStringValue"
                        }
                    ],
                    "additionalOptions": {
                        "MAXERRORS": "10000",
                        "DATEFORMAT": "'ymd'"
                    }
                }
            },
            "enableSkipIncompatibleRow": true
        }
    }
]

Lookup 活动属性Lookup activity properties

若要了解有关属性的详细信息,请查看 Lookup 活动To learn details about the properties, check Lookup activity.

GetMetadata 活动属性GetMetadata activity properties

若要了解有关属性的详细信息,请查看 GetMetadata 活动To learn details about the properties, check GetMetadata activity

Azure Synapse Analytics 的数据类型映射Data type mapping for Azure Synapse Analytics

从/向 Azure Synapse Analytics 复制数据时,以下映射用于从 Azure Synapse Analytics 数据类型映射到 Azure 数据工厂临时数据类型。When you copy data from or to Azure Synapse Analytics, the following mappings are used from Azure Synapse Analytics data types to Azure Data Factory interim data types. 若要了解复制活动如何将源架构和数据类型映射到接收器,请参阅架构和数据类型映射See schema and data type mappings to learn how Copy Activity maps the source schema and data type to the sink.

提示

请参阅 Azure Synapse Analytics 中的表数据类型一文,了解 Azure Synapse Analytics 支持的数据类型以及针对不支持的数据类型的解决方法。Refer to Table data types in Azure Synapse Analytics article on Azure Synapse Analytics supported data types and the workarounds for unsupported ones.

Azure Synapse Analytics 数据类型Azure Synapse Analytics data type 数据工厂临时数据类型Data Factory interim data type
bigintbigint Int64Int64
binarybinary Byte[]Byte[]
bitbit 布尔Boolean
charchar String, Char[]String, Char[]
datedate DateTimeDateTime
datetimeDatetime DateTimeDateTime
datetime2datetime2 DateTimeDateTime
DatetimeoffsetDatetimeoffset DateTimeOffsetDateTimeOffset
小数Decimal 小数Decimal
FILESTREAM attribute (varbinary(max))FILESTREAM attribute (varbinary(max)) Byte[]Byte[]
FloatFloat DoubleDouble
图像image Byte[]Byte[]
intint Int32Int32
moneymoney 小数Decimal
ncharnchar String, Char[]String, Char[]
numericnumeric 小数Decimal
nvarcharnvarchar String, Char[]String, Char[]
realreal SingleSingle
rowversionrowversion Byte[]Byte[]
smalldatetimesmalldatetime DateTimeDateTime
smallintsmallint Int16Int16
smallmoneysmallmoney 小数Decimal
timetime TimeSpanTimeSpan
tinyinttinyint ByteByte
uniqueidentifieruniqueidentifier GuidGuid
varbinaryvarbinary Byte[]Byte[]
varcharvarchar String, Char[]String, Char[]

后续步骤Next steps

有关 Azure 数据工厂中复制活动支持用作源和接收器的数据存储的列表,请参阅支持的数据存储和格式For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see supported data stores and formats.