排查复制活动的性能问题Troubleshoot copy activity performance

适用于: Azure 数据工厂 Azure Synapse Analytics

本文概述如何排查 Azure 数据工厂中复制活动的性能问题。This article outlines how to troubleshoot copy activity performance issue in Azure Data Factory.

运行复制活动后,可以在复制活动监视视图中收集运行结果和性能统计信息。After you run a copy activity, you can collect the run result and performance statistics in copy activity monitoring view. 下面是一个示例。The following is an example.

监视复制活动运行详细信息

性能优化提示Performance tuning tips

在某些情况下,当你运行数据工厂中的复制活动时,将在顶部看到“性能优化提示”,如上面的示例中所示。In some scenarios, when you run a copy activity in Data Factory, you'll see "Performance tuning tips" at the top as shown in the above example. 这些提示告知 ADF 针对此特定复制运行识别到的瓶颈,并建议如何提升复制吞吐量。The tips tell you the bottleneck identified by ADF for this particular copy run, along with suggestion on how to boost copy throughput. 请尝试根据建议进行更改,然后再次运行复制。Try making the recommanded change, then run the copy again.

作为参考,当前性能优化提示针对以下情况提供了建议:As a reference, currently the performance tuning tips provide suggestions for the following cases:

CategoryCategory 性能优化提示Performance tuning tips
特定于数据存储Data store specific 将数据载入 Azure Synapse Analytics:建议使用 PolyBase;如果 PolyBase 不可用,则使用 COPY 语句。Loading data into Azure Synapse Analytics: suggest using PolyBase or COPY statement if it's not used.
  从/向 Azure SQL 数据库 复制数据:当 DTU 的利用率较高时,建议升级到更高的层。Copying data from/to Azure SQL Database: when DTU is under high utilization, suggest upgrading to higher tier.
  从/向 Azure Cosmos DB 复制数据:当 RU 的利用率较高时,建议升级到更大的 RU。Copying data from/to Azure Cosmos DB: when RU is under high utilization, suggest upgrading to larger RU.
从 SAP 表复制数据:复制大量数据时,建议利用 SAP 连接器的分区选项启用并行加载并增加最大分区数。Copying data from SAP Table: when copying large amount of data, suggest leveraging SAP connector's partition option to enable parallel load and increase the max partition number.
  Amazon Redshift 引入数据:如果未使用 UNLOAD,建议使用它。Ingesting data from Amazon Redshift: suggest using UNLOAD if it's not used.
数据存储限制Data store throttling 如果在复制期间数据存储限制了一些读/写操作,则建议检查并增大数据存储允许的请求速率,或减小并发工作负荷。If a number of read/write operations are throttled by the data store during copy, suggest checking and increase the allowed request rate for the data store, or reduce the concurrent workload.
集成运行时Integration runtime 如果使用了 自承载集成运行时 (IR) ,而复制活动在队列中长时间等待,直到 IR 提供了用于执行该活动的资源,则建议横向/纵向扩展 IR。If you use a Self-hosted Integration Runtime (IR) and copy activity waits long in the queue until the IR has available resource to execute, suggest scaling out/up your IR.
  如果使用了非最佳区域中的 Azure Integration Runtime,导致读/写速度缓慢,则建议配置为使用另一区域中的 IR。If you use an Azure Integration Runtime that is in a not optimal region resulting in slow read/write, suggest configuring to use an IR in another region.
容错Fault tolerance 如果配置了容错并跳过不兼容的行,导致性能变慢,则建议确保源和接收器数据兼容。If you configure fault tolerance and skipping incompatible rows results in slow performance, suggest ensuring source and sink data are compatible.
暂存复制Staged copy 如果配置了分阶段复制,但此方法对于源-接收器对不起作用,则建议删除此方法。If staged copy is configured but not helpful for your source-sink pair, suggest removing it.
恢复Resume 如果复制活动已从上一故障点恢复,但你在完成原始运行后正好更改了 DIU 设置,请注意,新的 DIU 设置不会生效。When copy activity is resumed from last failure point but you happen to change the DIU setting after the original run, note the new DIU setting doesn't take effect.

了解复制活动执行详细信息Understand copy activity execution details

复制活动监视视图底部的执行详细信息和持续时间描述了复制活动所要经历的重要阶段(请参阅本文开头的示例),这对于排查复制性能问题特别有用。The execution details and durations at the bottom of the copy activity monitoring view describes the key stages your copy activity goes through (see example at the beginning of this article), which is especially useful for troubleshooting the copy performance. 复制运行的瓶颈就是持续时间最长的那个运行。The bottleneck of your copy run is the one with the longest duration. 请参阅下表中每个阶段的定义,并了解如何使用此类信息排查 Azure IR 中的复制活动的问题排查自承载 IR 中的复制活动的问题Refer to the following table on each stage's definition, and learn how to Troubleshoot copy activity on Azure IR and Troubleshoot copy activity on Self-hosted IR with such info.

阶段Stage 说明Description
队列Queue 复制活动在集成运行时中实际启动之前所消逝的时间。The elapsed time until the copy activity actually starts on the integration runtime.
复制前脚本Pre-copy script 复制活动在 IR 中启动之后、在接收器数据存储中执行完复制前脚本之前所消逝的时间。The elapsed time between copy activity starting on IR and copy activity finishing executing the pre-copy script in sink data store. 为数据库接收器配置复制前脚本时适用,例如,将数据写入 Azure SQL 数据库会在复制新数据之前执行清理。Apply when you configure the pre-copy script for database sinks, e.g. when writing data into Azure SQL Database do clean up before copy new data.
传输Transfer 完成前一步骤之后、在 IR 将所有数据从源传输到接收器之前所消逝的时间。The elapsed time between the end of the previous step and the IR transferring all the data from source to sink.
请注意,传输中的子步骤会并行运行,某些操作(例如,分析/生成文件格式)现在未显示。Note the sub-steps under transfer run in parallel, and some operations are not shown now e.g. parsing/generating file format.

- 距第一字节的时间: 在前一步骤结束之后、IR 从源数据存储收到第一个字节之前所经过的时间。- Time to first byte: The time elapsed between the end of the previous step and the time when the IR receives the first byte from the source data store. 适用于不是基于文件的源。Applies to non-file-based sources.
- 列出源: 枚举源文件或数据分区所花费的时间。- Listing source: The amount of time spent on enumerating source files or data partitions. 后者适用于为数据库源配置分区选项时,例如,从 Oracle/SAP HANA/Teradata/Netezza 等数据库复制数据时。The latter applies when you configure partition options for database sources, e.g. when copy data from databases like Oracle/SAP HANA/Teradata/Netezza/etc.
-从源中读取: 从源数据存储检索数据所花费的时间。-Reading from source: The amount of time spent on retrieving data from source data store.
- 写入接收器: 将数据写入接收器数据存储所花费的时间。- Writing to sink: The amount of time spent on writing data to sink data store. 请注意,某些连接器(包括 Azure 认知搜索、Azure 数据资源管理器、Azure 表存储、Oracle、SQL Server、Common Data Service、Dynamics 365、Dynamics CRM、Salesforce/Salesforce 服务云)目前没有此指标。Note some connectors do not have this metric at the moment, including Azure Cognitive Search, Azure Data Explorer, Azure Table storage, Oracle, SQL Server, Common Data Service, Dynamics 365, Dynamics CRM, Salesforce/Salesforce Service Cloud.

排查 Azure IR 中的复制活动的问题Troubleshoot copy activity on Azure IR

遵循性能优化步骤为方案规划并执行性能测试。Follow the Performance tuning steps to plan and conduct performance test for your scenario.

当复制活动性能不符合预期时,若要排查 Azure Integration Runtime 中运行的单个复制活动的问题,在看到复制监视视图中显示了性能优化提示的情况下,请应用建议并重试。When the copy activity performance doesn't meet your expectation, to troubleshoot single copy activity running on Azure Integration Runtime, if you see performance tuning tips shown up in the copy monitoring view, apply the suggestion and try again. 否则,请 了解复制活动执行详细信息,检查哪个阶段的持续时间 最长,并应用以下指导以提升复制性能:Otherwise, understand copy activity execution details, check which stage has the longest duration, and apply the guidance below to boost copy performance:

  • “复制前脚本”的持续时间较长: 表示接收器数据库中运行的复制前脚本花费了较长时间来完成。"Pre-copy script" experienced long duration: it means the pre-copy script running on sink database takes long to finish. 优化指定的复制前脚本逻辑,以增强性能。Tune the specified pre-copy script logic to enhance the performance. 如果在改进脚本方面需要更多的帮助,请与数据库团队联系。If you need further help on improving the script, contact your database team.

  • “传输 - 距第一字节的时间”的工作持续时间较长: 表示源查询花费了较长时间来返回任何数据。"Transfer - Time to first byte" experienced long working duration: it means your source query takes long to return any data. 检查并优化查询或服务器。Check and optimize the query or server. 如需更多帮助,请与数据存储团队联系。If you need further help, contact your data store team.

  • “传输 - 列出源”的工作持续时间较长: 表示枚举源文件或源数据库数据分区的速度缓慢。"Transfer - Listing source" experienced long working duration: it means enumerating source files or source database data partitions is slow.

    • 从基于文件的源复制数据时,如果对文件夹路径或文件名使用 通配符筛选器wildcardFolderPathwildcardFileName),或使用 文件上次修改时间筛选器modifiedDatetimeStartmodifiedDatetimeEnd),请注意,此类筛选器会导致复制活动在客户端中列出指定文件夹下的所有文件,然后应用筛选器。When copying data from file-based source, if you use wildcard filter on folder path or file name (wildcardFolderPath or wildcardFileName), or use file last modified time filter (modifiedDatetimeStart ormodifiedDatetimeEnd), note such filter would result in copy activity listing all the files under the specified folder to client side then apply the filter. 此类文件枚举可能会变成瓶颈,尤其是只有少量的文件符合筛选规则时。Such file enumeration could become the bottleneck especially when only small set of files met the filter rule.

      • 检查是否可以基于按日期时间分区的文件路径或名称复制文件Check whether you can copy files based on datetime partitioned file path or name. 这不会在“列出源”端带来负担。Such way doesn't bring burden on listing source side.

      • 检查是否可以改用数据存储的本机筛选器,具体说来就是,是否可以使用 Amazon S3/Azure Blob/Azure 文件存储的“前缀”。Check if you can use data store's native filter instead, specifically "prefix" for Amazon S3/Azure Blob/Azure File Storage. 这些筛选器是一个数据存储服务器端筛选器,其性能要好得多。Those filters are data store server-side filter and would have much better performance.

      • 考虑将单个大型数据集拆分为多个小型数据集,并让每个并发运行的复制作业处理一部分数据。Consider to split single large data set into several smaller data sets, and let those copy jobs run concurrently each tackles portion of data. 为此,可以使用 Lookup/GetMetadata + ForEach + Copy。You can do this with Lookup/GetMetadata + ForEach + Copy. 请参阅从多个容器复制文件将数据从 Amazon S3 迁移到 ADLS Gen2 解决方案模板,其中提供了一般性的示例。Refer to Copy files from multiple containers or Migrate data from Amazon S3 to ADLS Gen2 solution templates as general example.

    • 检查 ADF 是否报告了源中的任何限制错误,或者数据存储是否处于高利用率状态。Check if ADF reports any throttling error on source or if your data store is under high utilization state. 如果是,请减少数据存储中的工作负荷,或者尝试联系数据存储管理员来提高限制或增加可用资源。If so, either reduce your workloads on the data store, or try contacting your data store administrator to increase the throttling limit or available resource.

    • 使用同一源数据存储区域或者与之靠近的区域中的 Azure IR。Use Azure IR in the same or close to your source data store region.

  • “传输 - 从源读取”的工作持续时间较长:"Transfer - reading from source" experienced long working duration:

    • 采用特定于连接器的数据加载最佳做法(如果适用)。Adopt connector-specific data loading best practice if applies. 例如,从 Amazon Redshift 复制数据时,请配置为使用 Redshift UNLOAD。For example, when copying data from Amazon Redshift, configure to use Redshift UNLOAD.

    • 检查 ADF 是否报告了源中的任何限制错误,或者数据存储是否处于高利用率状态。Check if ADF reports any throttling error on source or if your data store is under high utilization. 如果是,请减少数据存储中的工作负荷,或者尝试联系数据存储管理员来提高限制或增加可用资源。If so, either reduce your workloads on the data store, or try contacting your data store administrator to increase the throttling limit or available resource.

    • 检查复制源和接收器模式:Check your copy source and sink pattern:

    • 使用同一源数据存储区域中或者与之靠近的区域中的 Azure IR。Use Azure IR in the same or close to your source data store region.

  • “传输 - 写入接收器”的工作持续时间较长:"Transfer - writing to sink" experienced long working duration:

    • 采用特定于连接器的数据加载最佳做法(如果适用)。Adopt connector-specific data loading best practice if applies. 例如,将数据复制到 Azure Synapse Analytics 时,请使用 PolyBase 或 COPY 语句。For example, when copying data into Azure Synapse Analytics, use PolyBase or COPY statement.

    • 检查 ADF 是否报告了接收器中的任何限制错误,或者数据存储是否处于高利用率状态。Check if ADF reports any throttling error on sink or if your data store is under high utilization. 如果是,请减少数据存储中的工作负荷,或者尝试联系数据存储管理员来提高限制或增加可用资源。If so, either reduce your workloads on the data store, or try contacting your data store administrator to increase the throttling limit or available resource.

    • 检查复制源和接收器模式:Check your copy source and sink pattern:

      • 如果复制模式支持 4 个以上的数据集成单位 (DIU) - 请参阅此部分中的详细信息,一般情况下,可以尝试增加 DIU 以获得更好的性能。If your copy pattern supports larger than 4 Data Integration Units (DIUs) - refer to this section on details, generally you can try increasing DIUs to get better performance.

      • 否则,请逐步优化并行复制,同时请注意,过多的并行复制可能会进一步损害性能。Otherwise, gradually tune the parallel copies, note that too many parallel copies may even hurt the performance.

    • 使用同一接收器数据存储区域中或者与之靠近的区域中的 Azure IR。Use Azure IR in the same or close to your sink data store region.

排查自承载 IR 中的复制活动的问题Troubleshoot copy activity on Self-hosted IR

遵循性能优化步骤为方案规划并执行性能测试。Follow the Performance tuning steps to plan and conduct performance test for your scenario.

当复制性能不符合预期时,若要排查 Azure Integration Runtime 中运行的单个复制活动的问题,在看到复制监视视图中显示了性能优化提示的情况下,请应用建议并重试。When the copy performance doesn't meet your expectation, to troubleshoot single copy activity running on Azure Integration Runtime, if you see performance tuning tips shown up in the copy monitoring view, apply the suggestion and try again. 否则,请 了解复制活动执行详细信息,检查哪个阶段的持续时间 最长,并应用以下指导以提升复制性能:Otherwise, understand copy activity execution details, check which stage has the longest duration, and apply the guidance below to boost copy performance:

  • “队列”持续时间较长: 表示复制活动在队列中长时间等待,直到自承载 IR 提供了用于执行该活动的资源。"Queue" experienced long duration: it means the copy activity waits long in the queue until your Self-hosted IR has resource to execute. 检查 IR 容量和使用率,并根据工作负荷进行纵向或横向扩展Check the IR capacity and usage, and scale up or out according to your workload.

  • “传输 - 距第一字节的时间”的工作持续时间较长: 表示源查询花费了较长时间来返回任何数据。"Transfer - Time to first byte" experienced long working duration: it means your source query takes long to return any data. 检查并优化查询或服务器。Check and optimize the query or server. 如需更多帮助,请与数据存储团队联系。If you need further help, contact your data store team.

  • “传输 - 列出源”的工作持续时间较长: 表示枚举源文件或源数据库数据分区的速度缓慢。"Transfer - Listing source" experienced long working duration: it means enumerating source files or source database data partitions is slow.

    • 检查自承载 IR 计算机是否以较低的延迟连接到源数据存储。Check if the Self-hosted IR machine has low latency connecting to source data store. 如果源位于 Azure 中,你可以使用此工具检查自承载 IR 计算机与 Azure 区域之间的连接延迟,延迟值越小越好。If your source is in Azure, you can use this tool to check the latency from the Self-hosted IR machine to the Azure region, the less the better.

    • 从基于文件的源复制数据时,如果对文件夹路径或文件名使用 通配符筛选器wildcardFolderPathwildcardFileName),或使用 文件上次修改时间筛选器modifiedDatetimeStartmodifiedDatetimeEnd),请注意,此类筛选器会导致复制活动在客户端中列出指定文件夹下的所有文件,然后应用筛选器。When copying data from file-based source, if you use wildcard filter on folder path or file name (wildcardFolderPath or wildcardFileName), or use file last modified time filter (modifiedDatetimeStart ormodifiedDatetimeEnd), note such filter would result in copy activity listing all the files under the specified folder to client side then apply the filter. 此类文件枚举可能会变成瓶颈,尤其是只有少量的文件符合筛选规则时。Such file enumeration could become the bottleneck especially when only small set of files met the filter rule.

      • 检查是否可以基于按日期时间分区的文件路径或名称复制文件Check whether you can copy files based on datetime partitioned file path or name. 这不会在“列出源”端带来负担。Such way doesn't bring burden on listing source side.

      • 检查是否可以改用数据存储的本机筛选器,具体说来就是,是否可以使用 Amazon S3/Azure Blob/Azure 文件存储的“前缀”。Check if you can use data store's native filter instead, specifically "prefix" for Amazon S3/Azure Blob/Azure File Storage. 这些筛选器是一个数据存储服务器端筛选器,其性能要好得多。Those filters are data store server-side filter and would have much better performance.

      • 考虑将单个大型数据集拆分为多个小型数据集,并让每个并发运行的复制作业处理一部分数据。Consider to split single large data set into several smaller data sets, and let those copy jobs run concurrently each tackles portion of data. 为此,可以使用 Lookup/GetMetadata + ForEach + Copy。You can do this with Lookup/GetMetadata + ForEach + Copy. 请参阅从多个容器复制文件将数据从 Amazon S3 迁移到 ADLS Gen2 解决方案模板,其中提供了一般性的示例。Refer to Copy files from multiple containers or Migrate data from Amazon S3 to ADLS Gen2 solution templates as general example.

    • 检查 ADF 是否报告了源中的任何限制错误,或者数据存储是否处于高利用率状态。Check if ADF reports any throttling error on source or if your data store is under high utilization state. 如果是,请减少数据存储中的工作负荷,或者尝试联系数据存储管理员来提高限制或增加可用资源。If so, either reduce your workloads on the data store, or try contacting your data store administrator to increase the throttling limit or available resource.

  • “传输 - 从源读取”的工作持续时间较长:"Transfer - reading from source" experienced long working duration:

    • 检查自承载 IR 计算机是否以较低的延迟连接到源数据存储。Check if the Self-hosted IR machine has low latency connecting to source data store. 如果源位于 Azure 中,你可以使用此工具检查自承载 IR 计算机与 Azure 区域之间的连接延迟,延迟值越小越好。If your source is in Azure, you can use this tool to check the latency from the Self-hosted IR machine to the Azure regions, the less the better.

    • 检查自承载 IR 计算机是否具有足够的入站带宽,可以有效地读取和传输数据。Check if the Self-hosted IR machine has enough inbound bandwidth to read and transfer the data efficiently. 如果源数据存储位于 Azure 中,你可以使用此工具检查下载速度。If your source data store is in Azure, you can use this tool to check the download speed.

    • 在 Azure 门户 -> 数据工厂 -> 概述页中检查自承载 IR 的 CPU 和内存使用趋势。Check the Self-hosted IR's CPU and memory usage trend in Azure portal -> your data factory -> overview page. 如果 CPU 使用率较高或可用内存不足,请考虑纵向/横向扩展 IRConsider to scale up/out IR if the CPU usage is high or available memory is low.

    • 采用特定于连接器的数据加载最佳做法(如果适用)。Adopt connector-specific data loading best practice if applies. 例如:For example:

    • 检查 ADF 是否报告了源中的任何限制错误,或者数据存储是否处于高利用率状态。Check if ADF reports any throttling error on source or if your data store is under high utilization. 如果是,请减少数据存储中的工作负荷,或者尝试联系数据存储管理员来提高限制或增加可用资源。If so, either reduce your workloads on the data store, or try contacting your data store administrator to increase the throttling limit or available resource.

    • 检查复制源和接收器模式:Check your copy source and sink pattern:

  • “传输 - 写入接收器”的工作持续时间较长:"Transfer - writing to sink" experienced long working duration:

    • 采用特定于连接器的数据加载最佳做法(如果适用)。Adopt connector-specific data loading best practice if applies. 例如,将数据复制到 Azure Synapse Analytics 时,请使用 PolyBase 或 COPY 语句。For example, when copying data into Azure Synapse Analytics, use PolyBase or COPY statement.

    • 检查自承载 IR 计算机是否以较低的延迟连接到接收器数据存储。Check if the Self-hosted IR machine has low latency connecting to sink data store. 如果接收器位于 Azure 中,你可以使用此工具检查自承载 IR 计算机与 Azure 区域之间的连接延迟,延迟值越小越好。If your sink is in Azure, you can use this tool to check the latency from the Self-hosted IR machine to the Azure region, the less the better.

    • 检查自承载 IR 计算机是否具有足够的出站带宽,可以有效地传输和写入数据。Check if the Self-hosted IR machine has enough outbound bandwidth to transfer and write the data efficiently. 如果接收器数据存储位于 Azure 中,你可以使用此工具检查上传速度。If your sink data store is in Azure, you can use this tool to check the upload speed.

    • 在 Azure 门户 -> 数据工厂 -> 概述页中检查自承载 IR 的 CPU 和内存使用趋势。Check if the Self-hosted IR's CPU and memory usage trend in Azure portal -> your data factory -> overview page. 如果 CPU 使用率较高或可用内存不足,请考虑纵向/横向扩展 IRConsider to scale up/out IR if the CPU usage is high or available memory is low.

    • 检查 ADF 是否报告了接收器中的任何限制错误,或者数据存储是否处于高利用率状态。Check if ADF reports any throttling error on sink or if your data store is under high utilization. 如果是,请减少数据存储中的工作负荷,或者尝试联系数据存储管理员来提高限制或增加可用资源。If so, either reduce your workloads on the data store, or try contacting your data store administrator to increase the throttling limit or available resource.

    • 考虑逐步优化并行复制,同时请注意,过多的并行复制可能会进一步损害性能。Consider to gradually tune the parallel copies, note that too many parallel copies may even hurt the performance.

连接器和 IR 性能Connector and IR performance

本部分探讨特定连接器类型或集成运行时的一些性能故障排除指南。This section explores some performance troubleshooting guides for particular connector type or integration runtime.

使用 Azure IR 与 Azure VNet IR 时,活动执行时间会有所不同Activity execution time varies using Azure IR vs Azure VNet IR

当数据集基于不同 Integration Runtime 时,活动执行时间会有所不同。Activity execution time varies when the dataset is based on different Integration Runtime.

  • 症状:只需在数据集中切换“链接服务”下拉列表就可以执行相同的管道活动,但运行时间会明显不同。Symptoms: Simply toggling the Linked Service dropdown in the dataset performs the same pipeline activities, but has drastically different run-times. 当数据集基于托管虚拟网络集成运行时时,运行所需的平均时间要比基于默认集成运行时时更长。When the dataset is based on the Managed Virtual Network Integration Runtime, it takes more time on average than the run when based on the Default Integration Runtime.

  • 原因:检查管道运行的详细信息,可以看到慢速管道在托管 VNet(虚拟网络)IR 上运行,而正常管道在 Azure IR 上运行。Cause: Checking the details of pipeline runs, you can see that the slow pipeline is running on Managed VNet (Virtual Network) IR while the normal one is running on Azure IR. 按照设计,托管 VNet IR 的排队时间比 Azure IR 长,因为我们不会为每个数据工厂保留一个计算节点,因此每次启动复制活动时都需要进行预热,并且它主要在 VNet 联接(而不是 Azure IR)上进行。By design, Managed VNet IR takes longer queue time than Azure IR as we are not reserving one compute node per data factory, so there is a warm up for each copy activity to start, and it occurs primarily on VNet join rather than Azure IR.

在将数据加载到 Azure SQL 数据库时性能较低Low performance when loading data into Azure SQL Database

  • 症状:在将数据复制到 Azure SQL 数据库时速度变慢。Symptoms: Copying data in to Azure SQL Database turns to be slow.

  • 原因: 此问题的根本原因主要由 Azure SQL 数据库端的瓶颈触发。Cause: The root cause of the issue is mostly triggered by the bottleneck of Azure SQL Database side. 下面是一些可能的原因:Following are some possible causes:

    • Azure SQL 数据库层不够高。Azure SQL Database tier is not high enough.

    • Azure SQL 数据库 DTU 使用率接近 100%。Azure SQL Database DTU usage is close to 100%. 可以监视性能并考虑将 Azure SQL 数据库层升级。You can monitor the performance and consider to upgrade the Azure SQL Database tier.

    • 未正确设置索引。Indexes are not set properly. 请在加载数据之前先删除所有索引,并在加载完成之后再重新创建索引。Remove all the indexes before data load and recreate them after load complete.

    • WriteBatchSize 不够大,无法容纳架构行大小。WriteBatchSize is not large enough to fit schema row size. 若要解决此问题,请尝试增大该属性。Try to enlarge the property for the issue.

    • 使用的是存储过程,而不是批量插入,这会使性能更差。Instead of bulk inset, stored procedure is being used, which is expected to have worse performance.

分析大型 Excel 文件时超时或性能较低Timeout or slow performance when parsing large Excel file

  • 症状Symptoms:

    • 在创建 Excel 数据集并从连接/存储导入架构、预览数据、列出或刷新工作表时,如果 Excel 文件很大,则可能会出现超时错误。When you create Excel dataset and import schema from connection/store, preview data, list, or refresh worksheets, you may hit timeout error if the excel file is large in size.

    • 在使用复制活动将大型 Excel 文件(不小于 100 MB)中的数据复制到其他数据存储时,可能会遇到性能低下或 OOM 问题。When you use copy activity to copy data from large Excel file (>= 100 MB) into other data store, you may experience slow performance or OOM issue.

  • 原因Cause:

    • 对于导入架构、预览数据以及在 Excel 数据集上列出工作表等操作,超时为 100 秒并且是静态的。For operations like importing schema, previewing data, and listing worksheets on excel dataset, the timeout is 100 s and static. 对于大型 Excel 文件,这些操作可能无法在超时值内完成。For large Excel file, these operations may not finish within the timeout value.

    • ADF 复制活动将整个 Excel 文件读入内存,然后查找指定的工作表和单元格来读取数据。ADF copy activity reads the whole Excel file into memory then locate the specified worksheet and cells to read data. 此行为是由 ADF 使用的基础 SDK 导致的。This behavior is due to the underlying SDK ADF uses.

  • 解决方法Resolution:

    • 对于导入架构,你可以生成一个较小的示例文件(原始文件的一部分),并选择“从示例文件导入架构”而不是“从连接/存储导入架构”。For importing schema, you can generate a smaller sample file, which is a subset of original file, and choose "import schema from sample file" instead of "import schema from connection/store".

    • 若要列出工作表,可以改为在工作表下拉框中单击“编辑”并输入工作表名称/索引。For listing worksheet, in the worksheet dropdown, you can click "Edit" and input the sheet name/index instead.

    • 若要将大型 Excel 文件 (>100 MB) 复制到其他存储,可以使用支持流式读取且性能更好的数据流 Excel 源。To copy large excel file (>100 MB) into other store, you can use Data Flow Excel source which sport streaming read and perform better.

其他参考资料Other references

下面是有关一些受支持数据存储的性能监视和优化参考:Here is performance monitoring and tuning references for some of the supported data stores:

后续步骤Next steps

请参阅其他复制活动文章:See the other copy activity articles: