使用 Azure 数据工厂将数据从本地 Netezza 服务器迁移到 AzureUse Azure Data Factory to migrate data from an on-premises Netezza server to Azure

适用于: Azure 数据工厂 Azure Synapse Analytics(预览版)

Azure 数据工厂提供了高性能、稳健且经济高效的机制,用于将数据从本地 Netezza 服务器大规模迁移到 Azure 存储帐户或 Azure Synapse Analytics(以前称为 SQL 数据仓库)数据库。Azure Data Factory provides a performant, robust, and cost-effective mechanism to migrate data at scale from an on-premises Netezza server to your Azure storage account or Azure Synapse Analytics (formerly SQL Data Warehouse) database.

本文提供面向数据工程师和开发人员的以下信息:This article provides the following information for data engineers and developers:

  • 性能Performance
  • 复制复原能力Copy resilience
  • 网络安全Network security
  • 高级解决方案体系结构High-level solution architecture
  • 有关实现的最佳做法Implementation best practices

性能Performance

Azure 数据工厂提供一个可在不同级别实现并行度的无服务器体系结构。Azure Data Factory offers a serverless architecture that allows parallelism at various levels. 开发人员可以生成管道,以充分利用网络带宽和数据库带宽将环境的数据移动吞吐量最大化。If you're a developer, this means you can build pipelines to fully use both network and database bandwidth to maximize data movement throughput for your environment.

性能示意图

上面的示意图可以解释为:The preceding diagram can be interpreted as follows:

  • 单个复制活动可以利用可缩放的计算资源。A single copy activity can take advantage of scalable compute resources. 使用 Azure Integration Runtime 时,能够以无服务器方式为每个复制活动指定最多 256 个 DIUWhen you use Azure Integration Runtime, you can specify up to 256 DIUs for each copy activity in a serverless manner. 使用自承载集成运行时(自承载 IR)时,可以手动纵向扩展计算机或横向扩展为多个计算机(最多 4 个节点),单个复制活动将在所有节点之间分布其分区。With a self-hosted integration runtime (self-hosted IR), you can manually scale up the machine or scale out to multiple machines (up to four nodes), and a single copy activity distributes its partition across all nodes.

  • 单个复制活动使用多个线程读取和写入数据存储。A single copy activity reads from and writes to the data store by using multiple threads.

  • Azure 数据工厂控制流可以同时启动多个复制活动。Azure Data Factory control flow can start multiple copy activities in parallel. 例如,它可以使用 For Each 循环启动这些活动。For example, it can start them by using a For Each loop.

有关详细信息,请参阅复制活动性能和可伸缩性指南For more information, see Copy activity performance and scalability guide.

复原能力Resilience

在单个复制活动运行中,Azure 数据工厂具有内置的重试机制,因此,它可以处理数据存储或底层网络中特定级别的暂时性故障。Within a single copy activity run, Azure Data Factory has a built-in retry mechanism, which enables it to handle a certain level of transient failures in the data stores or in the underlying network.

使用 Azure 数据工厂复制活动在源与接收器数据存储之间复制数据时,可通过两种方式处理不兼容的行。With Azure Data Factory copy activity, when you copy data between source and sink data stores, you have two ways to handle incompatible rows. 可以中止复制活动并使其失败,或者可以通过跳过不兼容的数据行来继续复制剩余的数据。You can either abort and fail the copy activity or continue to copy the rest of the data by skipping the incompatible data rows. 此外,若要了解失败的原因,可以在 Azure Blob 存储中记录不兼容的行,修复数据源中的数据,并重试复制活动。In addition, to learn the cause of the failure, you can log the incompatible rows in Azure Blob storage, fix the data on the data source, and retry the copy activity.

网络安全Network security

默认情况下,Azure 数据工厂通过安全超文本传输协议 (HTTPS),使用加密的连接将数据从本地 Netezza 服务器传输到 Azure 存储帐户或 Azure Synapse Analytics 数据库。By default, Azure Data Factory transfers data from the on-premises Netezza server to an Azure storage account or Azure Synapse Analytics database by using an encrypted connection over Hypertext Transfer Protocol Secure (HTTPS). HTTPS 提供传输中数据加密,并可防止窃听和中间人攻击。HTTPS provides data encryption in transit and prevents eavesdropping and man-in-the-middle attacks.

如果你不希望通过公共 Internet 传输数据,可以通过 Azure Express Route 使用专用对等互连链路传输数据,借此提高安全性。Alternatively, if you don't want data to be transferred over the public internet, you can help achieve higher security by transferring data over a private peering link via Azure Express Route.

下一部分将介绍如何实现更高的安全性。The next section discusses how to achieve higher security.

解决方案体系结构Solution architecture

本部分介绍迁移数据的两种方式。This section discusses two ways to migrate your data.

通过公共 Internet 迁移数据Migrate data over the public internet

通过公共 Internet 迁移数据

上面的示意图可以解释为:The preceding diagram can be interpreted as follows:

  • 在此体系结构中,将通过公共 Internet 使用 HTTPS 安全传输数据。In this architecture, you transfer data securely by using HTTPS over the public internet.

  • 若要实现此体系结构,需要在企业防火墙后面的 Windows 计算机上安装 Azure 数据工厂集成运行时(自承载)。To achieve this architecture, you need to install the Azure Data Factory integration runtime (self-hosted) on a Windows machine behind a corporate firewall. 确保此集成运行时可以直接访问 Netezza 服务器。Make sure that this integration runtime can directly access the Netezza server. 若要充分利用网络和数据存储带宽来复制数据,可以手动纵向扩展计算机或横向扩展为多个计算机。To fully use your network and data stores bandwidth to copy data, you can manually scale up your machine or scale out to multiple machines.

  • 使用此体系结构可以迁移初始快照数据和增量数据。By using this architecture, you can migrate both initial snapshot data and delta data.

通过专用网络迁移数据Migrate data over a private network

通过专用网络迁移数据

上面的示意图可以解释为:The preceding diagram can be interpreted as follows:

  • 在此体系结构中,将通过 Azure Express Route 使用专用对等互连链路迁移数据,因此,数据永远不会遍历公共 Internet。In this architecture, you migrate data over a private peering link via Azure Express Route, and data never traverses over the public internet.

  • 若要实现此体系结构,需要在 Azure 虚拟网络中的 Windows 虚拟机 (VM) 上安装 Azure 数据工厂集成运行时(自承载)。To achieve this architecture, you need to install the Azure Data Factory integration runtime (self-hosted) on a Windows virtual machine (VM) within your Azure virtual network. 若要充分利用网络和数据存储带宽来复制数据,可以手动纵向扩展 VM 或横向扩展为多个 VM。To fully use your network and data stores bandwidth to copy data, you can manually scale up your VM or scale out to multiple VMs.

  • 使用此体系结构可以迁移初始快照数据和增量数据。By using this architecture, you can migrate both initial snapshot data and delta data.

实施最佳做法Implement best practices

管理身份验证和凭据Manage authentication and credentials

迁移初始快照数据Migrate initial snapshot data

对于小型表(即,卷大小小于 100 GB,或者可以在两小时内迁移到 Azure 的表),可使每个复制作业加载每个表的数据。For small tables (that is, tables with a volume of less than 100 GB or that can be migrated to Azure within two hours), you can make each copy job load data per table. 若要提高吞吐量,可以运行多个 Azure 数据工厂复制作业来同时加载不同的表。For greater throughput, you can run multiple Azure Data Factory copy jobs to load separate tables concurrently.

在每个复制作业中,若要运行并行查询并按分区复制数据,还可以结合以下任一数据分区选项使用 parallelCopies 属性设置来达到一定的并行度:Within each copy job, to run parallel queries and copy data by partitions, you can also reach some level of parallelism by using the parallelCopies property setting with either of the following data partition options:

  • 为帮助实现更高的效率,我们建议从数据切片开始。For help achieve greater efficiency, we encourage you to start from a data slice. 确保 parallelCopies 设置中的值小于 Netezza 服务器上的表中的数据切片分区总数。Make sure that the value in the parallelCopies setting is less than the total number of data-slice partitions in your table on the Netezza server.

  • 如果每个数据切片分区的卷仍然很大(例如,10 GB 或更大),我们建议切换到动态范围分区。If the volume of each data-slice partition is still large (for example, 10 GB or greater), we encourage you to switch to a dynamic range partition. 使用此选项可以更灵活地定义分区数目,并按分区列定义每个分区的卷(上限和下限)。This option gives you greater flexibility to define the number of partitions and the volume of each partition by partition column, upper bound and lower bound.

对于大型表(即,卷大于 100 GB,或者在两小时内无法迁移到 Azure 的表),我们建议按自定义查询将数据分区,然后使每个复制作业每次复制一个分区。 For larger tables (that is, tables with a volume of 100 GB or greater or that can't be migrated to Azure within two hours), we recommend that you partition the data by custom query and then make each copy-job copy one partition at a time. 若要提高吞吐量,可以同时运行多个 Azure 数据工厂复制作业。For better throughput, you can run multiple Azure Data Factory copy jobs concurrently. 要使每个复制作业目标按自定义查询加载一个分区,可以通过数据切片或动态范围启用并行度来提高吞吐量。For each copy-job target of loading one partition by custom query, you can increase throughput by enabling parallelism via either data slice or dynamic range.

如果网络或数据存储的暂时性问题导致任何复制作业失败,你可以重新运行失败的复制作业,以从表中加载特定的分区。If any copy job fails because of a network or data store transient issue, you can rerun the failed copy job to reload that specific partition from the table. 加载其他分区的其他复制作业不受影响。Other copy jobs that load other partitions aren't affected.

将数据加载到 Azure Synapse Analytics 数据库时,我们建议在复制作业中启用 PolyBase,并使用 Azure Blob 存储作为暂存存储。When you load data into an Azure Synapse Analytics database, we suggest that you enable PolyBase within the copy job with Azure Blob storage as staging.

迁移增量数据Migrate delta data

若要标识表中的新行或更新的行,请使用架构中的时间戳列或递增键。To identify the new or updated rows from your table, use a timestamp column or an incrementing key within the schema. 然后可将最新的值作为高水印存储在外部表中,下次加载数据时,可以使用此外部表来筛选增量数据。You can then store the latest value as a high watermark in an external table and then use it to filter the delta data the next time you load data.

每个表可以使用不同的水印列来标识其新行或更新的行。Each table can use a different watermark column to identify its new or updated rows. 建议创建一个外部控制表。We suggest that you create an external control table. 该表中的每行代表 Netezza 服务器上的一个表,并具有自身特定的水印列名称和高水印值。In the table, each row represents one table on the Netezza server with its specific watermark column name and high watermark value.

配置自承载集成运行时Configure a self-hosted integration runtime

如果你要将数据从 Netezza 服务器迁移到 Azure,无论该服务器是在企业防火墙后的本地位置还是在虚拟网络环境中,都需要在 Windows 计算机或 VM 上安装自承载 IR,用作移动数据的引擎。If you're migrating data from the Netezza server to Azure, whether the server is on-premises behind your corporation firewall or within a virtual network environment, you need to install a self-hosted IR on a Windows machine or VM, which is the engine that's used to move data. 建议采用以下方法安装自承载 IR:As you're installing the self-hosted IR, we recommend the following approach:

  • 每个 Windows 计算机或 VM 的初始配置为 32 个 vCPU 和 128 GB 内存。For each Windows machine or VM, start with a configuration of 32 vCPU and 128-GB memory. 可以在数据迁移过程中持续监视 IR 计算机的 CPU 和内存使用率,以确定是否需要进一步扩展计算机来提高性能,或缩减计算机来节省成本。You can keep monitoring the CPU and memory usage of the IR machine during the data migration to see whether you need to further scale up the machine for better performance or scale down the machine to save cost.

  • 还可以通过将最多 4 个节点关联到一个自承载 IR 进行横向扩展。You can also scale out by associating up to four nodes with a single self-hosted IR. 针对自承载 IR 运行的单个复制作业将应用所有 VM 节点来同时复制数据。A single copy job that's running against a self-hosted IR automatically applies all VM nodes to copy the data in parallel. 为实现高可用性,请从 4 个 VM 节点着手,以避免在数据迁移过程中出现单一故障点。For high availability, start with four VM nodes to avoid a single point of failure during the data migration.

限制分区Limit your partitions

最佳做法是使用有代表性的示例数据集执行性能概念证明 (POC),以便能够确定每个复制活动的适当分区大小。As a best practice, conduct a performance proof of concept (POC) with a representative sample dataset, so that you can determine an appropriate partition size for each copy activity. 建议在两小时内将每个分区加载到 Azure。We suggest that you load each partition to Azure within two hours.

若要复制表,请先在一个自承载 IR 计算机上使用单个复制活动。To copy a table, start with a single copy activity with a single, self-hosted IR machine. 根据表中的数据切片分区数逐渐增大 parallelCopies 设置。Gradually increase the parallelCopies setting based on the number of data-slice partitions in your table. 看看是否能够根据复制作业产生的吞吐量,在两小时内将整个表加载到 Azure。See whether the entire table can be loaded to Azure within two hours, according to the throughput that results from the copy job.

如果无法在两小时内将整个表加载到 Azure,并且未充分利用自承载 IR 节点和数据存储的容量,请逐渐增大并发复制活动的数目,直到达到数据存储的网络或带宽限制。If it can't be loaded to Azure within two hours, and the capacity of the self-hosted IR node and the data store are not fully used, gradually increase the number of concurrent copy activities until you reach the limit of your network or the bandwidth limit of the data stores.

持续监视自承载 IR 计算机上的 CPU 和内存使用率,如果看到 CPU 和内存已充分利用,请准备好横向扩展计算机或横向扩展为多个计算机。Keep monitoring the CPU and memory usage on the self-hosted IR machine, and be ready to scale up the machine or scale out to multiple machines when you see that the CPU and memory are fully used.

遇到 Azure 数据工厂复制活动报告的限制错误时,请在 Azure 数据工厂中减小并发性或 parallelCopies 设置,或考虑提高网络和数据存储的带宽或每秒 I/O 操作次数 (IOPS) 限制。When you encounter throttling errors, as reported by Azure Data Factory copy activity, either reduce the concurrency or parallelCopies setting in Azure Data Factory, or consider increasing the bandwidth or I/O operations per second (IOPS) limits of the network and data stores.

估算定价Estimate your pricing

考虑构建了以下管道用于将数据从本地 Netezza 服务器迁移到 Azure Synapse Analytics 数据库:Consider the following pipeline, which is constructed to migrate data from the on-premises Netezza server to an Azure Synapse Analytics database:

定价管道

假设以下陈述属实:Let's assume that the following statements are true:

  • 总数据量为 50 TB。The total data volume is 50 terabytes (TB).

  • 使用第一个解决方案体系结构迁移数据(Netezza 服务器位于防火墙后的本地位置)。We're migrating data by using first-solution architecture (the Netezza server is on-premises, behind the firewall).

  • 50 TB 卷划分为 500 个分区,每个复制活动移动一个分区。The 50-TB volume is divided into 500 partitions, and each copy activity moves one partition.

  • 为每个复制活动配置了一个针对 4 台计算机的自承载 IR,可实现 20 MBps 的吞吐量。Each copy activity is configured with one self-hosted IR against four machines and achieves a throughput of 20 megabytes per second (MBps). (在复制活动中,parallelCopies 设置为 4,用于从表中加载数据的每个线程可实现 5 MBps 的吞吐量。)(Within copy activity, parallelCopies is set to 4, and each thread to load data from the table achieves a 5-MBps throughput.)

  • ForEach 并发性设置为 3,聚合吞吐量为 60 MBps。The ForEach concurrency is set to 3, and the aggregate throughput is 60 MBps.

  • 完成迁移总共需要花费 243 小时。In total, it takes 243 hours to complete the migration.

根据上述假设,预算价格如下:Based on the preceding assumptions, here's the estimated price:

定价表

备注

上表中显示的定价是假构的。The pricing shown in the preceding table is hypothetical. 实际价格取决于环境中的实际吞吐量。Your actual pricing depends on the actual throughput in your environment. 不包括 Windows 计算机(装有自承载 IR)的价格。The price for the Windows machine (with the self-hosted IR installed) is not included.

其他参考Additional references

有关详细信息,请参阅以下文章和指南:For more information, see the following articles and guides:

后续步骤Next steps