Azure 数据工厂常见问题解答Azure Data Factory FAQ

适用于:是 Azure 数据工厂否 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory noAzure Synapse Analytics (Preview)

本文提供有关 Azure 数据工厂的常见问题解答。This article provides answers to frequently asked questions about Azure Data Factory.

什么是 Azure 数据工厂?What is Azure Data Factory?

数据工厂是一项完全托管的、基于云的数据集成 ETL 服务,可以自动移动和转换数据。Data Factory is a fully managed, cloud-based, data-integration ETL service that automates the movement and transformation of data. 如同工厂运转设备将原材料转换为成品一样,Azure 数据工厂可协调现有的服务,收集原始数据并将其转换为随时可用的信息。Like a factory that runs equipment to transform raw materials into finished goods, Azure Data Factory orchestrates existing services that collect raw data and transform it into ready-to-use information.

使用 Azure 数据工厂可以创建数据驱动的工作流,用于在本地与云数据存储之间移动数据。By using Azure Data Factory, you can create data-driven workflows to move data between on-premises and cloud data stores. ADF 还使用 Azure HDInsight 和 SQL Server Integration Services (SSIS) 集成运行时等计算服务来支持用于手动编码转换的外部计算引擎。ADF also supports external compute engines for hand-coded transformations by using compute services such as Azure HDInsight, and the SQL Server Integration Services (SSIS) integration runtime.

使用数据工厂,可在基于 Azure 的云服务或自己的自承载计算环境(例如 SSIS、SQL Server 或 Oracle)中执行数据处理。With Data Factory, you can execute your data processing either on an Azure-based cloud service or in your own self-hosted compute environment, such as SSIS, SQL Server, or Oracle. 创建用于执行所需操作的管道后,可将它计划为定期运行(例如每小时、每天或每周)、按时间范围运行或者在发生某个事件时触发。After you create a pipeline that performs the action you need, you can schedule it to run periodically (hourly, daily, or weekly, for example), time window scheduling, or trigger the pipeline from an event occurrence. 有关详细信息,请参阅 Azure 数据工厂简介For more information, see Introduction to Azure Data Factory.

控制流和缩放Control flows and scale

为了支持现代数据仓库中的各种集成流和模式,数据工厂启用了新的灵活的数据管道模型。To support the diverse integration flows and patterns in the modern data warehouse, Data Factory enables flexible data pipeline modeling. 这样,就可以在数据管道的控制流中对条件语句进行建模,设置分支,以及在这些流中或跨这些流显式传递参数。This entails full control flow programming paradigms, which include conditional execution, branching in data pipelines, and the ability to explicitly pass parameters within and across these flows. 控制流还包含通过活动调度将数据转换到外部执行引擎,包括通过复制活动大规模移动数据。Control flow also encompasses transforming data through activity dispatch to external execution engines, including data movement at scale, via the Copy activity.

使用数据工厂可自由地对数据集成所需的任何流样式进行建模,可按需调度或按计划重复调度这些样式。Data Factory provides freedom to model any flow style that's required for data integration and that can be dispatched on demand or repeatedly on a schedule. 此模型支持的几个常见流有:A few common flows that this model enables are:

  • 控制流:Control flows:
    • 可在管道中将活动按顺序链接起来。Activities can be chained together in a sequence within a pipeline.
    • 可在管道中创建活动的分支。Activities can be branched within a pipeline.
    • 参数:Parameters:
      • 可在管道级别定义参数,并可在按需或从触发器调用管道时传递参数。Parameters can be defined at the pipeline level and arguments can be passed while you invoke the pipeline on demand or from a trigger.
      • 活动可以使用传递给管道的自变量。Activities can consume the arguments that are passed to the pipeline.
    • 自定义状态传递:Custom state passing:
      • 包含状态的活动输出可供管道中的后续活动使用。Activity outputs, including state, can be consumed by a subsequent activity in the pipeline.
    • 循环容器:Looping containers:
      • foreach 活动将在循环中迭代指定的活动集合。The foreach activity will iterate over a specified collection of activities in a loop.
  • 基于触发器的流:Trigger-based flows:
    • 可以按需或按时钟时间触发管道。Pipelines can be triggered on demand or by wall-clock time.
  • 增量流:Delta flows:
    • 可以使用参数并定义用于增量复制的高水位标记,同时移动本地或云中的关系存储中的维度或引用表,以将数据载入 Lake。Parameters can be used to define your high-water mark for delta copy while moving dimension or reference tables from a relational store, either on-premises or in the cloud, to load the data into the lake.

有关详细信息,请参阅教程:控制流For more information, see Tutorial: Control flows.

使用无代码管道大规模转换的数据Data transformed at scale with code-free pipelines

使用基于浏览器的新工具,可进行无代码管道创作和部署,获得现代化的、基于 Web 的交互式体验。The new browser-based tooling experience provides code-free pipeline authoring and deployment with a modern, interactive web-based experience.

对于可视数据开发人员和数据工程师,数据工厂 Web UI 是可用于生成管道的无代码设计环境。For visual data developers and data engineers, the Data Factory web UI is the code-free design environment that you will use to build pipelines. 它与 Visual Studio Online Git 完全集成,并提供 CI/CD 集成以及含有调试选项的迭代开发。It's fully integrated with Visual Studio Online Git and provides integration for CI/CD and iterative development with debugging options.

为高级用户提供丰富的跨平台 SDKRich cross-platform SDKs for advanced users

数据工厂 V2 提供了更为丰富的一组 SDK,你可以在偏好的 IDE 中使用这些 SDK 来创作、管理和监视管道。这些 IDE 包括:Data Factory V2 provides a rich set of SDKs that can be used to author, manage, and monitor pipelines by using your favorite IDE, including:

  • Python SDKPython SDK
  • PowerShell CLIPowerShell CLI
  • C# SDKC# SDK

用户还能利用已记录的 REST API 来与数据工厂 V2 交互。Users can also use the documented REST APIs to interface with Data Factory V2.

使用可视化工具进行迭代开发和调试Iterative development and debugging by using visual tools

使用 Azure 数据工厂可视化工具可进行迭代开发和调试。Azure Data Factory visual tools enable iterative development and debugging. 可使用管道画布中的“调试”功能创建管道并测试运行情况,无需编写任何代码 。You can create your pipelines and do test runs by using the Debug capability in the pipeline canvas without writing a single line of code. 可在管道画布的“输出”窗口中查看测试运行的结果 。You can view the results of your test runs in the Output window of your pipeline canvas. 在测试运行成功后,可向管道中添加更多活动并继续以迭代方式进行调试。After your test run succeeds, you can add more activities to your pipeline and continue debugging in an iterative manner. 还可以“取消”正在进行的测试运行。You can also cancel your test runs after they are in progress.

在选择“调试”之前,不需要将所做的更改发布至数据工厂服务。 You are not required to publish your changes to the data factory service before selecting Debug. 在开发、测试或生产环境中更新数据工厂工作流之前,如果想确保新添加的内容或更改能按预期工作,这一点就很有帮助。This is helpful in scenarios where you want to make sure that the new additions or changes will work as expected before you update your data factory workflows in development, test, or production environments.

将 SSIS 包部署到 Azure 的功能Ability to deploy SSIS packages to Azure

如果想要移动 SSIS 工作负荷,可以创建一个数据工厂,并预配 Azure-SSIS 集成运行时。If you want to move your SSIS workloads, you can create a Data Factory and provision an Azure-SSIS integration runtime. Azure-SSIS Integration Runtime 是由 Azure VM(节点)构成的完全托管群集,专用于在云中运行 SSIS 包。An Azure-SSIS integration runtime is a fully managed cluster of Azure VMs (nodes) that are dedicated to run your SSIS packages in the cloud. 有关分步说明,请参阅将 SSIS 包部署到 Azure 教程。For step-by-step instructions, see the Deploy SSIS packages to Azure tutorial.

SDKSDKs

对于正在寻求编程接口的高级用户,数据工厂提供了一组丰富的 SDK,让用户在偏好的 IDE 中创作、管理和监视管道。If you are an advanced user and looking for a programmatic interface, Data Factory provides a rich set of SDKs that you can use to author, manage, or monitor pipelines by using your favorite IDE. 语言支持包括 .NET、PowerShell、Python 以及 REST。Language support includes .NET, PowerShell, Python, and REST.

监视Monitoring

可以通过 PowerShell、SDK 或浏览器用户界面中的可视化监视工具监视数据工厂。You can monitor your Data Factories via PowerShell, SDK, or the Visual Monitoring Tools in the browser user interface. 可高效监视并管理基于需求、基于触发器的或由时钟驱动的自定义流。You can monitor and manage on-demand, trigger-based, and clock-driven custom flows in an efficient and effective manner. 在一个面板取消现有任务、速览故障信息、向下钻取以获取详细的错误消息并调试问题,无需进行上下文切换或来回转换界面。Cancel existing tasks, see failures at a glance, drill down to get detailed error messages, and debug the issues, all from a single pane of glass without context switching or navigating back and forth between screens.

数据工厂中的 SSIS 新功能New features for SSIS in Data Factory

从 2017 年首次发布公共预览版以来,数据工厂添加了以下 SSIS 功能:Since the initial public preview release in 2017, Data Factory has added the following features for SSIS:

  • 增加了对三种 Azure SQL 数据库配置/变体的支持,可托管项目/包的 SSIS 数据库 (SSISDB):Support for three more configurations/variants of Azure SQL Database to host the SSIS database (SSISDB) of projects/packages:
  • 包含虚拟网络服务终结点的 SQL 数据库SQL Database with virtual network service endpoints
  • 托管实例Managed instance
  • 弹性池Elastic pool
  • 支持构建在经典虚拟网络(将来会弃用)基础之上的 Azure 资源管理器虚拟网络,可让你将 Azure-SSIS Integration Runtime 注入/联接到可以访问虚拟网络服务终结点/MI/本地数据的 Azure SQL 数据库。Support for an Azure Resource Manager virtual network on top of a classic virtual network to be deprecated in the future, which lets you inject/join your Azure-SSIS integration runtime to a virtual network configured for SQL Database with virtual network service endpoints/MI/on-premises data access. 有关详细信息,另请参阅将 Azure-SSIS Integration Runtime 加入虚拟网络For more information, see also Join an Azure-SSIS integration runtime to a virtual network.
  • 支持使用 Azure Active Directory (Azure AD) 身份验证和 SQL 身份验证连接到 SSISDB,以便可对 Azure 资源的数据工厂托管标识进行 Azure AD 身份验证Support for Azure Active Directory (Azure AD) authentication and SQL authentication to connect to the SSISDB, allowing Azure AD authentication with your Data Factory managed identity for Azure resources
  • 支持自带本地 SQL Server 许可证,大量节约 Azure 混合权益选项的成本Support for bringing your own on-premises SQL Server license to earn substantial cost savings from the Azure Hybrid Benefit option
  • 支持 Azure-SSIS Integration Runtime 企业版,可让你使用高级功能、用于安装附加组件/扩展的自定义安装界面,以及合作伙伴生态系统。Support for Enterprise Edition of the Azure-SSIS integration runtime that lets you use advanced/premium features, a custom setup interface to install additional components/extensions, and a partner ecosystem. 有关详细信息,另请参阅 ADF 中的 SSIS 企业版、自定义安装和第三方可扩展性For more information, see also Enterprise Edition, Custom Setup, and 3rd Party Extensibility for SSIS in ADF.
  • 在数据工厂中更深入地与 SSIS 进行了集成,可让你在数据工厂管道中调用/触发一流的执行 SSIS 包活动并通过 SSMS 对它们进行计划。Deeper integration of SSIS in Data Factory that lets you invoke/trigger first-class Execute SSIS Package activities in Data Factory pipelines and schedule them via SSMS. 有关详细信息,另请参阅使用 ADF 管道中的 SSIS 活动来实现 ETL/ELT 工作流的现代化并对其进行扩展For more information, see also Modernize and extend your ETL/ELT workflows with SSIS activities in ADF pipelines.

什么是 Integration Runtime?What is the integration runtime?

集成运行时是 Azure 数据工厂用于在各种网络环境之间提供以下数据集成功能的计算基础结构:The integration runtime is the compute infrastructure that Azure Data Factory uses to provide the following data integration capabilities across various network environments:

  • 数据移动:就数据移动而言,集成运行时在源和目标数据存储之间移动数据,同时为内置连接器、格式转换、列映射和高性能可缩放数据传输提供支持。Data movement: For data movement, the integration runtime moves the data between the source and destination data stores, while providing support for built-in connectors, format conversion, column mapping, and performant and scalable data transfer.
  • 调动活动:就转换而言,集成运行时提供本机执行 SSIS 包的能力。Dispatch activities: For transformation, the integration runtime provides capability to natively execute SSIS packages.
  • 执行 SSIS 包:Integration Runtime 在托管的 Azure 计算环境中本机执行 SSIS 包。Execute SSIS packages: The integration runtime natively executes SSIS packages in a managed Azure compute environment. Integration Runtime 还支持调度和监视各种计算服务(如 Azure HDInsight、SQL 数据库、SQL Server)上运行的转换活动。The integration runtime also supports dispatching and monitoring transformation activities running on a variety of compute services, such as Azure HDInsight, SQL Database, and SQL Server.

可以按需部署一个或多个集成运行时实例来移动和转换数据。You can deploy one or many instances of the integration runtime as required to move and transform data. 集成运行时可以在 Azure 公用网络或专用网络(本地、Azure 虚拟网络或 Amazon Web Services 虚拟私有云 [VPC])中运行。The integration runtime can run on an Azure public network or on a private network (on-premises, Azure Virtual Network, or Amazon Web Services virtual private cloud [VPC]).

有关详细信息,请参阅 Azure 数据工厂中的集成运行时For more information, see Integration runtime in Azure Data Factory.

对集成运行时的数目有何限制?What is the limit on the number of integration runtimes?

对于可在数据工厂中使用多少个集成运行时实例,没有硬性限制。There is no hard limit on the number of integration runtime instances you can have in a data factory. 不过,对于集成运行时在每个订阅中可用于执行 SSIS 包的 VM 核心数有限制。There is, however, a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package execution. 有关详细信息,请参阅数据工厂限制For more information, see Data Factory limits.

Azure 数据工厂的顶层概念有哪些?What are the top-level concepts of Azure Data Factory?

一个 Azure 订阅可以包含一个或多个 Azure 数据工厂实例(或数据工厂)。An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure 数据工厂包含四个关键组件,这些组件组合成为一个平台,可在其中编写数据驱动型工作流,以及用来移动和转换数据的步骤。Azure Data Factory contains four key components that work together as a platform on which you can compose data-driven workflows with steps to move and transform data.

管道Pipelines

数据工厂可以包含一个或多个数据管道。A data factory can have one or more pipelines. 管道是执行任务单元的活动的逻辑分组。A pipeline is a logical grouping of activities to perform a unit of work. 管道中的活动可以共同执行一项任务。Together, the activities in a pipeline perform a task. 例如,一个管道可以包含一组活动,这些活动从 Azure Blob 引入数据,并在 HDInsight 群集上运行 Hive 查询,以便对数据分区。For example, a pipeline can contain a group of activities that ingest data from an Azure blob and then run a Hive query on an HDInsight cluster to partition the data. 优点在于,可以使用管道以集的形式管理活动,而无需单独管理每个活动。The benefit is that you can use a pipeline to manage the activities as a set instead of having to manage each activity individually. 管道中的活动可以链接在一起来按顺序运行,也可以独立并行运行。You can chain together the activities in a pipeline to operate them sequentially, or you can operate them independently, in parallel.

活动Activities

活动表示管道中的处理步骤。Activities represent a processing step in a pipeline. 例如,可以使用复制活动将数据从一个数据存储复制到另一个数据存储。For example, you can use a Copy activity to copy data from one data store to another data store. 同样,可以使用在 Azure HDInsight 群集上运行 Hive 查询的 Hive 活动来转换或分析数据。Similarly, you can use a Hive activity, which runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. 数据工厂支持三种类型的活动:数据移动活动、数据转换活动和控制活动。Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.

数据集Datasets

数据集代表数据存储中的数据结构,这些结构直接指向需要在活动中使用的数据,或者将其作为输入或输出引用。Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.

链接服务Linked services

链接的服务类似于连接字符串,它定义数据工厂连接到外部资源时所需的连接信息。Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. 不妨这样考虑:链接服务定义到数据源的连接,而数据集则代表数据的结构。Think of it this way: A linked service defines the connection to the data source, and a dataset represents the structure of the data. 例如,Azure 存储链接服务指定连接到 Azure 存储帐户所需的连接字符串。For example, an Azure Storage linked service specifies the connection string to connect to the Azure Storage account. Azure Blob 数据集指定 Blob 容器以及包含数据的文件夹。And an Azure blob dataset specifies the blob container and the folder that contains the data.

数据工厂中的链接服务有两个用途:Linked services have two purposes in Data Factory:

  • 代表数据存储,包括但不限于本地 SQL Server 实例、Oracle 数据库实例、文件共享或 Azure Blob 存储帐户。 To represent a data store that includes, but is not limited to, an on-premises SQL Server instance, an Oracle database instance, a file share, or an Azure Blob storage account. 有关支持的数据存储列表,请参阅 Azure 数据工厂中的复制活动For a list of supported data stores, see Copy Activity in Azure Data Factory.
  • 代表可托管活动执行的计算资源To represent a compute resource that can host the execution of an activity. 例如,HDInsight Hive 活动在 HDInsight Hadoop 群集上运行。For example, the HDInsight Hive activity runs on an HDInsight Hadoop cluster. 有关转换活动列表和支持的计算环境,请参阅在 Azure 数据工厂中转换数据For a list of transformation activities and supported compute environments, see Transform data in Azure Data Factory.

触发器Triggers

触发器表示处理单元,确定何时启动管道执行。Triggers represent units of processing that determine when a pipeline execution is kicked off. 不同类型的事件有不同类型的触发器类型。There are different types of triggers for different types of events.

管道运行Pipeline runs

管道运行是管道执行实例。A pipeline run is an instance of a pipeline execution. 我们通常通过将自变量传递给管道中定义的参数来实例化管道运行。You usually instantiate a pipeline run by passing arguments to the parameters that are defined in the pipeline. 可以手动传递自变量,也可以在触发器定义中传递。You can pass the arguments manually or within the trigger definition.

parametersParameters

参数是只读配置中的键值对。Parameters are key-value pairs in a read-only configuration. 在管道中定义参数,并在执行期间通过运行上下文传递所定义参数的自变量。 You define parameters in a pipeline, and you pass the arguments for the defined parameters during execution from a run context. 运行上下文由触发器创建,或通过手动执行的管道创建。The run context is created by a trigger or from a pipeline that you execute manually. 管道中的活动使用参数值。Activities within the pipeline consume the parameter values.

数据集是可以重用或引用的强类型参数和实体。A dataset is a strongly typed parameter and an entity that you can reuse or reference. 活动可以引用数据集并且可以使用数据集定义中所定义的属性。An activity can reference datasets, and it can consume the properties that are defined in the dataset definition.

链接服务也是强类型参数,其中包含数据存储或计算环境的连接信息。A linked service is also a strongly typed parameter that contains connection information to either a data store or a compute environment. 它也是可重用或引用的实体。It's also an entity that you can reuse or reference.

控制流Control flows

控制流协调管道活动。管道活动包含序列中的链接活动、分支、在管道级别定义的参数,以及按需或通过触发器调用管道时传递的自变量。Control flows orchestrate pipeline activities that include chaining activities in a sequence, branching, parameters that you define at the pipeline level, and arguments that you pass as you invoke the pipeline on demand or from a trigger. 控制流还包含自定义状态传递和循环容器(即 for-each 迭代器)。Control flows also include custom state passing and looping containers (that is, foreach iterators).

有关数据工厂概念的详细信息,请参阅以下文章:For more information about Data Factory concepts, see the following articles:

数据工厂的定价模型是怎样的?What is the pricing model for Data Factory?

有关 Azure 数据工厂的定价详细信息,请参阅数据工厂定价详细信息For Azure Data Factory pricing details, see Data Factory pricing details.

如何及时了解有关数据工厂的最新信息?How can I stay up-to-date with information about Data Factory?

有关 Azure 数据工厂的最新信息,请访问以下站点:For the most up-to-date information about Azure Data Factory, go to the following sites:

技术深入了解Technical deep dive

如何计划管道?How can I schedule a pipeline?

可以采用计划程序触发器或时间范围触发器来计划管道。You can use the scheduler trigger or time window trigger to schedule a pipeline. 该触发器使用时钟日历计划,你可以使用该计划定期或通过基于日历的重复模式(例如,在每周的周一下午 6 点和周四晚上 9 点)来安排管道。The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM). 有关详细信息,请参阅管道执行和触发器For more information, see Pipeline execution and triggers.

是否可以向管道传递参数?Can I pass parameters to a pipeline run?

可以,在数据工厂中,参数是最重要的顶层概念。Yes, parameters are a first-class, top-level concept in Data Factory. 可以在管道级别定义参数,并在按需或使用触发器执行管道运行时传递自变量。You can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or by using a trigger.

是否可以为管道参数定义默认值?Can I define default values for the pipeline parameters?

是的。Yes. 可以为管道中的参数定义默认值。You can define default values for the parameters in the pipelines.

管道中的活动是否可以使用传递给管道运行的自变量?Can an activity in a pipeline consume arguments that are passed to a pipeline run?

是的。Yes. 管道中的每个活动都可以使用通过 @parameter 构造传递给管道运行的参数值。Each activity within the pipeline can consume the parameter value that's passed to the pipeline and run with the @parameter construct.

活动输出属性是否可以在其他活动中使用?Can an activity output property be consumed in another activity?

是的。Yes. 在后续活动中可以通过 @activity 构造来使用活动输出。An activity output can be consumed in a subsequent activity with the @activity construct.

如何得体地处理活动输出中的 NULL 值?How do I gracefully handle null values in an activity output?

可以在表达式中使用 @coalesce 构造来得体地处理 NULL 值。You can use the @coalesce construct in the expressions to handle null values gracefully.

后续步骤Next steps

有关创建数据工厂的分步说明,请参阅以下教程:For step-by-step instructions to create a data factory, see the following tutorials: