在 Azure 数据工厂中转换数据Transform data in Azure Data Factory

适用于:是 Azure 数据工厂是 Azure Synapse Analytics(预览版)APPLIES TO: yesAzure Data Factory yesAzure Synapse Analytics (Preview)

概述Overview

本文介绍了 Azure 数据工厂中的数据转换活动,可利用这些活动将原始数据转换和处理为大规模预测和见解。This article explains data transformation activities in Azure Data Factory that you can use to transform and process your raw data into predictions and insights at scale. 转换活动在计算环境(例如 Azure HDInsight)中执行。A transformation activity executes in a computing environment such as Azure HDInsight. 其提供了相关文章链接,内附各转换活动的详细信息。It provides links to articles with detailed information on each transformation activity.

数据工厂支持以下数据转换活动,这些活动可单独添加到管道,还可与其他活动关联在一起。Data Factory supports the following data transformation activities that can be added to pipelines either individually or chained with another activity.

外部转换External transformations

可以手动编写代码转换并自行管理外部计算环境。You can hand-code transformations and manage the external compute environment yourself.

HDInsight Hive 活动HDInsight Hive activity

数据工厂管道中的 HDInsight Hive 活动会在自己的或基于 Windows/Linux 的按需 HDInsight 群集上执行 Hive 查询。The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand Windows/Linux-based HDInsight cluster. 有关此活动的详细信息,请参阅 Hive 活动一文。See Hive activity article for details about this activity.

HDInsight Pig 活动HDInsight Pig activity

数据工厂管道中的 HDInsight Pig 活动会在自己或基于 Windows/Linux 的按需 HDInsight 群集上执行 Pig 查询。The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand Windows/Linux-based HDInsight cluster. 有关此活动的详细信息,请参阅 Pig 活动一文。See Pig activity article for details about this activity.

HDInsight MapReduce 活动HDInsight MapReduce activity

数据工厂管道中的 HDInsight MapReduce 活动会在自己或基于 Windows/Linux 的按需 HDInsight 群集上执行 MapReduce 程序。The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or on-demand Windows/Linux-based HDInsight cluster. 有关此活动的详细信息,请参阅 MapReduce 活动一文。See MapReduce activity article for details about this activity.

HDInsight Streaming 活动HDInsight Streaming activity

数据工厂管道中的 HDInsight 流式处理活动会在自己的或基于 Windows/Linux 的按需 HDInsight 群集上执行 Hadoop 流式处理程序。The HDInsight Streaming activity in a Data Factory pipeline executes Hadoop Streaming programs on your own or on-demand Windows/Linux-based HDInsight cluster. 有关此活动的详细信息,请参阅 HDInsight Streaming 活动See HDInsight Streaming activity for details about this activity.

HDInsight Spark 活动HDInsight Spark activity

数据工厂管道中的 HDInsight Spark 活动在自己的 HDInsight 群集上执行 Spark 程序。The HDInsight Spark activity in a Data Factory pipeline executes Spark programs on your own HDInsight cluster. 有关详细信息,请参阅从 Azure 数据工厂调用 Spark 程序For details, see Invoke Spark programs from Azure Data Factory.

存储过程活动Stored procedure activity

可使用数据工厂管道中的 SQL Server 存储过程活动调用以下数据存储之一中的存储过程:你的企业或 Azure VM 中的 Azure SQL 数据库、Azure Synapse Analytics(以前称为 SQL 数据仓库)、SQL Server 数据库。You can use the SQL Server Stored Procedure activity in a Data Factory pipeline to invoke a stored procedure in one of the following data stores: Azure SQL Database, Azure Synapse Analytics (formerly SQL Data Warehouse), SQL Server Database in your enterprise or an Azure VM. 有关详细信息,请参阅存储过程活动一文。See Stored Procedure activity article for details.

自定义活动Custom activity

如果需要采用数据工厂不支持的方式转换数据,可以使用自己的数据处理逻辑创建自定义活动,并在管道中使用该活动。If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity with your own data processing logic and use the activity in the pipeline. 可以使用 Azure Batch 服务或 Azure HDInsight 群集配置要运行的自定义 .NET 活动。You can configure the custom .NET activity to run using either an Azure Batch service or an Azure HDInsight cluster. 有关详细信息,请参阅使用自定义活动文章。See Use custom activities article for details.

可以创建一项自定义活动,在安装了 R 的 HDInsight 群集上运行 R 脚本。You can create a custom activity to run R scripts on your HDInsight cluster with R installed. 请参阅 Run R Script using Azure Data Factory(使用 Azure 数据工厂运行 R 脚本)。See Run R Script using Azure Data Factory.

计算环境Compute environments

为计算环境创建链接服务,并在定义转换活动时使用该服务。You create a linked service for the compute environment and then use the linked service when defining a transformation activity. 数据工厂支持两类计算环境。There are two types of compute environments supported by Data Factory.

  • 按需:在这种情况下,计算环境由数据工厂完全托管。On-Demand: In this case, the computing environment is fully managed by Data Factory. 作业提交到进程数据前,数据工厂服务会自动创建计算环境,作业完成后则自动将其删除。It is automatically created by the Data Factory service before a job is submitted to process data and removed when the job is completed. 针对作业执行、群集管理和启动操作,可以配置和控制按需计算环境的粒度设置。You can configure and control granular settings of the on-demand compute environment for job execution, cluster management, and bootstrapping actions.
  • 自带:在这种情况下,可将自己的计算环境(例如 HDInsight 群集)注册为数据工厂中的链接服务。Bring Your Own: In this case, you can register your own computing environment (for example HDInsight cluster) as a linked service in Data Factory. 计算环境由用户进行管理,数据工厂服务使用它执行活动。The computing environment is managed by you and the Data Factory service uses it to execute the activities.

有关数据工厂支持的计算服务列表,请参阅 计算链接服务文章。See Compute Linked Services article to learn about compute services supported by Data Factory.

后续步骤Next steps

请参阅以下使用转换活动的示例教程:教程:使用 Spark 转换数据See the following tutorial for an example of using a transformation activity: Tutorial: transform data using Spark