Azure 数据工厂简介Introduction to Azure Data Factory

在大数据环境中,原始、散乱的数据通常存储在关系、非关系和其他存储系统中。In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage systems. 但是,就其本身而言,原始数据没有适当的上下文或含义来为分析师、数据科学家或业务决策人提供有意义的见解。However, on its own, raw data doesn't have the proper context or meaning to provide meaningful insights to analysts, data scientists, or business decision makers.

大数据需要可以启用协调和操作过程以将这些巨大的原始数据存储优化为可操作的业务见解的服务。Big data requires service that can orchestrate and operationalize processes to refine these enormous stores of raw data into actionable business insights. Azure 数据工厂是为这些复杂的混合提取-转换-加载 (ETL)、提取-加载-转换 (ELT) 和数据集成项目而构建的托管云服务。Azure Data Factory is a managed cloud service that's built for these complex hybrid extract-transform-load (ETL), extract-load-transform (ELT), and data integration projects.

例如,假设某个游戏公司收集云中的游戏所生成的万兆字节的游戏日志。For example, imagine a gaming company that collects petabytes of game logs that are produced by games in the cloud. 该公司的目的是通过分析这些日志,了解客户偏好、人口统计信息和使用行为。The company wants to analyze these logs to gain insights into customer preferences, demographics, and usage behavior. 该公司的另一个目的是确定向上销售和交叉销售机会、开发极具吸引力的新功能、促进企业发展,并为其客户提供更好的体验。It also wants to identify up-sell and cross-sell opportunities, develop compelling new features, drive business growth, and provide a better experience to its customers.

为了分析这些日志,该公司需要使用参考数据,例如位于本地数据存储中的客户信息、游戏信息和市场营销活动信息。To analyze these logs, the company needs to use reference data such as customer information, game information, and marketing campaign information that is in an on-premises data store. 公司想要利用本地数据存储中的这些数据,将其与云数据存储中具有的其他日志数据结合在一起。The company wants to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud data store.

为了获取见解,它希望使用云中的 Spark 群集 (Azure HDInsight) 处理加入的数据,并将转换的数据发布到云数据仓库(如 Azure SQL 数据仓库)以轻松地基于它生成报表。To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight), and publish the transformed data into a cloud data warehouse such as Azure SQL Data Warehouse to easily build a report on top of it. 公司的人员希望自动执行此工作流,并每天按计划对其进行监视和管理。They want to automate this workflow, and monitor and manage it on a daily schedule. 他们还希望在文件存储到 blob 存储容器中时执行该工作流。They also want to execute it when files land in a blob store container.

Azure 数据工厂是解决此类数据方案的平台。Azure Data Factory is the platform that solves such data scenarios. 它是基于云的数据集成服务,用于在云中创建数据驱动型工作流,以便协调和自动完成数据移动和数据转换It is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. 可以使用 Azure 数据工厂创建和计划数据驱动型工作流(称为管道),以便从不同的数据存储引入数据。Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. 它可以使用计算服务(例如 Azure HDInsight Hadoop、Spark 和 Azure 机器学习)处理和转换数据。It can process and transform the data by using compute services such as Azure HDInsight Hadoop, Spark, and Azure Machine Learning.

此外,你还可以将输出数据发布到数据存储(例如 Azure SQL 数据仓库),供商业智能 (BI) 应用程序使用。Additionally, you can publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume. 最终,通过 Azure 数据工厂,可将原始数据组织成有意义的数据存储和数据湖,以实现更好的业务决策。Ultimately, through Azure Data Factory, raw data can be organized into meaningful data stores and data lakes for better business decisions.

数据工厂的顶级视图

工作原理How does it work?

Azure 数据工厂中的管道(数据驱动型工作流)通常执行以下四个步骤:The pipelines (data-driven workflows) in Azure Data Factory typically perform the following four steps:

数据驱动型工作流的四个步骤

连接和收集Connect and collect

企业拥有各种类型的数据(位于云、结构化、非结构化和半结构化的本地分散源中),都以不同的时间间隔和速度到达。Enterprises have data of various types that are located in disparate sources on-premises, in the cloud, structured, unstructured, and semi-structured, all arriving at different intervals and speeds.

构建信息生产系统时,第一步是连接到所有必需的数据和处理源(例如软件即服务 (SaaS) 服务、数据库、文件共享、FTP Web 服务)。The first step in building an information production system is to connect to all the required sources of data and processing, such as software-as-a-service (SaaS) services, databases, file shares, and FTP web services. 下一步是根据需要将数据移至中央位置进行后续处理。The next step is to move the data as needed to a centralized location for subsequent processing.

没有数据工厂,企业就必须生成自定义数据移动组件或编写自定义服务,以便集成这些数据源并进行处理。Without Data Factory, enterprises must build custom data movement components or write custom services to integrate these data sources and processing. 集成和维护此类系统既昂贵又困难。It's expensive and hard to integrate and maintain such systems. 另外,这些系统通常还缺乏企业级监视、警报和控制,而这些功能是完全托管的服务能够提供的。In addition, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully managed service can offer.

而有了数据工厂,便可以在数据管道中使用复制活动,将数据从本地和云的源数据存储移到云的集中数据存储进行进一步的分析。With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and cloud source data stores to a centralization data store in the cloud for further analysis. 例如,还可以使用 Azure HDInsight Hadoop 群集在 Azure Blob 存储中收集数据并稍后对其进行转换。For example, you can also collect data in Azure Blob storage and transform it later by using an Azure HDInsight Hadoop cluster.

转换和扩充Transform and enrich

在云中的集中式数据存储中存在数据后,使用 HDInsight Hadoop 和 Spark 等计算服务处理或转换收集的数据。After data is present in a centralized data store in the cloud, process or transform the collected data by using compute services such as HDInsight Hadoop, and Spark. 需要按可以维护和控制的计划以可靠方式生成转换的数据,为生产环境提供可信数据。You want to reliably produce transformed data on a maintainable and controlled schedule to feed production environments with trusted data.

发布Publish

原始数据被优化为业务就绪型可使用的窗体后,请将数据载入 Azure 数据仓库、Azure SQL 数据库、Azure CosmosDB 或业务用户可从其商业智能工具中指向的任何分析引擎。After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business intelligence tools.

监视Monitor

成功地构建和部署数据集成管道后(提供优化数据的业务值),请监视计划的活动和管道,以了解成功率和失败率。After you have successfully built and deployed your data integration pipeline, providing business value from refined data, monitor the scheduled activities and pipelines for success and failure rates. Azure 数据工厂通过 Azure 门户上的 Azure Monitor、API、PowerShell、Azure Monitor 日志和运行状况面板,对管道监视提供内置支持。Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.

顶级概念Top-level concepts

一个 Azure 订阅可以包含一个或多个 Azure 数据工厂实例(或数据工厂)。An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure 数据工厂由四个关键组件组成。Azure Data Factory is composed of four key components. 这些组件组合起来提供一个平台,供你在上面编写数据驱动型工作流(其中包含用来移动和转换数据的步骤)。These components work together to provide the platform on which you can compose data-driven workflows with steps to move and transform data.

管道Pipeline

数据工厂可以包含一个或多个管道。A data factory might have one or more pipelines. 管道是执行任务单元的活动的逻辑分组。A pipeline is a logical grouping of activities that performs a unit of work. 管道中的活动可以共同执行一项任务。Together, the activities in a pipeline perform a task. 例如,一个管道可能包含一组活动,这些活动从 Azure Blob 引入数据,然后在 HDInsight 群集上运行 Hive 查询,以便对数据分区。For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition the data.

这样做的好处是,可以通过管道以集的形式管理活动,不必对每个活动单独进行管理。The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each one individually. 管道中的活动可以链接在一起来按顺序执行,也可以独立并行执行。The activities in a pipeline can be chained together to operate sequentially, or they can operate independently in parallel.

活动Activity

活动表示管道中的处理步骤。Activities represent a processing step in a pipeline. 例如,可以使用复制活动将数据从一个数据存储复制到另一个数据存储。For example, you might use a copy activity to copy data from one data store to another data store. 同样,可以使用在 Azure HDInsight 群集上运行 Hive 查询的 Hive 活动来转换或分析数据。Similarly, you might use a Hive activity, which runs a Hive query on an Azure HDInsight cluster, to transform or analyze your data. 数据工厂支持三种类型的活动:数据移动活动、数据转换活动和控制活动。Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.

数据集Datasets

数据集代表数据存储中的数据结构,这些结构直接指向需要在活动中使用的数据,或者将其作为输入或输出引用。Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.

链接服务Linked services

链接服务十分类似于连接字符串,用于定义数据工厂连接到外部资源时所需的连接信息。Linked services are much like connection strings, which define the connection information that's needed for Data Factory to connect to external resources. 不妨这样考虑:链接服务定义到数据源的连接,而数据集则代表数据的结构。Think of it this way: a linked service defines the connection to the data source, and a dataset represents the structure of the data. 例如,Azure 存储链接服务指定连接到 Azure 存储帐户所需的连接字符串。For example, an Azure Storage-linked service specifies a connection string to connect to the Azure Storage account. 另外,Azure Blob 数据集指定 Blob 容器以及包含数据的文件夹。Additionally, an Azure blob dataset specifies the blob container and the folder that contains the data.

数据工厂中的链接服务有两个用途:Linked services are used for two purposes in Data Factory:

  • 代表数据存储。此类存储包括但不限于本地 SQL Server 数据库、Oracle 数据库、文件共享或 Azure Blob 存储帐户。To represent a data store that includes, but isn't limited to, an on-premises SQL Server database, Oracle database, file share, or Azure blob storage account. 有关支持的数据存储的列表,请参阅复制活动一文。For a list of supported data stores, see the copy activity article.

  • 代表可托管活动执行的计算资源To represent a compute resource that can host the execution of an activity. 例如,HDInsightHive 活动在 HDInsight Hadoop 群集上运行。For example, the HDInsightHive activity runs on an HDInsight Hadoop cluster. 有关转换活动列表和支持的计算环境,请参阅转换数据一文。For a list of transformation activities and supported compute environments, see the transform data article.

触发器Triggers

触发器代表处理单元,用于确定何时需要启动管道执行。Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. 不同类型的事件有不同类型的触发器类型。There are different types of triggers for different types of events.

管道运行Pipeline runs

管道运行是管道执行实例。A pipeline run is an instance of the pipeline execution. 管道运行通常是通过将自变量传递给管道中定义的参数来实例化的。Pipeline runs are typically instantiated by passing the arguments to the parameters that are defined in pipelines. 自变量可手动传递,也可在触发器定义中传递。The arguments can be passed manually or within the trigger definition.

parametersParameters

参数是只读配置的键值对。Parameters are key-value pairs of read-only configuration.  参数是在管道中定义的。  Parameters are defined in the pipeline. 所定义的参数的自变量是在执行期间通过由触发器创建的运行上下文传递的或通过手动执行的管道传递的。The arguments for the defined parameters are passed during execution from the run context that was created by a trigger or a pipeline that was executed manually. 管道中的活动使用参数值。Activities within the pipeline consume the parameter values.

数据集是强类型参数和可重用/可引用的实体。A dataset is a strongly typed parameter and a reusable/referenceable entity. 活动可以引用数据集并且可以使用数据集定义中所定义的属性。An activity can reference datasets and can consume the properties that are defined in the dataset definition.

链接服务也是强类型参数,其中包含数据存储或计算环境的连接信息。A linked service is also a strongly typed parameter that contains the connection information to either a data store or a compute environment. 它也是可重用/可引用的实体。It is also a reusable/referenceable entity.

控制流Control flow

控制流是管道活动的业务流程,包括将活动按顺序链接起来、设置分支。可以在管道级别定义参数,在按需或者通过触发器调用管道时传递自变量。Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or from a trigger. 它还包括自定义状态传递和循环容器,即 For-each 迭代器。It also includes custom-state passing and looping containers, that is, For-each iterators.

有关数据工厂概念的详细信息,请参阅以下文章:For more information about Data Factory concepts, see the following articles:

支持的区域Supported regions

若要查看目前提供数据工厂的 Azure 区域的列表,请在以下页面上选择感兴趣的区域,然后展开“分析” 以找到“数据工厂” :可用产品(按区域)For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the following page, and then expand Analytics to locate Data Factory: Products available by region. 但是,数据工厂可以访问其他 Azure 区域的数据存储和计算数据,在数据存储之间移动数据或使用计算服务处理数据。However, a data factory can access data stores and compute services in other Azure regions to move data between data stores or process data using compute services.

Azure 数据工厂本身不存储任何数据。Azure Data Factory itself does not store any data. 它允许创建数据驱动型工作流,协调受支持数据存储之间的数据移动,以及使用计算服务在其他区域或本地环境中处理数据。It lets you create data-driven workflows to orchestrate the movement of data between supported data stores and the processing of data using compute services in other regions or in an on-premises environment. 它还允许使用编程方式及 UI 机制来监视和管理工作流。It also allows you to monitor and manage workflows by using both programmatic and UI mechanisms.

尽管数据工厂只能在特定区域使用,但数据工厂中支持数据移动的服务可在全球多个区域使用。Although Data Factory is available only in certain regions, the service that powers the data movement in Data Factory is available globally in several regions. 如果数据存储位于防火墙后面,则可改用本地环境中安装的自承载 Integration Runtime 来移动数据。If a data store is behind a firewall, then a Self-hosted Integration Runtime that's installed in your on-premises environment moves the data instead.

例如,假设计算环境(例如 Azure HDInsight 群集)即将耗尽中国北部 2 区域的资源。For an example, let's assume that your compute environments such as Azure HDInsight cluster is running out of the China North 2 region. 可以在中国东部 2 创建并使用一个 Azure 数据工厂实例来计划中国东部 2 计算环境中的作业。You can create and use an Azure Data Factory instance in China East 2 and use it to schedule jobs on your compute environments in China East 2. 只需几毫秒时间,数据工厂就能触发计算环境上的作业,但在计算环境上运行作业所需的时间不会改变。It takes a few milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the job on your computing environment does not change.

辅助功能Accessibility

可以访问 Azure 门户中的数据工厂用户体验。The Data Factory user experience in the Azure portal is accessible.

后续步骤Next steps

开始使用以下工具/SDK 之一创建数据工厂管道:Get started with creating a Data Factory pipeline by using one of the following tools/SDKs: