大规模提取、转换和加载 (ETL)Extract, transform, and load (ETL) at scale

提取、转换和加载 (ETL) 是从各种来源获取数据的过程。Extract, transform, and load (ETL) is the process by which data is acquired from various sources. 在标准位置收集、清理和处理数据。The data is collected in a standard location, cleaned, and processed. 最终,数据将加载到可以查询它的数据存储中。Ultimately, the data is loaded into a datastore from which it can be queried. 传统的 ETL 过程是导入数据、就地清理该数据,然后将其存储在关系数据引擎中。Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. 使用 Azure HDInsight 时,各种 Apache Hadoop 环境组件都支持大规模 ETL。With Azure HDInsight, a wide variety of Apache Hadoop environment components support ETL at scale.

在 ETL 过程中使用 HDInsight 通过以下管道来总结:The use of HDInsight in the ETL process is summarized by this pipeline:

大规模 HDInsight ETL 概述

以下部分探讨每个 ETL 阶段及其关联的组件。The following sections explore each of the ETL phases and their associated components.

业务流程Orchestration

业务流程跨越 ETL 管道的所有阶段。Orchestration spans across all phases of the ETL pipeline. HDInsight 中的 ETL 作业通常涉及到多个相互配合运行的不同产品。ETL jobs in HDInsight often involve several different products working in conjunction with each other. 例如:For example:

  • 可使用 Apache Hive 来清理部分数据,使用 Apache Pig 来清理另一部分数据。You might use Apache Hive to clean a portion of the data, and Apache Pig to clean another portion. 若要在适当的时间运行适当的作业,则需要业务流程。Orchestration is needed to run the appropriate job at the appropriate time.

Apache OozieApache Oozie

Apache Oozie 是一个管理 Hadoop 作业的工作流协调系统。Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie 在 HDInsight 群集中运行,并与 Hadoop 堆栈集成。Oozie runs within an HDInsight cluster and is integrated with the Hadoop stack. Oozie 支持对 Apache Hadoop MapReduce、Pig、Hive 和 Sqoop 执行 Hadoop 作业。Oozie supports Hadoop jobs for Apache Hadoop MapReduce, Pig, Hive, and Sqoop. 可以使用 Oozie 来计划特定于某系统的作业,例如 Java 程序或 shell 脚本。You can use Oozie to schedule jobs that are specific to a system, such as Java programs or shell scripts.

有关详细信息,请参阅在 HDInsight 中将 Apache Oozie 与 Apache Hadoop 配合使用以定义和运行工作流For more information, see Use Apache Oozie with Apache Hadoop to define and run a workflow on HDInsight. 另请参阅操作数据管道See also, Operationalize the Data Pipeline.

Azure 数据工厂Azure Data Factory

Azure 数据工厂以平台即服务 (PaaS) 的形式提供业务流程功能。Azure Data Factory provides orchestration capabilities in the form of platform as a service (PaaS). Azure 数据工厂是基于云的数据集成服务。Azure Data Factory is a cloud-based data integration service. 它可让你创建数据驱动的工作流,用于协调和自动化数据移动和数据转换。It allows you to create data-driven workflows for orchestrating and automating data movement and data transformation.

使用 Azure 数据工厂可以:Use Azure Data Factory to:

  1. 创建和安排数据驱动的工作流。Create and schedule data-driven workflows. 通过这些管道从不同的数据存储引入数据。These pipelines ingest data from disparate data stores.
  2. 使用计算服务(如 HDInsight 或 Hadoop)处理和转换数据。Process and transform the data by using compute services such as HDInsight or Hadoop. 对于此步骤,还可使用 Spark、Azure Data Lake Analytics、Azure Batch 或 Azure 机器学习。You can also use Spark, Azure Data Lake Analytics, Azure Batch, or Azure Machine Learning for this step.
  3. 将输出数据发布到数据存储(例如 Azure SQL 数据仓库),供 BI 应用程序使用。Publish output data to data stores, such as Azure SQL Data Warehouse, for BI applications to consume.

有关 Azure 数据工厂的详细信息,请参阅文档For more information on Azure Data Factory, see the documentation.

提取文件存储和结果存储Ingest file storage and result storage

源数据文件通常加载到 Azure 存储或 Azure Data Lake Storage 中的某个位置。Source data files are typically loaded into a location on Azure Storage or Azure Data Lake Storage. 这些文件通常采用平面格式,如 CSV。The files are usually in a flat format, like CSV. 但它们可以采用任何格式。But, they can be in any format.

Azure 存储Azure Storage

Azure 存储具有特定的自适应性目标。Azure Storage has specific adaptability targets. 请参阅 Blob 存储的可伸缩性和性能目标获取详细信息。See Scalability and performance targets for Blob storage for more information. 对于大多数分析节点而言,在处理许多较小的文件时,Azure 存储的可伸缩性最佳。For most analytic nodes, Azure Storage scales best when dealing with many smaller files. 只要在帐户限制范围内,无论文件大小如何,Azure 存储都可保证提供相同的性能。As long as you're within your account limits, Azure Storage guarantees the same performance, no matter how large the files are. 你可以存储 TB 级的数据,并且仍然可以获得一致的性能。You can store terabytes of data and still get consistent performance. 无论你使用的是子集还是所有数据,效果都一样。This statement is true whether you're using a subset or all of the data.

Azure 存储包含多种类型的 Blob。Azure Storage has several types of blobs. “追加 Blob”是存储 Web 日志或传感器数据的极佳选项。An append blob is a great option for storing web logs or sensor data.

多个 Blob 可以分布在多个服务器上,以横向扩展对它们的访问。Multiple blobs can be distributed across many servers to scale out access to them. 但是,单个 Blob 仅由单个服务器提供服务。But a single blob is only served by a single server. 虽然 Blob 可在 Blob 容器中进行逻辑分组,但这种分组不会对分区产生影响。Although blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.

Azure 存储为 Blob 存储提供了一个 WebHDFS API 层。Azure Storage has a WebHDFS API layer for the blob storage. 所有 HDInsight 服务都可以访问 Azure Blob 存储中的文件,以便进行数据清理和数据处理。All HDInsight services can access files in Azure Blob storage for data cleaning and data processing. 这类似于这些服务使用 Hadoop 分布式文件系统 (HDFS) 的方式。This is similar to how those services would use Hadoop Distributed File System (HDFS).

数据通常通过 PowerShell、Azure 存储 SDK 或 AZCopy 引入到 Azure 存储中。Data is typically ingested into Azure Storage through PowerShell, the Azure Storage SDK, or AZCopy.

Azure Data Lake StorageAzure Data Lake Storage

Azure Data Lake Storage 是一个托管的超大规模存储库,用于分析数据。Azure Data Lake Storage is a managed, hyperscale repository for analytics data. 它与 HDFS 兼容并使用类似于 HDFS 的设计范例。It's compatible with and uses a design paradigm that's similar to HDFS. Data Lake Storage 为总容量和单个文件的大小提供无限制的适应性。Data Lake Storage offers unlimited adaptability for total capacity and the size of individual files. 它非常适合与大型文件配合运行,因为大型文件可以跨多个节点存储。It's a good choice when working with large files, because they can be stored across multiple nodes. Data Lake Storage 中的数据分区在后台执行。Partitioning data in Data Lake Storage is done behind the scenes. 通过数以千计的并发执行程序,可高效读取和写入数百 TB 的数据,从而可获得极大的吞吐量来运行分析作业。You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.

数据通常通过 Azure 数据工厂引入到 Data Lake Storage 中。Data is usually ingested into Data Lake Storage through Azure Data Factory. 你还可以使用 Data Lake Storage SDK、AdlCopy 服务、Apache DistCp 或 Apache Sqoop。You can also use Data Lake Storage SDKs, the AdlCopy service, Apache DistCp, or Apache Sqoop. 所选的服务取决于数据的位置。The service you choose depends on where the data is. 如果数据位于现有的 Hadoop 群集中,则可以使用 Apache DistCp、AdlCopy 服务或 Azure 数据工厂。If it's in an existing Hadoop cluster, you might use Apache DistCp, the AdlCopy service, or Azure Data Factory. 对于 Azure Blob 存储中的数据,则可以使用 Azure Data Lake Storage .NET SDK、Azure PowerShell 或 Azure 数据工厂。For data in Azure Blob storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.

Data Lake Storage 已针对使用 Azure 事件中心或 Apache Storm 运行的事件引入进行了优化。Data Lake Storage is optimized for event ingestion through Azure Event Hubs or Apache Storm.

两种存储选项的注意事项Considerations for both storage options

如果上传的数据集达到 TB 量级,网络延迟可能是一个重要问题。For uploading datasets in the terabyte range, network latency can be a major problem. 如果数据来自本地位置,则更是如此。This is particularly true if the data is coming from an on-premises location. 这种情况下,可使用以下选项:In such cases, you can use these options:

  • Azure ExpressRoute: 在 Azure 数据中心与本地基础结构之间创建专用连接。Azure ExpressRoute: Create private connections between Azure datacenters and your on-premises infrastructure. 这些连接为传输大量数据提供了可靠选项。These connections provide a reliable option for transferring large amounts of data. 有关详细信息,请参阅 Azure ExpressRoute 文档For more information, see Azure ExpressRoute documentation.

  • 从硬盘驱动器上传数据: 可以使用 Azure 导入/导出服务将包含数据的硬盘驱动器发送到 Azure 数据中心。Data upload from hard disk drives: You can use Azure Import/Export service to ship hard disk drives with your data to an Azure datacenter. 数据会首先上传到 Azure Blob 存储。Your data is first uploaded to Azure Blob storage. 然后可使用 Azure 数据工厂或 AdlCopy 工具从 Azure Blob 存储将数据复制到 Data Lake Storage。You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Blob storage to Data Lake Storage.

Azure SQL 数据仓库Azure SQL Data Warehouse

Azure SQL 数据仓库是存储已准备结果的恰当选择。Azure SQL Data Warehouse is an appropriate choice to store prepared results. 可以使用 Azure HDInsight 为 SQL 数据仓库执行这些服务。You can use Azure HDInsight to perform those services for SQL Data Warehouse.

Azure SQL 数据仓库是已针对分析工作负载进行优化的关系数据库。Azure SQL Data Warehouse is a relational database store optimized for analytic workloads. 它根据分区表进行缩放。It scales based on partitioned tables. 表可以跨多个节点分区。Tables can be partitioned across multiple nodes. 在创建节点时便选择了节点。The nodes are selected at the time of creation. 可以在事后缩放节点,但是,该主动过程可能需要移动数据。They can scale after the fact, but that's an active process that might require data movement. 有关详细信息,请参阅管理 SQL 数据仓库中的计算资源For more information, see Manage compute in SQL Data Warehouse.

Apache HBaseApache HBase

Apache HBase 是 Azure HDInsight 中提供的键/值存储。Apache HBase is a key/value store available in Azure HDInsight. 它是一种开放源代码 NoSQL 数据库,构建于 Hadoop 基础之上,并基于 Google BigTable 模型化。It's an open-source, NoSQL database that's built on Hadoop and modeled after Google BigTable. HBase 针对大量非结构化和结构化数据提供高性能随机访问和强一致性。HBase provides performant random access and strong consistency for large amounts of unstructured and semi-structured data.

由于 HBase 是一个无架构数据库,因此在使用列和数据类型之前,不需要定义它们。Because HBase is a schemaless database, you don't need to define columns and data types before you use them. 数据存储在表的各行中,按列系列分组。Data is stored in the rows of a table, and is grouped by column family.

开放源代码可进行线性伸缩,以处理上千节点上数 PB 的数据。The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase 依赖 Hadoop 环境中的分布式应用程序提供的数据冗余、批处理以及其他功能。HBase relies on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop environment.

HBase 是供将来分析的传感器和日志数据的绝佳目标。HBase is a good destination for sensor and log data for future analysis.

HBase 自适应性取决于 HDInsight 群集中的节点数。HBase adaptability is dependent on the number of nodes in the HDInsight cluster.

Azure SQL 数据库Azure SQL databases

Azure 提供三个 PaaS 关系数据库:Azure offers three PaaS relational databases:

添加更多 CPU 和内存以横向扩展这些产品。Add more CPU and memory to scale up these products. 还可以选择使用高级磁盘配合这些产品来获得更好的 I/O 性能。You can also choose to use premium disks with the products for better I/O performance.

Azure Analysis ServicesAzure Analysis Services

Azure Analysis Services 是用于决策支持和业务分析的分析数据引擎。Azure Analysis Services is an analytical data engine used in decision support and business analytics. 它为业务报表和客户端应用程序(如 Power BI)提供分析数据。It provides the analytical data for business reports and client applications such as Power BI. 分析数据还可用于 Excel、SQL Server Reporting Services 报表和其他数据可视化工具。The analytical data also works with Excel, SQL Server Reporting Services reports, and other data visualization tools.

可通过更改每个多维数据集的层来缩放这些分析多维数据集。Scale analysis cubes by changing tiers for each individual cube. 有关详细信息,请参阅 Azure Analysis Services 定价For more information, see Azure Analysis Services Pricing.

提取和加载Extract and load

将数据存储到 Azure 中后,可以使用许多服务将这些数据提取和加载到其他产品。After the data exists in Azure, you can use many services to extract and load it into other products. HDInsight 支持 Sqoop 和 Flume。HDInsight supports Sqoop and Flume.

Apache SqoopApache Sqoop

Apache Sqoop 是专用于在结构化、半结构化和非结构化数据源之间有效传输数据的工具。Apache Sqoop is a tool designed for efficiently transferring data between structured, semi-structured, and unstructured data sources.

Sqoop 使用 MapReduce 导入和导出数据,可提供并行操作和容错。Sqoop uses MapReduce to import and export the data, to provide parallel operation and fault tolerance.

Apache FlumeApache Flume

Apache Flume 是分布式、可靠且高度可用的服务,能够有效地收集、聚合与移动大量日志数据。Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. 它灵活的体系结构基于流式传输数据流。Its flexible architecture is based on streaming data flows. Flume 既可靠又能容错,提供可优化的可靠性机制。Flume is robust and fault-tolerant with tunable reliability mechanisms. 还具有许多故障转移和恢复机制。It has many failover and recovery mechanisms. Flume 使用一个允许联机分析应用程序的简单可扩展数据模型。Flume uses a simple extensible data model that allows for online, analytic application.

Apache Flume 无法与 Azure HDInsight 配合使用。Apache Flume can't be used with Azure HDInsight. 但是,本地 Hadoop 安装可以使用 Flume 将数据发送到 Azure Blob 存储或 Azure Data Lake Storage。But, an on-premises Hadoop installation can use Flume to send data to either Azure Blob storage or Azure Data Lake Storage. 有关详细信息,请参阅将 Apache Flume 与 HDInsight 配合使用For more information, see Using Apache Flume with HDInsight.

转换Transform

将数据存储到所选的位置后,需要根据特定的使用模式清理、合并或准备这些数据。After data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive、Pig 和 Spark SQL 都是适合用于完成此类任务的极佳选择。Hive, Pig, and Spark SQL are all good choices for that kind of work. HDInsight 支持所有这些产品。They're all supported on HDInsight.

后续步骤Next steps