使用 Azure Data Lake Storage Gen2 满足大数据需求Using Azure Data Lake Storage Gen2 for big data requirements

大数据处理的四个主要阶段:There are four key stages in big data processing:

  • 实时或批量引入大量数据到数据存储Ingesting large amounts of data into a data store, at real-time or in batches
  • 处理数据Processing the data
  • 下载数据Downloading the data
  • 可视化数据Visualizing the data

本文重点介绍每个处理阶段的选项和工具。This article highlights the options and tools for each processing phase.

有关可与 Azure Data Lake Storage Gen2 一起使用的 Azure 服务的完整列表,请参阅将 Azure Data Lake Storage 与 Azure 服务集成For a complete list of Azure services that you can use with Azure Data Lake Storage Gen2, see Integrate Azure Data Lake Storage with Azure services

将数据引入 Data Lake Storage Gen2Ingest the data into Data Lake Storage Gen2

本部分重点介绍不同的数据源和将数据引入 Data Lake Storage Gen2 帐户的不同方式。This section highlights the different sources of data and the different ways in which that data can be ingested into a Data Lake Storage Gen2 account.

将数据引入 Data Lake Storage Gen2Ingest data into Data Lake Storage Gen2

临时数据Ad hoc data

这表示可用于形成大数据应用程序原型的较小数据集。This represents smaller data sets that are used for prototyping a big data application. 存在数种不同的引入临时数据的方式,具体取决于数据源。There are different ways of ingesting ad hoc data depending on the source of the data.

下面是可以用来引入临时数据的工具的列表。Here's a list of tools that you can use to ingest ad hoc data.

数据源Data Source 引入方式Ingest it using
本地计算机Local computer Azure PowerShellAzure PowerShell

存储资源管理器Storage Explorer

AzCopy 工具AzCopy tool
Azure 存储 BlobAzure Storage Blob Azure 数据工厂Azure Data Factory

AzCopy 工具AzCopy tool

HDInsight 群集上运行的 DistCpDistCp running on HDInsight cluster

流数据Streamed data

这表示可由应用程序、设备、传感器等多种源生成的数据。此数据可通过各种工具引入 Data Lake Storage Gen2。This represents data that can be generated by various sources such as applications, devices, sensors, etc. This data can be ingested into Data Lake Storage Gen2 by a variety of tools. 这些工具通常实时逐事件捕获和处理数据,并随后批量将事件写入 Data Lake Storage Gen2,以便这些事件可以得到进一步处理。These tools will usually capture and process the data on an event-by-event basis in real-time, and then write the events in batches into Data Lake Storage Gen2 so that they can be further processed.

关系数据Relational data

也可从关系数据库中获得数据。You can also source data from relational databases. 在一个时间段期间,关系数据库会收集大量数据,这些数据如果通过大数据管道处理,可提供重要见解。Over a period of time, relational databases collect huge amounts of data which can provide key insights if processed through a big data pipeline. 可使用以下工具将此类数据移入 Data Lake Storage Gen2。You can use the following tools to move such data into Data Lake Storage Gen2.

下面是一个列表,其中包含可以用来引入关系数据的工具。Here's a list of tools that you can use to ingest relational data.

工具Tool 指南Guidance
Azure 数据工厂Azure Data Factory Azure 数据工厂中的 Copy 活动Copy Activity in Azure Data Factory

Web 服务器日志数据(使用自定义应用程序上传)Web server log data (upload using custom applications)

由于分析 Web 服务器日志数据是大数据应用程序的常见用途,且需要上传大量日志文件到 Data Lake Storage Gen2,将明确调用此类数据集。This type of dataset is specifically called out because analysis of web server log data is a common use case for big data applications and requires large volumes of log files to be uploaded to Data Lake Storage Gen2. 可使用以下任何工具编写自己的脚本或应用程序来上传此类数据。You can use any of the following tools to write your own scripts or applications to upload such data.

下面是一个列表,其中包含可以用来引入 Web 服务器日志数据的工具。Here's a list of tools that you can use to ingest Web server log data.

工具Tool 指南Guidance
Azure 数据工厂Azure Data Factory Azure 数据工厂中的 Copy 活动Copy Activity in Azure Data Factory
Azure PowerShellAzure PowerShell Azure PowerShellAzure PowerShell

对于上传 web 服务器日志数据和上传其他类型的数据(例如社会情绪数据),编写自定义脚本/应用程序是一个非常不错的方式,因为这可灵活地将数据上传组件作为更大的大数据应用程序的一部分而包含在内。For uploading web server log data, and also for uploading other kinds of data (e.g. social sentiments data), it is a good approach to write your own custom scripts/applications because it gives you the flexibility to include your data uploading component as part of your larger big data application. 某些情况下,此代码的形式可能是脚本或简单的命令行实用工具。In some cases this code may take the form of a script or simple command line utility. 其他情况下,此代码可用于将大数据处理集成到业务应用程序或解决方案中。In other cases, the code may be used to integrate big data processing into a business application or solution.

与 Azure HDInsight 群集关联的数据Data associated with Azure HDInsight clusters

大多数 HDInsight 群集类型(Hadoop、HBase、Storm)支持将 Data Lake Storage Gen2 用作数据存储库。Most HDInsight cluster types (Hadoop, HBase, Storm) support Data Lake Storage Gen2 as a data storage repository. HDInsight 群集从 Azure 存储 Blob (WASB)访问数据。HDInsight clusters access data from Azure Storage Blobs (WASB). 要提高性能,可从 WASB 将数据复制到与此群集关联的 Data Lake Storage Gen2 帐户。For better performance, you can copy the data from WASB into a Data Lake Storage Gen2 account associated with the cluster. 可使用以下工具复制数据。You can use the following tools to copy the data.

下面是一个列表,其中包含可以用来引入与 HDInsight 群集关联的数据的工具。Here's a list of tools that you can use to ingest data associated with HDInsight clusters.

工具Tool 指南Guidance
Apache DistCpApache DistCp 使用 DistCp 在 Azure 存储 Blob 与 Data Lake Storage Gen2 之间复制数据Use DistCp to copy data between Azure Storage Blobs and Azure Data Lake Storage Gen2
AzCopy 工具AzCopy tool 使用 AzCopy 传输数据Transfer data with the AzCopy
Azure 数据工厂Azure Data Factory 使用 Azure 数据工厂向/从 Azure Data Lake Storage Gen2 复制数据Copy data to or from Azure Data Lake Storage Gen2 by using Azure Data Factory

存储在本地或 IaaS Hadoop 群集的数据Data stored in on-premises or IaaS Hadoop clusters

使用 HDFS 可在本地计算机上的现有 Hadoop 群集中存储大量数据。Large amounts of data may be stored in existing Hadoop clusters, locally on machines using HDFS. Hadoop 群集可能在本地部署或 Azure 上的 IaaS 群集中。The Hadoop clusters may be in an on-premises deployment or may be within an IaaS cluster on Azure. 通过一次性方法或重复方式将此类数据复制到 Azure Data Lake Storage Gen2 可能存在一些要求。There could be requirements to copy such data to Azure Data Lake Storage Gen2 for a one-off approach or in a recurring fashion. 实现此目的有多种可用选项。There are various options that you can use to achieve this. 以下是替代项和相关折衷方案的列表。Below is a list of alternatives and the associated trade-offs.

方法Approach 详细信息Details 优点Advantages 注意事项Considerations
使用 Azure 数据工厂 (ADF) 直接将数据从 Hadoop 群集复制到 Azure Data Lake Storage Gen2Use Azure Data Factory (ADF) to copy data directly from Hadoop clusters to Azure Data Lake Storage Gen2 ADF 支持将 HDFS 作为数据源ADF supports HDFS as a data source ADF 提供 HDFS 开箱即用支持和一流的端到端管理和监视ADF provides out-of-the-box support for HDFS and first class end-to-end management and monitoring 要求在本地或 IaaS 群集中部署数据管理网关Requires Data Management Gateway to be deployed on-premises or in the IaaS cluster
使用 Distcp 将数据从 Hadoop 复制到 Azure 存储。Use Distcp to copy data from Hadoop to Azure Storage. 然后使用适当机制将数据从 Azure 存储复制到 Data Lake Storage Gen2。Then copy data from Azure Storage to Data Lake Storage Gen2 using appropriate mechanism. 可使用以下工具将数据从 Azure 存储复制到 Data Lake Storage Gen2:You can copy data from Azure Storage to Data Lake Storage Gen2 using: 可使用开放源代码工具。You can use open-source tools. 涉及多种技术的多步处理Multi-step process that involves multiple technologies

大型数据集Really large datasets

对于上传兆兆字节范围内的数据集,使用上述方法可能有时速度慢且成本高。For uploading datasets that range in several terabytes, using the methods described above can sometimes be slow and costly. 在这种情况下,可以使用 Azure ExpressRoute。In such cases, you can use Azure ExpressRoute.

Azure ExpressRoute 允许在 Azure 数据中心与本地中的基础结构之间创建专有连接。Azure ExpressRoute lets you create private connections between Azure data centers and infrastructure on your premises. 这对传输大量数据提供了可靠的选项。This provides a reliable option for transferring large amounts of data. 若要了解详细信息,请参阅 Azure ExpressRoute 文档To learn more, see Azure ExpressRoute documentation.

处理数据Process the data

数据在 Data Lake Storage Gen2 中可用后,可使用支持的大数据应用程序对此数据运行分析。Once the data is available in Data Lake Storage Gen2 you can run analysis on that data using the supported big data applications.

分析 Data Lake Storage Gen2 中的数据Analyze data in Data Lake Storage Gen2

下面是一个列表,其中包含的工具可以用来对存储在 Data Lake Storage Gen2 中的数据运行数据分析作业。Here's a list of tools that you can use to run data analysis jobs on data that is stored in Data Lake Storage Gen2.

工具Tool 指南Guidance
Azure HDInsightAzure HDInsight 将 Azure Data Lake Storage Gen2 用于 Azure HDInsight 群集Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters

下载数据Download the data

用户可能还希望为一些方案从 Azure Data Lake Storage Gen2 下载或移动数据,例如:You might also want to download or move data from Azure Data Lake Storage Gen2 for scenarios such as:

  • 将数据移动到其他存储库以便连接现有数据处理管道。Move data to other repositories to interface with your existing data processing pipelines. 例如,用户可能希望从 Data Lake Storage Gen2 将数据移动到 Azure SQL 数据库或本地 SQL 服务器。For example, you might want to move data from Data Lake Storage Gen2 to Azure SQL Database or on-premises SQL Server.

  • 构建应用程序原型时,下载数据到本地计算机以在 IDE 中进行处理。Download data to your local computer for processing in IDE environments while building application prototypes.

从 Data Lake Storage Gen2 传出数据Egress data from Data Lake Storage Gen2

下面是一个列表,其中包含的工具可以用来从 Data Lake Storage Gen2 下载数据。Here's a list of tools that you can use to download data from Data Lake Storage Gen2.

工具Tool 指南Guidance
Azure 数据工厂Azure Data Factory Azure 数据工厂中的 Copy 活动Copy Activity in Azure Data Factory
Apache DistCpApache DistCp 使用 DistCp 在 Azure 存储 Blob 与 Data Lake Storage Gen2 之间复制数据Use DistCp to copy data between Azure Storage Blobs and Azure Data Lake Storage Gen2
Azure 存储资源管理器Azure Storage Explorer 使用 Azure 存储资源管理器管理 Azure Data Lake Storage Gen2 中的目录、文件和 ACLUse Azure Storage Explorer to manage directories, files, and ACLs in Azure Data Lake Storage Gen2
AzCopy 工具AzCopy tool 使用 AzCopy 和 Blob 存储传输数据Transfer data with AzCopy and Blob storage