Azure HDInsight 中的 Apache Spark 是什么What is Apache Spark in Azure HDInsight

Apache Spark 是并行处理框架,支持使用内存中处理来提升大数据分析应用程序的性能。Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Azure HDInsight 中的 Apache Spark 是 Microsoft 的 Apache Spark 在云中的实现。Apache Spark in Azure HDInsight is the Microsoft implementation of Apache Spark in the cloud. 使用 HDInsight 可在 Azure 中轻松创建和配置 Spark 群集。HDInsight makes it easier to create and configure a Spark cluster in Azure. HDInsight 中的 Spark 群集兼容 Azure Blob 存储和 Azure Data Lake Storage。Spark clusters in HDInsight are compatible with Azure Blob storage, and Azure Data Lake Storage. 因此,可使用 HDInsight Spark 群集来处理存储在 Azure 中的数据。So you can use HDInsight Spark clusters to process your data stored in Azure. 有关组件和版本信息,请参阅 Azure HDInsight 中的 Apache Hadoop 组件和版本For the components and the versioning information, see Apache Hadoop components and versions in Azure HDInsight.


什么是 Apache Spark?What is Apache Spark?

Spark 提供了用于内存中群集计算的基元。Spark provides primitives for in-memory cluster computing. Spark 作业可以将数据加载和缓存到内存中并重复地对其进行查询。A Spark job can load and cache data into memory and query it repeatedly. 内存中计算比基于磁盘的应用程序(例如通过 Hadoop 分布式文件系统 (HDFS) 共享数据的 Hadoop)快速得多。In-memory computing is much faster than disk-based applications, such as Hadoop, which shares data through Hadoop distributed file system (HDFS). Spark 还集成到 Scala 编程语言中,让你可以像处理本地集合一样处理分布式数据集。Spark also integrates into the Scala programming language to let you manipulate distributed data sets like local collections. 无需将所有内容构造为映射和化简操作。There's no need to structure everything as map and reduce operations.

传统 MapReduce 与Spark

HDInsight 中的 Spark 群集提供完全托管的 Spark 服务。Spark clusters in HDInsight offer a fully managed Spark service. 下面列出了在 HDInsight 中创建 Spark 群集的优势。Benefits of creating a Spark cluster in HDInsight are listed here.

功能Feature 说明Description
轻松创建Ease creation 可以使用 Azure 门户、Azure PowerShell 或 HDInsight .NET SDK,在几分钟之内于 HDInsight 中创建新的 Spark 群集。You can create a new Spark cluster in HDInsight in minutes using the Azure portal, Azure PowerShell, or the HDInsight .NET SDK. 请参阅 HDInsight 中的 Apache Spark 群集入门See Get started with Apache Spark cluster in HDInsight.
易于使用Ease of use HDInsight 中的 Spark 群集包括 Jupyter Notebook 和 Apache Zeppelin Notebook。Spark cluster in HDInsight include Jupyter Notebooks and Apache Zeppelin Notebooks. 可以使用这些笔记本执行交互式数据处理和可视化。You can use these notebooks for interactive data processing and visualization. 请参阅将 Apache Zeppelin 笔记本与 Apache Spark 配合使用在 Apache Spark 群集上加载数据并运行查询See Use Apache Zeppelin notebooks with Apache Spark and Load data and run queries on an Apache Spark cluster.
REST APIREST APIs HDInsight 中的 Spark 群集包含 Apache Livy,它是基于 REST-API 的 Spark 作业服务器,用于远程提交和监视作业。Spark clusters in HDInsight include Apache Livy, a REST API-based Spark job server to remotely submit and monitor jobs. 请参阅使用 Apache Spark REST API 将远程作业提交到 HDInsight Spark 群集See Use Apache Spark REST API to submit remote jobs to an HDInsight Spark cluster.
支持 Azure 存储Support for Azure Storage HDInsight 中的 Spark 群集可使用 Azure Data Lake Storage Gen2 作为主存储或附加存储。Spark clusters in HDInsight can use Azure Data Lake Storage Gen2 as both the primary storage or additional storage. 有关 Data Lake Storage Gen2 的详细信息,请参阅 Azure Data Lake Storage Gen2For more information on Data Lake Storage Gen2, see Azure Data Lake Storage Gen2.
与 Azure 服务集成Integration with Azure services HDInsight 中的 Spark 群集随附了 Azure 事件中心的连接器。Spark cluster in HDInsight comes with a connector to Azure Event Hubs. 可以使用事件中心来构建流式处理应用程序。You can build streaming applications using the Event Hubs. 此类应用程序包括已作为 Spark 的一部分提供的 Apache Kafka。Including Apache Kafka, which is already available as part of Spark.
与第三方 IDE 集成Integration with third-party IDEs HDInsight 提供多个 IDE 插件,这些插件可用于创建应用程序,并将应用程序提交到 HDInsight Spark 群集。HDInsight provides several IDE plugins that are useful to create and submit applications to an HDInsight Spark cluster. 有关详细信息,请参阅使用 Azure Toolkit for IntelliJ IDEA使用用于 VSCode 的 Spark 和 Hive 工具使用 Azure Toolkit for EclipseFor more information, see Use Azure Toolkit for IntelliJ IDEA, Use Spark & Hive Tools for VSCode, and Use Azure Toolkit for Eclipse.
并发查询Concurrent Queries HDInsight 中的 Spark 群集支持并发查询。Spark clusters in HDInsight support concurrent queries. 此功能允许一个用户运行多个查询,或者不同的用户运行多个查询,以及让应用程序共享相同的群集资源。This capability enables multiple queries from one user or multiple queries from various users and applications to share the same cluster resources.
SSD 缓存Caching on SSDs 可以选择将数据缓存在内存中,或缓存在已附加到群集节点的 SSD 中。You can choose to cache data either in memory or in SSDs attached to the cluster nodes. 内存缓存提供最佳的查询性能,但可能费用不菲。Caching in memory provides the best query performance but could be expensive. SSD 缓存是改善查询性能的绝佳选项,而且不需要根据内存中的整个数据集创建满足其需求的群集规模。Caching in SSDs provides a great option for improving query performance without the need to create a cluster of a size that is required to fit the entire dataset in memory. 请参阅使用 Azure HDInsight IO 缓存提高 Apache Spark 工作负载的性能See Improve performance of Apache Spark workloads using Azure HDInsight IO Cache.
与 BI 工具集成Integration with BI Tools HDInsight 中的 Spark 群集为 BI 工具(例如用于数据分析的 Power BI)提供了连接器。Spark clusters in HDInsight provide connectors for BI tools such as Power BI for data analytics.
预先加载的 Anaconda 库Pre-loaded Anaconda libraries HDInsight 中的 Spark 群集随附预先安装的 Anaconda 库。Spark clusters in HDInsight come with Anaconda libraries pre-installed. Anaconda 提供将近 200 个用于机器学习、数据分析、可视化效果等的库。Anaconda provides close to 200 libraries for machine learning, data analysis, visualization, and so on.
自适应性Adaptability HDInsight 允许通过自动缩放功能动态更改群集节点的数量。HDInsight allows you to change the number of cluster nodes dynamically with the Autoscale feature. 请参阅自动缩放 Azure HDInsight 群集See Automatically scale Azure HDInsight clusters. 此外,由于所有数据都存储在 Azure Blob 存储或 Azure Data Lake Storage Gen2 中,因此可以在不丢失数据的情况下删除 Spark 群集。Also, Spark clusters can be dropped with no loss of data since all the data is stored in Azure Blob storage, Azure Data Lake Storage Gen2.
SLASLA HDInsight 中的 Spark 群集附有全天候支持和保证正常运行时间达 99.9% 的 SLA。Spark clusters in HDInsight come with 24/7 support and an SLA of 99.9% up-time.

默认情况下,HDInsight 中的 Apache Spark 群集可通过群集提供以下组件。Apache Spark clusters in HDInsight include the following components that are available on the clusters by default.

HDInsight Spark 群集提供了一个 ODBC 驱动程序,用于从 BI 工具(例如 Microsoft Power BI)建立连接。HDInsight Spark clusters an ODBC driver for connectivity from BI tools such as Microsoft Power BI.

Spark 群集体系结构Spark cluster architecture

HDInsight Spark 的体系结构

了解 Spark 在 HDInsight 群集上的运行方式后,即可轻松了解 Spark 的组件。It is easy to understand the components of Spark by understanding how Spark runs on HDInsight clusters.

Spark 应用程序在群集上作为一组独立的进程运行,由主要程序(称为驱动器程序)中的 SparkContext 对象加以协调。Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

SparkContext 可连接到多种类型的群集管理器,可跨应用程序分配资源。The SparkContext can connect to several types of cluster managers, which give resources across applications. 这些群集管理器包括 Apache Mesos、Apache Hadoop YARN 或 Spark 群集管理器。These cluster managers include Apache Mesos, Apache Hadoop YARN, or the Spark cluster manager. 在 HDInsight 中,Spark 使用 YARN 群集管理器运行。In HDInsight, Spark runs using the YARN cluster manager. 连接后,Spark 可获取群集中工作节点上的执行程序,该执行程序是为应用程序运行计算和存储数据的进程。Once connected, Spark acquires executors on workers nodes in the cluster, which are processes that run computations and store data for your application. 然后,它将应用程序代码(由传递给 SparkContext 的 JAR 或 Python 文件指定)发送到执行程序。Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. 最后,SparkContext 将任务发送给执行程序来运行。Finally, SparkContext sends tasks to the executors to run.

SparkContext 在工作节点上运行用户的主函数,并执行各种并行操作。The SparkContext runs the user's main function and executes the various parallel operations on the worker nodes. 然后,SparkContext 收集操作的结果。Then, the SparkContext collects the results of the operations. 工作节点从 Hadoop 分布式文件系统读取数据并将数据写入其中。The worker nodes read and write data from and to the Hadoop distributed file system. 工作节点还将已转换数据作为弹性分布式数据集 (RDD) 缓存在内存中。The worker nodes also cache transformed data in-memory as Resilient Distributed Datasets (RDDs).

SparkContext 连接到 Spark 主节点,负责将应用程序转换为各个任务的有向图 (DAG)。The SparkContext connects to the Spark master and is responsible for converting an application to a directed graph (DAG) of individual tasks. 在工作器节点上的执行程序进程内执行的任务。Tasks that get executed within an executor process on the worker nodes. 每个应用程序都有其自己的执行程序进程。Each application gets its own executor processes. 这些进程会在整个应用程序的持续时间内保持运行,并以多个线程的形式运行任务。Which stay up for the duration of the whole application and run tasks in multiple threads.

HDInsight 中的 Spark 用例Spark in HDInsight use cases

HDInsight 中的 Spark 群集适用于以下主要方案:Spark clusters in HDInsight enable the following key scenarios:

交互式数据分析和 BIInteractive data analysis and BI

HDInsight 中的 Apache Spark 将数据存储在 Azure Blob 存储或 Azure Data Lake Storage Gen2 中。Apache Spark in HDInsight stores data in Azure Blob Storage or Azure Data Lake Storage Gen2. 商务专家和重要决策者可以利用这些数据来进行分析和创建报告,并使用 Microsoft Power BI 来根据分析数据生成交互式报告。Business experts and key decision makers can analyze and build reports over that data and use Microsoft Power BI to build interactive reports from the analyzed data. 分析师可以从群集存储内的非结构化/半结构化数据着手,使用 Notebook 来定义数据的架构,然后使用 Microsoft Power BI 生成数据模型。Analysts can start from unstructured/semi structured data in cluster storage, define a schema for the data using notebooks, and then build data models using Microsoft Power BI. HDInsight 中的 Spark 群集还支持 Tableau 等多种第三方 BI 工具,可让数据分析师、商务专家和重要决策者更轻松地工作。Spark clusters in HDInsight also support a number of third-party BI tools such as Tableau making it easier for data analysts, business experts, and key decision makers.

Spark 机器学习Spark Machine Learning

Apache Spark 附带了 MLlibApache Spark comes with MLlib. MLlib 是在 Spark 基础上构建的机器学习库,可以从 HDInsight 中的 Spark 群集使用。MLlib is a machine learning library built on top of Spark that you can use from a Spark cluster in HDInsight. HDInsight 中的 Spark 群集还包含 Anaconda - 为机器学习提供各种包的 Python 发行版。Spark cluster in HDInsight also includes Anaconda, a Python distribution with different kinds of packages for machine learning. 结合内置的 Jupyter 和 Zeppelin 笔记本支持,你有了一个用于创建机器学习应用程序的环境。And with built-in support for Jupyter and Zeppelin notebooks, you have an environment for creating machine learning applications.

Spark 流式处理和实时数据分析Spark streaming and real-time data analysis

HDInsight 中的 Spark 群集提供丰富的支持,供你生成实时分析解决方案。Spark clusters in HDInsight offer a rich support for building real-time analytics solutions. Spark 已包含一些连接器用于从 Kafka、Flume、Twitter、ZeroMQ 或 TCP 套接字等多个源引入数据。Spark already has connectors to ingest data from many sources like Kafka, Flume, Twitter, ZeroMQ, or TCP sockets. HDInsight 中的 Spark 为从 Azure 事件中心引入数据增加了了一流的支持。Spark in HDInsight adds first-class support for ingesting data from Azure Event Hubs. 事件中心是 Azure 上最广泛使用的队列服务。Event Hubs is the most widely used queuing service on Azure. HDInsight 中的 Spark 群集完全支持事件中心,因此已成为生成实时分析管道的理想平台。Having complete support for Event Hubs makes Spark clusters in HDInsight an ideal platform for building real-time analytics pipeline.

后续步骤Next Steps

在此概述中,你已基本了解了 Azure HDInsight 中的 Apache Spark。In this overview, you've got a basic understanding of Apache Spark in Azure HDInsight. 可以使用以下文章来详细了解 HDInsight 中的 Apache Spark,并且可以创建 HDInsight Spark 群集并进一步运行一些示例 Spark 查询:You can use the following articles to learn more about Apache Spark in HDInsight, and you can create an HDInsight Spark cluster and further run some sample Spark queries: