将 Azure HDInsight 3.6 Apache Storm 迁移到 HDInsight 4.0 Apache SparkMigrate Azure HDInsight 3.6 Apache Storm to HDInsight 4.0 Apache Spark

本文档介绍如何将 HDInsight 3.6 上的 Apache Storm 工作负荷迁移到 HDInsight 4.0。This document describes how to migrate Apache Storm workloads on HDInsight 3.6 to HDInsight 4.0. HDInsight 4.0 不支持 Apache Storm 群集类型,你需要迁移到另一流数据平台。HDInsight 4.0 doesn't support the Apache Storm cluster type and you will need to migrate to another streaming data platform. 两个适当的选项是 Apache Spark 流和 Spark 结构化流。Two suitable options are Apache Spark Streaming and Spark Structured Streaming. 本文档将介绍这些平台的差异,并提供迁移 Apache Storm 工作负荷的建议工作流。This document describes the differences between these platforms and also recommends a workflow for migrating Apache Storm workloads.

HDInsight 中的 Storm 迁移路径Storm migration paths in HDInsight

若要从 HDInsight 3.6 上的 Apache Storm 进行迁移,可以使用多个选项:If you want to migrate from Apache Storm on HDInsight 3.6 you have multiple options:

  • HDInsight 4.0 上的 Spark 流Spark Streaming on HDInsight 4.0
  • HDInsight 4.0 上的 Spark 结构化流Spark Structured Streaming on HDInsight 4.0
  • Azure 流分析Azure Stream Analytics

本文档提供有关从 Apache Storm 迁移到 Spark 流和 Spark 结构化流的指导。This document provides a guide for migrating from Apache Storm to Spark Streaming and Spark Structured Streaming.

HDInsight Storm 迁移路径HDInsight Storm migration path

Apache Storm、Spark 流与 Spark 结构化流之间的比较Comparison between Apache Storm and Spark Streaming, Spark Structured Streaming

Apache Storm 可以提供不同级别的有保证的消息处理。Apache Storm can provide different levels of guaranteed message processing. 例如,基本的 Storm 应用程序至少可以保证一次处理,而 Trident 仅可以保证一次处理。For example, a basic Storm application can guarantee at-least-once processing, and Trident can guarantee exactly once processing. Spark 流和 Spark 结构化流保证恰好处理任意输入事件一次,即使某个节点发生故障。Spark Streaming and Spark Structured Streaming guarantee that any input event is processed exactly once, even if a node failure occurs. Storm 提供一个用于处理每个事件的模型,你也可以将微批模型与 Trident 配合使用。Storm has a model that processes each single event, and you can also use the Micro Batch model with Trident. Spark 流和 Spark 结构化流提供微批处理模型。Spark Streaming and Spark Structured Streaming provide Micro-Batch processing model.

StormStorm Spark 流式处理Spark streaming Spark 结构化流Spark structured streaming
事件处理保证Event processing guarantee 至少一次At least once
恰好一次 (Trident)Exactly Once (Trident)
恰好一次Exactly Once 恰好一次Exactly Once
处理模型Processing Model 实时Real-time
微批 (Trident)Micro Batch (Trident)
微批Micro Batch 微批Micro Batch
事件时间支持Event time support Yes No Yes
语言Languages Java 等。Java, etc. Scala、Java、PythonScala, Java, Python Python、R、Scala、Java、SQLPython, R, Scala, Java, SQL

Spark 流与 Spark 结构化流Spark streaming vs Spark structured streaming

Spark 结构化流即将取代 Spark 流 (DStreams)。Spark Structured Streaming is replacing Spark Streaming (DStreams). 结构化流会不断得到增强和维护,而 DStreams 只会保留维护模式。Structured Streaming will continue to receive enhancements and maintenance, while DStreams will be in maintenance mode only. 注释:需要提供链接来强调这一点Note: need links to emphasize this point. 在现成支持的源和接收器方面,结构化流的功能目前不如 DStreams 那么多,因此,在选择适当的 Spark 流处理选项之前,请先评估要求。Structured Streaming does not have as many features as DStreams for the sources and sinks that it supports out of the box, so evaluate your requirements to choose the appropriate Spark stream processing option.

流(单一事件)处理与微批处理Streaming (Single event) processing vs Micro-Batch processing

Storm 提供用于处理每个事件的模型。Storm provides a model that processes each single event. 这意味着,所有传入的记录在抵达后即会得到处理。This means that all incoming records will be processed as soon as they arrive. Spark 流应用程序必须先等待一会,以收集事件的每个微批,然后发送该批进行处理。Spark Streaming applications must wait a fraction of a second to collect each micro-batch of events before sending that batch on for processing. 与此相反,事件驱动应用程序会立即处理每个事件。In contrast, an event-driven application processes each event immediately. Spark 流式处理延迟一般为几秒钟。Spark Streaming latency is typically under a few seconds. 微批处理方法的优点是数据处理效率更高和聚合计算更简单。The benefits of the micro-batch approach are more efficient data processing and simpler aggregate calculations.

流处理和微批处理streaming and micro-batch processing

Storm 体系结构和组件Storm architecture and components

Storm 拓扑由多个以有向无环图 (DAG) 形式排列的组件构成。Storm topologies are composed of multiple components that are arranged in a directed acyclic graph (DAG). 数据在该图中的组件之间流动。Data flows between the components in the graph. 每个组件使用一个或多个数据流,并可选择性地发出一个或多个流。Each component consumes one or more data streams, and can optionally emit one or more streams.

组件Component 说明Description
SpoutSpout 将数据引入拓扑。Brings data into a topology. 它们将一个或多个流发出到拓扑中。They emit one or more streams into the topology.
BoltBolt 使用 Spout 或其他 Bolt 发出的流。Consumes streams emitted from spouts or other bolts. Bolt 可以选择性地将流发出到拓扑中。Bolts might optionally emit streams into the topology. Bolt 还负责将数据写入 HDFS、Kafka 或 HBase 等外部服务或存储。Bolts are also responsible for writing data to external services or storage, such as HDFS, Kafka, or HBase.

Storm 组件的交互interaction of storm components

Storm 包括使 Storm 群集保持正常运行的以下三个守护程序。Storm consists of the following three daemons, which keep the Storm cluster functioning.

守护程序Daemon 说明Description
NimbusNimbus 类似于 Hadoop JobTracker,负责在整个群集中分发代码、将任务分配给计算机以及监视故障情况。Similar to Hadoop JobTracker, it's responsible for distributing code around the cluster and assigning tasks to machines and monitoring for failures.
ZookeeperZookeeper 用于群集协调。Used for cluster coordination.
监督器Supervisor 侦听分配给其计算机的工作,并根据 Nimbus 的指令启动和停止工作进程。Listens for work assigned to its machine and starts and stops worker processes based on directives from Nimbus. 每个工作进程执行一部分拓扑。Each worker process executes a subset of a topology. 用户的应用程序逻辑(Spout 和 Bolt)在此处运行。User’s application logic (Spouts and Bolt) run here.

nimbus、zookeeper 和 supervisor 守护程序nimbus, zookeeper, and supervisor daemons

Spark 流体系结构和组件Spark Streaming architecture and components

以下步骤汇总了组件如何在 Spark 流 (DStreams) 和 Spark 结构化流中配合工作:The following steps summarize how components work together in Spark Streaming (DStreams) and Spark Structured Streaming:

  • 启动 Spark 流时,驱动程序将在执行程序中启动任务。When Spark Streaming is launched, the driver launches the task in the executor.
  • 执行程序从流数据源接收流。The executor receives a stream from a streaming data source.
  • 当执行程序收到数据流时,会将流拆分为块,并将其保留在内存中。When the executor receives data streams, it splits the stream into blocks and keeps them in memory.
  • 数据块将复制到其他执行程序。Blocks of data are replicated to other executors.
  • 然后,处理的数据将存储在目标数据存储中。The processed data is then stored in the target data store.

要输出的 Spark 流路径spark streaming path to output

Spark 流 (DStream) 工作流Spark Streaming (DStream) workflow

每个批处理间隔后,将生成新的 RDD,其中包含该间隔的所有数据。As each batch interval elapses, a new RDD is produced that contains all the data from that interval. 连续的 RDD 集将被收集到 DStream 中。The continuous sets of RDDs are collected into a DStream. 例如,如果批处理间隔为 1 秒,则 DStream 将每秒发出一个批处理,其中包含一个 RDD(包含该秒期间引入的所有数据)。For example, if the batch interval is one second long, your DStream emits a batch every second containing one RDD that contains all the data ingested during that second. 处理 DStream 时,温度事件将出现在其中一个批处理中。When processing the DStream, the temperature event appears in one of these batches. Spark 流式处理应用程序处理包含事件的批处理并最终作用于每个 RDD 中存储的数据。A Spark Streaming application processes the batches that contain the events and ultimately acts on the data stored in each RDD.

Spark 流处理批spark streaming processing batches

有关适用于 Spark 流的不同转换的详细信息,请参阅 DStreams 中的转换For details on the different transformations available with Spark Streaming, see Transformations on DStreams.

Spark 结构化流Spark Structured Streaming

Spark 结构化流以表的形式表示数据流,该表的深度不受限。Spark Structured Streaming represents a stream of data as a table that is unbounded in depth. 随着新数据的抵达,该表会不断增大。The table continues to grow as new data arrives. 此输入表由一个长时间运行的查询持续处理,结果将发送到输出表。This input table is continuously processed by a long-running query, and the results are sent to an output table.

在结构化流中,数据抵达系统后立即被引入输入表中。In Structured Streaming, data arrives at the system and is immediately ingested into an input table. 可以编写针对此输入表执行操作的查询(使用数据帧和数据集 API)。You write queries (using the DataFrame and Dataset APIs) that perform operations against this input table.

查询输出生成一个结果表,其中包含查询的结果。 The query output yields a results table, which contains the results of your query. 可以基于外部数据存储(例如关系数据库)的结果表绘制数据。You can draw data from the results table for an external datastore, such a relational database.

处理输入表中数据的时间由触发器间隔控制。The timing of when data is processed from the input table is controlled by the trigger interval. 默认情况下,触发器间隔为零,因此,结构化流会在数据抵达时尽快处理数据。By default, the trigger interval is zero, so Structured Streaming tries to process the data as soon as it arrives. 在实践中,这意味着结构化流在处理完前一查询的运行之后,会立即针对所有新收到的数据启动另一个处理运行。In practice, this means that as soon as Structured Streaming is done processing the run of the previous query, it starts another processing run against any newly received data. 可将触发器配置为根据某个间隔运行,以便在基于时间的批中处理流数据。You can configure the trigger to run at an interval, so that the streaming data is processed in time-based batches.

结构化流中数据的处理processing of data in structured streaming

结构化流的编程模型programming model for structured streaming

一般迁移流程General migration flow

从 Storm 迁移到 Spark 的建议迁移流程假设采用以下初始体系结构:The recommended migration flow from Storm to Spark assumes the following initial architecture:

  • 将 Kafka 用作流数据源Kafka is used as the streaming data source

  • Kafka 和 Storm 部署在同一虚拟网络中Kafka and Storm are deployed on the same virtual network

  • Storm 处理的数据写入 Azure 存储或 Azure Data Lake Storage Gen2 等数据接收器。The data processed by Storm is written to a data sink, such as Azure Storage or Azure Data Lake Storage Gen2.

    设想的当前环境示意图diagram of presumed current environment

若要将应用程序从 Storm 迁移到某个 Spark 流 API,请执行以下操作:To migrate your application from Storm to one of the Spark streaming APIs, do the following:

  1. 部署新群集。Deploy a new cluster. 在同一虚拟网络中部署新的 HDInsight 4.0 Spark 群集,在该群集上部署 Spark 流或 Spark 结构化流应用程序,并对其进行全面的测试。Deploy a new HDInsight 4.0 Spark cluster in the same virtual network and deploy your Spark Streaming or Spark Structured Streaming application on it and test it thoroughly.

    HDInsight 中的新 Spark 部署new spark deployment in HDInsight

  2. 停止旧 Storm 群集上的数据使用。Stop consuming on the old Storm cluster. 在现有 Storm 中,停止使用流数据源中的数据,并等待完成将数据写入目标接收器。In the existing Storm, stop consuming data from the streaming data source and wait it for the data to finish writing to the target sink.

    停止当前群集上的数据使用stop consuming on current cluster

  3. 开始在新 Spark 群集上使用数据。Start consuming on the new Spark cluster. 开始从新部署的 HDInsight 4.0 Spark 群集流式传输数据。Start streaming data from a newly deployed HDInsight 4.0 Spark cluster. 此时,将通过从最新的 Kafka 偏移量使用数据来接管进程。At this time, the process is taken over by consuming from the latest Kafka offset.

    开始在新群集上使用数据start consuming on new cluster

  4. 根据需要删除旧群集。Remove the old cluster as needed. 切换完成并正常工作后,根据需要删除旧的 HDInsight 3.6 Storm 群集。Once the switch is complete and working properly, remove the old HDInsight 3.6 Storm cluster as needed.

    根据需要删除旧的 HDInsight 群集remove old HDInsight clusters as needed

后续步骤Next steps

有关 Storm、Spark 流和 Spark 结构化流的详细信息,请参阅以下文档:For more information about Storm, Spark Streaming, and Spark Structured Streaming, see the following documents: