什么是 Azure HDInsight 中的 Apache KafkaWhat is Apache Kafka in Azure HDInsight

Apache Kafka 是开源分布式流式处理平台,可用于构建实时流数据管道和应用程序。Apache Kafka is an open-source distributed streaming platform that can be used to build real-time streaming data pipelines and applications. Kafka 还提供类似于消息队列的消息中转站功能,可在其中向命名的数据流发布和订阅信息。Kafka also provides message broker functionality similar to a message queue, where you can publish and subscribe to named data streams.

Kafka on HDInsight 的具体特征如下:The following are specific characteristics of Kafka on HDInsight:

  • 它是一种托管服务,可提供简化的配置过程。It is a managed service that provides a simplified configuration process. 其结果是经 Microsoft 测试并支持的配置。The result is a configuration that is tested and supported by Microsoft.

  • Microsoft 就 Kafka 正常运行时间提供 99.9 % 的服务级别协议(SLA)。Microsoft provides a 99.9% Service Level Agreement (SLA) on Kafka uptime. 有关详细信息,请参阅 HDInsight 的 SLA 信息文档。For more information, see the SLA information for HDInsight document.

  • 它使用 Azure 托管磁盘作为 Kafka 的后备存储。It uses Azure Managed Disks as the backing store for Kafka. 托管磁盘可为每个 Kafka 代理提供高达 16 TB 的存储空间。Managed Disks can provide up to 16 TB of storage per Kafka broker. 有关为 Kafka on HDInsight 配置托管磁盘的信息,请参阅提高 Apache Kafka on HDInsight 的可伸缩性For information on configuring managed disks with Kafka on HDInsight, see Increase scalability of Apache Kafka on HDInsight.

    有关托管磁盘的详细信息,请参阅 Azure 托管磁盘For more information on managed disks, see Azure Managed Disks.

  • Kafka 采用一维机架视图设计。Kafka was designed with a single dimensional view of a rack. Azure 将机架分为两个维度,即更新域 (UD) 和容错域 (FD)。Azure separates a rack into two dimensions - Update Domains (UD) and Fault Domains (FD). Microsoft 提供相关工具,重新均衡 UD 和 FD 中的 Kafka 分区与副本。Microsoft provides tools that rebalance Kafka partitions and replicas across UDs and FDs.

    有关详细信息,请参阅使用 Apache Kafka on HDInsight 实现高可用性For more information, see High availability with Apache Kafka on HDInsight.

  • 创建群集后,HDInsight 允许更改辅助角色节点(托管 Kafka 代理)的数目。HDInsight allows you to change the number of worker nodes (which host the Kafka-broker) after cluster creation. 可以通过 Azure 门户、Azure PowerShell 和其他 Azure 管理界面执行缩放。Scaling can be performed from the Azure portal, Azure PowerShell, and other Azure management interfaces. 对于 Kafka,在执行缩放操作后,应重新均衡分区副本。For Kafka, you should rebalance partition replicas after scaling operations. 重新均衡分区可让 Kafka 利用新的工作节点数。Rebalancing partitions allows Kafka to take advantage of the new number of worker nodes.

    有关详细信息,请参阅使用 Apache Kafka on HDInsight 实现高可用性For more information, see High availability with Apache Kafka on HDInsight.

Apache Kafka on HDInsight 体系结构Apache Kafka on HDInsight architecture

下图显示了一个典型的 Kafka 配置,该配置利用使用者组、分区和复制提供带容错功能的事件并行读取:The following diagram shows a typical Kafka configuration that uses consumer groups, partitioning, and replication to offer parallel reading of events with fault tolerance:

Kafka 群集配置关系图

Apache ZooKeeper 管理 Kafka 群集的状态。Apache ZooKeeper manages the state of the Kafka cluster. Zookeeper 专用于并发、可复原和低延迟事务。Zookeeper is built for concurrent, resilient, and low-latency transactions.

Kafka 将记录(数据)存储在主题中 。Kafka stores records (data) in topics. 记录由生成者生成,由使用者使用。Records are produced by producers, and consumed by consumers. 生成者将记录发送到 Kafka 代理 。Producers send records to Kafka brokers. HDInsight 群集中的每个辅助角色节点都是一个 Kafka 中转站。Each worker node in your HDInsight cluster is a Kafka broker.

主题跨代理对记录进行分区。Topics partition records across brokers. 在使用记录时,每个分区最多可使用一个使用者来实现数据并行处理。When consuming records, you can use up to one consumer per partition to achieve parallel processing of the data.

利用复制功能将分区复制到各个节点上,以防止发生节点(代理)服务中断。Replication is employed to duplicate partitions across nodes, protecting against node (broker) outages. 关系图中用 (L) 表示的分区是给定分区的前导者 。A partition denoted with an (L) in the diagram is the leader for the given partition. 生成方流量将根据由 ZooKeeper 管理的状态路由到每个节点的前导者。Producer traffic is routed to the leader of each node, using the state managed by ZooKeeper.

为何使用 Apache Kafka on HDInsight?Why use Apache Kafka on HDInsight?

以下是可使用 Kafka on HDInsight 执行的常见任务和模式:The following are common tasks and patterns that can be performed using Kafka on HDInsight:

用途Use 说明Description
复制 Apache Kafka 数据Replication of Apache Kafka data Kafka 提供了 MirrorMaker 实用工具,用于在 Kafka 群集之间复制数据。Kafka provides the MirrorMaker utility, which replicates data between Kafka clusters. 有关使用 MirrorMaker 的信息,请参阅使用 Apache Kafka on HDInsight 复制 Apache Kafka 主题For information on using MirrorMaker, see Replicate Apache Kafka topics with Apache Kafka on HDInsight.
发布-订阅消息模式Publish-subscribe messaging pattern Kafka 提供了生成者 API 来用于向 Kafka 主题发布记录。Kafka provides a Producer API for publishing records to a Kafka topic. 订阅某个主题时,会用到使用者 API。The Consumer API is used when subscribing to a topic. 有关详细信息,请参阅 Apache Kafka on HDInsight 入门For more information, see Start with Apache Kafka on HDInsight.
流处理Stream processing Kafka 通常与 Apache Storm 或 Spark 配合使用,以实现实时流式处理。Kafka is often used with Apache Storm or Spark for real-time stream processing. Kafka 0.10.0.0(HDInsight 版本 3.5 和 3.6)引入了流式处理 API,可用于构建流式处理解决方案,而无需使用 Storm 或 Spark。Kafka 0.10.0.0 (HDInsight version 3.5 and 3.6) introduced a streaming API that allows you to build streaming solutions without requiring Storm or Spark. 有关详细信息,请参阅 Apache Kafka on HDInsight 入门For more information, see Start with Apache Kafka on HDInsight.
横向缩放Horizontal scale Kafka 可将 HDInsight 群集中不同节点之间的流进行分区。Kafka partitions streams across the nodes in the HDInsight cluster. 使用者进程可与单个分区相关联,在使用记录时提供负载均衡。Consumer processes can be associated with individual partitions to provide load balancing when consuming records. 有关详细信息,请参阅 Apache Kafka on HDInsight 入门For more information, see Start with Apache Kafka on HDInsight.
按序送达In-order delivery 在每个分区中,记录按接收顺序存储在流中。Within each partition, records are stored in the stream in the order that they were received. 通过在使用者进程与分区之间建立一对一的关联,可以保证记录按顺序处理。By associating one consumer process per partition, you can guarantee that records are processed in-order. 有关详细信息,请参阅 Apache Kafka on HDInsight 入门For more information, see Start with Apache Kafka on HDInsight.
消息传送Messaging 由于支持发布-订阅消息模式,Kafka 通常用作消息中转站。Since it supports the publish-subscribe message pattern, Kafka is often used as a message broker.
活动跟踪Activity tracking 由于 Kafka 提供有序的日志记录,因此可用于跟踪和重建活动,Since Kafka provides in-order logging of records, it can be used to track and re-create activities. 例如,网站上或应用程序内的用户操作。For example, user actions on a web site or within an application.
聚合Aggregation 使用流处理可从不同的流中聚合信息,将信息合并和集中到运营数据中。Using stream processing, you can aggregate information from different streams to combine and centralize the information into operational data.
转换Transformation 使用流处理可将多个输入主题中的数据合并到一个或多个输出主题中,丰富其内容。Using stream processing, you can combine and enrich data from multiple input topics into one or more output topics.

后续步骤Next steps

单击以下链接了解如何使用 Apache Kafka on HDInsight:Use the following links to learn how to use Apache Kafka on HDInsight: