Apache Kafka on HDInsight 的 Apache Spark 流式处理 (DStream) 示例Apache Spark streaming (DStream) example with Apache Kafka on HDInsight

了解如何使用 Apache Spark 通过 DStreams 将数据流式传入或流式传出 Apache Kafka on HDInsight。Learn how to use Apache Spark to stream data into or out of Apache Kafka on HDInsight using DStreams. 本示例使用在 Spark 群集上运行的 Jupyter NotebookThis example uses a Jupyter Notebook that runs on the Spark cluster.


本文档中的步骤创建了一个包含 Spark on HDInsight 和 Kafka on HDInsight 群集的 Azure 资源组。The steps in this document create an Azure resource group that contains both a Spark on HDInsight and a Kafka on HDInsight cluster. 这些群集都位于 Azure 虚拟网络中,允许 Spark 群集直接与 Kafka 群集进行通信。These clusters are both located within an Azure Virtual Network, which allows the Spark cluster to directly communicate with the Kafka cluster.

完成本文档中的步骤后,请记得删除这些群集,避免支付额外费用。When you are done with the steps in this document, remember to delete the clusters to avoid excess charges.


此示例使用 DStreams,这是较旧的 Spark 流式处理技术。This example uses DStreams, which is an older Spark streaming technology. 有关使用较新的 Spark 流式处理功能的示例,请参阅使用 Apache Kafka 的 Spark 结构化流式处理文档。For an example that uses newer Spark streaming features, see the Spark Structured Streaming with Apache Kafka document.

创建群集Create the clusters

Apache Kafka on HDInsight 不提供通过公共 Internet 访问 Kafka 中转站的权限。Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. 与 Kafka 对话的任何内容都必须与 Kafka 群集中的节点位于同一 Azure 虚拟网络中。Anything that talks to Kafka must be in the same Azure virtual network as the nodes in the Kafka cluster. 对于此示例,Kafka 和 Spark 群集都位于 Azure 虚拟网络中。For this example, both the Kafka and Spark clusters are located in an Azure virtual network. 下图显示通信在群集之间的流动方式:The following diagram shows how communication flows between the clusters:

Azure 虚拟网络中的 Spark 和 Kafka 群集图表


虽然 Kafka 本身受限于虚拟网络中的通信,但可以通过 Internet 访问群集上的其他服务(例如 SSH 和 Ambari)。Though Kafka itself is limited to communication within the virtual network, other services on the cluster such as SSH and Ambari can be accessed over the internet. 有关可用于 HDInsight 的公共端口的详细信息,请参阅 HDInsight 使用的端口和 URIFor more information on the public ports available with HDInsight, see Ports and URIs used by HDInsight.

尽管可手动创建 Azure 虚拟网络、Kafka 和 Spark 群集,但使用 Azure 资源管理器模板更简单。While you can create an Azure virtual network, Kafka, and Spark clusters manually, it's easier to use an Azure Resource Manager template. 使用以下步骤将 Azure 虚拟网络、Kafka 和 Spark 群集部署到 Azure 订阅。Use the following steps to deploy an Azure virtual network, Kafka, and Spark clusters to your Azure subscription.

  1. 使用以下按钮登录到 Azure,并在 Azure 门户中打开模板。Use the following button to sign in to Azure and open the template in the Azure portal.

    Deploy to Azure

    Azure 资源管理器模板位于 https://hditutorialdata.blob.core.chinacloudapi.cn/armtemplates/create-linux-based-kafka-spark-cluster-in-vnet-v4.1.jsonThe Azure Resource Manager template is located at https://hditutorialdata.blob.core.chinacloudapi.cn/armtemplates/create-linux-based-kafka-spark-cluster-in-vnet-v4.1.json.


    若要确保 Kafka on HDInsight 的可用性,群集必须至少包含 3 个辅助节点。To guarantee availability of Kafka on HDInsight, your cluster must contain at least three worker nodes. 此模板创建的 Kafka 群集包含三个辅助角色节点。This template creates a Kafka cluster that contains three worker nodes.

    此模板为 Kafka 和 Spark 创建 HDInsight 3.6 群集。This template creates an HDInsight 3.6 cluster for both Kafka and Spark.

  2. 使用以下信息填充“自定义部署”部分中的条目 :Use the following information to populate the entries on the Custom deployment section:

    propertiesProperty Value
    资源组Resource group 创建一个组或选择有个现有的组。Create a group or select an existing one.
    位置Location 选择在地理上邻近的位置。Select a location geographically close to you.
    基群集名称Base Cluster Name 此值将用作 Spark 和 Kafka 群集的基名称。This value is used as the base name for the Spark and Kafka clusters. 例如,输入 hdistreaming 将创建名为 spark-hdistreaming 的 Spark 群集和名为 kafka-hdistreaming 的 Kafka 群集。For example, entering hdistreaming creates a Spark cluster named spark-hdistreaming and a Kafka cluster named kafka-hdistreaming.
    群集登录用户名Cluster Login User Name Spark 和 Kafka 群集的管理员用户名。The admin user name for the Spark and Kafka clusters.
    群集登录密码Cluster Login Password Spark 和 Kafka 群集的管理员用户密码。The admin user password for the Spark and Kafka clusters.
    SSH 用户名SSH User Name 创建 Spark 和 Kafka 群集的 SSH 用户。The SSH user to create for the Spark and Kafka clusters.
    SSH 密码SSH Password Spark 和 Kafka 群集的 SSH 用户的密码。The password for the SSH user for the Spark and Kafka clusters.

    HDInsight 自定义部署参数

  3. 阅读“条款和条件” ,并选择“我同意上述条款和条件” 。Read the Terms and Conditions, and then select I agree to the terms and conditions stated above.

  4. 最后,选择“购买” 。Finally, select Purchase. 创建群集大约需要 20 分钟时间。It takes about 20 minutes to create the clusters.

创建资源后,会显示摘要页面。Once the resources have been created, a summary page appears.

VNet 和群集的资源组摘要


请注意,HDInsight 群集的名称为 spark-BASENAMEkafka-BASENAME,其中 BASENAME 是为模板提供的名称。Notice that the names of the HDInsight clusters are spark-BASENAME and kafka-BASENAME, where BASENAME is the name you provided to the template. 在连接到群集的后续步骤中,会用到这些名称。You use these names in later steps when connecting to the clusters.

使用笔记本Use the notebooks

可在 https://github.com/Azure-Samples/hdinsight-spark-scala-kafka 处查看本文档所描述示例的代码。The code for the example described in this document is available at https://github.com/Azure-Samples/hdinsight-spark-scala-kafka.

删除群集Delete the cluster


HDInsight 群集是基于分钟按比例计费,而不管用户是否使用它们。Billing for HDInsight clusters is prorated per minute, whether you use them or not. 请务必在使用完群集之后将其删除。Be sure to delete your cluster after you finish using it. 请参阅如何删除 HDInsight 群集See how to delete an HDInsight cluster.

由于本文档中的步骤在相同的 Azure 资源组中创建两个群集,因此可在 Azure 门户中删除资源组。Since the steps in this document create both clusters in the same Azure resource group, you can delete the resource group in the Azure portal. 删除该组将删除按照本文档创建的所有资源、Azure 虚拟网络和群集使用的存储帐户。Deleting the group removes all resources created by following this document, the Azure Virtual Network, and storage account used by the clusters.

后续步骤Next steps

在本示例中,了解如何使用 Spark 对 Kafka 进行读取和写入。In this example, you learned how to use Spark to read and write to Kafka. 使用以下链接来发现与 Kafka 配合使用的其他方式:Use the following links to discover other ways to work with Kafka: