教程:使用 IntelliJ 在 HDInsight 中创建适用于 Apache Spark 的 Scala Maven 应用程序Tutorial: Create a Scala Maven application for Apache Spark in HDInsight using IntelliJ

本教程介绍如何结合使用 Apache Maven 和 IntelliJ IDEA 创建用 Scala 编写的 Apache Spark 应用程序。In this tutorial, you learn how to create an Apache Spark application written in Scala using Apache Maven with IntelliJ IDEA. 本文将 Apache Maven 用作生成系统,并从 IntelliJ IDEA 提供的 Scala 现有 Maven 原型开始。The article uses Apache Maven as the build system and starts with an existing Maven archetype for Scala provided by IntelliJ IDEA. 在 IntelliJ IDEA 中创建 Scala 应用程序需要以下步骤:Creating a Scala application in IntelliJ IDEA involves the following steps:

  • 将 Maven 用作生成系统。Use Maven as the build system.
  • 更新项目对象模型 (POM) 文件以解析 Spark 模块依赖项。Update Project Object Model (POM) file to resolve Spark module dependencies.
  • 在 Scala 中编写应用程序。Write your application in Scala.
  • 生成可提交到 HDInsight Spark 群集的 jar 文件。Generate a jar file that can be submitted to HDInsight Spark clusters.
  • 使用 Livy 在 Spark 群集上运行应用程序。Run the application on Spark cluster using Livy.

Note

HDInsight 还提供一个 IntelliJ IDEA 插件工具,用于简化创建应用程序并将其提交到 Linux 上一个 HDInsight Spark 群集的过程。HDInsight also provides an IntelliJ IDEA plugin tool to ease the process of creating and submitting applications to an HDInsight Spark cluster on Linux. 有关详细信息,请参阅使用适用于 IntelliJ IDEA 的 HDInsight 工具插件以创建和提交 Apache Spark 应用程序For more information, see Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Apache Spark applications.

本教程介绍如何执行下列操作:In this tutorial, you learn how to:

  • 使用 IntelliJ 开发 Scala Maven 应用程序Use IntelliJ to develop a Scala Maven application

如果没有 Azure 订阅,请在开始前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.

先决条件Prerequisites

安装适用于 IntelliJ IDEA 的 Scala 插件Install Scala plugin for IntelliJ IDEA

执行以下步骤安装 Scala 插件:Perform the following steps to install the Scala plugin:

  1. 打开 IntelliJ IDEA。Open IntelliJ IDEA.

  2. 在欢迎屏幕上,导航到“配置” > “插件”打开“插件”窗口。On the welcome screen, navigate to Configure > Plugins to open the Plugins window.

    启用 scala 插件

  3. 选择在新窗口中作为特色功能列出的 Scala 插件对应的“安装”。Select Install for the Scala plugin that is featured in the new window.

    安装 scala 插件

  4. 成功安装该插件后,必须重启 IDE。After the plugin installs successfully, you must restart the IDE.

使用 IntelliJ 创建应用程序Use IntelliJ to create application

  1. 启动 IntelliJ IDEA,选择“创建新项目”打开“新建项目”窗口。Start IntelliJ IDEA, and select Create New Project to open the New Project window.

  2. 在左窗格中选择“Azure Spark/HDInsight”。Select Azure Spark/HDInsight from the left pane.

  3. 在主窗口中选择“Spark 项目(Scala)”。Select Spark Project (Scala) from the main window.

  4. 在“生成工具”下拉列表中选择下列其中一项:From the Build tool drop-down list, select one of the following:

    • Maven:支持 Scala 项目创建向导。Maven for Scala project-creation wizard support.
    • SBT:用于管理依赖项和生成 Scala 项目。SBT for managing the dependencies and building for the Scala project.

    “新建项目”对话框

  5. 选择“下一步”。Select Next.

  6. 在“新建项目”窗口中提供以下信息:In the New Project window, provide the following information:

    属性Property 说明Description
    项目名称Project name 输入名称。Enter a name.
    项目位置 Project location 输入所需的位置用于保存项目。Enter the desired location to save your project.
    项目 SDKProject SDK 首次使用 IDEA 时,此字段是空白的。This will be blank on your first use of IDEA. 选择“新建...”并导航到 JDK。Select New... and navigate to your JDK.
    Spark 版本Spark Version 创建向导集成了适当版本的 Spark SDK 和 Scala SDK。The creation wizard integrates the proper version for Spark SDK and Scala SDK. 如果 Spark 群集版本低于 2.0,请选择“Spark 1.x”。If the Spark cluster version is earlier than 2.0, select Spark 1.x. 否则,请选择“Spark 2.x”。Otherwise, select Spark2.x. 本示例使用“Spark 2.3.0 (Scala 2.11.8)”。This example uses Spark 2.3.0 (Scala 2.11.8).

    选择 Spark SDK

  7. 选择“完成”。Select Finish.

创建独立 Scala 项目Create a standalone Scala project

  1. 启动 IntelliJ IDEA,选择“创建新项目”打开“新建项目”窗口。Start IntelliJ IDEA, and select Create New Project to open the New Project window.

  2. 在左窗格中选择“Maven”。Select Maven from the left pane.

  3. 指定“项目 SDK” 。Specify a Project SDK. 如果此字段是空白的,请选择“新建...”并导航到 Java 安装目录。If blank, select New... and navigate to the Java installation directory.

  4. 选中“从原型创建”复选框。Select the Create from archetype checkbox.

  5. 从原型列表中,选择“org.scala-tools.archetypes:scala-archetype-simple” 。From the list of archetypes, select org.scala-tools.archetypes:scala-archetype-simple. 此原型会创建适当的目录结构,并下载所需的默认依赖项来编写 Scala 程序。This archetype creates the right directory structure and downloads the required default dependencies to write Scala program.

    创建 Maven 项目

  6. 选择“下一步”。Select Next.

  7. 提供 GroupIdArtifactIdVersion 的相关值。Provide relevant values for GroupId, ArtifactId, and Version. 本教程涉及以下值:The following values are used in this tutorial:

    • GroupId: com.microsoft.spark.exampleGroupId: com.microsoft.spark.example
    • ArtifactId: SparkSimpleAppArtifactId: SparkSimpleApp
  8. 选择“下一步”。Select Next.

  9. 验证设置,并选择“下一步”。Verify the settings and then select Next.

  10. 验证项目名称和位置,并选择“完成”。Verify the project name and location, and then select Finish. 导入项目需要花费几分钟时间。The project will take a few minutes to import.

  11. 导入项目后,在左窗格中导航到“SparkSimpleApp” > “src” > “test” > “scala” > “com” > “microsoft” > “spark” > “example”。Once the project has imported, from the left pane navigate to SparkSimpleApp > src > test > scala > com > microsoft > spark > example. 右键单击“MySpec”并选择“删除...”。应用程序不需要此文件。Right-click MySpec, and then select Delete.... You do not need this file for the application. 在对话框中选择“确定”。Select OK in the dialog box.

  12. 后续步骤会更新 pom.xml 以定义 Spark Scala 应用程序的依赖项。In the subsequent steps, you update the pom.xml to define the dependencies for the Spark Scala application. 若要自动下载并解析这些依赖项,必须相应地配置 Maven。For those dependencies to be downloaded and resolved automatically, you must configure Maven accordingly.

  13. 在“文件”菜单中,选择“设置”打开“设置”窗口。From the File menu, select Settings to open the Settings window.

  14. 在“设置”窗口中,导航到“生成、执行、部署” > “生成工具” > “Maven” > “导入”。From the Settings window, navigate to Build, Execution, Deployment > Build Tools > Maven > Importing.

  15. 选中“自动导入 Maven 项目”复选框。Select the Import Maven projects automatically checkbox.

  16. 依次选择“应用”、“确定”。Select Apply, and then select OK. 随后将返回到项目窗口。You will then be returned to the project window.

    配置 Maven 以进行自动下载

  17. 在左侧窗格中,导航到“src” > “main” > “scala” > “com.microsoft.spark.example”,然后双击“应用”打开 App.scala。From the left pane, navigate to src > main > scala > com.microsoft.spark.example, and then double-click App to open App.scala.

  18. 将当前示例代码替换为以下代码,然后保存所做的更改。Replace the existing sample code with the following code and save the changes. 此代码从 HVAC.csv(所有 HDInsight Spark 群集均有该文件)中读取数据,检索第六列中只有一个数字的行,并将输出写入群集的默认存储容器下的 /HVACOutThis code reads the data from the HVAC.csv (available on all HDInsight Spark clusters), retrieves the rows that only have one digit in the sixth column, and writes the output to /HVACOut under the default storage container for the cluster.

    package com.microsoft.spark.example
    
    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
    
    /**
      * Test IO to wasb
      */
    object WasbIOTest {
      def main (arg: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("WASBIOTest")
        val sc = new SparkContext(conf)
    
        val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
    
        //find the rows which have only one digit in the 7th column in the CSV
        val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
    
        rdd1.saveAsTextFile("wasb:///HVACout")
      }
    }
    
  19. 在左侧窗格中,双击“pom.xml”。In the left pane, double-click pom.xml.

  20. <project>\<properties> 中添加以下段:Within <project>\<properties> add the following segments:

      <scala.version>2.11.8</scala.version>
      <scala.compat.version>2.11.8</scala.compat.version>
      <scala.binary.version>2.11</scala.binary.version>
    
  21. <project>\<dependencies> 中添加以下段:Within <project>\<dependencies> add the following segments:

       <dependency>
         <groupId>org.apache.spark</groupId>
         <artifactId>spark-core_${scala.binary.version}</artifactId>
         <version>2.3.0</version>
       </dependency>
    

    将更改保存到 pom.xml。Save changes to pom.xml.

  22. 创建 .jar 文件。Create the .jar file. IntelliJ IDEA 允许创建 JAR,作为项目的一个项目 (artifact)。IntelliJ IDEA enables creation of JAR as an artifact of a project. 执行以下步骤。Perform the following steps.

    1. 在“文件”菜单中,选择“项目结构...”。From the File menu, select Project Structure....

    2. 在“项目结构”窗口中,导航到“项目” > “+” > “JAR” > “从包含依赖项的模块...”。From the Project Structure window, navigate to Artifacts > the plus symbol + > JAR > From modules with dependencies....

      创建 JAR

    3. 在“从模块创建 JAR”窗口中,选择“主类”文本框中的文件夹图标。In the Create JAR from Modules window, select the folder icon in the Main Class text box.

    4. 在“选择主类”窗口中,选择默认显示的类,然后选择“确定”。In the Select Main Class window, select the class that appears by default and then select OK.

      创建 JAR

    5. 在“从模块创建 JAR”窗口中,确保已选择“提取到目标 JAR”选项,然后选择“确定”。In the Create JAR from Modules window, ensure the extract to the target JAR option is selected, and then select OK. 这设置会创建包含所有依赖项的单个 JAR。This setting creates a single JAR with all dependencies.

      创建 JAR

    6. “输出布局”选项卡列出了所有包含为 Maven 项目一部分的 jar。The Output Layout tab lists all the jars that are included as part of the Maven project. 可以选择并删除 Scala 应用程序不直接依赖的 jar。You can select and delete the ones on which the Scala application has no direct dependency. 对于此处创建的应用程序,可以删除最后一个(SparkSimpleApp 编译输出)以外的所有 jar。For the application, you are creating here, you can remove all but the last one (SparkSimpleApp compile output). 选择要删除的 jar,然后选择减号 -Select the jars to delete and then select the negative symbol -.

      创建 JAR

      请务必选中“包含在项目生成中”复选框,以确保每次生成或更新项目时都创建 jar。Ensure sure the Include in project build checkbox is selected, which ensures that the jar is created every time the project is built or updated. 依次选择“应用”、“确定”。Select Apply and then OK.

    7. 若要创建 jar,请导航到“生成” > “生成项目” > “生成”。To create the jar, navigate to Build > Build Artifacts > Build. 该项目将在大约 30 秒内完成编译。The project will compile in about 30 seconds. 输出 jar 在 \out\artifacts 下创建。The output jar is created under \out\artifacts.

      创建 JAR

在 Apache Spark 群集上运行应用程序Run the application on the Apache Spark cluster

若要在群集上运行应用程序,可以使用以下方法:To run the application on the cluster, you can use the following approaches:

后续步骤Next step

本文介绍了如何创建 Apache Spark scala 应用程序。In this article, you learned how to create an Apache Spark scala application. 转到下一文章,了解如何使用 Livy 在 HDInsight Spark 群集上运行此应用程序。Advance to the next article to learn how to run this application on an HDInsight Spark cluster using Livy.