使用用于 IntelliJ 的 Azure 工具包通过 SSH 调试 HDInsight 群集上的 Apache Spark 应用程序Debug Apache Spark applications on an HDInsight cluster with Azure Toolkit for IntelliJ through SSH

本文逐步介绍如何使用 Azure Toolkit for IntelliJ 中的 HDInsight 工具远程调试 HDInsight 群集上的应用程序。This article provides step-by-step guidance on how to use HDInsight Tools in Azure Toolkit for IntelliJ to debug applications remotely on an HDInsight cluster. 要调试项目,还可以观看 Debug HDInsight Spark applications with Azure Toolkit for IntelliJ(使用用于 IntelliJ 的 Azure 工具包调试 HDInsight Spark 应用程序)视频。To debug your project, you can also view the Debug HDInsight Spark applications with Azure Toolkit for IntelliJ video.

必备条件Prerequisites

创建 Spark Scala 应用程序Create a Spark Scala application

  1. 启动 IntelliJ IDEA,选择“创建新项目”打开“新建项目”窗口。 Start IntelliJ IDEA, and select Create New Project to open the New Project window.

  2. 在左窗格中选择“Apache Spark/HDInsight” 。Select Apache Spark/HDInsight from the left pane.

  3. 在主窗口中选择“带示例的 Spark 项目(Scala)”。 Select Spark Project with Samples (Scala) from the main window.

  4. 在“生成工具”下拉列表中选择下列其中一项: From the Build tool drop-down list, select one of the following:

    • Maven:支持 Scala 项目创建向导。Maven for Scala project-creation wizard support.
    • SBT:用于管理依赖项和生成 Scala 项目。SBT for managing the dependencies and building for the Scala project.

    Intellij 的“创建新项目”选项

  5. 选择“下一步”。Select Next.

  6. 在下一个“新建项目”窗口中提供以下信息: In the next New Project window, provide the following information:

    属性Property 说明Description
    项目名称Project name 输入名称。Enter a name. 本演练使用 myAppThis walk through uses myApp.
    项目位置Project location 输入所需的位置用于保存项目。Enter the desired location to save your project.
    项目 SDKProject SDK 如果为空,请选择“新建...”并导航到 JDK。 If blank, select New... and navigate to your JDK.
    Spark 版本Spark Version 创建向导集成了适当版本的 Spark SDK 和 Scala SDK。The creation wizard integrates the proper version for Spark SDK and Scala SDK. 如果 Spark 群集版本低于 2.0,请选择“Spark 1.x” 。If the Spark cluster version is earlier than 2.0, select Spark 1.x. 否则,选择“Spark 2.x.”。 Otherwise, select Spark 2.x.. 本示例使用“Spark 2.3.0 (Scala 2.11.8)”。 This example uses Spark 2.3.0 (Scala 2.11.8).

    Intellij 的“新建项目”选项选择 Spark 版本

  7. 选择“完成”。 Select Finish. 可能需要在几分钟后才会显示该项目。It may take a few minutes before the project becomes available. 观看右下角的进度。Watch the bottom right-hand corner for progress.

  8. 展开项目,然后导航到 src > main > scala > sampleExpand your project, and navigate to src > main > scala > sample. 双击“SparkCore_WasbIOTest” 。Double-click SparkCore_WasbIOTest.

执行本地运行Perform local run

  1. SparkCore_WasbIOTest 脚本中,右键单击脚本编辑器,然后选择“运行 'SparkCore_WasbIOTest'”选项来执行本地运行 。From the SparkCore_WasbIOTest script, right-click the script editor, and then select the option Run 'SparkCore_WasbIOTest' to perform local run.

  2. 本地运行完成后,可以看到输出文件保存到当前的项目资源管理器的“数据” > “默认”中 。Once local run completed, you can see the output file save to your current project explorer data > default.

    本地运行结果

  3. 执行本地运行和本地调试时,工具已自动设置默认的本地运行配置。Our tools have set the default local run configuration automatically when you perform the local run and local debug. 打开右上角的“[HDInsight 上的 Spark] XXX”配置,可以看到已在“HDInsight 上的 Apache Spark”下创建了“[HDInsight 上的 Spark]XXX” 。Open the configuration [Spark on HDInsight] XXX on the upper right corner, you can see the [Spark on HDInsight]XXX already created under Apache Spark on HDInsight. 切换到“本地运行”选项卡 。Switch to Locally Run tab.

    本地运行配置

    • 环境变量:如果已将系统环境变量 HADOOP_HOME 设置为 C:\WinUtils,则它可自动检测到此设置,而无需手动添加此变量。Environment variables: If you already set the system environment variable HADOOP_HOME to C:\WinUtils, it can auto detect that no need to manually add.
    • WinUtils.exe 位置:如果尚未设置此系统环境变量,则可单击其按钮找到它的位置。WinUtils.exe Location: If you have not set the system environment variable, you can find the location by clicking its button.
    • 只需选择两个选项之一,在 MacOS 和 Linux 上不需要它们。Just choose either of two options and, they are not needed on MacOS and Linux.
  4. 也可在执行本地运行和本地调试前手动设置此配置。You can also set the configuration manually before performing local run and local debug. 在前面的屏幕截图中,选择加号(“+ ”)。In the preceding screenshot, select the plus sign (+). 然后选择“HDInsight 上的 Apache Spark”选项。 Then select the Apache Spark on HDInsight option. 输入“名称”、“主类名称” 信息并保存,然后单击本地运行按钮。Enter information for Name, Main class name to save, then click the local run button.

执行本地调试Perform local debugging

  1. 打开“SparkCore_wasbloTest”脚本,设置断点。 Open the SparkCore_wasbloTest script, set breakpoints.
  2. 右键单击脚本编辑器,然后选择“调试‘[HDInsight 上的 Spark]XXX’”选项来执行本地调试 。Right-click the script editor, and then select the option Debug '[Spark on HDInsight]XXX' to perform local debugging.

执行远程运行Perform remote run

  1. 导航到“运行” > “编辑配置...” 。通过此菜单可以创建或编辑远程调试的相关配置。Navigate to Run > Edit Configurations.... From this menu, you can create or edit the configurations for remote debugging.

  2. 在“运行/调试配置”对话框中,选择加号 (+) 。In the Run/Debug Configurations dialog box, select the plus sign (+). 然后选择“HDInsight 上的 Apache Spark”选项。 Then select the Apache Spark on HDInsight option.

    添加新配置

  3. 切换到“在群集中远程运行”选项卡。 为“名称” 、“Spark 群集” 和“Main 类名” 输入信息。Switch to Remotely Run in Cluster tab. Enter information for Name, Spark cluster, and Main class name. 然后单击“高级配置(远程调试)”。 Then Click Advanced configuration (Remote Debugging). 工具支持使用“执行器” 进行调试。Our tools support debug with Executors. numExectors 默认值为 5 。The numExectors, the default value is 5. 设置的值最好不要大于 3。You'd better not set higher than 3.

    运行调试配置

  4. 在“高级配置(远程调试)”部分,选择“启用 Spark 远程调试”。 In the Advanced Configuration (Remote Debugging) part, select Enable Spark remote debug. 输入 SSH 用户名,然后输入密码或使用私钥文件。Enter the SSH username, and then enter a password or use a private key file. 若要执行远程调试,则需要设置它。If you want to perform remote debug, you need to set it. 若只想使用远程运行,则无需设置它。There is no need to set it if you just want to use remote run.

    启用 Spark 远程调试

  5. 现已使用提供的名称保存配置。The configuration is now saved with the name you provided. 若要查看配置详细信息,请选择配置名称。To view the configuration details, select the configuration name. 若要进行更改,请选择“编辑配置”。 To make changes, select Edit Configurations.

  6. 完成配置设置后,可以针对远程群集运行项目,或执行远程调试。After you complete the configurations settings, you can run the project against the remote cluster or perform remote debugging.

    远程运行按钮

  7. 单击“断开连接”按钮,这样,提交日志就不显示在左面板中 。Click the Disconnect button that the submission logs not appear in the left panel. 但是,它仍在在后端上运行。However, it is still running on the backend.

    远程运行结果

执行远程调试Perform remote debugging

  1. 设置断点,然后单击“远程调试”图标。 Set up breaking points, and then Click the Remote debug icon. 与远程提交的区别是需要配置 SSH 用户名/密码。The difference with remote submission is that SSH username/password need to be configured.

    选择“调试”图标

  2. 程序执行到达断点时,会在“调试程序”窗格中显示一个“驱动程序”选项卡和两个“执行程序”选项卡。 When the program execution reaches the breaking point, you see a Driver tab and two Executor tabs in the Debugger pane. 选择“恢复程序”图标继续运行代码,这会到达下一个断点。 Select the Resume Program icon to continue running the code, which then reaches the next breakpoint. 需要切换到正确的“执行程序”选项卡,以找到目标执行程序 来进行调试。You need to switch to the correct Executor tab to find the target executor to debug. 可以在相应的“控制台”选项卡上查看执行日志。 You can view the execution logs on the corresponding Console tab.

    “调试”选项卡

应用场景 3:执行远程调试和 bug 修复Scenario 3: Perform remote debugging and bug fixing

  1. 设置两个断点,然后选择“调试”图标启动远程调试过程。 Set up two breaking points, and then select the Debug icon to start the remote debugging process.

  2. 代码在第一个断点处停止,“变量”窗格中显示参数和变量信息。 The code stops at the first breaking point, and the parameter and variable information are shown in the Variables pane.

  3. 选择“恢复程序”图标继续运行。 Select the Resume Program icon to continue. 代码在第二个断点处停止。The code stops at the second point. 按预期方式捕获异常。The exception is caught as expected.

    引发错误

  4. 再次选择“恢复程序”图标。 Select the Resume Program icon again. “HDInsight Spark 提交”窗口随即显示“作业运行失败”错误。 The HDInsight Spark Submission window displays a "job run failed" error.

    错误提交

  5. 要使用 IntelliJ 调试功能动态更新变量值,请再次选择“调试” 。To dynamically update the variable value by using the IntelliJ debugging capability, select Debug again. “变量”窗格会再次出现。 The Variables pane appears again.

  6. 在“调试”选项卡上右键单击目标,选择“设置值”。 Right-click the target on the Debug tab, and then select Set Value. 接下来,输入变量的新值。Next, enter a new value for the variable. 然后按 Enter 保存该值。Then select Enter to save the value.

    设置值

  7. 选择“恢复程序”图标继续运行程序。 Select the Resume Program icon to continue to run the program. 这一次不会捕获到异常。This time, no exception is caught. 可以看到,项目已成功运行,未出现任何异常。You can see that the project runs successfully without any exceptions.

    完成调试且未出现异常

后续步骤Next steps

方案Scenarios

创建和运行应用程序Create and run applications

工具和扩展Tools and extensions

管理资源Manage resources