使用 Azure Toolkit for IntelliJ 通过 VPN 在 HDInsight 中远程调试 Apache Spark 应用程序Use Azure Toolkit for IntelliJ to debug Apache Spark applications remotely in HDInsight through VPN

我们建议通过 SSH 远程调试 Apache Spark 应用程序。We recommend debugging Apache Spark applications remotely through SSH. 有关说明,请参阅使用 Azure Toolkit for IntelliJ 通过 SSH 远程调试 HDInsight 群集上的 Apache Spark 应用程序For instructions, see Remotely debug Apache Spark applications on an HDInsight cluster with Azure Toolkit for IntelliJ through SSH.

本文提供有关如何在 HDInsight Spark 群集上使用用于 IntelliJ 的 Azure 工具包中的 HDInsight 工具插件提交 Spark 作业,并从台式机远程调试该作业的逐步指导。This article provides step-by-step guidance on how to use the HDInsight Tools in Azure Toolkit for IntelliJ to submit a Spark job on an HDInsight Spark cluster, and then debug it remotely from your desktop computer. 若要完成这些任务,必须执行以下概要步骤:To complete these tasks, you must perform the following high-level steps:

  1. 创建站点到站点或点到站点 Azure 虚拟网络。Create a site-to-site or point-to-site Azure virtual network. 本文档中的步骤假设你使用站点到站点网络。The steps in this document assume that you use a site-to-site network.
  2. 在 HDInsight 中创建属于站点到站点虚拟网络一部分的 Spark 群集。Create a Spark cluster in HDInsight that is part of the site-to-site virtual network.
  3. 验证群集头节点与台式机之间的连接。Verify the connectivity between the cluster head node and your desktop.
  4. 在 IntelliJ IDEA 中创建 Scala 应用程序,并对它进行配置以进行远程调试。Create a Scala application in IntelliJ IDEA, and then configure it for remote debugging.
  5. 运行和调试应用程序。Run and debug the application.

先决条件Prerequisites

步骤 1:创建 Azure 虚拟网络Step 1: Create an Azure virtual network

遵照以下链接中的说明创建 Azure 虚拟网络,并验证台式机与虚拟网络之间的连接。Follow the instructions from the following links to create an Azure virtual network, and then verify the connectivity between your desktop computer and the virtual network:

步骤 2:创建 HDInsight Spark 群集Step 2: Create an HDInsight Spark cluster

我们建议额外在 Azure HDInsight 上创建属于所创建 Azure 虚拟网络一部分的 Apache Spark 群集。We recommend that you also create an Apache Spark cluster in Azure HDInsight that is part of the Azure virtual network that you created. 请参考在 HDInsight 中创建基于 Linux 的群集中提供的信息。Use the information available in Create Linux-based clusters in HDInsight. 选择在上一步骤中创建的 Azure 虚拟网络作为可选配置的一部分。As part of optional configuration, select the Azure virtual network that you created in the previous step.

步骤 3:验证群集头节点与台式机之间的连接Step 3: Verify the connectivity between the cluster head node and your desktop

  1. 获取头节点的 IP 地址。Get the IP address of the head node. 打开群集的 Ambari UI。Open Ambari UI for the cluster. 在群集边栏选项卡中,选择“仪表板”。From the cluster blade, select Dashboard.

    在 Ambari 中选择“仪表板”

  2. 在 Ambari UI 中,选择“主机”。 From the Ambari UI, select Hosts.

    在 Ambari 中选择“主机”

  3. 此时会显示头节点、工作节点和 zookeeper 节点的列表。You see a list of head nodes, worker nodes, and zookeeper nodes. 头节点带有 hn* 前缀。The head nodes have an hn* prefix. 选择第一个头节点。Select the first head node.

    在 Ambari 中查找头节点

  4. 在打开的页面底部的“摘要”窗格中,复制头节点的“IP 地址”和“主机名”。 From the Summary pane at the bottom of the page that opens, copy the IP Address of the head node and the Hostname.

    在 Ambari 中查找 IP 地址

  5. 将头节点的 IP 地址和主机名添加到要从中运行和远程调试 Spark 作业的计算机上的 hosts 文件中。Add the IP address and the hostname of the head node to the hosts file on the computer where you want to run and remotely debug the Spark job. 这样,便可以使用 IP 地址和主机名来与头节点通信。This enables you to communicate with the head node by using the IP address, as well as the hostname.

    a.a. 以提升的权限打开一个记事本文件。Open a Notepad file with elevated permissions. 在“文件”菜单中选择“打开” ,并找到 hosts 文件的位置。 From the File menu, select Open, and then find the location of the hosts file. 在 Windows 计算机上,该位置为 C:\Windows\System32\Drivers\etc\hostsOn a Windows computer, the location is C:\Windows\System32\Drivers\etc\hosts.

    b.b. 将以下信息添加到 hosts 文件:Add the following information to the hosts file:

        # For headnode0
        192.xxx.xx.xx hn0-nitinp
        192.xxx.xx.xx hn0-nitinp.lhwwghjkpqejawpqbwcdyp3.gx.internal.chinacloudapp.cn
    
        # For headnode1
        192.xxx.xx.xx hn1-nitinp
        192.xxx.xx.xx hn1-nitinp.lhwwghjkpqejawpqbwcdyp3.gx.internal.chinacloudapp.cn
    
  6. 在连接到 HDInsight 群集所用 Azure 虚拟网络的计算机中,验证是否能够使用该 IP 地址和主机名 ping 通两个头节点。From the computer that you connected to the Azure virtual network that is used by the HDInsight cluster, verify that you can ping the head nodes by using the IP address, as well as the hostname.

  7. 遵照使用 SSH 连接到 HDInsight 群集中的说明,使用 SSH 连接到群集头节点。Use SSH to connect to the cluster head node by following the instructions in Connect to an HDInsight cluster using SSH. 从群集头节点,对台式机的 IP 地址执行 ping 操作。From the cluster head node, ping the IP address of the desktop computer. 测试是否能够连接到分配给计算机的两个 IP 地址:Test the connectivity to both IP addresses assigned to the computer:

    • 一个是网络连接的地址One for the network connection
    • 另一个是 Azure 虚拟网络的地址One for the Azure virtual network
  8. 针对另一个头节点重复上述步骤。Repeat the steps for the other head node.

步骤 4:使用 Azure Toolkit for IntelliJ 中的 HDInsight 工具创建 Apache Spark Scala 应用程序,并对其进行配置以远程调试Step 4: Create an Apache Spark Scala application by using HDInsight Tools in Azure Toolkit for IntelliJ and configure it for remote debugging

  1. 打开 IntelliJ IDEA 并创建一个新项目。Open IntelliJ IDEA and create a new project. 在“新建项目”对话框中执行以下操作:In the New Project dialog box, do the following:

    在于 IntelliJ IDEA 中选择新建项目模板

    a.a. 选择“HDInsight” > “Spark on HDInsight (Scala)”Select HDInsight > Spark on HDInsight (Scala).

    b.b. 选择“下一步”。Select Next.

  2. 在接下来显示的“新建项目”对话框中执行以下操作,并选择“完成”:In the next New Project dialog box, do the following, and then select Finish:

    • 输入项目名称和位置。Enter a project name and location.

    • 在“项目 SDK”下拉列表中,选择适用于 Spark 2.x 群集的“Java 1.8”,或选择适用于 Spark 1.x 群集的“Java 1.7”。In the Project SDK drop-down list, select Java 1.8 for the Spark 2.x cluster, or select Java 1.7 for the Spark 1.x cluster.

    • 在“Spark 版本”下拉列表中,Scala 项目创建向导集成了 Spark SDK 和 Scala SDK 的适当版本。In the Spark version drop-down list, the Scala project creation wizard integrates the proper version for the Spark SDK and the Scala SDK. 如果 Spark 群集版本低于 2.0,请选择“Spark 1.x”。If the Spark cluster version is earlier than 2.0, select Spark 1.x. 否则,请选择“Spark 2.x”。Otherwise, select Spark2.x. 本示例使用“Spark 2.0.2 (Scala 2.11.8)”。This example uses Spark 2.0.2 (Scala 2.11.8).

    选择项目 SDK 和 Spark 版本

  3. Spark 项目将自动创建一个项目。The Spark project automatically creates an artifact for you. 若要查看项目,请执行以下操作:To view the artifact, do the following:

    a.a. 在“文件”菜单中,选择“项目结构”。From the File menu, select Project Structure.

    b.b. 在“项目结构”对话框中,选择“项目”查看创建的默认项目。In the Project Structure dialog box, select Artifacts to view the default artifact that is created. 也可以选择加号 (+) 图标创建自己的项目。You can also create your own artifact by selecting the plus sign (+).

    创建 JAR

  4. 将库添加到项目。Add libraries to your project. 若要添加库,请执行以下操作:To add a library, do the following:

    a.a. 在项目树中右键单击项目名称,并单击“打开模块设置” 。Right-click the project name in the project tree, and then select Open Module Settings.

    b.b. 在“项目结构”对话框中选择“库”,选择 ( + ) 符号,并选择“从 Maven”。 In the Project Structure dialog box, select Libraries, select the (+) symbol, and then select From Maven.

    添加库

    c.c. 在“从 Maven 存储库下载库” 对话框中,搜索并添加以下库:In the Download Library from Maven Repository dialog box, search for and add the following libraries:

    • org.scalatest:scalatest_2.10:2.2.1
    • org.apache.hadoop:hadoop-azure:2.7.1
  5. 从群集头节点复制 yarn-site.xmlcore-site.xml 并将其添加到项目。Copy yarn-site.xml and core-site.xml from the cluster head node and add them to the project. 使用以下命令来复制文件。Use the following commands to copy the files. 可以使用 Cygwin 运行以下 scp 命令,以便从群集头节点复制文件:You can use Cygwin to run the following scp commands to copy the files from the cluster head nodes:

    scp <ssh user name>@<headnode IP address or host name>://etc/hadoop/conf/core-site.xml .
    

    由于已将群集头节点 IP 地址和主机名添加到台式机上的 hosts 文件,因此可按以下方式使用 scp 命令:Because we already added the cluster head node IP address and hostnames for the hosts file on the desktop, we can use the scp commands in the following manner:

    scp sshuser@hn0-nitinp:/etc/hadoop/conf/core-site.xml .
    scp sshuser@hn0-nitinp:/etc/hadoop/conf/yarn-site.xml .
    

    若要将这些文件添加到项目,请将这些文件复制到项目树中的 /src 文件夹下(例如 <your project directory>\src)。To add these files to your project, copy them under the /src folder in your project tree, for example <your project directory>\src.

  6. 更新 core-site.xml 文件以进行以下更改:Update the core-site.xml file to make the following changes:

    a.a. 替换加密的密钥。Replace the encrypted key. core-site.xml 文件包含与群集关联的存储帐户的已加密密钥。The core-site.xml file includes the encrypted key to the storage account associated with the cluster. 在已添加到项目的 core-site.xml 文件中,将已加密密钥替换为与默认存储帐户关联的实际存储密钥。In the core-site.xml file that you added to the project, replace the encrypted key with the actual storage key associated with the default storage account. 有关详细信息,请参阅管理存储访问密钥For more information, see Manage your storage access keys.

    <property>
            <name>fs.azure.account.key.hdistoragecentral.blob.core.chinacloudapi.cn</name>
            <value>access-key-associated-with-the-account</value>
    </property>
    

    b.b. core-site.xml 中删除以下条目:Remove the following entries from core-site.xml:

    <property>
            <name>fs.azure.account.keyprovider.hdistoragecentral.blob.core.chinacloudapi.cn</name>
            <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
    </property>
    
    <property>
            <name>fs.azure.shellkeyprovider.script</name>
            <value>/usr/lib/python2.7/dist-packages/hdinsight_common/decrypt.sh</value>
    </property>
    
    <property>
            <name>net.topology.script.file.name</name>
            <value>/etc/hadoop/conf/topology_script.py</value>
    </property>
    

    c.c. 保存文件。Save the file.

  7. 添加应用程序的 main 类。Add the main class for your application. 在“项目资源管理器”中,右键单击“src”,指向“新建”,并选择“Scala 类”。From the Project Explorer, right-click src, point to New, and then select Scala class.

    选择 main 类

  8. 在“新建 Scala 类”对话框中提供名称,在“种类”对话框中选择“对象”,并选择“确定”。In the Create New Scala Class dialog box, provide a name, select Object in the Kind box, and then select OK.

    新建 Scala 类

  9. MyClusterAppMain.scala 文件中粘贴以下代码。In the MyClusterAppMain.scala file, paste the following code. 此代码创建 Spark 上下文,并从 SparkSample 对象打开 executeJob 方法。This code creates the Spark context and opens an executeJob method from the SparkSample object.

    import org.apache.spark.{SparkConf, SparkContext}
    
    object SparkSampleMain {
        def main (arg: Array[String]): Unit = {
        val conf = new SparkConf().setAppName("SparkSample")
                                    .set("spark.hadoop.validateOutputSpecs", "false")
        val sc = new SparkContext(conf)
    
        SparkSample.executeJob(sc,
                            "wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv",
                            "wasb:///HVACOut")
        }
    }
    
  10. 重复步骤 8 和 9,添加名为 *SparkSample 的新 Scala 对象。Repeat steps 8 and 9 to add a new Scala object called *SparkSample. 将以下代码添加到此类。Add the following code to this class. 此代码从 HVAC.csv(用于所有 HDInsight Spark 群集)中读取数据。This code reads the data from the HVAC.csv (available in all HDInsight Spark clusters). 它会检索在 CSV 文件的第七列中只有一个数字的行,并将输出写入群集的默认存储容器下的 /HVACOutIt retrieves the rows that only have one digit in the seventh column in the CSV file, and then writes the output to /HVACOut under the default storage container for the cluster.

    import org.apache.spark.SparkContext
    
    object SparkSample {
        def executeJob (sc: SparkContext, input: String, output: String): Unit = {
        val rdd = sc.textFile(input)
    
        //find the rows which have only one digit in the 7th column in the CSV
        val rdd1 =  rdd.filter(s => s.split(",")(6).length() == 1)
    
        val s = sc.parallelize(rdd.take(5)).cartesian(rdd).count()
        println(s)
    
        rdd1.saveAsTextFile(output)
        //rdd1.collect().foreach(println)
         }
    }
    
  11. 重复步骤 8 和 9,添加名为 RemoteClusterDebugging 的新类。Repeat steps 8 and 9 to add a new class called RemoteClusterDebugging. 此类实现用于调试应用程序的 Spark 测试框架。This class implements the Spark test framework that is used to debug the applications. 将以下代码添加到 RemoteClusterDebugging 类:Add the following code to the RemoteClusterDebugging class:

        import org.apache.spark.{SparkConf, SparkContext}
        import org.scalatest.FunSuite
    
        class RemoteClusterDebugging extends FunSuite {
    
         test("Remote run") {
           val conf = new SparkConf().setAppName("SparkSample")
                                     .setMaster("yarn-client")
                                     .set("spark.yarn.am.extraJavaOptions", "-Dhdp.version=2.4")
                                     .set("spark.yarn.jar", "wasb:///hdp/apps/2.4.2.0-258/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar")
                                     .setJars(Seq("""C:\workspace\IdeaProjects\MyClusterApp\out\artifacts\MyClusterApp_DefaultArtifact\default_artifact.jar"""))
                                     .set("spark.hadoop.validateOutputSpecs", "false")
           val sc = new SparkContext(conf)
    
           SparkSample.executeJob(sc,
             "wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv",
             "wasb:///HVACOut")
         }
        }
    

    需要注意几个要点:There are a couple of important things to note:

    • 对于. .set("spark.yarn.jar", "wasb:///hdp/apps/2.4.2.0-258/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar"),请确保 Spark 程序集 JAR 可在指定路径上的群集存储中使用。For .set("spark.yarn.jar", "wasb:///hdp/apps/2.4.2.0-258/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar"), make sure the Spark assembly JAR is available on the cluster storage at the specified path.
    • 对于 setJars,请指定项目 JAR 的创建位置。For setJars, specify the location where the artifact JAR is created. 这通常是 <Your IntelliJ project directory>\out\<project name>_DefaultArtifact\default_artifact.jarTypically, it is <Your IntelliJ project directory>\out\<project name>_DefaultArtifact\default_artifact.jar.
  12. *RemoteClusterDebugging 类中,右键单击 test 关键字,并选择“创建 RemoteClusterDebugging 配置” 。In the*RemoteClusterDebugging class, right-click the test keyword, and then select Create RemoteClusterDebugging Configuration.

    创建远程配置

  13. 在“创建 RemoteClusterDebugging 配置”对话框中,为配置提供名称,并选择选择“测试种类”作为“测试名称”。 In the Create RemoteClusterDebugging Configuration dialog box, provide a name for the configuration, and then select Test kind as the Test name. 将其他所有值保留默认设置。Leave all the other values as the default settings. 依次选择“应用”、“确定” 。Select Apply, and then select OK.

    添加配置详细信息

  14. 现在,菜单栏中应会显示“远程运行” 配置下拉列表。You should now see a Remote run configuration drop-down list in the menu bar.

    “远程运行”下拉列表

步骤 5:在调试模式下运行应用程序Step 5: Run the application in debug mode

  1. 在 IntelliJ IDEA 项目中打开 SparkSample.scala 并在 val rdd1 旁边创建一个断点。In your IntelliJ IDEA project, open SparkSample.scala and create a breakpoint next to val rdd1. 在“在为以下对象创建断点”弹出菜单中,选择“line in function executeJob” 。In the Create Breakpoint for pop-up menu, select line in function executeJob.

    添加断点

  2. 若要运行应用程序,请选择“远程运行” 配置下拉列表旁边的“调试运行” 按钮。To run the application, select the Debug Run button next to the Remote Run configuration drop-down list.

    选择“调试运行”按钮

  3. 程序执行步骤到达断点时,底部窗格中应会显示“调试程序” 选项卡。When the program execution reaches the breakpoint, you see a Debugger tab in the bottom pane.

    查看“调试程序”选项卡

  4. 若要添加监视,请选择 ( + ) 图标。To add a watch, select the (+) icon.

    选择“+”图标

    在此示例中,应用程序在创建变量 rdd1 之前已中断。In this example, the application broke before the variable rdd1 was created. 使用此监视可查看变量 rdd 中的前五行。By using this watch, we can see the first five rows in the variable rdd. EnterSelect Enter.

    在调试模式下运行程序

    从上图可以看到,在运行时,可以查询 TB 量级的数据,并可以逐步调试应用程序。What you see in the previous image is that at runtime, you might query terabytes of data and debug how your application progresses. 例如,在上图显示的输出中,可以看到输出的第一行是标头。For example, in the output shown in the previous image, you can see that the first row of the output is a header. 基于此输出,可以修改应用程序代码,以根据需要跳过标头行。Based on this output, you can modify your application code to skip the header row, if necessary.

  5. 现在可以选择“恢复程序” 图标继续运行应用程序。You can now select the Resume Program icon to proceed with your application run.

    选择“恢复程序”

  6. 如果应用程序成功完成,应会显示类似于下面的输出:If the application finishes successfully, you should see output like the following:

    控制台输出

后续步骤Next steps

方案Scenarios

创建和运行应用程序Create and run applications

工具和扩展Tools and extensions

管理资源Manage resources