在 HDInsight 上的 Apache Spark 群集中将外部包与 Jupyter 笔记本配合使用Use external packages with Jupyter notebooks in Apache Spark clusters on HDInsight

了解如何在 HDInsight 上的 Apache Spark 群集中配置 Jupyter Notebook,以使用未现成包含在群集中的、由社区贡献的外部 Apache maven 包。Learn how to configure a Jupyter Notebook in Apache Spark cluster on HDInsight to use external, community-contributed Apache maven packages that aren't included out-of-the-box in the cluster.

可以在 Maven 存储库 中搜索可用包的完整列表。You can search the Maven repository for the complete list of packages that are available. 也可以从其他源获取可用包的列表。You can also get a list of available packages from other sources. 例如, Spark 包中提供了社区贡献包的完整列表。For example, a complete list of community-contributed packages is available at Spark Packages.

本文介绍如何将 spark-csv 包与 Jupyter 笔记本配合使用。In this article, you'll learn how to use the spark-csv package with the Jupyter notebook.

先决条件Prerequisites

将外部包与 Jupyter 笔记本配合使用Use external packages with Jupyter notebooks

  1. 导航至 https://CLUSTERNAME.azurehdinsight.cn/jupyter,其中 CLUSTERNAME 是 Spark 群集的名称。Navigate to https://CLUSTERNAME.azurehdinsight.cn/jupyter where CLUSTERNAME is the name of your Spark cluster.

  2. 创建新的笔记本。Create a new notebook. 选择“新建” ,然后选择“Spark” 。Select New, and then select Spark.

    创建新的 Spark Jupyter 笔记本Create a new Spark Jupyter notebook

  3. 随即创建新笔记本,并以 Untitled.pynb 名称打开。A new notebook is created and opened with the name Untitled.pynb. 选择顶部的笔记本名称,并输入一个友好名称。Select the notebook name at the top, and enter a friendly name.

    提供笔记本的名称Provide a name for the notebook

  4. 我们将使用 %%configure magic 将笔记本配置为使用外部包。You'll use the %%configure magic to configure the notebook to use an external package. 在使用外部包的笔记本中,确保在第一个代码单元中调用 %%configure magic。In notebooks that use external packages, make sure you call the %%configure magic in the first code cell. 这可以确保将内核配置为在启动会话之前使用该包。This ensures that the kernel is configured to use the package before the session starts.

    重要

    如果忘记了在第一个单元中配置内核,可以结合 -f 参数使用 %%configure,但这会重新启动会话,导致所有进度都会丢失。If you forget to configure the kernel in the first cell, you can use the %%configure with the -f parameter, but that will restart the session and all progress will be lost.

    HDInsight 版本HDInsight version 命令Command
    对于 HDInsight 3.5 和 HDInsight 3.6For HDInsight 3.5 and HDInsight 3.6 %%configure
    { "conf": {"spark.jars.packages": "com.databricks:spark-csv_2.11:1.5.0" }}
    对于 HDInsight 3.3 和 HDInsight 3.4For HDInsight 3.3 and HDInsight 3.4 %%configure
    { "packages":["com.databricks:spark-csv_2.10:1.4.0"] }
  5. 上述代码片段需要 Maven 中心存储库中用于外部包的 maven 坐标。The snippet above expects the maven coordinates for the external package in Maven Central Repository. 在此代码片段中, com.databricks:spark-csv_2.11:1.5.0spark-csv 包的 maven 坐标。In this snippet, com.databricks:spark-csv_2.11:1.5.0 is the maven coordinate for spark-csv package. 下面说明了如何构造包的坐标。Here's how you construct the coordinates for a package.

    a.a. 在 Maven 存储库中找出该包。Locate the package in the Maven Repository. 在本文中,我们使用 spark-csvFor this article, we use spark-csv.

    b.b. 从存储库中收集 GroupIdArtifactIdVersion 的值。From the repository, gather the values for GroupId, ArtifactId, and Version. 确保收集的值与群集相匹配。Make sure that the values you gather match your cluster. 本示例中,我们将使用 Scala 2.11 和 Spark 1.5.0 包,但可能需要选择群集中相应的 Scala 或 Spark 版本的不同版本。In this case, we're using a Scala 2.11 and Spark 1.5.0 package, but you may need to select different versions for the appropriate Scala or Spark version in your cluster. 通过在 Spark Jupyter 内核或 Spark 提交上运行 scala.util.Properties.versionString,可以找出群集上的 Scala 版本。You can find out the Scala version on your cluster by running scala.util.Properties.versionString on the Spark Jupyter kernel or on Spark submit. 通过在 Jupyter 笔记本上运行 sc.version,可以找出群集上的 Spark 版本。You can find out the Spark version on your cluster by running sc.version on Jupyter notebooks.

    将外部包与 Jupyter 笔记本配合使用Use external packages with Jupyter notebook

    c.c. 串连这三个值并以冒号分隔 ( : )。Concatenate the three values, separated by a colon (:).

     com.databricks:spark-csv_2.11:1.5.0
    
  6. 结合 %%configure magic 运行代码单元。Run the code cell with the %%configure magic. 这会将基础 Livy 会话配置为使用提供的包。This will configure the underlying Livy session to use the package you provided. 现在,可以在笔记本的后续单元中使用该包,如下所示。In the subsequent cells in the notebook, you can now use the package, as shown below.

     val df = spark.read.format("com.databricks.spark.csv").
     option("header", "true").
     option("inferSchema", "true").
     load("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
    

    对于 HDInsight 3.4 及更低版本,应使用以下代码片段。For HDInsight 3.4 and below, you should use the following snippet.

     val df = sqlContext.read.format("com.databricks.spark.csv").
     option("header", "true").
     option("inferSchema", "true").
     load("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
    
  7. 然后,可以运行代码片段(如下所示)以查看上一步骤创建的数据帧中的数据。You can then run the snippets, like shown below, to view the data from the dataframe you created in the previous step.

     df.show()
    
     df.select("Time").count()
    

另请参阅See also

方案Scenarios

创建和运行应用程序Create and run applications

工具和扩展Tools and extensions

管理资源Manage resources