Azure HDInsight 中 Apache Spark 群集上的 Jupyter notebook 的内核Kernels for Jupyter notebook on Apache Spark clusters in Azure HDInsight

HDInsight Spark 群集提供可在 Apache Spark 上的 Jupyter notebook 中用于测试应用程序的内核。HDInsight Spark clusters provide kernels that you can use with the Jupyter notebook on Apache Spark for testing your applications. 内核是可以运行和解释代码的程序。A kernel is a program that runs and interprets your code. 三个内核如下:The three kernels are:

  • PySpark - 适用于以 Python2 编写的应用程序PySpark - for applications written in Python2
  • PySpark3 - 适用于以 Python3 编写的应用程序PySpark3 - for applications written in Python3
  • Spark - 适用于以 Scala 编写的应用程序Spark - for applications written in Scala

本文介绍如何使用这些内核以及使用它们的优势。In this article, you learn how to use these kernels and the benefits of using them.

先决条件Prerequisites

在 Spark HDInsight 上创建 Jupyter NotebookCreate a Jupyter notebook on Spark HDInsight

  1. Azure 门户中,选择 Spark 群集。From the Azure portal, select your Spark cluster. 有关说明,请参阅列出和显示群集See List and show clusters for the instructions. 此时将打开“概述” 视图。The Overview view opens.

  2. 在“概述” 视图的“群集仪表板” 框中,选择 Jupyter NotebookFrom the Overview view, in the Cluster dashboards box, select Jupyter notebook. 出现提示时,请输入群集的管理员凭据。If prompted, enter the admin credentials for the cluster.

    Spark 上的 Jupyter NotebookJupyter notebook on Spark

    备注

    也可以在浏览器中打开以下 URL 来访问 Spark 群集中的 Jupyter Notebook。You may also reach the Jupyter notebook on Spark cluster by opening the following URL in your browser. CLUSTERNAME 替换为群集的名称:Replace CLUSTERNAME with the name of your cluster:

    https://CLUSTERNAME.azurehdinsight.cn/jupyter

  3. 选择“新建”,然后选择“Pyspark”、“PySpark3”或“Spark”创建 Notebook。 Select New, and then select either Pyspark, PySpark3, or Spark to create a notebook. 使用适用于 Scala 应用程序的 Spark 内核、适用于 Python2 应用程序的 PySpark 内核,以及适用于 Python3 应用程序的 PySpark3 内核。Use the Spark kernel for Scala applications, PySpark kernel for Python2 applications, and PySpark3 kernel for Python3 applications.

    Spark 上适用于 Jupyter Notebook 的内核Kernels for Jupyter notebook on Spark

  4. 此时会打开使用所选内核的 Notebook。A notebook opens with the kernel you selected.

使用这些内核的好处Benefits of using the kernels

以下是在 Spark HDInsight 群集中的 Jupyter Notebook 上使用新内核的几个好处。Here are a few benefits of using the new kernels with Jupyter notebook on Spark HDInsight clusters.

  • 预设上下文Preset contexts. 使用 PySparkPySpark3Spark 内核时,无需首先显式设置 Spark 或 Hive 上下文,即可开始使用应用程序。With PySpark, PySpark3, or the Spark kernels, you do not need to set the Spark or Hive contexts explicitly before you start working with your applications. 这些上下文默认可供使用。These are available by default. 这些上下文包括:These contexts are:

    • sc - 表示 Spark 上下文sc - for Spark context

    • sqlContext - 表示 Hive 上下文sqlContext - for Hive context

      因此,不需要运行如下语句来设置上下文:So, you don't have to run statements like the following to set the contexts:

      sc = SparkContext('yarn-client')
      sqlContext = HiveContext(sc)
      

      可以直接在应用程序中使用预设上下文。Instead, you can directly use the preset contexts in your application.

  • 单元 magicCell magics. PySpark 内核提供一些预定义的“magic”,这是可以结合 %% 调用的特殊命令(例如 %%MAGIC <args>)。The PySpark kernel provides some predefined "magics", which are special commands that you can call with %% (for example, %%MAGIC <args>). magic 命令必须是代码单元中的第一个字,并且允许多行内容。The magic command must be the first word in a code cell and allow for multiple lines of content. magic 一字应该是单元中的第一个字。The magic word should be the first word in the cell. 在 magic 前面添加任何内容(即使是注释)都会导致错误。Adding anything before the magic, even comments, causes an error. 有关 magic 的详细信息,请参阅 此文For more information on magics, see here.

    下表列出了可通过内核提供的不同 magic。The following table lists the different magics available through the kernels.

    MagicMagic 示例Example 说明Description
    helphelp %%help 生成所有可用 magic 的表,其中包含示例和说明Generates a table of all the available magics with example and description
    infoinfo %%info 输出当前 Livy 终结点的会话信息Outputs session information for the current Livy endpoint
    配置configure %%configure -f
    {"executorMemory": "1000M",{"executorMemory": "1000M",
    "executorCores": 4}"executorCores": 4}
    配置用于创建会话的参数。Configures the parameters for creating a session. 如果已创建会话,则必须指定 force 标志 (-f),确保删除再重新创建该会话。The force flag (-f) is mandatory if a session has already been created, which ensures that the session is dropped and recreated. 有关有效参数的列表,请查看 Livy's POST /sessions Request Body (Livy 的 POST /sessions 请求正文)。Look at Livy's POST /sessions Request Body for a list of valid parameters. 参数必须以 JSON 字符串传入,并且必须位于 magic 后面的下一行,如示例列中所示。Parameters must be passed in as a JSON string and must be on the next line after the magic, as shown in the example column.
    sqlsql %%sql -o <variable name>
    SHOW TABLES
    针对 sqlContext 执行 Hive 查询。Executes a Hive query against the sqlContext. 如果传递了 -o 参数,则查询的结果以 Pandas 数据帧的形式保存在 %%local Python 上下文中。If the -o parameter is passed, the result of the query is persisted in the %%local Python context as a Pandas dataframe.
    locallocal %%local
    a=1
    后续行中的所有代码将在本地执行。All the code in later lines is executed locally. 无论你使用哪个内核,代码都必须是有效的 Python2 代码。Code must be valid Python2 code no matter which kernel you're using. 因此,即使在创建 Notebook 时选择了“PySpark3”或“Spark”,但如果在单元中使用 %%local magic,该单元也只能包含有效的 Python2 代码。 So, even if you selected PySpark3 or Spark kernels while creating the notebook, if you use the %%local magic in a cell, that cell must only have valid Python2 code.
    日志logs %%logs 输出当前 Livy 会话的日志。Outputs the logs for the current Livy session.
    删除delete %%delete -f -s <session number> 删除当前 Livy 终结点的特定会话。Deletes a specific session of the current Livy endpoint. 无法删除针对内核本身启动的会话。You can't delete the session that is started for the kernel itself.
    cleanupcleanup %%cleanup -f 删除当前 Livy 终结点的所有会话,包括此笔记本的会话。Deletes all the sessions for the current Livy endpoint, including this notebook's session. force 标志 -f 是必需的。The force flag -f is mandatory.

    备注

    除了 PySpark 内核添加的 magic 以外,还可以使用内置的 IPython magic(包括 %%sh)。In addition to the magics added by the PySpark kernel, you can also use the built-in IPython magics, including %%sh. 可以使用 %%sh magic 在群集头节点上运行脚本和代码块。You can use the %%sh magic to run scripts and block of code on the cluster headnode.

  • 自动可视化Auto visualization. Pyspark 内核会自动将 Hive 和 SQL 查询的输出可视化。The Pyspark kernel automatically visualizes the output of Hive and SQL queries. 可以选择多种不同类型的视觉效果,包括表、饼图、折线图、分区图和条形图。You can choose between several different types of visualizations including Table, Pie, Line, Area, Bar.

%%sql magic 支持的参数Parameters supported with the %%sql magic

%%sql magic 支持不同的参数,可以使用这些参数控制运行查询时收到的输出类型。The %%sql magic supports different parameters that you can use to control the kind of output that you receive when you run queries. 下表列出了输出。The following table lists the output.

参数Parameter 示例Example 说明Description
-o-o -o <VARIABLE NAME> 使用此参数将查询结果以 Pandas 数据帧的形式保存在 %%local Python 上下文中。Use this parameter to persist the result of the query, in the %%local Python context, as a Pandas dataframe. 数据帧变量的名称是指定的变量名称。The name of the dataframe variable is the variable name you specify.
-q-q -q 使用此参数可关闭单元可视化。Use this parameter to turn off visualizations for the cell. 如果不想自动将单元内容可视化,而只想将它作为数据帧捕获,可以使用 -q -o <VARIABLE>If you don't want to autovisualize the content of a cell and just want to capture it as a dataframe, then use -q -o <VARIABLE>. 如果想要关闭可视化而不捕获结果(例如,运行诸如 CREATE TABLE 语句的 SQL 查询),请使用不带 -o 参数的 -qIf you want to turn off visualizations without capturing the results (for example, for running a SQL query, like a CREATE TABLE statement), use -q without specifying a -o argument.
-m-m -m <METHOD> 其中,METHODtakesample(默认为 take)。Where METHOD is either take or sample (default is take). 如果方法为 take ,内核将从 MAXROWS 指定的结果数据集顶部选取元素(如此表中的后面部分所述)。If the method is take, the kernel picks elements from the top of the result data set specified by MAXROWS (described later in this table). 如果方法为 sample,内核将根据 -r 参数进行数据集的元素随机采样,如此表中稍后所述。If the method is sample, the kernel randomly samples elements of the data set according to -r parameter, described next in this table.
-r-r -r <FRACTION> 此处的 FRACTION 是介于 0.0 与 1.0 之间的浮点数。Here FRACTION is a floating-point number between 0.0 and 1.0. 如果 SQL 查询的示例方法为 sample,则内核会随机地对结果集的指定部分元素取样。If the sample method for the SQL query is sample, then the kernel randomly samples the specified fraction of the elements of the result set for you. 例如,如果使用参数 -m sample -r 0.01运行 SQL 查询,则 1% 的结果行是随机取样的。For example, if you run a SQL query with the arguments -m sample -r 0.01, then 1% of the result rows are randomly sampled.
-n-n -n <MAXROWS> MAXROWS 是整数值。MAXROWS is an integer value. 内核将输出行的数目限制为 MAXROWSThe kernel limits the number of output rows to MAXROWS. 如果 MAXROWS 是负数(例如 -1),结果集中的行数不受限制。If MAXROWS is a negative number such as -1, then the number of rows in the result set is not limited.

示例:Example:

%%sql -q -m sample -r 0.1 -n 500 -o query2
SELECT * FROM hivesampletable

上述语句执行以下操作:The statement above does the following actions:

  • hivesampletable中选择所有记录。Selects all records from hivesampletable.
  • 由于使用了 -q,因此将关闭自动可视化。Because we use -q, it turns off autovisualization.
  • 由于使用了 -m sample -r 0.1 -n 500 ,因此将从 hivesampletable 的行中随机采样 10%,并将结果集的大小限制为 500 行。Because we use -m sample -r 0.1 -n 500 it randomly samples 10% of the rows in the hivesampletable and limits the size of the result set to 500 rows.
  • 最后,由于使用了 -o query2 ,因此将输出保存到名为 query2的数据帧中。Finally, because we used -o query2 it also saves the output into a dataframe called query2.

使用新内核时的注意事项Considerations while using the new kernels

不管使用哪种内核,使 Notebook 一直保持运行都会消耗群集资源。Whichever kernel you use, leaving the notebooks running consumes the cluster resources. 使用这些内核时,由于上下文是预设的,仅退出笔记本并不会终止上下文,With these kernels, because the contexts are preset, simply exiting the notebooks doesn't kill the context. 因此会继续占用群集资源。And so the cluster resources continue to be in use. 合理的做法是在使用完笔记本后,使用笔记本的“文件”菜单中的“关闭并停止”选项。 A good practice is to use the Close and Halt option from the notebook's File menu when you're finished using the notebook. 关闭会终止上下文,并退出笔记本。The closure kills the context and then exits the notebook.

Notebook 存储在何处?Where are the notebooks stored?

如果群集使用 Azure 存储作为默认存储帐户,Jupyter notebook 将保存到 /HdiNotebooks 文件夹下的存储帐户。If your cluster uses Azure Storage as the default storage account, Jupyter notebooks are saved to storage account under the /HdiNotebooks folder. 可以从存储帐户访问在 Jupyter 内部创建的 Notebook、文本文件和文件夹。Notebooks, text files, and folders that you create from within Jupyter are accessible from the storage account. 例如,如果使用 Jupyter 创建文件夹 myfolder 和 Notebook myfolder/mynotebook.ipynb,可在存储帐户中通过 /HdiNotebooks/myfolder/mynotebook.ipynb 访问该 Notebook。For example, if you use Jupyter to create a folder myfolder and a notebook myfolder/mynotebook.ipynb, you can access that notebook at /HdiNotebooks/myfolder/mynotebook.ipynb within the storage account. 反之亦然,如果直接将 Notebook 上传到 /HdiNotebooks/mynotebook1.ipynb中的存储帐户,则可以从 Jupyter 查看该 Notebook。The reverse is also true, that is, if you upload a notebook directly to your storage account at /HdiNotebooks/mynotebook1.ipynb, the notebook is visible from Jupyter as well. 即使删除了群集,Notebook 也仍会保留在存储帐户中。Notebooks remain in the storage account even after the cluster is deleted.

备注

将 Azure Data Lake Storage 作为默认存储的 HDInsight 群集不在关联的存储中存储笔记本。HDInsight clusters with Azure Data Lake Storage as the default storage do not store notebooks in associated storage.

将笔记本保存到存储帐户的方式与 Apache Hadoop HDFS 兼容。The way notebooks are saved to the storage account is compatible with Apache Hadoop HDFS. 如果通过 SSH 连接到群集,可以使用文件管理命令:If you SSH into the cluster you can use the file management commands:

hdfs dfs -ls /HdiNotebooks                               # List everything at the root directory - everything in this directory is visible to Jupyter from the home page
hdfs dfs -copyToLocal /HdiNotebooks                    # Download the contents of the HdiNotebooks folder
hdfs dfs -copyFromLocal example.ipynb /HdiNotebooks   # Upload a notebook example.ipynb to the root folder so it's visible from Jupyter

不论群集是使用 Azure 存储还是 Azure Data Lake Storage 作为默认存储帐户,笔记本还是会保存在群集头节点上的 /var/lib/jupyter 中。Whether the cluster uses Azure Storage or Azure Data Lake Storage as the default storage account, the notebooks are also saved on the cluster headnode at /var/lib/jupyter.

支持的浏览器Supported browser

Google Chrome 仅支持 Spark HDInsight 群集中的 Jupyter Notebook。Jupyter notebooks on Spark HDInsight clusters are supported only on Google Chrome.

反馈Feedback

新内核正处于发展阶段,一段时间后将变得成熟。The new kernels are in evolving stage and will mature over time. 因此,API 会随着这些内核的成熟而改变。So the APIs could change as these kernels mature. 如果在使用这些新内核时有任何反馈,我们将不胜感激。We would appreciate any feedback that you have while using these new kernels. 此反馈对于塑造这些内核的最终版本会很有帮助。The feedback is useful in shaping the final release of these kernels. 可以在本文末尾的“反馈”部分下面留下意见/反馈。You can leave your comments/feedback under the Feedback section at the bottom of this article.

后续步骤Next steps