在计算机上安装 Jupyter notebook 并连接到 HDInsight 上的 Apache SparkInstall Jupyter notebook on your computer and connect to Apache Spark on HDInsight

本文介绍如何使用 Spark magic 安装具有自定义 PySpark(适用于 Python)和 Apache Spark(适用于 Scala)内核的 Jupyter 笔记本。In this article, you learn how to install Jupyter notebook with the custom PySpark (for Python) and Apache Spark (for Scala) kernels with Spark magic. 然后将笔记本连接到 HDInsight 群集。You then connect the notebook to an HDInsight cluster.

安装 Jupyter 并连接到 HDInsight 上的 Apache Spark 涉及到四个重要步骤。There are four key steps involved in installing Jupyter and connecting to Apache Spark on HDInsight.

  • 配置 Spark 群集。Configure Spark cluster.
  • 安装 Jupyter 笔记本。Install Jupyter notebook.
  • 安装 PySpark 和具有 Spark magic 的 Spark 内核。Install the PySpark and Spark kernels with the Spark magic.
  • 配置 Spark magic 以访问 HDInsight 上的 Spark 群集。Configure Spark magic to access Spark cluster on HDInsight.

有关自定义内核和 Spark magic 的详细信息,请参阅适用于装有 HDInsight 上的 Apache Spark Linux 群集的 Jupyter 笔记本的内核For more information about custom kernels and Spark magic, see Kernels available for Jupyter notebooks with Apache Spark Linux clusters on HDInsight.

先决条件Prerequisites

在计算机上安装 Jupyter 笔记本Install Jupyter notebook on your computer

先安装 Python,然后再安装 Jupyter 笔记本。Install Python before you install Jupyter notebooks. Anaconda 分发版将安装 Python 和 Jupyter Notebook。The Anaconda distribution will install both, Python, and Jupyter Notebook.

下载适用于用户的平台的 Anaconda 安装程序 ,并运行安装。Download the Anaconda installer for your platform and run the setup. 运行安装向导时,请确保选择将 Anaconda 添加到 PATH 变量的选项。While running the setup wizard, make sure you select the option to add Anaconda to your PATH variable. 另请参阅使用 Anaconda 安装 JupyterSee also, Installing Jupyter using Anaconda.

安装 Spark magicInstall Spark magic

  1. 输入以下命令之一以安装 Spark magic。Enter one of the commands below to install Spark magic. 另请参阅 sparkmagic 文档See also, sparkmagic documentation.

    群集版本Cluster version 安装命令Install command
    v3.6 和 v3.5v3.6 and v3.5 pip install sparkmagic==0.13.1
    v3.4v3.4 pip install sparkmagic==0.2.3
  2. 确保通过运行以下命令正确安装了 ipywidgetsEnsure ipywidgets is properly installed by running the following command:

    jupyter nbextension enable --py --sys-prefix widgetsnbextension
    

安装 PySpark 和 Spark 内核Install PySpark and Spark kernels

  1. 通过输入以下命令确定 sparkmagic 的安装位置:Identify where sparkmagic is installed by entering the following command:

    pip show sparkmagic
    

    然后,将工作目录更改为使用上述命令确定的位置。Then change your working directory to the location identified with the above command.

  2. 从新的工作目录输入下面的一个或多个命令,以安装所需的内核:From your new working directory, enter one or more of the commands below to install the wanted kernel(s):

    内核Kernel 命令Command
    SparkSpark jupyter-kernelspec install sparkmagic/kernels/sparkkernel
    SparkRSparkR jupyter-kernelspec install sparkmagic/kernels/sparkrkernel
    PySparkPySpark jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
    PySpark3PySpark3 jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel
  3. 可选。Optional. 输入以下命令以启用服务器扩展:Enter the command below to enable the server extension:

    jupyter serverextension enable --py sparkmagic
    

配置 Spark magic 以连接到 HDInsight Spark 群集Configure Spark magic to connect to HDInsight Spark cluster

在本部分中,将之前安装的 Spark magic 配置为连接到 Apache Spark 群集。In this section, you configure the Spark magic that you installed earlier to connect to an Apache Spark cluster.

  1. 使用以下命令启动 Python shell:Start the Python shell with the following command:

    python
    
  2. Jupyter 配置信息通常存储在用户主目录中。The Jupyter configuration information is typically stored in the users home directory. 输入以下命令来标识主目录,并创建一个名为 .sparkmagic 的文件夹。Enter the following command to identify the home directory, and create a folder called .sparkmagic. 将输出完整路径。The full path will be outputted.

    import os
    path = os.path.expanduser('~') + "\\.sparkmagic"
    os.makedirs(path)
    print(path)
    exit()
    
  3. 在文件夹 .sparkmagic 中,创建名为 config.json 的文件,并在该文件中添加以下 JSON 代码片段。Within the folder .sparkmagic, create a file called config.json and add the following JSON snippet inside it.

    {
      "kernel_python_credentials" : {
        "username": "{USERNAME}",
        "base64_password": "{BASE64ENCODEDPASSWORD}",
        "url": "https://{CLUSTERDNSNAME}.azurehdinsight.cn/livy"
      },
    
      "kernel_scala_credentials" : {
        "username": "{USERNAME}",
        "base64_password": "{BASE64ENCODEDPASSWORD}",
        "url": "https://{CLUSTERDNSNAME}.azurehdinsight.cn/livy"
      },
    
      "custom_headers" : {
        "X-Requested-By": "livy"
      },
    
      "heartbeat_refresh_seconds": 5,
      "livy_server_heartbeat_timeout_seconds": 60,
      "heartbeat_retry_seconds": 1
    }
    
  4. 对该文件进行以下编辑:Make the following edits to the file:

    模板值Template value 新值New value
    {USERNAME}{USERNAME} 群集登录名,默认为 adminCluster login, default is admin.
    {CLUSTERDNSNAME}{CLUSTERDNSNAME} 群集名称Cluster name
    {BASE64ENCODEDPASSWORD}{BASE64ENCODEDPASSWORD} 实际密码的 base64 编码密码。A base64 encoded password for your actual password. 可在 https://www.url-encode-decode.com/base64-encode-decode/ 中生成 base64 密码。You can generate a base64 password at https://www.url-encode-decode.com/base64-encode-decode/.
    "livy_server_heartbeat_timeout_seconds": 60 如果使用 sparkmagic 0.12.7(群集 v3.5 和 v3.6),请保留此值。Keep if using sparkmagic 0.12.7 (clusters v3.5 and v3.6). 如果使用 sparkmagic 0.2.3(群集 v3.4),请替换为 "should_heartbeat": trueIf using sparkmagic 0.2.3 (clusters v3.4), replace with "should_heartbeat": true.

    可在示例 config.json 中查看完整的示例文件。You can see a full example file at sample config.json.

    Tip

    发送检测信号,以确保会话不会泄漏。Heartbeats are sent to ensure that sessions are not leaked. 当计算机转到睡眠或关闭状态时,将不会发送检测信号,从而导致会话被清除。When a computer goes to sleep or is shut down, the heartbeat is not sent, resulting in the session being cleaned up. 对于群集 v3.4,如果要禁用此行为,可以从 Ambari UI 将 Livy 配置 livy.server.interactive.heartbeat.timeout 设置为 0For clusters v3.4, if you wish to disable this behavior, you can set the Livy config livy.server.interactive.heartbeat.timeout to 0 from the Ambari UI. 对于群集 v3.5,如果未设置上述 3.5 配置,会话不会删除。For clusters v3.5, if you do not set the 3.5 configuration above, the session will not be deleted.

  5. 启动 Jupyter。Start Jupyter. 从命令提示符使用以下命令。Use the following command from the command prompt.

    jupyter notebook
    
  6. 验证是否可以使用内核随附的 Spark magic。Verify that you can use the Spark magic available with the kernels. 完成以下步骤。Complete the following steps.

    a.a. 创建新的笔记本。Create a new notebook. 在右侧一角选择“新建”。From the right-hand corner, select New. 应会看到默认内核 Python 2Python 3,以及安装的内核。You should see the default kernel Python 2 or Python 3 and the kernels you installed. 实际值根据安装时所做的选择而有所不同。The actual values may vary depending on your installation choices. 选择“PySpark”。Select PySpark.

    Jupyter notebook 中的内核Kernels in Jupyter notebook

    Important

    选择“新建”后,检查 shell 中是否出现任何错误。After selecting New review your shell for any errors. 如果看到错误 TypeError: __init__() got an unexpected keyword argument 'io_loop',原因可能是遇到了某些 Tornado 版本中的已知问题。If you see the error TypeError: __init__() got an unexpected keyword argument 'io_loop' you may be experiencing a known issue with certain versions of Tornado. 如果出现此情况,请停止内核,然后使用以下命令降级 Tornado 安装:pip install tornado==4.5.3If so, stop the kernel and then downgrade your Tornado installation with the following command: pip install tornado==4.5.3.

    b.b. 运行以下代码片段。Run the following code snippet.

    %%sql
    SELECT * FROM hivesampletable LIMIT 5
    

    如果可以成功检索输出,则表示与 HDInsight 群集的连接已经过测试。If you can successfully retrieve the output, your connection to the HDInsight cluster is tested.

    若要更新笔记本配置以连接到不同的群集,请使用一组新值更新 config.json,如上述步骤 3 中所示。If you want to update the notebook configuration to connect to a different cluster, update the config.json with the new set of values, as shown in Step 3, above.

为何要在计算机上安装 Jupyter?Why should I install Jupyter on my computer?

在计算机上安装 Jupyter 并将其连接到 HDInsight 上的 Apache Spark 群集的原因:Reasons to install Jupyter on your computer and then connect it to an Apache Spark cluster on HDInsight:

  • 提供了此选项:在本地创建笔记本,针对正在运行的群集测试应用程序,然后将笔记本上传到群集。Provides you the option to create your notebooks locally, test your application against a running cluster, and then upload the notebooks to the cluster. 若要将笔记本上传到群集,可以使用在群集上运行的 Jupyter 笔记本上传它们,或者将它们保存到与群集关联的存储帐户中的 /HdiNotebooks 文件夹。To upload the notebooks to the cluster, you can either upload them using the Jupyter notebook that is running or the cluster, or save them to the /HdiNotebooks folder in the storage account associated with the cluster. 有关如何在群集上存储 notebook 的详细信息,请参阅 Where are Jupyter notebooks stored?(Jupyter notebook 存储在何处?)For more information on how notebooks are stored on the cluster, see Where are Jupyter notebooks stored?
  • 使用本地提供的笔记本可以根据应用程序要求连接到不同的 Spark 群集。With the notebooks available locally, you can connect to different Spark clusters based on your application requirement.
  • 可以使用 GitHub 实施源代码管理系统,并对笔记本进行版本控制。You can use GitHub to implement a source control system and have version control for the notebooks. 此外,还可以建立一个协作环境,其中的多个用户可以使用同一个笔记本。You can also have a collaborative environment where multiple users can work with the same notebook.
  • 甚至不需要启动群集就能在本地使用笔记本。You can work with notebooks locally without even having a cluster up. 只需创建一个群集以根据它来测试笔记本,而不需要手动管理笔记本或开发环境。You only need a cluster to test your notebooks against, not to manually manage your notebooks or a development environment.
  • 配置自己的本地开发环境比在群集上配置 Jupyter 安装更容易。It may be easier to configure your own local development environment than it is to configure the Jupyter installation on the cluster. 可以利用本地安装的所有软件,而不需要配置一个或多个远程群集。You can take advantage of all the software you have installed locally without configuring one or more remote clusters.

Warning

在本地计算机上安装 Jupyter 后,多个用户可以同时在同一个 Spark 群集上运行同一个笔记本。With Jupyter installed on your local computer, multiple users can run the same notebook on the same Spark cluster at the same time. 在这种情况下,会创建多个 Livy 会话。In such a situation, multiple Livy sessions are created. 如果遇到问题并想要调试,则跟踪哪个 Livy 会话属于哪个用户将是一项复杂的任务。If you run into an issue and want to debug that, it will be a complex task to track which Livy session belongs to which user.

后续步骤Next steps