在计算机上安装 Jupyter notebook 并连接到 HDInsight 上的 Apache SparkInstall Jupyter notebook on your computer and connect to Apache Spark on HDInsight

本文介绍如何安装具有自定义 PySpark(适用于 Python)以及具有 Apache Spark(适用于 Scala)内核和 Spark magic 的 Jupyter 笔记本,并将笔记本连接到 HDInsight 群集。In this article you learn how to install Jupyter notebook, with the custom PySpark (for Python) and Apache Spark (for Scala) kernels with Spark magic, and connect the notebook to an HDInsight cluster. 在本地计算机上安装 Jupyter 的原因多种多样,同时这种安装也面临着多种难题。There can be a number of reasons to install Jupyter on your local computer, and there can be some challenges as well. 有关此方面的详细信息,请参阅本文末尾的为何要在计算机上安装 JupyterFor more on this, see the section Why should I install Jupyter on my computer at the end of this article.

安装 Jupyter 并连接到 HDInsight 上的 Apache Spark 涉及到四个重要步骤。There are four key steps involved in installing Jupyter and connecting to Apache Spark on HDInsight.

  • 配置 Spark 群集。Configure Spark cluster.
  • 安装 Jupyter 笔记本。Install Jupyter notebook.
  • 安装 PySpark 和具有 Spark magic 的 Spark 内核。Install the PySpark and Spark kernels with the Spark magic.
  • 配置 Spark magic 以访问 HDInsight 上的 Spark 群集。Configure Spark magic to access Spark cluster on HDInsight.

有关适用于装有 HDInsight 群集的 Jupyter notebook 的自定义内核和 Spark magic 的详细信息,请参阅 Kernels available for Jupyter notebooks with Apache Spark Linux clusters on HDInsight(适用于装有 HDInsight 上的 Apache Spark Linux 群集的 Jupyter notebook 的内核)。For more information about the custom kernels and the Spark magic available for Jupyter notebooks with HDInsight cluster, see Kernels available for Jupyter notebooks with Apache Spark Linux clusters on HDInsight.

Important

本文中的步骤仅适用于 Spark 版本 2.1.0。The steps in the article only work up to Spark version 2.1.0.

先决条件Prerequisites

此处所列的先决条件不适用于安装 Jupyter。The prerequisites listed here are not for installing Jupyter. 这些先决条件适用于安装笔记本之后将 Jupyter 笔记本连接到 HDInsight 群集。These are for connecting the Jupyter notebook to an HDInsight cluster once the notebook is installed.

在计算机上安装 Jupyter 笔记本Install Jupyter notebook on your computer

必须先安装 Python 才能安装 Jupyter 笔记本。You must install Python before you can install Jupyter notebooks. Python 和 Jupyter 都作为 Anaconda 分发版的一部分提供。Both Python and Jupyter are available as part of the Anaconda distribution. 安装 Anaconda 时,安装的是某个 Python 发行版。When you install Anaconda, you install a distribution of Python. 安装 Anaconda 之后,可通过运行相应命令来添加 Jupyter 安装。Once Anaconda is installed, you add the Jupyter installation by running appropriate commands.

  1. 下载适用于用户的平台的 Anaconda 安装程序 ,并运行安装。Download the Anaconda installer for your platform and run the setup. 运行安装向导时,请确保选择将 Anaconda 添加到 PATH 变量的选项。While running the setup wizard, make sure you select the option to add Anaconda to your PATH variable.

  2. 运行以下命令来安装 Jupyter。Run the following command to install Jupyter.

     conda install jupyter
    

    有关安装 Jupyter 的详细信息,请参阅 Installing Jupyter using Anaconda(使用 Anaconda 安装 Jupyter)。For more information on installing Jupyter, see Installing Jupyter using Anaconda.

安装内核和 Spark magicInstall the kernels and Spark magic

有关如何安装 Spark magic、PySpark 和 Spark 内核的说明,请参阅 GitHub 上的 sparkmagic 文档中的安装说明。For instructions on how to install the Spark magic, the PySpark and Spark kernels, follow the installation instructions in the sparkmagic documentation on GitHub. Spark magic 文档中的第一步会要求安装 Spark magic。The first step in the Spark magic documentation asks you to install Spark magic. 使用以下命令替换该链接中的第一步,具体取决于要连接到的 HDInsight 群集的版本。Replace that first step in the link with the following commands, depending on the version of the HDInsight cluster you will connect to. 之后,按照 Spark magic 文档中的剩余步骤进行操作。After that, follow the remaining steps in the Spark magic documentation. 如果要安装不同的内核,则必须执行 Spark magic 安装说明部分中的步骤 3。If you want to install the different kernels, you must perform Step 3 in the Spark magic installation instructions section.

  • 对于群集 v3.5 和 v3.6,请执行以下命令安装 sparkmagic 0.11.2For clusters v3.5 and v3.6, install sparkmagic 0.11.2 by executing pip install sparkmagic==0.11.2

  • 对于群集 v3.4,请执行以下命令安装 sparkmagic 0.2.3For clusters v3.4, install sparkmagic 0.2.3 by executing pip install sparkmagic==0.2.3

配置 Spark magic 以连接到 HDInsight Spark 群集Configure Spark magic to connect to HDInsight Spark cluster

在本部分,我们将配置前面安装的 Spark magic,以连接到 Apache Spark 群集(必须事先在 Azure HDInsight 中创建)。In this section, you configure the Spark magic that you installed earlier to connect to an Apache Spark cluster that you must have already created in Azure HDInsight.

  1. 使用以下命令启动 Python shell:Start the Python shell with the following command:

    python
    
  2. Jupyter 配置信息通常存储在用户主目录中。The Jupyter configuration information is typically stored in the users home directory. 输入以下命令以标识主目录,然后在该目录中创建名为 .sparkmagic 的文件夹。Enter the following command to identify the home directory, and create a folder there called .sparkmagic. 将输出完整路径。The full path will be outputted.

    import os
    path = os.path.expanduser('~') + "\\.sparkmagic"
    os.makedirs(path)
    print(path)
    exit()
    
  3. 在文件夹 .sparkmagic 中,创建名为 config.json 的文件,并在该文件中添加以下 JSON 代码片段。Within the folder .sparkmagic, create a file called config.json and add the following JSON snippet inside it.

    {
      "kernel_python_credentials" : {
        "username": "{USERNAME}",
        "base64_password": "{BASE64ENCODEDPASSWORD}",
        "url": "https://{CLUSTERDNSNAME}.azurehdinsight.cn/livy"
      },
    
      "kernel_scala_credentials" : {
        "username": "{USERNAME}",
        "base64_password": "{BASE64ENCODEDPASSWORD}",
        "url": "https://{CLUSTERDNSNAME}.azurehdinsight.cn/livy"
      },
    
      "heartbeat_refresh_seconds": 5,
      "livy_server_heartbeat_timeout_seconds": 60,
      "heartbeat_retry_seconds": 1
    }
    
  4. 对该文件进行以下编辑:Make the following edits to the file:

    模板值Template value 新值New value
    {USERNAME}{USERNAME} 群集登录名,默认为 admin。Cluster login, default is admin.
    {CLUSTERDNSNAME}{CLUSTERDNSNAME} 群集名称Cluster name
    {BASE64ENCODEDPASSWORD}{BASE64ENCODEDPASSWORD} 实际密码的 base64 编码密码。A base64 encoded password for your actual password. 可在 https://www.url-encode-decode.com/base64-encode-decode/ 中生成 base64 密码。You can generate a base64 password at https://www.url-encode-decode.com/base64-encode-decode/.
    "livy_server_heartbeat_timeout_seconds": 60 如果使用 sparkmagic 0.11.23(群集 v3.5 和 v3.6),请保留此值。Keep if using sparkmagic 0.11.23 (clusters v3.5 and v3.6). 如果使用 sparkmagic 0.2.3(群集 v3.4),请替换为 "should_heartbeat": trueIf using sparkmagic 0.2.3 (clusters v3.4), replace with "should_heartbeat": true.

    可在示例 config.json 中查看完整的示例文件。You can see a full example file at sample config.json.

    Tip

    发送检测信号,以确保会话不会泄漏。Heartbeats are sent to ensure that sessions are not leaked. 当计算机转到睡眠或关闭状态时,将不会发送检测信号,从而导致会话被清除。When a computer goes to sleep or is shut down, the heartbeat is not sent, resulting in the session being cleaned up. 对于群集 v3.4,如果要禁用此行为,可以从 Ambari UI 将 Livy 配置 livy.server.interactive.heartbeat.timeout 设置为 0For clusters v3.4, if you wish to disable this behavior, you can set the Livy config livy.server.interactive.heartbeat.timeout to 0 from the Ambari UI. 对于群集 v3.5,如果未设置上述 3.5 配置,会话将不会删除。For clusters v3.5, if you do not set the 3.5 configuration above, the session will not be deleted.

  5. 启动 Jupyter。Start Jupyter. 从命令提示符使用以下命令。Use the following command from the command prompt.

     jupyter notebook
    
  6. 验证是否可以使用内核随附的 Spark magic。Verify that you can use the Spark magic available with the kernels. 执行以下步骤。Perform the following steps.

    a.a. 创建新的笔记本。Create a new notebook. 在右侧一角选择“新建”。From the right-hand corner, select New. 应会看到默认内核 Python 2Python 3,以及安装的内核。You should see the default kernel Python 2 or Python 3 and the kernels you installed. 实际值根据安装时所做的选择而有所不同。The actual values may vary depending on your installation choices. 选择“PySpark”。Select PySpark.

    Jupyter notebook 中的内核Kernels in Jupyter notebook

    Important

    选择“新建”后,检查 shell 中是否出现任何错误。After selecting New review your shell for any errors. 如果看到错误 TypeError: __init__() got an unexpected keyword argument 'io_loop',原因可能是遇到了某些 Tornado 版本中的已知问题。If you see the error TypeError: __init__() got an unexpected keyword argument 'io_loop' you may be experiencing a known issue with certain versions of Tornado. 如果出现此情况,请停止内核,然后使用以下命令降级 Tornado 安装:pip install tornado==4.5.3If so, stop the kernel and then downgrade your Tornado installation with the following command: pip install tornado==4.5.3.

    b.b. 运行以下代码片段。Run the following code snippet.

    %%sql
    SELECT * FROM hivesampletable LIMIT 5
    

    如果可以成功检索输出,则表示与 HDInsight 群集的连接已经过测试。If you can successfully retrieve the output, your connection to the HDInsight cluster is tested.

    若要更新笔记本配置以连接到不同的群集,请使用一组新值更新 config.json,如上述步骤 3 中所示。If you want to update the notebook configuration to connect to a different cluster, update the config.json with the new set of values, as shown in Step 3, above.

为何要在计算机上安装 Jupyter?Why should I install Jupyter on my computer?

你可能会出于多种原因而要在计算机上安装 Jupyter,然后将其连接到 HDInsight 上的 Apache Spark 群集。There can be a number of reasons why you might want to install Jupyter on your computer and then connect it to an Apache Spark cluster on HDInsight.

  • 尽管 Azure HDInsight 中的 Spark 群集上已提供 Jupyter 笔记本,但在计算机上安装 Jupyter 可以选择在本地创建笔记本,根据正在运行的群集测试你的应用程序,然后将笔记本上传到群集。Even though Jupyter notebooks are already available on the Spark cluster in Azure HDInsight, installing Jupyter on your computer provides you the option to create your notebooks locally, test your application against a running cluster, and then upload the notebooks to the cluster. 若要将笔记本上传到群集,可以使用群集上运行的 Jupyter 笔记本来上传,或者将它们保存到与群集关联的存储帐户中的 /HdiNotebooks 文件夹。To upload the notebooks to the cluster, you can either upload them using the Jupyter notebook that is running or the cluster, or save them to the /HdiNotebooks folder in the storage account associated with the cluster. 有关如何在群集上存储 notebook 的详细信息,请参阅 Where are Jupyter notebooks stored?(Jupyter notebook 存储在何处?)For more information on how notebooks are stored on the cluster, see Where are Jupyter notebooks stored?
  • 使用本地提供的笔记本可以根据应用程序要求连接到不同的 Spark 群集。With the notebooks available locally, you can connect to different Spark clusters based on your application requirement.
  • 可以使用 GitHub 实施源代码管理系统,并对笔记本进行版本控制。You can use GitHub to implement a source control system and have version control for the notebooks. 此外,还可以建立一个协作环境,其中的多个用户可以使用同一个笔记本。You can also have a collaborative environment where multiple users can work with the same notebook.
  • 甚至不需要启动群集就能在本地使用笔记本。You can work with notebooks locally without even having a cluster up. 只需创建一个群集以根据它来测试笔记本,而不需要手动管理笔记本或开发环境。You only need a cluster to test your notebooks against, not to manually manage your notebooks or a development environment.
  • 配置自己的本地开发环境比在群集上配置 Jupyter 安装更容易。It may be easier to configure your own local development environment than it is to configure the Jupyter installation on the cluster. 可以利用本地安装的所有软件,而不需要配置一个或多个远程群集。You can take advantage of all the software you have installed locally without configuring one or more remote clusters.

Warning

在本地计算机上安装 Jupyter 后,多个用户可以同时在同一个 Spark 群集上运行同一个笔记本。With Jupyter installed on your local computer, multiple users can run the same notebook on the same Spark cluster at the same time. 在这种情况下,会创建多个 Livy 会话。In such a situation, multiple Livy sessions are created. 如果遇到问题并想要调试,则跟踪哪个 Livy 会话属于哪个用户将是一项复杂的任务。If you run into an issue and want to debug that, it will be a complex task to track which Livy session belongs to which user.

另请参阅See also

方案Scenarios

创建和运行应用程序Create and run applications

工具和扩展Tools and extensions

管理资源Manage resources