使用脚本操作在 Azure HDInsight 上安全管理 Python 环境Safely manage Python environment on Azure HDInsight using Script Action

HDInsight 在 Spark 群集中有两个内置 Python 安装,即 Anaconda Python 2.7 和 Python 3.5。HDInsight has two built-in Python installations in the Spark cluster, Anaconda Python 2.7 and Python 3.5. 客户可能需要自定义 Python 环境。Customers may need to customize the Python environment. 例如,安装外部 Python 包或其他 Python 版本。Like installing external Python packages or another Python version. 本文介绍的最佳做法涉及如何安全地管理 HDInsight 上 Apache Spark 群集的 Python 环境。Here, we show the best practice of safely managing Python environments for Apache Spark clusters on HDInsight.

先决条件Prerequisites

HDInsight 上的 Apache Spark 群集。An Apache Spark cluster on HDInsight. 有关说明,请参阅在 Azure HDInsight 中创建 Apache Spark 群集For instructions, see Create Apache Spark clusters in Azure HDInsight. 如果 HDInsight 上还没有 Spark 群集,则可以在群集创建过程中运行脚本操作。If you do not already have a Spark cluster on HDInsight, you can run script actions during cluster creation. 访问有关如何使用自定义脚本操作的文档。Visit the documentation on how to use custom script actions.

支持 HDInsight 群集上使用的开源软件Support for open-source software used on HDInsight clusters

Microsoft Azure HDInsight 服务使用围绕 Apache Hadoop 构建的开源技术环境。The Microsoft Azure HDInsight service uses an environment of open-source technologies formed around Apache Hadoop. Microsoft Azure 为开源技术提供常规级别的支持。Microsoft Azure provides a general level of support for open-source technologies.

HDInsight 服务中有两种类型的开放源代码组件:There are two types of open-source components that are available in the HDInsight service:

组件Component 说明Description
内置Built-in 这些组件已预先安装在 HDInsight 群集上,并提供群集的核心功能。These components are pre-installed on HDInsight clusters and provide core functionality of the cluster. 例如,Apache Hadoop YARN 资源管理器、Apache Hive 查询语言 (HiveQL) 及 Mahout 库均属于此类别。For example, Apache Hadoop YARN Resource Manager, the Apache Hive query language (HiveQL), and the Mahout library belong to this category. HDInsight 提供的 Apache Hadoop 群集版本的新增功能中提供了群集组件的完整列表。A full list of cluster components is available in What's new in the Apache Hadoop cluster versions provided by HDInsight.
自定义Custom 群集用户可以安装或者在工作负荷中使用由社区提供的或自己创建的任何组件。You, as a user of the cluster, can install or use in your workload any component available in the community or created by you.

重要

完全支持通过 HDInsight 群集提供的组件。Components provided with the HDInsight cluster are fully supported. Microsoft 支持部门可帮助找出并解决与这些组件相关的问题。Microsoft Support helps to isolate and resolve issues related to these components.

自定义组件可获得合理范围的支持,有助于进一步解决问题。Custom components receive commercially reasonable support to help you to further troubleshoot the issue. Microsoft 支持部门也许能够解决问题,也可能要求你参与可用的开放源代码技术渠道,获取该技术的深入专业知识。Microsoft support may be able to resolve the issue OR they may ask you to engage available channels for the open source technologies where deep expertise for that technology is found. 例如,有许多可以使用的社区站点,例如:有关 HDInsight 的 Microsoft Q&A 问题页面https://stackoverflow.comFor example, there are many community sites that can be used, like: Microsoft Q&A question page for HDInsight, https://stackoverflow.com. 此外,Apache 项目在 https://apache.org 上有项目站点。Also Apache projects have project sites on https://apache.org.

了解默认 Python 安装Understand default Python installation

HDInsight Spark 群集是通过 Anaconda 安装创建的。HDInsight Spark cluster is created with Anaconda installation. 群集中有两个 Python 安装:Anaconda Python 2.7 和 Python 3.5。There are two Python installations in the cluster, Anaconda Python 2.7 and Python 3.5. 下表显示了 Spark、Livy 和 Jupyter 的默认 Python 设置。The table below shows the default Python settings for Spark, Livy, and Jupyter.

设置Setting Python 2.7Python 2.7 Python 3.5Python 3.5
路径Path /usr/bin/anaconda/bin/usr/bin/anaconda/bin /usr/bin/anaconda/envs/py35/bin/usr/bin/anaconda/envs/py35/bin
Spark 版本Spark version 默认设置为 2.7Default set to 2.7 空值N/A
Livy 版本Livy version 默认设置为 2.7Default set to 2.7 不适用N/A
JupyterJupyter PySpark 内核PySpark kernel PySpark3 内核PySpark3 kernel

安全安装外部 Python 包Safely install external Python packages

HDInsight 群集依赖于内置 Python 环境(Python 2.7 和 Python 3.5)。HDInsight cluster depends on the built-in Python environment, both Python 2.7 and Python 3.5. 直接在这些默认内置环境中安装自定义包可能会导致意外的库版本更改,Directly installing custom packages in those default built-in environments may cause unexpected library version changes. 并进一步破坏群集。And break the cluster further. 若要为 Spark 应用程序安全地安装自定义的外部 Python 包,请执行以下步骤。To safely install custom external Python packages for your Spark applications, follow below steps.

  1. 使用 conda 创建 Python 虚拟环境。Create Python virtual environment using conda. 虚拟环境为你的项目提供隔离的空间,且不会破坏其他项目。A virtual environment provides an isolated space for your projects without breaking others. 创建 Python 虚拟环境时,可以指定要使用的 python 版本。When creating the Python virtual environment, you can specify python version that you want to use. 即使要使用 Python 2.7 和 3.5,仍需要创建虚拟环境。You still need to create virtual environment even though you would like to use Python 2.7 and 3.5. 此要求是为了确保群集的默认环境不会受到破坏。This requirement is to make sure the cluster's default environment not getting broke. 使用以下脚本在群集上为所有节点运行脚本操作,以创建 Python 虚拟环境。Run script actions on your cluster for all nodes with below script to create a Python virtual environment.

    • --prefix 指定 conda 虚拟环境所在的路径。--prefix specifies a path where a conda virtual environment lives. 需要根据此处指定的路径进一步更改几项配置。There are several configs that need to be changed further based on the path specified here. 本示例使用 py35new,因为群集中已包含名为 py35 的现有虚拟环境。In this example, we use the py35new, as the cluster has an existing virtual environment called py35 already.
    • python= 指定虚拟环境的 Python 版本。python= specifies the Python version for the virtual environment. 本示例使用版本 3.5,即群集的内置版本。In this example, we use version 3.5, the same version as the cluster built in one. 也可以使用其他 Python 版本来创建虚拟环境。You can also use other Python versions to create the virtual environment.
    • anaconda 将 package_spec 指定为 anaconda,以在虚拟环境中安装 Anaconda 包。anaconda specifies the package_spec as anaconda to install Anaconda packages in the virtual environment.
    sudo /usr/bin/anaconda/bin/conda create --prefix /usr/bin/anaconda/envs/py35new python=3.5 anaconda --yes 
    
  2. 根据需要在创建的虚拟环境中安装外部 Python 包。Install external Python packages in the created virtual environment if needed. 使用以下脚本在群集上为所有节点运行脚本操作,以安装外部 Python 包。Run script actions on your cluster for all nodes with below script to install external Python packages. 此处需要有 sudo 权限才能将文件写入虚拟环境文件夹。You need to have sudo privilege here to write files to the virtual environment folder.

    包索引中搜索可用包的完整列表。Search the package index for the complete list of packages that are available. 也可以从其他源获取可用包的列表。You can also get a list of available packages from other sources. 例如,可以安装通过 conda-forge 提供的包。For example, you can install packages made available through conda-forge.

    如果要安装其最新版本的库,请使用以下命令:Use below command if you would like to install a library with its latest version:

    • 使用 conda 通道:Use conda channel:

      • seaborn 是要安装的包名称。seaborn is the package name that you would like to install.
      • -n py35new 指定刚创建的虚拟环境名称。-n py35new specify the virtual environment name that just gets created. 确保根据虚拟环境创建相应地更改名称。Make sure to change the name correspondingly based on your virtual environment creation.
      sudo /usr/bin/anaconda/bin/conda install seaborn -n py35new --yes
      
    • 或者使用 PyPi 存储库,请相应地更改 seabornpy35newOr use PyPi repo, change seaborn and py35new correspondingly:

      sudo /usr/bin/anaconda/env/py35new/bin/pip install seaborn
      

    如果要安装特定版本的库,请使用以下命令:Use below command if you would like to install a library with a specific version:

    • 使用 conda 通道:Use conda channel:

      • numpy=1.16.1 是要安装的包名称和版本。numpy=1.16.1 is the package name and version that you would like to install.
      • -n py35new 指定刚创建的虚拟环境名称。-n py35new specify the virtual environment name that just gets created. 确保根据虚拟环境创建相应地更改名称。Make sure to change the name correspondingly based on your virtual environment creation.
      sudo /usr/bin/anaconda/bin/conda install numpy=1.16.1 -n py35new --yes
      
    • 或者使用 PyPi 存储库,请相应地更改 numpy==1.16.1py35newOr use PyPi repo, change numpy==1.16.1 and py35new correspondingly:

      sudo /usr/bin/anaconda/env/py35new/bin/pip install numpy==1.16.1
      

    如果不知道虚拟环境名称,则可以通过 SSH 连接到群集的头节点并运行 /usr/bin/anaconda/bin/conda info -e 以显示所有虚拟环境。if you don't know the virtual environment name, you can SSH to the head node of the cluster and run /usr/bin/anaconda/bin/conda info -e to show all virtual environments.

  3. 更改 Spark 和 Livy 配置,并指向创建的虚拟环境。Change Spark and Livy configs and point to the created virtual environment.

    1. 打开 Ambari UI,转到“Spark2”页中的“配置”选项卡。Open Ambari UI, go to Spark2 page, Configs tab.

      通过 Ambari 更改 Spark 和 Livy 配置

    2. 展开“高级 livy2-env”,在底部添加以下语句。Expand Advanced livy2-env, add below statements at bottom. 如果使用不同的前缀安装了虚拟环境,请相应地更改路径。If you installed the virtual environment with a different prefix, change the path correspondingly.

      export PYSPARK_PYTHON=/usr/bin/anaconda/envs/py35new/bin/python
      export PYSPARK_DRIVER_PYTHON=/usr/bin/anaconda/envs/py35new/bin/python
      

      通过 Ambari 更改 Livy 配置

    3. 展开“高级 spark2-env”,替换底部的现有 export PYSPARK_PYTHON 语句。Expand Advanced spark2-env, replace the existing export PYSPARK_PYTHON statement at bottom. 如果使用不同的前缀安装了虚拟环境,请相应地更改路径。If you installed the virtual environment with a different prefix, change the path correspondingly.

      export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/anaconda/envs/py35new/bin/python}
      

      通过 Ambari 更改 Spark 配置

    4. 保存更改并重启受影响的服务。Save the changes and restart affected services. 需要重启 Spark2 服务才能使这些更改生效。These changes need a restart of Spark2 service. Ambari UI 将提示需要重启。单击“重启”以重启所有受影响的服务。Ambari UI will prompt a required restart reminder, click Restart to restart all affected services.

      通过 Ambari 更改 Spark 配置

  4. 如果要在 Jupyter 上使用新创建的虚拟环境。If you would like to use the new created virtual environment on Jupyter. 更改 Jupyter 配置并重启 Jupyter。Change Jupyter configs and restart Jupyter. 使用以下语句在所有头节点上运行脚本操作,使 Jupyter 指向新创建的虚拟环境。Run script actions on all header nodes with below statement to point Jupyter to the new created virtual environment. 请务必修改针对虚拟环境指定的前缀的路径。Make sure to modify the path to the prefix you specified for your virtual environment. 运行此脚本操作后,通过 Ambari UI 重启 Jupyter 服务,使此项更改生效。After running this script action, restart Jupyter service through Ambari UI to make this change available.

    sudo sed -i '/python3_executable_path/c\ \"python3_executable_path\" : \"/usr/bin/anaconda/envs/py35new/bin/python3\"' /home/spark/.sparkmagic/config.json
    

    可以通过运行以下代码,在 Jupyter Notebook 中反复确认 Python 环境:You could double confirm the Python environment in Jupyter Notebook by running below code:

    在 Jupyter Notebook 中检查 Python 版本

已知问题Known issue

Anaconda 版本 4.7.114.7.124.8.0 有一个已知的 bug。There's a known bug for Anaconda version 4.7.11, 4.7.12, and 4.8.0. 如果发现脚本操作在执行 "Collecting package metadata (repodata.json): ...working..." 时停止响应,然后失败并显示 "Python script has been killed due to timeout after waiting 3600 secs"If you see your script actions stops responding at "Collecting package metadata (repodata.json): ...working..." and failing with "Python script has been killed due to timeout after waiting 3600 secs". 可以下载此脚本,并在所有节点上将其作为脚本操作运行,这样即可解决此问题。You can download this script and run it as script actions on all nodes to fix the issue.

若要检查 Anaconda 版本,可以通过 SSH 连接到群集头节点并运行 /usr/bin/anaconda/bin/conda --vTo check your Anaconda version, you can SSH to the cluster header node and run /usr/bin/anaconda/bin/conda --v.

后续步骤Next steps