使用脚本操作在 Azure HDInsight 上安全管理 Python 环境Safely manage Python environment on Azure HDInsight using Script Action

HDInsight 在 Spark 群集中包含两个内置的 Python 安装:Anaconda Python 2.7 和 Python 3.5。HDInsight has two built-in Python installations in the Spark cluster, Anaconda Python 2.7 and Python 3.5. 在某些情况下,客户需要自定义 Python 环境,例如,安装外部 Python 包或其他 Python 版本。In some cases, customers need to customize the Python environment, like installing external Python packages or another Python version. 本文提供有关安全管理 HDInsight 上 Apache Spark 群集的 Python 环境的最佳做法。In this article, we show the best practice of safely managing Python environments for an Apache Spark cluster on HDInsight.

必备条件Prerequisites

支持 HDInsight 群集上使用的开源软件Support for open-source software used on HDInsight clusters

Azure HDInsight服务使用围绕 Apache Hadoop 构建的开源技术生态系统。The Azure HDInsightservice uses an ecosystem of open-source technologies formed around Apache Hadoop. Microsoft Azure 为开源技术提供常规级别的支持。Microsoft Azure provides a general level of support for open-source technologies. HDInsight 服务为内置组件提供附加的支持级别。The HDInsight service provides an additional level of support for built-in components.

HDInsight 服务中有两种类型的开放源代码组件:There are two types of open-source components that are available in the HDInsight service:

  • 内置组件 - 这些组件预先安装在 HDInsight 群集上,并提供在群集的核心功能。Built-in components - These components are pre-installed on HDInsight clusters and provide core functionality of the cluster. 例如,Apache Hadoop YARN ResourceManager、Apache Hive 查询语言 (HiveQL) 及 Mahout 库均属于此类别。For example, Apache Hadoop YARN ResourceManager, the Apache Hive query language (HiveQL), and the Mahout library belong to this category. HDInsight 提供的 Apache Hadoop 群集版本的新增功能中提供了群集组件的完整列表。A full list of cluster components is available in What's new in the Apache Hadoop cluster versions provided by HDInsight.
  • 自定义组件 - 作为群集用户,可以安装,或者在工作负荷中使用由社区提供的或自己创建的任何组件。Custom components - You, as a user of the cluster, can install or use in your workload any component available in the community or created by you.

重要

完全支持通过 HDInsight 群集提供的组件。Components provided with the HDInsight cluster are fully supported. Microsoft 支持部门可帮助找出并解决与这些组件相关的问题。Microsoft Support helps to isolate and resolve issues related to these components.

自定义组件可获得合理范围的支持,有助于进一步解决问题。Custom components receive commercially reasonable support to help you to further troubleshoot the issue. Microsoft 支持部门也许能够解决问题,也可能要求你参与可用的开放源代码技术渠道,获取该技术的深入专业知识。Microsoft support may be able to resolve the issue OR they may ask you to engage available channels for the open source technologies where deep expertise for that technology is found. 有许多可以使用的社区站点,例如:面向 HDInsight 的 MSDN 论坛https://stackoverflow.comFor example, there are many community sites that can be used, like: MSDN forum for HDInsight, https://stackoverflow.com. 此外,Apache 项目在 https://apache.org 上提供了项目站点,例如:HadoopAlso Apache projects have project sites on https://apache.org, for example: Hadoop.

了解默认的 Python 安装Understand default Python installation

HDInsight Spark 群集是通过 Anaconda 安装创建的。HDInsight Spark cluster is created with Anaconda installation. 群集中有两个 Python 安装:Anaconda Python 2.7 和 Python 3.5。There are two Python installations in the cluster, Anaconda Python 2.7 and Python 3.5. 下表显示了 Spark、Livy 和 Jupyter 的默认 Python 设置。The table below shows the default Python settings for Spark, Livy, and Jupyter.

Python 2.7Python 2.7 Python 3.5Python 3.5
PathPath /usr/bin/anaconda/bin/usr/bin/anaconda/bin /usr/bin/anaconda/envs/py35/bin/usr/bin/anaconda/envs/py35/bin
SparkSpark 默认设置为 2.7Default set to 2.7 不适用N/A
LivyLivy 默认设置为 2.7Default set to 2.7 不适用N/A
JupyterJupyter PySpark 内核PySpark kernel PySpark3 内核PySpark3 kernel

安全安装外部 Python 包Safely install external Python packages

HDInsight 群集依赖于内置的 Python 环境,即 Python 2.7 和 Python 3.5。HDInsight cluster depends on the built-in Python environment, both Python 2.7 and Python 3.5. 直接在这些默认内置环境中安装自定义包可能会导致意外的库版本更改,并进一步破坏群集。Directly installing custom packages in those default built-in environments may cause unexpected library version changes, and break the cluster further. 若要为 Spark 应用程序安全安装自定义的外部 Python 包,请执行以下步骤。In order to safely install custom external Python packages for your Spark applications, follow below steps.

  1. 使用 conda 创建 Python 虚拟环境。Create Python virtual environment using conda. 虚拟环境为你的项目提供隔离的空间,且不会破坏其他项目。A virtual environment provides an isolated space for your projects without breaking others. 创建 Python 虚拟环境时,可以指定要使用的 Python 版本。When creating the Python virtual environment, you can specify python version that you want to use. 请注意,即使使用 Python 2.7 和 3.5,也仍需要创建虚拟环境。Note that you still need to create virtual environment even though you would like to use Python 2.7 and 3.5. 这是为了确保群集的默认环境不会破坏。This is to make sure the cluster�s default environment not getting broke. 使用以下脚本针对群集上的所有节点运行脚本操作,以创建 Python 虚拟环境。Run script actions on your cluster for all nodes with below script to create a Python virtual environment.

    • --prefix 指定 conda 虚拟环境所在的路径。--prefix specifies a path where a conda virtual environment lives. 需要根据此处指定的路径进一步更改几项配置。There are several configs that need to be changed further based on the path specified here. 本示例使用 py35new,因为群集中已包含名为 py35 的现有虚拟环境。In this example, we use the py35new, as the cluster has an existing virtual environment called py35 already.
    • python= 指定虚拟环境的 Python 版本。python= specifies the Python version for the virtual environment. 本示例使用版本 3.5,即群集的内置版本。In this example, we use version 3.5, the same version as the cluster built in one. 也可以使用其他 Python 版本来创建虚拟环境。You can also use other Python versions to create the virtual environment.
    • anaconda 将 package_spec 指定为 anaconda,以在虚拟环境中安装 Anaconda 包。anaconda specifies the package_spec as anaconda to install Anaconda packages in the virtual environment.
    sudo /usr/bin/anaconda/bin/conda create --prefix /usr/bin/anaconda/envs/py35new python=3.5 anaconda --yes 
    
  2. 根据需要在创建的虚拟环境中安装外部 Python 包。Install external Python packages in the created virtual environment if needed. 使用以下脚本针对群集上的所有节点运行脚本操作,以安装外部 Python 包。Run script actions on your cluster for all nodes with below script to install external Python packages. 此处需要拥有 sudo 特权才能将文件写入虚拟环境文件夹。You need to have sudo privilege here in order to write files to the virtual environment folder.

    可以在 包索引 中搜索可用包的完整列表。You can search the package index for the complete list of packages that are available. 也可以从其他源获取可用包的列表。You can also get a list of available packages from other sources. 例如,可以安装通过 conda-forge 提供的包。For example, you can install packages made available through conda-forge.

    • seaborn 要安装的包名称。seaborn is the package name that you would like to install.
    • -n py35new 指定刚刚创建的虚拟环境名称。-n py35new specify the virtual environment name that just gets created. 请务必根据虚拟环境创建过程相应地更改此名称。Make sure to change the name correspondingly based on your virtual environment creation.
    sudo /usr/bin/anaconda/bin/conda install seaborn -n py35new --yes
    

    如果你不知道虚拟环境名称,可以通过 SSH 连接到群集的头节点,然后运行 /usr/bin/anaconda/bin/conda info -e 以显示所有虚拟环境。if you don't know the virtual environment name, you can SSH to the head node of the cluster and run /usr/bin/anaconda/bin/conda info -e to show all virtual environments.

  3. 更改 Spark 和 Livy 配置,并指向创建的虚拟环境。Change Spark and Livy configs and point to the created virtual environment.

    1. 打开 Ambari UI,转到“Spark2”页中的“配置”选项卡。Open Ambari UI, go to Spark2 page, Configs tab.

      通过 Ambari 更改 Spark 和 Livy 配置

    2. 展开“高级 livy2-env”,在底部添加以下语句。Expand Advanced livy2-env, add below statements at bottom. 如果使用不同的前缀安装了虚拟环境,请相应地更改路径。If you installed the virtual environment with a different prefix, change the path correspondingly.

      export PYSPARK_PYTHON=/usr/bin/anaconda/envs/py35new/bin/python
      export PYSPARK_DRIVER_PYTHON=/usr/bin/anaconda/envs/py35new/bin/python
      

      通过 Ambari 更改 Livy 配置

    3. 展开“高级 spark2-env”,替换底部的现有 export PYSPARK_PYTHON 语句。Expand Advanced spark2-env, replace the existing export PYSPARK_PYTHON statement at bottom. 如果使用不同的前缀安装了虚拟环境,请相应地更改路径。If you installed the virtual environment with a different prefix, change the path correspondingly.

      export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/usr/bin/anaconda/envs/py35new/bin/python}
      

      通过 Ambari 更改 Spark 配置

    4. 保存更改并重启受影响的服务。Save the changes and restart affected services. 需要重启 Spark2 服务才能使这些更改生效。These changes need a restart of Spark2 service. Ambari UI 将提示需要重启。单击“重启”以重启所有受影响的服务。Ambari UI will prompt a required restart reminder, click Restart to restart all affected services.

      通过 Ambari 更改 Spark 配置

  4. 若要在 Jupyter 上使用新建的虚拟环境,If you would like to use the new created virtual environment on Jupyter. 需要更改 Jupyter 配置并重启 Jupyter。You need to change Jupyter configs and restart Jupyter. 使用以下语句针对所有头节点运行脚本操作,使 Jupyter 指向新建的虚拟环境。Run script actions on all header nodes with below statement to point Jupyter to the new created virtual environment. 请务必修改针对虚拟环境指定的前缀的路径。Make sure to modify the path to the prefix you specified for your virtual environment. 运行此脚本操作后,通过 Ambari UI 重启 Jupyter 服务,使此项更改生效。After running this script action, restart Jupyter service through Ambari UI to make this change available.

    sudo sed -i '/python3_executable_path/c\ \"python3_executable_path\" : \"/usr/bin/anaconda/envs/py35new/bin/python3\"' /home/spark/.sparkmagic/config.json
    

    可以通过运行以下代码,在 Jupyter Notebook 中反复确认 Python 环境:You could double confirm the Python environment in Jupyter Notebook by running below code:

    在 Jupyter Notebook 中检查 Python 版本

已知问题Known issue

Anaconda 版本 4.7.11、4.7.12 和 4.8.0 有一个已知的 bug。There is a known bug for Anaconda version 4.7.11, 4.7.12 and 4.8.0. 如果发现脚本操作在执行 "Collecting package metadata (repodata.json): ...working..." 时挂起,同时失败并显示 "Python script has been killed due to timeout after waiting 3600 secs"If you see your script actions hanging at "Collecting package metadata (repodata.json): ...working..." and failing with "Python script has been killed due to timeout after waiting 3600 secs". 可以下载此脚本,并在所有节点上将其作为脚本操作运行,这样即可解决此问题。You can download this script and run it as script actions on all nodes to fix the issue.

若要检查 Anaconda 版本,可以通过 SSH 连接到群集头节点并运行 /usr/bin/anaconda/bin/conda --vTo check your Anaconda version, you can SSH to the cluster header node and run /usr/bin/anaconda/bin/conda --v.

另请参阅See also

方案Scenarios

创建和运行应用程序Create and run applications

工具和扩展Tools and extensions

管理资源Manage resources