将 Microsoft 认知工具包深入学习模型与 Azure HDInsight Spark 群集配合使用Use Microsoft Cognitive Toolkit deep learning model with Azure HDInsight Spark cluster

本文涉及以下步骤。In this article, you do the following steps.

  1. 运行自定义脚本,以在 Azure HDInsight Spark 群集上安装 Microsoft Cognitive ToolkitRun a custom script to install Microsoft Cognitive Toolkit on an Azure HDInsight Spark cluster.

  2. Jupyter Notebook 上传到 Apache Spark 群集,以了解如何使用 Spark Python API (PySpark) 将定型的 Microsoft Cognitive Toolkit 深度学习模型应用于 Azure Blob 存储帐户中的文件Upload a Jupyter Notebook to the Apache Spark cluster to see how to apply a trained Microsoft Cognitive Toolkit deep learning model to files in an Azure Blob Storage Account using the Spark Python API (PySpark)

先决条件Prerequisites

此解决方案的流程如何?How does this solution flow?

此解决方案分为两部分,即本文和作为本教程的一部分上传的 Jupyter 笔记本。This solution is divided between this article and a Jupyter notebook that you upload as part of this tutorial. 在本文中,完成以下步骤:In this article, you complete the following steps:

  • 在 HDInsight Spark 群集上运行脚本操作,安装 Microsoft 认知工具包和 Python 包。Run a script action on an HDInsight Spark cluster to install Microsoft Cognitive Toolkit and Python packages.
  • 将运行解决方案的 Jupyter 笔记本上传到 HDInsight Spark 群集中。Upload the Jupyter notebook that runs the solution to the HDInsight Spark cluster.

以下其余步骤涵盖在 Jupyter 笔记本中。The following remaining steps are covered in the Jupyter notebook.

  • 将示例图像加载到 Spark 弹性分布式数据集或 RDD 中Load sample images into a Spark Resiliant Distributed Dataset or RDD
    • 加载模块并定义预设Load modules and define presets
    • 将数据集下载到本地 Spark 群集上Download the dataset locally on the Spark cluster
    • 将数据集格转换为 RDDConvert the dataset into an RDD
  • 使用定型的认知工具包模型对图像评分Score the images using a trained Cognitive Toolkit model
    • 将定型的认知工具包模型下载到 Spark 群集Download the trained Cognitive Toolkit model to the Spark cluster
    • 定义由辅助角色节点使用的函数Define functions to be used by worker nodes
    • 对辅助角色节点上的图像评分Score the images on worker nodes
    • 评估模型准确性Evaluate model accuracy

安装 Microsoft 认知工具包Install Microsoft Cognitive Toolkit

可以使用脚本操作在 Spark 群集上安装 Microsoft 认知工具包。You can install Microsoft Cognitive Toolkit on a Spark cluster using script action. 脚本操作使用自定义脚本在群集上安装默认情况下未提供的组件。Script action uses custom scripts to install components on the cluster that are not available by default. 可以从 Azure 门户、通过 HDInsight .NET SDK 或 Azure PowerShell 使用自定义脚本。You can use the custom script from the Azure Portal, by using HDInsight .NET SDK, or by using Azure PowerShell. 还可以在创建群集过程中或者在群集已启动并运行之后使用脚本安装工具包。You can also use the script to install the toolkit either as part of cluster creation, or after the cluster is up and running.

在本文中,我们在群集创建完成后使用门户安装该工具包。In this article, we use the portal to install the toolkit, after the cluster has been created. 有关运行自定义脚本的其他方式,请参阅使用脚本操作自定义 HDInsight 群集For other ways to run the custom script, see Customize HDInsight clusters using Script Action.

使用 Azure 门户Using the Azure Portal

有关如何使用 Azure 门户运行脚本操作的说明,请参阅使用脚本操作自定义 HDInsight 群集For instructions on how to use the Azure Portal to run script action, see Customize HDInsight clusters using Script Action. 确保提供以下输入,以便安装 Microsoft 认知工具包。Make sure you provide the following inputs to install Microsoft Cognitive Toolkit.

  • 提供脚本操作名称的值。Provide a value for the script action name.

  • 对于 Bash 脚本 URI,输入 https://raw.githubusercontent.com/Azure-Samples/hdinsight-pyspark-cntk-integration/master/cntk-install.shFor Bash script URI, enter https://raw.githubusercontent.com/Azure-Samples/hdinsight-pyspark-cntk-integration/master/cntk-install.sh.

  • 请确保仅在头节点和工作节点上运行脚本并清除其他所有复选框。Make sure you run the script only on the head and worker nodes and clear all the other checkboxes.

  • 单击创建Click Create.

将 Jupyter 笔记本上传到 Azure HDInsight Spark 群集Upload the Jupyter notebook to Azure HDInsight Spark cluster

要将 Microsoft 认知工具包与 Azure HDInsight Spark 群集配合使用,必须将 Jupyter 笔记本 CNTK_model_scoring_on_Spark_walkthrough.ipynb 加载到 Azure HDInsight Spark 群集中。To use the Microsoft Cognitive Toolkit with the Azure HDInsight Spark cluster, you must load the Jupyter notebook CNTK_model_scoring_on_Spark_walkthrough.ipynb to the Azure HDInsight Spark cluster. GitHub https://github.com/Azure-Samples/hdinsight-pyspark-cntk-integration 中提供了此笔记本。This notebook is available on GitHub at https://github.com/Azure-Samples/hdinsight-pyspark-cntk-integration.

  1. 克隆 GitHub 存储库 https://github.com/Azure-Samples/hdinsight-pyspark-cntk-integrationClone the GitHub repository https://github.com/Azure-Samples/hdinsight-pyspark-cntk-integration. 有关克隆的说明,请参阅 Cloning a repository(克隆存储库)。For instructions to clone, see Cloning a repository.

  2. 从 Azure 门户中,打开已预配的“Spark 群集”边栏选项卡,依次单击“群集仪表板”和“Jupyter notebook”。From the Azure Portal, open the Spark cluster blade that you already provisioned, click Cluster Dashboard, and then click Jupyter notebook.

    也可以通过转到 URL https://<clustername>.azurehdinsight.cn/jupyter/ 来启动 Jupyter 笔记本。You can also launch the Jupyter notebook by going to the URL https://<clustername>.azurehdinsight.cn/jupyter/. 将 <clustername> 替换为 HDInsight 群集名。Replace <clustername> with the name of your HDInsight cluster.

  3. 从 Jupyter 笔记本中,单击右上角的“上传”,并导航至克隆 GitHub 存储库的位置。From the Jupyter notebook, click Upload in the top-right corner and then navigate to the location where you cloned the GitHub repository.

    将 Jupyter 笔记本上传到 Azure HDInsight Spark 群集Upload Jupyter notebook to Azure HDInsight Spark cluster

  4. 再次单击“上传”。Click Upload again.

  5. 笔记本上传后,单击笔记本的名称,并按照笔记本中有关如何加载数据集和执行教程的说明进行操作。After the notebook is uploaded, click the name of the notebook and then follow the instructions in the notebook itself on how to load the data set and perform the tutorial.

另请参阅See also

方案Scenarios

创建和运行应用程序Create and run applications

工具和扩展Tools and extensions

管理资源Manage resources