将 Microsoft 认知工具包深入学习模型与 Azure HDInsight Spark 群集配合使用Use Microsoft Cognitive Toolkit deep learning model with Azure HDInsight Spark cluster

本文涉及以下步骤。In this article, you do the following steps.

  1. 运行自定义脚本,以在 Azure HDInsight Spark 群集上安装 Microsoft Cognitive ToolkitRun a custom script to install Microsoft Cognitive Toolkit on an Azure HDInsight Spark cluster.

  2. Jupyter Notebook 上传到 Apache Spark 群集,以了解如何使用 Spark Python API (PySpark) 将定型的 Microsoft Cognitive Toolkit 深度学习模型应用于 Azure Blob 存储帐户中的文件Upload a Jupyter Notebook to the Apache Spark cluster to see how to apply a trained Microsoft Cognitive Toolkit deep learning model to files in an Azure Blob Storage Account using the Spark Python API (PySpark)

先决条件Prerequisites

此解决方案的流程如何?How does this solution flow?

此解决方案分为两部分,即本文和在本文中上传的 Jupyter 笔记本。This solution is divided between this article and a Jupyter notebook that you upload as part of this article. 在本文中,完成以下步骤:In this article, you complete the following steps:

  • 在 HDInsight Spark 群集上运行脚本操作,安装 Microsoft 认知工具包和 Python 包。Run a script action on an HDInsight Spark cluster to install Microsoft Cognitive Toolkit and Python packages.
  • 将运行解决方案的 Jupyter 笔记本上传到 HDInsight Spark 群集中。Upload the Jupyter notebook that runs the solution to the HDInsight Spark cluster.

以下其余步骤涵盖在 Jupyter 笔记本中。The following remaining steps are covered in the Jupyter notebook.

  • 将示例图像加载到 Spark 弹性分布式数据集或 RDD 中。Load sample images into a Spark Resilient Distributed Dataset or RDD.
    • 加载模块并定义预设。Load modules and define presets.
    • 将数据集下载到本地 Spark 群集上。Download the dataset locally on the Spark cluster.
    • 将数据集格式转换为 RDD。Convert the dataset into an RDD.
  • 使用定型的 Cognitive Toolkit 模型对图像评分。Score the images using a trained Cognitive Toolkit model.
    • 将定型的 Cognitive Toolkit 模型下载到 Spark 群集。Download the trained Cognitive Toolkit model to the Spark cluster.
    • 定义由辅助角色节点使用的函数。Define functions to be used by worker nodes.
    • 对辅助角色节点上的图像评分。Score the images on worker nodes.
    • 评估模型准确性。Evaluate model accuracy.

安装 Microsoft 认知工具包Install Microsoft Cognitive Toolkit

可以使用脚本操作在 Spark 群集上安装 Microsoft 认知工具包。You can install Microsoft Cognitive Toolkit on a Spark cluster using script action. 脚本操作使用自定义脚本在群集上安装默认情况下未提供的组件。Script action uses custom scripts to install components on the cluster that are not available by default. 可以从 Azure 门户、通过 HDInsight .NET SDK 或 Azure PowerShell 使用自定义脚本。You can use the custom script from the Azure Portal, by using HDInsight .NET SDK, or by using Azure PowerShell. 还可以在创建群集过程中或者在群集已启动并运行之后使用脚本安装工具包。You can also use the script to install the toolkit either as part of cluster creation, or after the cluster is up and running.

在本文中,我们在群集创建完成后使用门户安装该工具包。In this article, we use the portal to install the toolkit, after the cluster has been created. 有关运行自定义脚本的其他方式,请参阅使用脚本操作自定义 HDInsight 群集For other ways to run the custom script, see Customize HDInsight clusters using Script Action.

使用 Azure 门户Using the Azure Portal

有关如何使用 Azure 门户运行脚本操作的说明,请参阅使用脚本操作自定义 HDInsight 群集For instructions on how to use the Azure portal to run script action, see Customize HDInsight clusters using Script Action. 确保提供以下输入,以便安装 Microsoft 认知工具包。Make sure you provide the following inputs to install Microsoft Cognitive Toolkit. 对于脚本操作,请使用以下值:Use the following values for your script action:

属性Property ValueValue
脚本类型Script type - Custom- Custom
名称Name 安装 MCTInstall MCT
Bash 脚本 URIBash script URI https://raw.githubusercontent.com/Azure-Samples/hdinsight-pyspark-cntk-integration/master/cntk-install.sh
节点类型:Node type(s): 头节点、工作器节点Head, Worker
parametersParameters None

将 Jupyter 笔记本上传到 Azure HDInsight Spark 群集Upload the Jupyter notebook to Azure HDInsight Spark cluster

要将 Microsoft 认知工具包与 Azure HDInsight Spark 群集配合使用,必须将 Jupyter 笔记本 CNTK_model_scoring_on_Spark_walkthrough.ipynb 加载到 Azure HDInsight Spark 群集中。To use the Microsoft Cognitive Toolkit with the Azure HDInsight Spark cluster, you must load the Jupyter notebook CNTK_model_scoring_on_Spark_walkthrough.ipynb to the Azure HDInsight Spark cluster. GitHub https://github.com/Azure-Samples/hdinsight-pyspark-cntk-integration 中提供了此笔记本。This notebook is available on GitHub at https://github.com/Azure-Samples/hdinsight-pyspark-cntk-integration.

  1. 下载并解压缩 https://github.com/Azure-Samples/hdinsight-pyspark-cntk-integrationDownload and unzip https://github.com/Azure-Samples/hdinsight-pyspark-cntk-integration.

  2. 在 Web 浏览器中,导航到 https://CLUSTERNAME.azurehdinsight.cn/jupyter,其中 CLUSTERNAME 是群集的名称。From a web browser, navigate to https://CLUSTERNAME.azurehdinsight.cn/jupyter, where CLUSTERNAME is the name of your cluster.

  3. 在 Jupyter 笔记本中,选择右上角的“上传” ,然后导航到“下载”并选择文件 CNTK_model_scoring_on_Spark_walkthrough.ipynbFrom the Jupyter notebook, select Upload in the top-right corner and then navigate to the download and select file CNTK_model_scoring_on_Spark_walkthrough.ipynb.

    将 Jupyter 笔记本上传到 Azure HDInsight Spark 群集Upload Jupyter notebook to Azure HDInsight Spark cluster

  4. 再次选择“上传” 。Select Upload again.

  5. 笔记本上传后,单击笔记本的名称,并按照笔记本本身中有关如何加载数据集和执行本文的说明进行操作。After the notebook is uploaded, click the name of the notebook and then follow the instructions in the notebook itself on how to load the data set and perform the article.

另请参阅See also

方案Scenarios

创建和运行应用程序Create and run applications

工具和扩展Tools and extensions

管理资源Manage resources