安装已发布的应用程序 - H2O Sparkling WaterInstall published application - H2O Sparkling Water

本文介绍如何在 Azure HDInsight 上安装和运行 H20 Sparkling Water 发布的 Apache Hadoop 应用程序。This article describes how to install and run the H20 Sparkling Water published Apache Hadoop application on Azure HDInsight. 有关 HDInsight 应用程序平台的概述以及可用独立软件供应商 (ISV) 发布的应用程序的列表,请参阅安装第三方 Hadoop 应用程序For an overview of the HDInsight application platform, and a list of available Independent Software Vendor (ISV) published applications, see Install third-party Hadoop applications. 有关如何安装自己的应用程序的说明,请参阅安装自定义 HDInsight 应用程序For instructions on installing your own application, see Install custom HDInsight applications.

关于 H2O Sparkling WaterAbout H2O Sparkling Water

H2O Sparkling Water 是具有线性缩放能力的开源、完全分布式内存中机器学习平台。H2O Sparkling Water is an open source, fully distributed in-memory machine learning platform with linear scalability. 使用 H2O Sparkling Water 可将 H2O 的快速、可缩放机器学习算法与 Apache Spark 的功能相结合。H2O Sparkling Water let you combine the fast, scalable machine learning algorithms of H2O with the capabilities of Apache Spark. 借助 Sparkling Water,用户可以使用 H2O Flow UI 推动 Scala、R 和 Python 的计算。With Sparkling Water, users can drive computation from Scala, R, and Python using the H2O Flow UI.

H2O Sparkling Water 提供:H2O Sparkling Water provides:

  • 易用的 WebUI 和熟悉的界面 – 使用 H2O 的基于 Web 的直观 Flow GUI 或编程环境(例如 R、Python、Java、Scala、JSON)和 H2O API 进行设置和快速入门。Easy-to-use WebUI and familiar interfaces – Set up and get started quickly using either H2O’s intuitive web-based Flow GUI or programming environments such as R, Python, Java, Scala, JSON, and the H2O APIs.
  • 对所有常见数据库和文件类型的数据不可知支持 – 从 Microsoft Excel、R Studio、Tableau 等工具的内部轻松探索和建模大数据。Data-agnostic support for all common database and file types – Easily explore and model Big Data from within Microsoft Excel, R Studio, Tableau, and more. 从 HDFS、S3、SQL 和 NoSQL 数据源连接到数据。Connect to data from HDFS, S3, SQL, and NoSQL data sources.
  • 大规模的大数据修整和分析 – H2O Big Joins 的执行速度比 R data.table 操作快 7 倍,并可线性扩展到 100 亿 x 100 亿行联接。Massively scalable Big Data munging and analysis – H2O Big Joins can perform 7x faster than R data.table operations, and linearly scale to 10 billion x 10 billion row joins.
  • 实时数据评分 – 使用无格式普通 Java 对象 (POJO)、模型优化的 Java 对象 (MOJO) 或 H2O REST API,将模型快速部署到生产环境。Real-time data scoring – Rapidly deploy models to production using plain-old Java objects (POJO), model-optimized Java objects (MOJO), or the H2O REST API.

先决条件Prerequisites

若要在新的 HDInsight 群集或现有群集上安装此应用,必须采用以下配置:To install this app on a new HDInsight cluster, or an existing cluster, you must have the following configuration:

  • 群集层:标准或高级Cluster tier(s): Standard or Premium
  • 群集类型:SparkCluster type: Spark
  • 群集版本:3.5 或 3.6Cluster version(s): 3.5 or 3.6

安装 H2O Sparkling Water 发布的应用程序Install the H2O Sparkling Water published application

有关安装此应用程序和其他可用 ISV 应用程序的分步说明,请阅读安装第三方 Apache Hadoop 应用程序For step-by-step instructions on installing this and other available ISV applications, read Install third-party Apache Hadoop applications.

启动 H2O Sparkling WaterLaunch H2O Sparkling Water

  1. 安装完成后,可以在 Azure 门户中打开 Jupyter Notebook (https://<ClusterName>.azurehdinsight.cn/jupyter),通过群集开始使用 H2O Sparkling Water (h2o-sparklingwater)。After installation, you can start using H2O Sparkling Water (h2o-sparklingwater) from your cluster in Azure portal by opening Jupyter Notebooks (https://<ClusterName>.azurehdinsight.cn/jupyter). 转到 Jupyter 的另一种方法是在门户上的群集窗格中选择“群集仪表板”,然后选择“Jupyter Notebook”。You can also get to Jupyter by selecting Cluster dashboard from your cluster pane in the portal, then selecting Jupyter Notebook. 系统会提示输入凭据。You are prompted to enter your credentials. 输入创建群集时指定的群集 Hadoop 凭据。Enter the cluster's Hadoop credentials as specified on cluster creation.

  2. 在 Jupyter 中,可以看到以下三个文件夹:H2O-PySparkling-Examples、PySpark Examples 和 Scala Examples。In Jupyter, you see three folders: H2O-PySparkling-Examples, PySpark Examples, and Scala Examples. 选择“H2O-PySparkling-Examples”文件夹。Select the H2O-PySparkling-Examples folder.

    Jupyter Notebook 主页

  3. 创建新 Notebook 时的第一个步骤是配置 Spark 环境。The first step when creating a new notebook is to configure the Spark environment. Sentiment_analysis_with_Sparkling_Water 示例中包含了此信息。This information is included in the Sentiment_analysis_with_Sparkling_Water example. 配置 Spark 环境时,请务必使用正确的 jar,并指定第一个单元输出中提供的 IP 地址。When configuring the Spark environment, be sure to use the correct jar, and specify the IP address provided by the output of the first cell.

    Jupyter Notebook 主页

  4. 启动 H2O 群集。Start the H2O Cluster.

    启动群集

  5. H2O 群集启动并运行后,转到 https://<ClusterName>-h2o.apps.azurehdinsight.cn:443 打开 H2O Flow。After the H2O Cluster is up and running, open H2O Flow by going to https://<ClusterName>-h2o.apps.azurehdinsight.cn:443.

    Note

    如果无法打开 H2O Flow,请尝试清除浏览器缓存。If you are unable to open H2O Flow, try clearing your browser cache. 如果仍然无法访问它,则可能表示群集中没有足够的资源。If you still unable to reach it, you probably do not have enough resources on your cluster. 请尝试在群集窗格中的“缩放群集”选项下增加工作节点的数目。Try increasing the number of Worker nodes under the Scale cluster option in your cluster pane.

    H2O Flow 仪表板

  6. 在右侧菜单中选择“Million_Songs.flow”示例。Select the Million_Songs.flow example from the menu on the right. 出现提示和警告时,请单击“加载 Notebook”。When prompted with a warning, click Load Notebook. 几分钟后,将使用真实数据运行此演示。This demo is designed to run in a few minutes using real data. 目的是使用二元分类,基于数据预测该歌曲是在 2004 年之前还是之后发行的。The goal is to predict from the data whether the song was released before or after 2004 using binary classification.

    选择 Million_Songs.flow

  7. 查找包含 milsongs-cls-train.csv.gz 的路径,并将整个路径替换为 **https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/milsongs/milsongs-cls-train.csv.gz**。Find the path containing milsongs-cls-train.csv.gz, and replace the entire path with https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/milsongs/milsongs-cls-train.csv.gz.

  8. 查找包含 milsongs-cls-test.csv.gz 的路径,并将其替换为 **https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/milsongs/milsongs-cls-test.csv.gz**。Find the path containing milsongs-cls-test.csv.gz and replace it with https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/milsongs/milsongs-cls-test.csv.gz.

  9. 若要执行 Notebook 单元中的所有语句,请在工具栏上选择“全部运行”按钮。To execute all statements within the notebook cells, select the Run All button on the toolbar.

    全部运行

  10. 几分钟后,应会看到如下所示的输出。After a few minutes, you should see an output similar to the following.

    输出

就这么简单!That's it! 只花费了几分钟时间,你就掌握了 Spark 中的人工智能。You've harnessed artificial intelligence in Spark within a matter of minutes. 现在,可以继续探索 H2O Flow 中演示不同类型的机器学习算法的其他示例。You can now explore more examples in H2O Flow that demonstrate different types of machine learning algorithms.

后续步骤Next steps