快速入门:使用模板在 HDInsight 中创建 Apache Spark 群集Quickstart: Create an Apache Spark cluster in HDInsight using template

了解如何在 Azure HDInsight 中创建 Apache Spark 群集,以及如何对 Apache Hive 表运行 Spark SQL 查询。Learn how to create an Apache Spark cluster in Azure HDInsight, and how to run Spark SQL queries against Apache Hive tables. 通过 Apache Spark 可以使用内存处理进行快速数据分析和群集计算。Apache Spark enables fast data analytics and cluster computing using in-memory processing. 有关 Spark on HDInsight 的信息,请参阅概述:Azure HDInsight 上的 Apache SparkFor information on Spark on HDInsight, see Overview: Apache Spark on Azure HDInsight.

在本快速入门教程中,可以使用资源管理器模板创建 HDInsight Spark 群集。In this quickstart, you use a Resource Manager template to create an HDInsight Spark cluster. 群集将 Azure 存储 Blob 用作群集存储。The cluster uses Azure Storage Blobs as the cluster storage. 有关使用 Data Lake Storage Gen2 的详细信息,请参阅快速入门:在 HDInsight 中设置群集For more information on using Data Lake Storage Gen2, see Quickstart: Set up clusters in HDInsight.

Important

HDInsight 群集是基于分钟按比例收费,而不管用户是否正在使用它们。Billing for HDInsight clusters is prorated per minute, whether you are using them or not. 请务必在使用完之后删除群集。Be sure to delete your cluster after you have finished using it. 有关详细信息,请参阅本文的清理资源部分。For more information, see the Clean up resources section of this article.

如果没有 Azure 订阅,请在开始之前创建一个试用帐户If you don't have an Azure subscription, create a trial account before you begin.

创建 HDInsight Spark 群集Create an HDInsight Spark cluster

使用 Azure 资源管理器模板创建 HDInsight Spark 群集。Create an HDInsight Spark cluster using an Azure Resource Manager template. 可以在 GitHub 中找到该模板。The template can be found in github.

  1. 选择以下链接在新的浏览器选项卡中打开 Azure 门户中的模板:Select the following link to open the template in the Azure portal in a new browser tab:

    部署到 AzureDeploy to Azure

  2. 输入以下值:Enter the following values:

    属性Property Value
    订阅Subscription 选择用于创建此群集的 Azure 订阅。Select your Azure subscription used for creating this cluster. 用于此快速入门的订阅是 <Azure subscription name>。The subscription used for this quickstart is <Azure subscription name>.
    资源组Resource group 创建一个资源组或选择一个现有的资源组。Create a resource group or select an existing one. 资源组用于管理项目的 Azure 资源。Resource group is used to manage Azure resources for your projects. 用于此快速入门的新资源组名称为“myspark20180403rg”。The new resource group name used for this quickstart is myspark20180403rg.
    位置Location 选择资源组的位置。Select a location for the resource group. 模板将此位置用于创建群集,以及用于默认群集存储。The template uses this location for creating the cluster as well as for the default cluster storage. 用于此快速入门的位置为“中国东部 2”。The location used for this quickstart is China East 2.
    ClusterNameClusterName 为要创建的 HDInsight 群集输入名称。Enter a name for the HDInsight cluster that you want to create. 用于此快速入门的新群集名称为“myspark20180403”。The new cluster name used for this quickstart is myspark20180403.
    群集登录名和密码Cluster login name and password 默认登录名是“admin”。选择用于群集登录的密码。The default login name is admin. Choose a password for the cluster login. 用于此快速入门的登录名为“admin”。The login name used for this quickstart is admin.
    SSH 用户名和密码SSH user name and password 选择用于 SSH 用户的密码。Choose a password for the SSH user. 用于此快速入门的 SSH 用户名为“sshuser”。The SSH user name used for this quickstart is sshuser.

    使用 Azure 资源管理器模板创建 HDInsight Spark 群集Create HDInsight Spark cluster using an Azure Resource Manager template

  3. 选择“我同意上述条款和条件”,选择“固定到仪表板”,并选择“购买”。Select I agree to the terms and conditions stated above, select Pin to dashboard, and then select Purchase. 此时会出现标题为“正在部署模板”的新磁贴。You can see a new tile titled Deploying Template deployment. 创建群集大约需要 20 分钟时间。It takes about 20 minutes to create the cluster. 必须先创建群集,才能继续下一会话。The cluster must be created before you can proceed to the next session.

如果在创建 HDInsight 群集时遇到问题,可能是因为没有这样做的适当权限。If you run into an issue with creating HDInsight clusters, it could be that you do not have the right permissions to do so. 有关详细信息,请参阅访问控制要求For more information, see Access control requirements.

为 Spark 应用程序安装 IntelliJ/EclipseInstall IntelliJ/Eclipse for Spark application

使用用于 IntelliJ/Eclipse 的 Azure 工具包插件开发以 Scala 编写的 Spark 应用程序,并直接从 IntelliJ/Eclipse 集成开发环境 (IDE) 将其提交到 Azure HDInsight Spark 群集。Use the Azure Toolkit for IntelliJ/Eclipse plug-in to develop Spark applications written in Scala, and then submit them to an Azure HDInsight Spark cluster directly from the IntelliJ/Eclipse integrated development environment (IDE). 有关详细信息,请参阅使用 IntelliJ 创作/提交 Spark 应用程序使用 Eclipse 创作/提交 Spark 应用程序For more information, see Use IntelliJ to author/submit Spark application and Use Eclipse to author/submit Spark application.

为 PySpark/hive 应用程序安装 VSCodeInstall VSCode for PySpark/hive applications

了解如何使用用于 Visual Studio Code (VSCode) 的 Azure HDInsight 工具来创建和提交 Hive 批处理作业、交互式 Hive 查询、PySpark 批处理和 PySpark 交互式脚本。Learn how to use the Azure HDInsight Tools for Visual Studio Code (VSCode) to create and submit Hive batch jobs, interactive Hive queries, PySpark batch, and PySpark interactive scripts. 可以在 VSCode 支持的平台上安装 Azure HDInsight 工具。The Azure HDInsight Tools can be installed on the platforms that are supported by VSCode. 这些平台包括 Windows、Linux 和 macOS。These include Windows, Linux, and macOS. 有关详细信息,请参阅使用 VSCode 创作/提交 PySpark 应用程序For more information, see Use VSCode to author/submit PySpark application.

创建 Jupyter 笔记本Create a Jupyter notebook

Jupyter Notebook 是支持各种编程语言的交互式笔记本环境。Jupyter Notebook is an interactive notebook environment that supports various programming languages. 通过此笔记本可以与数据进行交互、结合代码和 markdown 文本以及执行简单的可视化效果。The notebook allows you to interact with your data, combine code with markdown text and perform simple visualizations.

  1. 打开 Azure 门户Open the Azure portal.

  2. 选择“HDInsight 群集”,然后选择所创建的群集。Select HDInsight clusters, and then select the cluster you created.

    在 Azure 门户中打开 HDInsight 群集

  3. 在门户的“群集仪表板”部分中,单击“Jupyter Notebook”。From the portal, in Cluster dashboards section, click on Jupyter Notebook. 出现提示时,请输入群集的群集登录凭据。If prompted, enter the cluster login credentials for the cluster.

    打开 Jupyter Notebook 来运行交互式 Spark SQL 查询Open Jupyter notebook to run interactive Spark SQL query

  4. 选择“新建” > “PySpark”,创建笔记本。Select New > PySpark to create a notebook.

    创建 Jupyter Notebook 来运行交互式 Spark SQL 查询Create a Jupyter notebook to run interactive Spark SQL query

    新 Notebook 随即会创建,并以 Untitled(Untitled.pynb) 名称打开。A new notebook is created and opened with the name Untitled(Untitled.pynb).

运行 Spark SQL 语句Run Spark SQL statements

SQL(结构化查询语言)是用于查询和转换数据的最常见、最广泛使用的语言。SQL (Structured Query Language) is the most common and widely used language for querying and transforming data. Spark SQL 作为 Apache Spark 的扩展使用,可使用熟悉的 SQL 语法处理结构化数据。Spark SQL functions as an extension to Apache Spark for processing structured data, using the familiar SQL syntax.

  1. 验证 kernel 已就绪。Verify the kernel is ready. 如果在 Notebook 中的内核名称旁边看到空心圆,则内核已准备就绪。The kernel is ready when you see a hollow circle next to the kernel name in the notebook. 实心圆表示内核正忙。Solid circle denotes that the kernel is busy.

    HDInsight Spark 中的 Hive 查询Hive query in HDInsight Spark

    首次启动 Notebook 时,内核在后台执行一些任务。When you start the notebook for the first time, the kernel performs some tasks in the background. 等待内核准备就绪。Wait for the kernel to be ready.

  2. 将以下代码粘贴到一个空单元格中,然后按 SHIFT + ENTER 来运行这些代码。Paste the following code in an empty cell, and then press SHIFT + ENTER to run the code. 此命令列出群集上的 Hive 表:The command lists the Hive tables on the cluster:

    %%sql
    SHOW TABLES
    

    将 Jupyter Notebook 与 HDInsight Spark 群集配合使用时,会获得一个预设 spark 会话,可以使用它通过 Spark SQL 来运行 Hive 查询。When you use a Jupyter Notebook with your HDInsight Spark cluster, you get a preset spark session that you can use to run Hive queries using Spark SQL. %%sql 指示 Jupyter Notebook 使用预设 spark 会话运行 Hive 查询。%%sql tells Jupyter Notebook to use the preset spark session to run the Hive query. 该查询从默认情况下所有 HDInsight 群集都带有的 Hive 表 (hivesampletable) 检索前 10 行。The query retrieves the top 10 rows from a Hive table (hivesampletable) that comes with all HDInsight clusters by default. 第一次提交查询时,Jupyter 将为 Notebook 创建 Spark 应用程序。The first time you submit the query Jupyter will create Spark Application for the notebook. 该操作需要大约 30 秒才能完成。It takes about 30 seconds to complete. Spark 应用程序准备就绪后,查询将在大约一秒钟内执行并生成结果。Once the spark application is ready the query is executed in about a second and produces the results. 输出如下所示:The output looks like:

    HDInsight Spark 中的 Hive 查询Hive query in HDInsight Spark

    每次在 Jupyter 中运行查询时,Web 浏览器窗口标题中都会显示“(繁忙)”状态和 Notebook 标题。Every time you run a query in Jupyter, your web browser window title shows a (Busy) status along with the notebook title. 右上角“PySpark”文本的旁边还会出现一个实心圆。You also see a solid circle next to the PySpark text in the top-right corner.

  3. 运行另一个查询,请查看 hivesampletable 中的数据。Run another query to see the data in hivesampletable.

    %%sql
    SELECT * FROM hivesampletable LIMIT 10
    

    屏幕在刷新后会显示查询输出。The screen shall refresh to show the query output.

    HDInsight Spark 中的 Hive 查询输出Hive query output in HDInsight Spark

  4. 请在 Notebook 的“文件”菜单中选择“关闭并停止”。From the File menu on the notebook, select Close and Halt. 关闭 Notebook 会释放群集资源,包括 Spark 应用程序。Shutting down the notebook releases the cluster resources, including Spark Application.

清理资源Clean up resources

HDInsight 会将数据和 Jupyter Notebook 保存在 Azure 存储或 Azure Data Lake Store 中,以便在群集不用时安全地删除群集。HDInsight saves your data and Jupyter notebooks in Azure Storage or Azure Data Lake Store, so you can safely delete a cluster when it is not in use. 此外,还需要支付 HDInsight 群集费用,即使未使用。You are also charged for an HDInsight cluster, even when it is not in use. 由于群集费用数倍于存储空间费用,因此在群集不用时删除群集可以节省费用。Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use. 如果要立即开始后续步骤中所列的教程,可能需要保留群集。If you plan to work on the tutorial listed in Next steps immediately, you might want to keep the cluster.

切换回 Azure 门户,并选择“删除”。Switch back to the Azure portal, and select Delete.

删除 HDInsight 群集Delete an HDInsight cluster

还可以选择资源组名称,打开“资源组”页,然后选择“删除资源组”。You can also select the resource group name to open the resource group page, and then select Delete resource group. 通过删除资源组,可以删除 HDInsight Spark 群集和默认存储帐户。By deleting the resource group, you delete both the HDInsight Spark cluster, and the default storage account.

后续步骤Next steps

本快速入门介绍了如何创建 HDInsight Spark 群集并运行基本的 Spark SQL 查询。In this quickstart, you learned how to create an HDInsight Spark cluster and run a basic Spark SQL query. 转到下一教程,了解如何使用 HDInsight Spark 群集针对示例数据运行交互式查询。Advance to the next tutorial to learn how to use an HDInsight Spark cluster to run interactive queries on sample data.