快速入门:使用 ARM 模板在 Azure HDInsight 中创建 Apache Spark 群集Quickstart: Create Apache Spark cluster in Azure HDInsight using ARM template

本快速入门将使用 Azure 资源管理器模板(ARM 模板)在 Azure HDInsight 中创建一个 Apache Spark 群集。In this quickstart, you use an Azure Resource Manager template (ARM template) to create an Apache Spark cluster in Azure HDInsight. 然后,你将创建一个 Jupyter Notebook 文件,并使用它针对 Apache Hive 表运行 Spark SQL 查询。You then create a Jupyter Notebook file, and use it to run Spark SQL queries against Apache Hive tables. Azure HDInsight 是适用于企业的分析服务,具有托管、全面且开源的特点。Azure HDInsight is a managed, full-spectrum, open-source analytics service for enterprises. 用于 HDInsight 的 Apache Spark 框架使用内存中处理功能实现快速数据分析和群集计算。The Apache Spark framework for HDInsight enables fast data analytics and cluster computing using in-memory processing. 使用 Jupyter Notebook,可以与数据进行交互、将代码和 Markdown 文本结合使用,以及进行简单的可视化。Jupyter Notebook lets you interact with your data, combine code with markdown text, and do simple visualizations.

如果将多个群集一起使用,则需创建一个虚拟网络;如果使用的是 Spark 群集,则还需使用 Hive Warehouse Connector。If you're using multiple clusters together, you'll want to create a virtual network, and if you're using a Spark cluster you'll also want to use the Hive Warehouse Connector. 有关详细信息,请参阅为 Azure HDInsight 规划虚拟网络将 Apache Spark 和 Apache Hive 与 Hive Warehouse Connector 集成For more information, see Plan a virtual network for Azure HDInsight and Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector.

ARM 模板是定义项目基础结构和配置的 JavaScript 对象表示法 (JSON) 文件。An ARM template is a JavaScript Object Notation (JSON) file that defines the infrastructure and configuration for your project. 该模板使用声明性语法,使你可以声明要部署的内容,而不需要编写一系列编程命令来进行创建。The template uses declarative syntax, which lets you state what you intend to deploy without having to write the sequence of programming commands to create it.

如果你的环境满足先决条件,并且你熟悉如何使用 ARM 模板,请选择“部署到 Azure”按钮。If your environment meets the prerequisites and you're familiar with using ARM templates, select the Deploy to Azure button. Azure 门户中会打开模板。The template will open in the Azure portal.

部署到 AzureDeploy to Azure

先决条件Prerequisites

如果没有 Azure 订阅,可在开始前创建一个 创建试用帐户If you don't have an Azure subscription, create a Create trial account before you begin.

查看模板Review the template

本快速入门中使用的模板来自 Azure 快速启动模板The template used in this quickstart is from Azure Quickstart Templates.

{
  "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "clusterName": {
      "type": "string",
      "metadata": {
        "description": "The name of the HDInsight cluster to create."
      }
    },
    "clusterLoginUserName": {
      "type": "string",
      "defaultValue": "admin",
      "metadata": {
        "description": "These credentials can be used to submit jobs to the cluster and to log into cluster dashboards."
      }
    },
    "clusterLoginPassword": {
      "type": "securestring",
      "metadata": {
        "description": "The password must be at least 10 characters in length and must contain at least one digit, one non-alphanumeric character, and one upper or lower case letter."
      }
    },
    "sshUserName": {
      "type": "string",
      "defaultValue": "sshuser",
      "metadata": {
        "description": "These credentials can be used to remotely access the cluster."
      }
    },
    "sshPassword": {
      "type": "securestring",
      "metadata": {
        "description": "The password must be at least 10 characters in length and must contain at least one digit, one non-alphanumeric character, and one upper or lower case letter."
      }
    },
    "location": {
      "type": "string",
      "defaultValue": "[resourceGroup().location]",
      "metadata": {
        "description": "Location for all resources."
      }
    }
  },
  "variables": {
    "defaultStorageAccount": {
      "name": "[uniqueString(resourceGroup().id)]",
      "type": "Standard_LRS"
    }
  },
  "resources": [
    {
      "type": "Microsoft.Storage/storageAccounts",
      "name": "[variables('defaultStorageAccount').name]",
      "location": "[parameters('location')]",
      "apiVersion": "2016-01-01",
      "sku": {
        "name": "[variables('defaultStorageAccount').type]"
      },
      "kind": "Storage",
      "properties": {}
    },
    {
      "type": "Microsoft.HDInsight/clusters",
      "name": "[parameters('clusterName')]",
      "location": "[parameters('location')]",
      "apiVersion": "2018-06-01-preview",
      "dependsOn": [
        "[concat('Microsoft.Storage/storageAccounts/',variables('defaultStorageAccount').name)]"
      ],
      "tags": {},
      "properties": {
        "clusterVersion": "3.6",
        "osType": "Linux",
        "tier": "Standard",
        "clusterDefinition": {
          "kind": "spark",
          "configurations": {
            "gateway": {
              "restAuthCredential.isEnabled": true,
              "restAuthCredential.username": "[parameters('clusterLoginUserName')]",
              "restAuthCredential.password": "[parameters('clusterLoginPassword')]"
            }
          }
        },
        "storageProfile": {
          "storageaccounts": [
            {
              "name": "[replace(replace(reference(resourceId('Microsoft.Storage/storageAccounts', variables('defaultStorageAccount').name), '2016-01-01').primaryEndpoints.blob,'https://',''),'/','')]",
              "isDefault": true,
              "container": "[parameters('clusterName')]",
              "key": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('defaultStorageAccount').name), '2016-01-01').keys[0].value]"
            }
          ]
        },
        "computeProfile": {
          "roles": [
            {
              "name": "headnode",
              "targetInstanceCount": 2,
              "hardwareProfile": {
                "vmSize": "Standard_D12_v2"
              },
              "osProfile": {
                "linuxOperatingSystemProfile": {
                  "username": "[parameters('sshUserName')]",
                  "password": "[parameters('sshPassword')]"
                }
              },
              "virtualNetworkProfile": null,
              "scriptActions": []
            },
            {
              "name": "workernode",
              "targetInstanceCount": 2,
              "hardwareProfile": {
                "vmSize": "Standard_D13_v2"
              },
              "osProfile": {
                "linuxOperatingSystemProfile": {
                  "username": "[parameters('sshUserName')]",
                  "password": "[parameters('sshPassword')]"
                }
              },
              "virtualNetworkProfile": null,
              "scriptActions": []
            }
          ]
        }
      }
    }
  ],
  "outputs": {
    "storage": {
      "type": "object",
      "value": "[reference(resourceId('Microsoft.Storage/storageAccounts', variables('defaultStorageAccount').name))]"
    },
    "cluster": {
      "type": "object",
      "value": "[reference(resourceId('Microsoft.HDInsight/clusters',parameters('clusterName')))]"
    }
  }
}

该模板中定义了两个 Azure 资源:Two Azure resources are defined in the template:

部署模板Deploy the template

  1. 选择下面的“部署到 Azure”按钮登录到 Azure,并打开 ARM 模板。Select the Deploy to Azure button below to sign in to Azure and open the ARM template.

    部署到 AzureDeploy to Azure

  2. 输入或选择下列值:Enter or select the following values:

    propertiesProperty 说明Description
    订阅Subscription 从下拉列表中选择用于此群集的 Azure 订阅。From the drop-down list, select the Azure subscription that's used for the cluster.
    资源组Resource group 从下拉列表中选择现有资源组,或选择“新建”。From the drop-down list, select your existing resource group, or select Create new.
    位置Location 将使用用于资源组的位置自动填充此值。The value will autopopulate with the location used for the resource group.
    群集名称Cluster Name 输入任何全局唯一的名称。Enter a globally unique name. 对于此模板,请只使用小写字母和数字。For this template, use only lowercase letters, and numbers.
    群集登录用户名Cluster Login User Name 提供用户名,默认值为 adminProvide the username, default is admin.
    群集登录密码Cluster Login Password 提供密码。Provide a password. 密码长度不得少于 10 个字符,且至少必须包含一个数字、一个大写字母和一个小写字母、一个非字母数字字符(' " ` 字符除外)。The password must be at least 10 characters in length and must contain at least one digit, one uppercase, and one lower case letter, one non-alphanumeric character (except characters ' " ` ).
    SSH 用户名Ssh User Name 提供用户名,默认值为 sshuserProvide the username, default is sshuser
    SSH 密码Ssh Password 提供密码。Provide the password.

    使用 Azure 资源管理器模板在 HDInsight 中创建 Spark 群集Create Spark cluster in HDInsight using Azure Resource Manager template

  3. 查看“条款和条件”。Review the TERMS AND CONDITIONS. 接着选择“我同意上述条款和条件”,然后选择“购买” 。Then select I agree to the terms and conditions stated above, then Purchase. 你会收到一则通知,说明正在进行部署。You'll receive a notification that your deployment is in progress. 创建群集大约需要 20 分钟时间。It takes about 20 minutes to create a cluster.

如果在创建 HDInsight 群集时遇到问题,可能是因为你没有这样做的适当权限。If you run into an issue with creating HDInsight clusters, it could be that you don't have the right permissions to do so. 有关详细信息,请参阅访问控制要求For more information, see Access control requirements.

查看已部署的资源Review deployed resources

创建群集后,你会收到“部署成功”通知,通知中附有“转到资源”链接 。Once the cluster is created, you'll receive a Deployment succeeded notification with a Go to resource link. “资源组”页会列出新的 HDInsight 群集以及与该群集关联的默认存储。Your Resource group page will list your new HDInsight cluster and the default storage associated with the cluster. 每个群集都有一个 Azure 存储帐户依赖项。Each cluster has an Azure Storage account dependency. 该帐户称为默认存储帐户。It's referred as the default storage account. HDInsight 群集及其默认存储帐户必须共存于同一个 Azure 区域中。The HDInsight cluster and its default storage account must be colocated in the same Azure region. 删除群集不会删除存储帐户。Deleting clusters doesn't delete the storage account.

创建 Jupyter Notebook 文件Create a Jupyter Notebook file

Jupyter Notebook 是支持各种编程语言的交互式笔记本环境。Jupyter Notebook is an interactive notebook environment that supports various programming languages. 可以使用 Jupyter Notebook 文件与数据交互,将代码和 Markdown 文本相结合,并执行简单的可视化操作。You can use a Jupyter Notebook file to interact with your data, combine code with markdown text, and perform simple visualizations.

  1. 打开 Azure 门户Open the Azure portal.

  2. 选择“HDInsight 群集”,然后选择所创建的群集。Select HDInsight clusters, and then select the cluster you created.

    在 Azure 门户中打开 HDInsight 群集

  3. 在门户的“群集仪表板”部分中,选择“Jupyter Notebook”。 From the portal, in Cluster dashboards section, select Jupyter Notebook. 出现提示时,请输入群集的群集登录凭据。If prompted, enter the cluster login credentials for the cluster.

    打开 Jupyter Notebook 以运行交互式 Spark SQL 查询Open Jupyter notebook to run interactive Spark SQL query

  4. 选择“新建” > “PySpark”,创建笔记本 。Select New > PySpark to create a notebook.

    创建 Jupyter Notebook 以运行交互式 Spark SQL 查询Create a Jupyter notebook to run interactive Spark SQL query

    新笔记本随即已创建,并以 Untitled(Untitled.pynb) 名称打开。A new notebook is created and opened with the name Untitled(Untitled.pynb).

运行 Apache Spark SQL 语句Run Apache Spark SQL statements

SQL(结构化查询语言)是用于查询和转换数据的最常见、最广泛使用的语言。SQL (Structured Query Language) is the most common and widely used language for querying and transforming data. Spark SQL 作为 Apache Spark 的扩展使用,可使用熟悉的 SQL 语法处理结构化数据。Spark SQL functions as an extension to Apache Spark for processing structured data, using the familiar SQL syntax.

  1. 验证 kernel 已就绪。Verify the kernel is ready. 如果在 Notebook 中的内核名称旁边看到空心圆,则内核已准备就绪。The kernel is ready when you see a hollow circle next to the kernel name in the notebook. 实心圆表示内核正忙。Solid circle denotes that the kernel is busy.

    HDInsight Spark 中的 Hive 查询Hive query in HDInsight Spark

    首次启动 Notebook 时,内核在后台执行一些任务。When you start the notebook for the first time, the kernel performs some tasks in the background. 等待内核准备就绪。Wait for the kernel to be ready.

  2. 将以下代码粘贴到一个空单元格中,然后按 SHIFT + ENTER 来运行这些代码。Paste the following code in an empty cell, and then press SHIFT + ENTER to run the code. 此命令列出群集上的 Hive 表:The command lists the Hive tables on the cluster:

    %%sql
    SHOW TABLES
    

    将 Jupyter Notebook 文件与 HDInsight 群集配合使用时,会获得一个预设 spark 会话,可以使用它通过 Spark SQL 来运行 Hive 查询。When you use a Jupyter Notebook file with your HDInsight cluster, you get a preset spark session that you can use to run Hive queries using Spark SQL. %%sql 指示 Jupyter Notebook 使用预设 spark 会话运行 Hive 查询。%%sql tells Jupyter Notebook to use the preset spark session to run the Hive query. 该查询从默认情况下所有 HDInsight 群集都带有的 Hive 表 (hivesampletable) 检索前 10 行。The query retrieves the top 10 rows from a Hive table ( hivesampletable ) that comes with all HDInsight clusters by default. 第一次提交查询时,Jupyter 将为笔记本创建 Spark 应用程序。The first time you submit the query, Jupyter will create a Spark application for the notebook. 该操作需要大约 30 秒才能完成。It takes about 30 seconds to complete. Spark 应用程序准备就绪后,查询将在大约一秒钟内执行并生成结果。Once the Spark application is ready, the query is executed in about a second and produces the results. 输出如下所示:The output looks like:

    HDInsight Spark 中的 Hive 查询Hive query in HDInsight Spark

    每次在 Jupyter 中运行查询时,Web 浏览器窗口标题中都会显示“(繁忙)”状态和 Notebook 标题。Every time you run a query in Jupyter, your web browser window title shows a (Busy) status along with the notebook title. 右上角“PySpark”文本的旁边还会出现一个实心圆。You also see a solid circle next to the PySpark text in the top-right corner.

  3. 运行另一个查询,请查看 hivesampletable 中的数据。Run another query to see the data in hivesampletable.

    %%sql
    SELECT * FROM hivesampletable LIMIT 10
    

    屏幕在刷新后会显示查询输出。The screen shall refresh to show the query output.

    HDInsight 中的 Hive 查询输出Hive query output in HDInsight

  4. 请在 Notebook 的“文件”菜单中选择“关闭并停止” 。From the File menu on the notebook, select Close and Halt. 关闭笔记本会释放群集资源,包括 Spark 应用程序。Shutting down the notebook releases the cluster resources, including Spark application.

清理资源Clean up resources

完成本快速入门后,可以删除群集。After you complete the quickstart, you may want to delete the cluster. 有了 HDInsight,便可以将数据存储在 Azure 存储中,因此可以在群集不用时安全地删除群集。With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it isn't in use. 此外,还需要为 HDInsight 群集付费,即使不用也是如此。You're also charged for an HDInsight cluster, even when it isn't in use. 由于群集费用数倍于存储空间费用,因此在群集不用时删除群集可以节省费用。Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they aren't in use.

从 Azure 门户导航到群集,然后选择“删除”。From the Azure portal, navigate to your cluster, and select Delete.

在 Azure 门户中删除 HDInsight 群集Azure portal delete an HDInsight cluster

还可以选择资源组名称来打开“资源组”页,然后选择“删除资源组”。You can also select the resource group name to open the resource group page, and then select Delete resource group. 通过删除资源组,可以删除 HDInsight 群集和默认存储帐户。By deleting the resource group, you delete both the HDInsight cluster, and the default storage account.

后续步骤Next steps

在本快速入门中,你已了解如何在 HDInsight 中创建 Apache Spark 群集并运行基本的 Spark SQL 查询。In this quickstart, you learned how to create an Apache Spark cluster in HDInsight and run a basic Spark SQL query. 转到下一教程,了解如何使用 HDInsight 群集针对示例数据运行交互式查询。Advance to the next tutorial to learn how to use an HDInsight cluster to run interactive queries on sample data.