快速入门:使用 PowerShell 在 Azure HDInsight 中创建 Apache Spark 群集Quickstart: Create Apache Spark cluster in Azure HDInsight using PowerShell

在本快速入门中,我们使用 Azure PowerShell 在 Azure HDInsight 中创建 Apache Spark 群集。In this quickstart, you use Azure PowerShell to create an Apache Spark cluster in Azure HDInsight. 然后,我们创建一个 Jupyter 笔记本,并使用它针对 Apache Hive 表运行 Spark SQL 查询。You then create a Jupyter notebook, and use it to run Spark SQL queries against Apache Hive tables. Azure HDInsight 是适用于企业的分析服务,具有托管、全面且开源的特点。Azure HDInsight is a managed, full-spectrum, open-source analytics service for enterprises. 用于 Azure HDInsight 的 Apache Spark 框架允许使用内存中处理功能实现快速数据分析和群集计算。The Apache Spark framework for Azure HDInsight enables fast data analytics and cluster computing using in-memory processing. 使用 Jupyter 笔记本,可以与数据进行交互、将代码和 Markdown 文本结合使用,以及进行简单的可视化。Jupyter notebook lets you interact with your data, combine code with markdown text, and do simple visualizations.

概述:Azure HDInsight 上的 Apache Spark | Apache Spark | Apache Hive | Jupyter NotebookOverview: Apache Spark on Azure HDInsight | Apache Spark | Apache Hive | Jupyter Notebook

先决条件Prerequisite

在 HDInsight 中创建 Apache Spark 群集Create an Apache Spark cluster in HDInsight

重要

HDInsight 群集是基于分钟按比例收费,而不管用户是否正在使用它们。Billing for HDInsight clusters is prorated per minute, whether you are using them or not. 请务必在使用完之后删除群集。Be sure to delete your cluster after you have finished using it. 有关详细信息,请参阅本文的清理资源部分。For more information, see the Clean up resources section of this article.

创建 HDInsight 群集包括创建以下 Azure 对象和资源:Creating an HDInsight cluster includes creating the following Azure objects and resources:

  • Azure 资源组。An Azure resource group. Azure 资源组是 Azure 资源的容器。An Azure resource group is a container for Azure resources.
  • Azure 存储帐户或 Azure Data Lake Storage。An Azure storage account or Azure Data Lake Storage. 每个 HDInsight 群集都需要依赖的数据存储。Each HDInsight cluster requires a dependent data storage. 在本快速入门中,创建存储帐户。In this quickstart, you create a storage account.
  • 在 HDInsight 上的一个群集类型不同的群集。An cluster of different cluster types on HDInsight. 在本快速入门中,你将创建 Spark 2.3 群集。In this quickstart, you create a Spark 2.3 cluster.

使用 PowerShell 脚本创建资源。You use a PowerShell script to create the resources.

备注

本文进行了更新,以便使用新的 Azure PowerShell Az 模块。This article has been updated to use the new Azure PowerShell Az module. 你仍然可以使用 AzureRM 模块,至少在 2020 年 12 月之前,它将继续接收 bug 修补程序。You can still use the AzureRM module, which will continue to receive bug fixes until at least December 2020. 若要详细了解新的 Az 模块和 AzureRM 兼容性,请参阅新 Azure Powershell Az 模块简介To learn more about the new Az module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. 有关 Az 模块安装说明,请参阅安装 Azure PowerShellFor Az module installation instructions, see Install Azure PowerShell.

运行 PowerShell 脚本时,系统会提示输入以下值:When you run the PowerShell script, you are prompted to enter the following values:

参数Parameter ValueValue
Azure 资源组名称Azure resource group name 提供资源组的唯一名称。Provide a unique name for the resource group.
位置Location 指定 Azure 区域,例如“中国东部”。Specify the Azure region, for example 'China East'.
默认存储帐户名Default storage account name 为存储帐户提供唯一名称。Provide a unique name for the storage account.
群集名称Cluster name 为 HDInsight Spark 群集提供唯一名称。Provide a unique name for the HDInsight cluster.
群集登录凭据Cluster login credentials 在本快速入门中稍后使用该帐户连接到群集仪表板。You use this account to connect to the cluster dashboard later in the quickstart.
SSH 用户凭据SSH user credentials SSH 客户端可用于创建与 HDInsight 中的群集进行的远程命令行会话。The SSH clients can be used to create a remote command-line session with the clusters in HDInsight.
  1. 打开终端窗口并遵照说明连接到 Azure。Open a terminal window and the follow the instructions to connect to Azure.

  2. 在命令行窗口中复制并粘贴以下 PowerShell 脚本。Copy and paste the following PowerShell script in the terminal window.

    ### Create a Spark 2.3 cluster in Azure HDInsight
    
    # Default cluster size (# of worker nodes), version, and type
    $clusterSizeInNodes = "1"
    $clusterVersion = "3.6"
    $clusterType = "Spark"
    
    # Create the resource group
    $resourceGroupName = Read-Host -Prompt "Enter the resource group name"
    $location = Read-Host -Prompt "Enter the Azure region to create resources in, such as 'China East'"
    $defaultStorageAccountName = Read-Host -Prompt "Enter the default storage account name"
    New-AzResourceGroup -Name $resourceGroupName -Location $location
    
    # Create an Azure storage account and container
    # Note: Storage account kind BlobStorage can only be used as secondary storage for HDInsight clusters.
    New-AzStorageAccount `
        -ResourceGroupName $resourceGroupName `
        -Name $defaultStorageAccountName `
        -Location $location `
        -SkuName Standard_LRS `
        -Kind StorageV2 `
        -EnableHttpsTrafficOnly 1
    
    $defaultStorageAccountKey = (Get-AzStorageAccountKey `
                                    -ResourceGroupName $resourceGroupName `
                                    -Name $defaultStorageAccountName)[0].Value
    $defaultStorageContext = New-AzStorageContext `
                                    -StorageAccountName $defaultStorageAccountName `
                                    -StorageAccountKey $defaultStorageAccountKey
    
    # Create a Spark 2.3 cluster
    $clusterName = Read-Host -Prompt "Enter the name of the HDInsight cluster"
    # Cluster login is used to secure HTTPS services hosted on the cluster
    $httpCredential = Get-Credential -Message "Enter Cluster login credentials" -UserName "admin"
    # SSH user is used to remotely connect to the cluster using SSH clients
    $sshCredentials = Get-Credential -Message "Enter SSH user credentials" -UserName "sshuser"
    
    # Set the storage container name to the cluster name
    $defaultBlobContainerName = $clusterName
    
    # Create a blob container. This holds the default data store for the cluster.
    New-AzStorageContainer `
        -Name $clusterName `
        -Context $defaultStorageContext 
    
    $sparkConfig = New-Object "System.Collections.Generic.Dictionary``2[System.String,System.String]"
    $sparkConfig.Add("spark", "2.3")
    
    # Create the cluster in HDInsight
    New-AzHDInsightCluster `
        -ResourceGroupName $resourceGroupName `
        -ClusterName $clusterName `
        -Location $location `
        -ClusterSizeInNodes $clusterSizeInNodes `
        -ClusterType $clusterType `
        -OSType "Linux" `
        -Version $clusterVersion `
        -ComponentVersion $sparkConfig `
        -HttpCredential $httpCredential `
        -DefaultStorageAccountName "$defaultStorageAccountName.blob.core.chinacloudapi.cn" `
        -DefaultStorageAccountKey $defaultStorageAccountKey `
        -DefaultStorageContainer $clusterName `
        -SshCredential $sshCredentials
    
    Get-AzHDInsightCluster `
        -ResourceGroupName $resourceGroupName `
        -ClusterName $clusterName
    

    创建群集大约需要 20 分钟时间。It takes about 20 minutes to create the cluster. 必须先创建群集,才能继续下一会话。The cluster must be created before you can proceed to the next session.

如果在创建 HDInsight 群集时遇到问题,可能是因为没有这样做的适当权限。If you run into an issue with creating HDInsight clusters, it could be that you do not have the right permissions to do so. 有关详细信息,请参阅访问控制要求For more information, see Access control requirements.

创建 Jupyter 笔记本Create a Jupyter notebook

Jupyter Notebook 是支持各种编程语言的交互式笔记本环境。Jupyter Notebook is an interactive notebook environment that supports various programming languages. 通过此笔记本可以与数据进行交互、结合代码和 markdown 文本以及执行简单的可视化效果。The notebook allows you to interact with your data, combine code with markdown text and perform simple visualizations.

  1. Azure 门户中,搜索并选择“HDInsight 群集”。 In the Azure portal, search for and select HDInsight clusters.

    在 Azure 门户中打开 HDInsight 群集

  2. 从列表中选择已创建的群集。From the list, select the cluster you created.

    在 Azure 门户中打开 HDInsight 群集

  3. 在群集“概览”页上选择“群集仪表板”,然后选择“Jupyter Notebook” 。On the cluster Overview page, select Cluster dashboards, and then select Jupyter Notebook. 出现提示时,请输入群集的群集登录凭据。If prompted, enter the cluster login credentials for the cluster.

    打开 Jupyter Notebook 以运行交互式 Spark SQL 查询Open Jupyter Notebook to run interactive Spark SQL query

  4. 选择“新建” > “PySpark”,创建笔记本 。Select New > PySpark to create a notebook.

    创建 Jupyter Notebook 以运行交互式 Spark SQL 查询Create a Jupyter Notebook to run interactive Spark SQL query

    新 Notebook 随即会创建,并以 Untitled(Untitled.pynb) 名称打开。A new notebook is created and opened with the name Untitled(Untitled.pynb).

运行 Apache Spark SQL 语句Run Apache Spark SQL statements

SQL(结构化查询语言)是用于查询和定义数据的最常见、最广泛使用的语言。SQL (Structured Query Language) is the most common and widely used language for querying and defining data. Spark SQL 作为 Apache Spark 的扩展使用,可使用熟悉的 SQL 语法处理结构化数据。Spark SQL functions as an extension to Apache Spark for processing structured data, using the familiar SQL syntax.

  1. 验证 kernel 已就绪。Verify the kernel is ready. 如果在 Notebook 中的内核名称旁边看到空心圆,则内核已准备就绪。The kernel is ready when you see a hollow circle next to the kernel name in the notebook. 实心圆表示内核正忙。Solid circle denotes that the kernel is busy.

    内核状态kernel status

    首次启动 Notebook 时,内核在后台执行一些任务。When you start the notebook for the first time, the kernel performs some tasks in the background. 等待内核准备就绪。Wait for the kernel to be ready.

  2. 将以下代码粘贴到一个空单元格中,然后按 SHIFT + ENTER 来运行这些代码。Paste the following code in an empty cell, and then press SHIFT + ENTER to run the code. 此命令列出群集上的 Hive 表:The command lists the Hive tables on the cluster:

    %%sql
    SHOW TABLES
    

    将 Jupyter Notebook 与 HDInsight 中的 Spark 群集配合使用时,会获得一个预设的 sqlContext,用于通过 Spark SQL 运行 Hive 查询。When you use a Jupyter Notebook with your Spark cluster in HDInsight, you get a preset sqlContext that you can use to run Hive queries using Spark SQL. %%sql 指示 Jupyter Notebook 使用预设 sqlContext 运行 Hive 查询。%%sql tells Jupyter Notebook to use the preset sqlContext to run the Hive query. 该查询从默认情况下所有 HDInsight 群集都带有的 Hive 表 (hivesampletable ) 检索前 10 行。The query retrieves the top 10 rows from a Hive table (hivesampletable) that comes with all HDInsight clusters by default. 需要大约 30 秒才能获得结果。It takes about 30 seconds to get the results. 输出如下所示:The output looks like:

    Spark on HDInsight 中的 Apache Hive 查询Apache Hive query in Spark on HDInsight

    每次在 Jupyter 中运行查询时,Web 浏览器窗口标题中都会显示“(繁忙)” 状态和 Notebook 标题。Every time you run a query in Jupyter, your web browser window title shows a (Busy) status along with the notebook title. 右上角“PySpark” 文本的旁边还会出现一个实心圆。You also see a solid circle next to the PySpark text in the top-right corner.

  3. 运行另一个查询,请查看 hivesampletable 中的数据。Run another query to see the data in hivesampletable.

    %%sql
    SELECT * FROM hivesampletable LIMIT 10
    

    屏幕在刷新后会显示查询输出。The screen shall refresh to show the query output.

    HDInsight Spark 中的 Hive 查询输出Hive query output in HDInsight Spark

  4. 请在 Notebook 的“文件”菜单中选择“关闭并停止” 。From the File menu on the notebook, select Close and Halt. 关闭 Notebook 会释放群集资源。Shutting down the notebook releases the cluster resources.

清理资源Clean up resources

HDInsight 将数据保存在 Azure 存储或 Azure Data Lake Storage 中,因此可以在未使用群集时安全地删除群集。HDInsight saves your data in Azure Storage or Azure Data Lake Storage, so you can safely delete a cluster when it is not in use. 此外,还需要支付 HDInsight 群集费用,即使未使用。You are also charged for an HDInsight cluster, even when it is not in use. 由于群集费用高于存储空间费用数倍,因此在不使用群集时将其删除可以节省费用。Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use. 如果要立即开始后续步骤中所列的教程,可能需要保留群集。If you plan to work on the tutorial listed in Next steps immediately, you might want to keep the cluster.

切换回 Azure 门户,并选择“删除” 。Switch back to the Azure portal, and select Delete.

在 Azure 门户中删除 HDInsight 群集Azure portal delete an HDInsight cluster

还可以选择资源组名称来打开“资源组”页,然后选择“删除资源组” 。You can also select the resource group name to open the resource group page, and then select Delete resource group. 删除资源组即可删除 HDInsight 群集和默认存储帐户。By deleting the resource group, you delete both the HDInsight cluster, and the default storage account.

使用 PowerShell Az 模块进行段落清理Piecemeal clean up with PowerShell Az module

# Removes the specified HDInsight cluster from the current subscription.
Remove-AzHDInsightCluster `
    -ResourceGroupName $resourceGroupName `
    -ClusterName $clusterName

# Removes the specified storage container.
Remove-AzStorageContainer `
    -Name $clusterName `
    -Context $defaultStorageContext

# Removes a Storage account from Azure.
Remove-AzStorageAccount `
    -ResourceGroupName $resourceGroupName `
    -Name $defaultStorageAccountName

# Removes a resource group.
Remove-AzResourceGroup `
    -Name $resourceGroupName

后续步骤Next steps

在本快速入门中,你已了解如何在 HDInsight 中创建 Apache Spark 群集并运行基本的 Spark SQL 查询。In this quickstart, you learned how to create an Apache Spark cluster in HDInsight and run a basic Spark SQL query. 转到下一教程,了解如何使用 HDInsight 群集针对示例数据运行交互式查询。Advance to the next tutorial to learn how to use an HDInsight cluster to run interactive queries on sample data.