快速入门:使用 Azure CLI 在 Azure HDInsight 中创建 Apache Spark 群集Quickstart: Create Apache Spark cluster in Azure HDInsight using Azure CLI

本快速入门介绍如何使用 Azure 命令行接口 (CLI) 在 Azure HDInsight 中创建 Apache Spark 群集。In this quickstart, you learn how to create an Apache Spark cluster in Azure HDInsight using the Azure command-line interface (CLI). Azure HDInsight 是适用于企业的分析服务,具有托管、全面且开源的特点。Azure HDInsight is a managed, full-spectrum, open-source analytics service for enterprises. 用于 HDInsight 的 Apache Spark 框架使用内存中处理功能实现快速数据分析和群集计算。The Apache Spark framework for HDInsight enables fast data analytics and cluster computing using in-memory processing. Azure CLI 是用于管理 Azure 资源的 Microsoft 跨平台命令行体验。The Azure CLI is Microsoft's cross-platform command-line experience for managing Azure resources.

如果将多个群集一起使用,则需创建一个虚拟网络;如果使用的是 Spark 群集,则还需使用 Hive Warehouse Connector。If you're using multiple clusters together, you'll want to create a virtual network, and if you're using a Spark cluster you'll also want to use the Hive Warehouse Connector. 有关详细信息,请参阅为 Azure HDInsight 规划虚拟网络将 Apache Spark 和 Apache Hive 与 Hive Warehouse Connector 集成For more information, see Plan a virtual network for Azure HDInsight and Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector.

先决条件Prerequisites

创建 Apache Spark 群集Create an Apache Spark cluster

  1. 登录到 Azure 订阅。Sign in to your Azure subscription.

    az login
    
    # If you have multiple subscriptions, set the one to use
    # az account set --subscription "SUBSCRIPTIONID"
    
  2. 设置环境变量。Set environment variables. 本快速入门中的变量用法基于 Bash。The use of variables in this quickstart is based on Bash. 在其他环境中需要进行细微的更改。Slight variations will be needed for other environments. 将以下代码片段中的 RESOURCEGROUPNAME、LOCATION、CLUSTERNAME、STORAGEACCOUNTNAME 和 PASSWORD 替换为所需的值。Replace RESOURCEGROUPNAME, LOCATION, CLUSTERNAME, STORAGEACCOUNTNAME, and PASSWORD in the code snippet below with the desired values. 然后,输入 CLI 命令来设置环境变量。Then enter the CLI commands to set the environment variables.

    export resourceGroupName=RESOURCEGROUPNAME
    export location=LOCATION
    export clusterName=CLUSTERNAME
    export AZURE_STORAGE_ACCOUNT=STORAGEACCOUNTNAME
    export httpCredential='PASSWORD'
    export sshCredentials='PASSWORD'
    
    export AZURE_STORAGE_CONTAINER=$clusterName
    export clusterSizeInNodes=1
    export clusterVersion=3.6
    export clusterType=spark
    export componentVersion=Spark=2.3
    
  3. 输入以下命令来创建资源组:Create the resource group by entering the command below:

    az group create \
        --location $location \
        --name $resourceGroupName
    
  4. 输入以下命令来创建 Azure 存储帐户:Create an Azure storage account by entering the command below:

    az storage account create \
        --name $AZURE_STORAGE_ACCOUNT \
        --resource-group $resourceGroupName \
        --https-only true \
        --kind StorageV2 \
        --location $location \
        --sku Standard_LRS
    
  5. 通过输入以下命令从 Azure 存储帐户中提取主键,然后将其存储在一个变量中:Extract the primary key from the Azure storage account and store it in a variable by entering the command below:

    export AZURE_STORAGE_KEY=$(az storage account keys list \
        --account-name $AZURE_STORAGE_ACCOUNT \
        --resource-group $resourceGroupName \
        --query [0].value -o tsv)
    
  6. 输入以下命令来创建 Azure 存储容器:Create an Azure storage container by entering the command below:

    az storage container create \
        --name $AZURE_STORAGE_CONTAINER \
        --account-key $AZURE_STORAGE_KEY \
        --account-name $AZURE_STORAGE_ACCOUNT
    
  7. 输入以下命令创建 Apache Spark 群集:Create the Apache Spark cluster by entering the following command:

    az hdinsight create \
        --name $clusterName \
        --resource-group $resourceGroupName \
        --type $clusterType \
        --component-version $componentVersion \
        --http-password $httpCredential \
        --http-user admin \
        --location $location \
        --workernode-count $clusterSizeInNodes \
        --ssh-password $sshCredentials \
        --ssh-user sshuser \
        --storage-account $AZURE_STORAGE_ACCOUNT \
        --storage-account-key $AZURE_STORAGE_KEY \
        --storage-container $AZURE_STORAGE_CONTAINER \
        --version $clusterVersion
    

清理资源Clean up resources

完成本快速入门后,可以删除群集。After you complete the quickstart, you may want to delete the cluster. 有了 HDInsight,便可以将数据存储在 Azure 存储中,因此可以在群集不用时安全地删除群集。With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it is not in use. 此外,还需要支付 HDInsight 群集费用,即使未使用。You are also charged for an HDInsight cluster, even when it is not in use. 由于群集费用高于存储空间费用数倍,因此在不使用群集时将其删除可以节省费用。Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use.

输入以下命令中的全部或部分来删除资源:Enter all or some of the following commands to remove resources:

# Remove cluster
az hdinsight delete \
    --name $clusterName \
    --resource-group $resourceGroupName

# Remove storage container
az storage container delete \
    --account-name $AZURE_STORAGE_ACCOUNT \
    --name $AZURE_STORAGE_CONTAINER

# Remove storage account
az storage account delete \
    --name $AZURE_STORAGE_ACCOUNT \
    --resource-group $resourceGroupName

# Remove resource group
az group delete \
    --name $resourceGroupName

后续步骤Next steps

在本快速入门中,你学习了如何使用 Azure CLI 在 Azure HDInsight 中创建 Apache Spark 群集。In this quickstart, you learned how to create an Apache Spark cluster in Azure HDInsight using Azure CLI. 转到下一教程,了解如何使用 HDInsight 群集针对示例数据运行交互式查询。Advance to the next tutorial to learn how to use an HDInsight cluster to run interactive queries on sample data.