快速入门：使用 Azure CLI 在 Azure HDInsight 中创建 Apache Spark 群集

2025/04/07

在本快速入门中，你将了解如何使用 Azure CLI 在 Azure HDInsight 中创建 Apache Spark 群集。 Azure HDInsight 是适用于企业的分析服务，具有托管、全面且开源的特点。用于 HDInsight 的 Apache Spark 框架使用内存中处理功能实现快速数据分析和群集计算。 Azure CLI 是 Azure 用于管理 Azure 资源的跨平台命令行体验。

如果将多个群集一起使用，则可以创建一个虚拟网络，而如果使用的是 Spark 群集，则可以使用 Hive Warehouse Connector。有关详细信息，请参阅为 Azure HDInsight 规划虚拟网络和将 Apache Spark 和 Apache Hive 与 Hive Warehouse Connector 集成。

如果没有 Azure 试用版订阅，请在开始前创建 Azure 试用版订阅。

先决条件

如需在本地运行 CLI 参考命令，请安装 Azure CLI。如果在 Windows 或 macOS 上运行，请考虑在 Docker 容器中运行 Azure CLI。有关详细信息，请参阅如何在 Docker 容器中运行 Azure CLI。
- 如果使用的是本地安装，请使用 az login 命令登录到 Azure CLI。若要完成身份验证过程，请遵循终端中显示的步骤。有关其他登录选项，请参阅使用 Azure CLI 登录。
- 出现提示时，请在首次使用时安装 Azure CLI 扩展。有关扩展详细信息，请参阅使用 Azure CLI 的扩展。
- 运行 az version 以查找安装的版本和依赖库。若要升级到最新版本，请运行 az upgrade。

创建 Apache Spark 群集

登录到 Azure 订阅。

az cloud set -n AzureChinaCloud
az login
# az cloud set -n AzureCloud   //means return to Public Azure.

# If you have multiple subscriptions, set the one to use
# az account set --subscription "SUBSCRIPTIONID"

设置环境变量。本快速入门中的变量用法基于 Bash。其他环境需要稍有变化。将以下代码片段中的 RESOURCEGROUPNAME、LOCATION、CLUSTERNAME、STORAGEACCOUNTNAME 和 PASSWORD 替换为所需的值。然后，输入 CLI 命令来设置环境变量。

export resourceGroupName=RESOURCEGROUPNAME
export location=LOCATION
export clusterName=CLUSTERNAME
export AZURE_STORAGE_ACCOUNT=STORAGEACCOUNTNAME
export httpCredential='PASSWORD'
export sshCredentials='PASSWORD'

export AZURE_STORAGE_CONTAINER=$clusterName
export clusterSizeInNodes=1
export clusterVersion=4.0
export clusterType=spark
export componentVersion=Spark=2.3

输入以下命令来创建资源组：

az group create \
    --location $location \
    --name $resourceGroupName

输入以下命令来创建 Azure 存储帐户：

az storage account create \
    --name $AZURE_STORAGE_ACCOUNT \
    --resource-group $resourceGroupName \
    --https-only true \
    --kind StorageV2 \
    --location $location \
    --sku Standard_LRS

输入以下命令来从 Azure 存储帐户中提取主键，然后将其存储在某个变量中：

export AZURE_STORAGE_KEY=$(az storage account keys list \
    --account-name $AZURE_STORAGE_ACCOUNT \
    --resource-group $resourceGroupName \
    --query [0].value -o tsv)

输入以下命令来创建 Azure 存储容器：

az storage container create \
    --name $AZURE_STORAGE_CONTAINER \
    --account-key $AZURE_STORAGE_KEY \
    --account-name $AZURE_STORAGE_ACCOUNT

输入以下命令创建 Apache Spark 群集：

az hdinsight create \
    --name $clusterName \
    --resource-group $resourceGroupName \
    --type $clusterType \
    --component-version $componentVersion \
    --http-password $httpCredential \
    --http-user admin \
    --location $location \
    --workernode-count $clusterSizeInNodes \
    --ssh-password $sshCredentials \
    --ssh-user sshuser \
    --storage-account $AZURE_STORAGE_ACCOUNT \
    --storage-account-key $AZURE_STORAGE_KEY \
    --storage-container $AZURE_STORAGE_CONTAINER \
    --version $clusterVersion

清理资源

完成本快速入门后，可以删除群集。有了 HDInsight，便可以将数据存储在 Azure 存储中，因此可以在群集不用时安全地删除群集。此外，还需要为 HDInsight 群集付费，即使不用也是如此。由于群集费用数倍于存储空间费用，因此在群集不用时删除群集可以节省费用。

输入以下命令中的全部或部分来删除资源：

# Remove cluster
az hdinsight delete \
    --name $clusterName \
    --resource-group $resourceGroupName

# Remove storage container
az storage container delete \
    --account-name $AZURE_STORAGE_ACCOUNT \
    --name $AZURE_STORAGE_CONTAINER

# Remove storage account
az storage account delete \
    --name $AZURE_STORAGE_ACCOUNT \
    --resource-group $resourceGroupName

# Remove resource group
az group delete \
    --name $resourceGroupName

后续步骤

在本快速入门中，你学习了如何使用 Azure CLI 在 Azure HDInsight 中创建 Apache Spark 群集。转到下一教程，了解如何使用 HDInsight 群集针对示例数据运行交互式查询。

在 Apache Spark 上运行交互式查询

通过

快速入门：使用 Azure CLI 在 Azure HDInsight 中创建 Apache Spark 群集

先决条件

创建 Apache Spark 群集

清理资源

后续步骤

其他资源