Create a cluster with Data Lake Storage Gen2 using Azure CLI
To create an HDInsight cluster that uses Data Lake Storage Gen2 for storage, follow these steps.
Prerequisites
If you're unfamiliar with Azure Data Lake Storage Gen2, check out the overview section.
If you don't already have an Azure account, sign up for a trial subscription before continuing.
To run the CLI script examples:
- Install the latest version of the Azure CLI (2.0.13 or later) if you prefer to use a local CLI console. Sign in to Azure using
az login
, using an account that is associated with the Azure subscription under which you would like to deploy the user-assigned managed identity.Azure CLI.
- Install the latest version of the Azure CLI (2.0.13 or later) if you prefer to use a local CLI console. Sign in to Azure using
Warning
Billing for HDInsight clusters is prorated per minute, whether you use them or not. Be sure to delete your cluster after you finish using it. See how to delete an HDInsight cluster.
You can download a sample template file and download a sample parameters file. Before using the template and the Azure CLI code snippet below, replace the following placeholders with their correct values:
Placeholder | Description |
---|---|
<SUBSCRIPTION_ID> |
The ID of your Azure subscription |
<RESOURCEGROUPNAME> |
The resource group where you want the new cluster and storage account created. |
<MANAGEDIDENTITYNAME> |
The name of the managed identity that will be given permissions on your storage account with Azure Data Lake Storage Gen2. |
<STORAGEACCOUNTNAME> |
The new storage account with Azure Data Lake Storage Gen2 that will be created. |
<FILESYSTEMNAME> |
The name of the filesystem that this cluster should use in the storage account. |
<CLUSTERNAME> |
The name of your HDInsight cluster. |
<PASSWORD> |
Your chosen password for signing in to the cluster using SSH and the Ambari dashboard. |
The code snippet below does the following initial steps:
- Logs in to your Azure account.
- Sets the active subscription where the create operations will be done.
- Creates a new resource group for the new deployment activities.
- Creates a user-assigned managed identity.
- Adds an extension to the Azure CLI to use features for Data Lake Storage Gen2.
- Creates a new storage account with Data Lake Storage Gen2 by using the
--hierarchical-namespace true
flag.
az cloud set -n AzureChinaCloud
az login
# az cloud set -n AzureCloud //means return to Public Azure.
az account set --subscription <SUBSCRIPTION_ID>
# Create resource group
az group create --name <RESOURCEGROUPNAME> --location chinaeast
# Create managed identity
az identity create -g <RESOURCEGROUPNAME> -n <MANAGEDIDENTITYNAME>
az extension add --name storage-preview
az storage account create --name <STORAGEACCOUNTNAME> \
--resource-group <RESOURCEGROUPNAME> \
--location chinaeast --sku Standard_LRS \
--kind StorageV2 --hierarchical-namespace true
Next, sign in to the portal. Add the new user-assigned managed identity to the Storage Blob Data Owner role on the storage account. This step is described in step 3 under Using the Azure portal.
Important
Ensure that your storage account has the user-assigned identity with Storage Blob Data Owner role permissions, otherwise cluster creation will fail.
az deployment group create --name HDInsightADLSGen2Deployment \
--resource-group <RESOURCEGROUPNAME> \
--template-file hdinsight-adls-gen2-template.json \
--parameters parameters.json
Clean up resources
After you complete the article, you may want to delete the cluster. With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it isn't in use. You're also charged for an HDInsight cluster, even when it's not in use. Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they aren't in use.
Enter all or some of the following commands to remove resources:
# Remove cluster
az hdinsight delete \
--name $clusterName \
--resource-group $resourceGroupName
# Remove storage container
az storage container delete \
--account-name $AZURE_STORAGE_ACCOUNT \
--name $AZURE_STORAGE_CONTAINER
# Remove storage account
az storage account delete \
--name $AZURE_STORAGE_ACCOUNT \
--resource-group $resourceGroupName
# Remove resource group
az group delete \
--name $resourceGroupName
Troubleshoot
If you run into issues with creating HDInsight clusters, see access control requirements.
Next steps
You've successfully created an HDInsight cluster. Now learn how to work with your cluster.
Apache Spark clusters
- Customize Linux-based HDInsight clusters by using script actions
- Create a standalone application using Scala
- Run jobs remotely on an Apache Spark cluster using Apache Livy
- Apache Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools
- Apache Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results