Using the HDFS CLI with Data Lake Storage
You can access and manage the data in your storage account by using a command line interface just as you would with a Hadoop Distributed File System (HDFS). This article provides some examples that will help you get started.
HDInsight provides access to the distributed container that is locally attached to the compute nodes. You can access this container by using the shell that directly interacts with the HDFS and the other file systems that Hadoop supports.
For more information on HDFS CLI, see the official documentation and the HDFS Permissions Guide
Note
If you're using Azure Databricks instead of HDInsight, and you want to interact with your data by using a command line interface, you can use the Databricks CLI to interact with the Databricks file system. See Databricks CLI.
Use the HDFS CLI with an HDInsight Hadoop cluster on Linux
First, establish remote access to services. If you pick SSH the sample PowerShell code would look as follows:
#Connect to the cluster via SSH.
ssh sshuser@clustername-ssh.azurehdinsight.cn
#Execute basic HDFS commands. Display the hierarchy.
hdfs dfs -ls /
#Create a sample directory.
hdfs dfs -mkdir /samplefolder
The connection string can be found at the "SSH + Cluster login" section of the HDInsight cluster blade in Azure portal. SSH credentials were specified at the time of the cluster creation.
Important
HDInsight cluster billing starts after a cluster is created and stops when the cluster is deleted. Billing is pro-rated per minute, so you should always delete your cluster when it is no longer in use. To learn how to delete a cluster, see our article on the topic. However, data stored in a storage account with Data Lake Storage enabled persists even after an HDInsight cluster is deleted.
Create a container
hdfs dfs -D "fs.azure.createRemoteFileSystemDuringInitialization=true" -ls abfs://<container-name>@<storage-account-name>.dfs.core.chinacloudapi.cn/
Replace the
<container-name>
placeholder with the name that you want to give your container.Replace the
<storage-account-name>
placeholder with the name of your storage account.
Get a list of files or directories
hdfs dfs -ls <path>
Replace the <path>
placeholder with the URI of the container or container folder.
For example: hdfs dfs -ls abfs://my-file-system@mystorageaccount.dfs.core.chinacloudapi.cn/my-directory-name
Create a directory
hdfs dfs -mkdir [-p] <path>
Replace the <path>
placeholder with the root container name or a folder within your container.
For example: hdfs dfs -mkdir abfs://my-file-system@mystorageaccount.dfs.core.chinacloudapi.cn/
Delete a file or directory
hdfs dfs -rm <path>
Replace the <path>
placeholder with the URI of the file or folder that you want to delete.
For example: hdfs dfs -rmdir abfs://my-file-system@mystorageaccount.dfs.core.chinacloudapi.cn/my-directory-name/my-file-name
Display the Access Control Lists (ACLs) of files and directories
hdfs dfs -getfacl [-R] <path>
Example:
hdfs dfs -getfacl -R /dir
See getfacl
Set ACLs of files and directories
hdfs dfs -setfacl [-R] [-b|-k -m|-x <acl_spec> <path>]|[--set <acl_spec> <path>]
Example:
hdfs dfs -setfacl -m user:hadoop:rw- /file
See setfacl
Change the owner of files
hdfs dfs -chown [-R] <new_owner>:<users_group> <URI>
See chown
Change group association of files
hdfs dfs -chgrp [-R] <group> <URI>
See chgrp
Change the permissions of files
hdfs dfs -chmod [-R] <mode> <URI>
See chmod
You can view the complete list of commands on the Apache Hadoop 2.4.1 File System Shell Guide Website.