Work in the Apache Hadoop ecosystem on HDInsight from a Windows PC
Learn about development and management options on the Windows PC for working in the Apache Hadoop ecosystem on HDInsight.
HDInsight is based on Apache Hadoop and Hadoop components, open-source technologies developed on Linux. HDInsight version 3.4 and higher uses the Ubuntu Linux distribution as the underlying OS for the cluster. However, you can work with HDInsight from a Windows client or Windows development environment.
Use PowerShell for deployment and management tasks
Azure PowerShell is a scripting environment that you can use to control and automate deployment and management tasks in HDInsight from Windows.
Examples of tasks you can do with PowerShell:
- Create clusters using PowerShell.
- Run Apache Hive queries using PowerShell.
- Manage clusters with PowerShell.
Follow steps to install and configure Azure PowerShell to get the latest version.
Utilities you can run in a browser
The following utilities have a web UI that runs in a browser:
- Apache Ambari Web UI is a management and monitoring utility available in the Azure portal that can be used to manage different kinds of jobs, such as:
Before you go to the following examples, install and try Data Lake Tools for Visual Studio.
Visual Studio and the .NET SDK
You can use Visual Studio with the .NET SDK to manage clusters and develop big data applications. You can use other IDEs for the following tasks, but examples are shown in Visual Studio.
Examples of tasks you can do with the .NET SDK in Visual Studio:
- Azure HDInsight SDK for .NET.
- Run Apache Hive queries using the .NET SDK.
- Use C# user-defined functions with Apache Hive and Apache Pig streaming on Apache Hadoop.
Intellij IDEA and Eclipse IDE for Spark clusters
Both Intellij IDEA and the Eclipse IDE can be used to:
- Develop and submit a Scala Spark application on an HDInsight Spark cluster.
- Access Spark cluster resources.
- Develop and run a Scala Spark application locally.
These articles show how:
- Intellij IDEA: Create Apache Spark applications using the Azure Toolkit for Intellij plug-in and the Scala SDK.
- Eclipse IDE or Scala IDE for Eclipse: Create Apache Spark applications and the Azure Toolkit for Eclipse
Notebooks on Spark for data scientists
Apache Spark clusters in HDInsight include Apache Zeppelin notebooks and kernels that can be used with Jupyter Notebooks.
- Learn how to use kernels on Apache Spark clusters with Jupyter Notebooks to test Spark applications
- Learn how to use Apache Zeppelin notebooks on Apache Spark clusters to run Spark jobs
Run Linux-based tools and technologies on Windows
If you come across a situation where you must use a tool or technology that is only available on Linux, consider the following options:
- Bash on Ubuntu on Windows 10 provides a Linux subsystem on Windows. Bash allows you to directly run Linux utilities without having to maintain a dedicated Linux installation. See Windows Subsystem for Linux Installation Guide for Windows 10 for installation steps. Other Unix shells work as well.
- Docker for Windows provides access to many Linux-based tools, and can be run directly from Windows. For example, you can use Docker to run the Beeline client for Hive directly from Windows. You can also use Docker to run a local Jupyter Notebook and remotely connect to Spark on HDInsight. Get started with Docker for Windows
- MobaXTerm allows you to graphically browse the cluster file system over an SSH connection.
Cross-platform tools
The Azure command-line interface (CLI) is Azure's cross-platform command-line experience for managing Azure resources. For more information, see Azure Command-Line Interface (CLI).
Next steps
If you're new to work in Linux-based clusters, see the following articles: