Apache Spark guidelines

Article
04/07/2025

This article provides various guidelines for using Apache Spark on Azure HDInsight.

How do I run or submit Spark jobs?

Option	Documents
Visual Studio Code	Use Spark & Hive Tools for Visual Studio Code
Jupyter Notebooks	Tutorial: Load data and run queries on an Apache Spark cluster in Azure HDInsight
IntelliJ	Tutorial: Use Azure Toolkit for IntelliJ to create Apache Spark applications for an HDInsight cluster
IntelliJ	Tutorial: Create a Scala Maven application for Apache Spark in HDInsight using IntelliJ
Zeppelin notebooks	Use Apache Zeppelin notebooks with Apache Spark cluster on Azure HDInsight
Remote job submission with Livy	Use Apache Spark REST API to submit remote jobs to an HDInsight Spark cluster
Apache Oozie	Oozie is a workflow and coordination system that manages Hadoop jobs.
Apache Livy	You can use Livy to run interactive Spark shells or submit batch jobs to be run on Spark.
Azure Data Factory for Apache Spark	The Spark activity in a Data Factory pipeline executes a Spark program on your own or [on-demand HDInsight cluster.
Azure Data Factory for Apache Hive	The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand HDInsight cluster.

How do I monitor and debug Spark jobs?

Option	Documents
Azure Toolkit for IntelliJ	Failure spark job debugging with Azure Toolkit for IntelliJ (preview)
Azure Toolkit for IntelliJ through SSH	Debug Apache Spark applications locally or remotely on an HDInsight cluster with Azure Toolkit for IntelliJ through SSH
Azure Toolkit for IntelliJ through VPN	Use Azure Toolkit for IntelliJ to debug Apache Spark applications remotely in HDInsight through VPN
Job graph on Apache Spark History Server	Use extended Apache Spark History Server to debug and diagnose Apache Spark applications

How do I make my Spark jobs run more efficiently?

Option	Documents
IO Cache	Improve performance of Apache Spark workloads using Azure HDInsight IO Cache (Preview)
Configuration options	Optimize Apache Spark jobs

How do I connect to other Azure Services?

Option	Documents
Apache Hive on HDInsight	Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector
Apache HBase on HDInsight	Use Apache Spark to read and write Apache HBase data
Apache Kafka on HDInsight	Tutorial: Use Apache Spark Structured Streaming with Apache Kafka on HDInsight
Azure Cosmos DB	Azure Synapse Link for Azure Cosmos DB

What are my storage options?

Option	Documents
Azure Data Lake Storage Gen2	Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters
Azure Blob Storage	Use Azure storage with Azure HDInsight clusters

Next steps