Create your first workflow with an Azure Databricks job
This article demonstrates an Azure Databricks job that orchestrates tasks to read and process a sample dataset. In this quickstart, you:
- Create a new notebook and add code to retrieve a sample dataset containing popular baby names by year.
- Save the sample dataset to Unity Catalog.
- Create a new notebook and add code to read the dataset from Unity Catalog, filter it by year, and display the results.
- Create a new job and configure two tasks using the notebooks.
- Run the job and view the results.
You must have cluster creation permission to create job compute or permissions to all-purpose compute resources.
You must have a volume in Unity Catalog. This article uses a volume named my-volume
in a schema named default
within a catalog named main
. Also, you must have the following permissions in Unity Catalog:
READ VOLUME
andWRITE VOLUME
, orALL PRIVILEGES
, for themy-volume
volume.USE SCHEMA
orALL PRIVILEGES
for thedefault
schema.USE CATALOG
orALL PRIVILEGES
for themain
catalog.
To set these permissions, see your Databricks administrator or Unity Catalog privileges and securable objects.
To create a notebook to retrieve the sample dataset and save it to Unity Catalog:
Go to your Azure Databricks landing page and click New in the sidebar and select Notebook. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.
If necessary, change the default language to Python.
Copy the following Python code and paste it into the first cell of the notebook.
import requests response = requests.get('https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv') csvfile = response.content.decode('utf-8') dbutils.fs.put("/Volumes/main/default/my-volume/babynames.csv", csvfile, True)
To create a notebook to read and present the data for filtering:
Go to your Azure Databricks landing page and click New in the sidebar and select Notebook. Databricks creates and opens a new, blank notebook in your default folder. The default language is the language you most recently used, and the notebook is automatically attached to the compute resource that you most recently used.
If necessary, change the default language to Python.
Copy the following Python code and paste it into the first cell of the notebook.
babynames = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/Volumes/main/default/my-volume/babynames.csv") babynames.createOrReplaceTempView("babynames_table") years = spark.sql("select distinct(Year) from babynames_table").toPandas()['Year'].tolist() years.sort() dbutils.widgets.dropdown("year", "2014", [str(x) for x in years]) display(babynames.filter(babynames.Year == dbutils.widgets.get("year")))
Click Workflows in the sidebar.
Click .
The Tasks tab displays with the create task dialog.
Replace Add a name for your job… with your job name.
In the Task name field, enter a name for the task; for example, retrieve-baby-names.
In the Type drop-down menu, select Notebook.
Use the file browser to find the first notebook you created, click the notebook name, and click Confirm.
Click Create task.
Click below the task you just created to add another task.
In the Task name field, enter a name for the task; for example, filter-baby-names.
In the Type drop-down menu, select Notebook.
Use the file browser to find the second notebook you created, click the notebook name, and click Confirm.
Click Add under Parameters. In the Key field, enter
year
. In the Value field, enter2014
.Click Create task.
To run the job immediately, click in the upper right corner. You can also run the job by clicking the Runs tab and clicking Run Now in the Active Runs table.
Click the Runs tab and click the link for the run in the Active Runs table or in the Completed Runs (past 60 days) table.
Click either task to see the output and details. For example, click the filter-baby-names task to view the output and run details for the filter task:
To re-run the job and filter baby names for a different year:
- Click next to Run Now and select Run Now with Different Parameters or click Run Now with Different Parameters in the Active Runs table.
- In the Value field, enter
2015
. - Click Run.