Run tests with pytest using the Databricks extension for Visual Studio Code
This article describes how to run tests using pytest
using the Databricks extension for Visual Studio Code. See What is the Databricks extension for Visual Studio Code?.
You can run pytest on local code that does not need a connection to a cluster in a remote Azure Databricks workspace. For example, you might use pytest
to test your functions that accept and return PySpark DataFrames in local memory. To get started with pytest
and run it locally, see Get Started in the pytest
documentation.
To run pytest
on code in a remote Azure Databricks workspace, do the following in your Visual Studio Code project:
Step 1: Create the tests
Add a Python file with the following code, which contains your tests to run. This example assumes that this file is named spark_test.py
and is at the root of your Visual Studio Code project. This file contains a pytest
fixture, which makes the cluster's SparkSession
(the entry point to Spark functionality on the cluster) available to the tests. This file contains a single test that checks whether the specified cell in the table contains the specified value. You can add your own tests to this file as needed.
from pyspark.sql import SparkSession
import pytest
@pytest.fixture
def spark() -> SparkSession:
# Create a SparkSession (the entry point to Spark functionality) on
# the cluster in the remote Databricks workspace. Unit tests do not
# have access to this SparkSession by default.
return SparkSession.builder.getOrCreate()
# Now add your unit tests.
# For example, here is a unit test that must be run on the
# cluster in the remote Databricks workspace.
# This example determines whether the specified cell in the
# specified table contains the specified value. For example,
# the third column in the first row should contain the word "Ideal":
#
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# |_c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# | 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3. 98 | 2.43 |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# ...
#
def test_spark(spark):
spark.sql('USE default')
data = spark.sql('SELECT * FROM diamonds')
assert data.collect()[0][2] == 'Ideal'
Step 2: Create the pytest runner
Add a Python file with the following code, which instructs pytest
to run your tests from the previous step. This example assumes that the file is named pytest_databricks.py
and is at the root of your Visual Studio Code project.
import pytest
import os
import sys
# Run all tests in the connected directory in the remote Databricks workspace.
# By default, pytest searches through all files with filenames ending with
# "_test.py" for tests. Within each of these files, pytest runs each function
# with a function name beginning with "test_".
# Get the path to the directory for this file in the workspace.
dir_root = os.path.dirname(os.path.realpath(__file__))
# Switch to the root directory.
os.chdir(dir_root)
# Skip writing .pyc files to the bytecode cache on the cluster.
sys.dont_write_bytecode = True
# Now run pytest from the root directory, using the
# arguments that are supplied by your custom run configuration in
# your Visual Studio Code project. In this case, the custom run
# configuration JSON must contain these unique "program" and
# "args" objects:
#
# ...
# {
# ...
# "program": "${workspaceFolder}/path/to/this/file/in/workspace",
# "args": ["/path/to/_test.py-files"]
# }
# ...
#
retcode = pytest.main(sys.argv[1:])
Step 3: Create a custom run configuration
To instruct pytest
to run your tests, you must create a custom run configuration. Use the existing Databricks cluster-based run configuration to create your own custom run configuration, as follows:
On the main menu, click Run > Add configuration.
In the Command Palette, select Databricks.
Visual Studio Code adds a
.vscode/launch.json
file to your project, if this file does not already exist.Change the starter run configuration as follows, and then save the file:
- Change this run configuration's name from
Run on Databricks
to some unique display name for this configuration, in this exampleUnit Tests (on Databricks)
. - Change
program
from${file}
to the path in the project that contains the test runner, in this example${workspaceFolder}/pytest_databricks.py
. - Change
args
from[]
to the path in the project that contains the files with your tests, in this example["."]
.
Your
launch.json
file should look like this:{ // Use IntelliSense to learn about possible attributes. // Hover to view descriptions of existing attributes. // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 "version": "0.2.0", "configurations": [ { "type": "databricks", "request": "launch", "name": "Unit Tests (on Databricks)", "program": "${workspaceFolder}/pytest_databricks.py", "args": ["."], "env": {} } ] }
- Change this run configuration's name from
Step 4: Run the tests
Make sure that pytest
is already installed on the cluster first. For example, with the cluster's settings page open in your Azure Databricks workspace, do the following:
- On the Libraries tab, if pytest is visible, then
pytest
is already installed. If pytest is not visible, click Install new. - For Library Source, click PyPI.
- For Package, enter
pytest
. - Click Install.
- Wait until Status changes from Pending to Installed.
To run the tests, do the following from your Visual Studio Code project:
- On the main menu, click View > Run.
- In the Run and Debug list, click Unit Tests (on Databricks), if it is not already selected.
- Click the green arrow (Start Debugging) icon.
The pytest
results display in the Debug Console (View > Debug Console on the main menu). For example, these results show that at least one test was found in the spark_test.py
file, and a dot (.
) means that a single test was found and passed. (A failing test would show an F
.)
<date>, <time> - Creating execution context on cluster <cluster-id> ...
<date>, <time> - Synchronizing code to /Workspace/path/to/directory ...
<date>, <time> - Running /pytest_databricks.py ...
============================= test session starts ==============================
platform linux -- Python <version>, pytest-<version>, pluggy-<version>
rootdir: /Workspace/path/to/directory
collected 1 item
spark_test.py . [100%]
============================== 1 passed in 3.25s ===============================
<date>, <time> - Done (took 10818ms)