Databricks Runtime 9.1 LTS for Machine Learning
Databricks released this image and declared it Long Term Support (LTS) in September 2021.
Databricks Runtime 9.1 LTS for Machine Learning provides a ready-to-go environment for machine learning and data science based on Databricks Runtime 9.1 LTS. Databricks Runtime ML contains many popular machine learning libraries, including TensorFlow, PyTorch, and XGBoost. Databricks Runtime ML includes AutoML, a tool to automatically train machine learning pipelines. Databricks Runtime ML also supports distributed deep learning training using Horovod.
Note
LTS means this version is under long-term support. See Databricks Runtime LTS version lifecycle.
For more information, including instructions for creating a Databricks Runtime ML cluster, see AI and Machine Learning on Databricks.
New features and improvements
AutoML
The following improvements are available in Databricks Runtime 9.1 LTS ML and above.
AutoML supports larger datasets by sampling
AutoML now samples datasets that might exceed memory constraints, allowing it to run on larger datasets with less risk of out-of-memory errors. For details, see Sampling large datasets.
AutoML preprocesses columns based on semantic type
AutoML detects certain columns that have a semantic type that differs from their Spark or pandas data type. AutoML then converts and applies data preprocessing steps based on the detected semantic type. Specifically, AutoML performs the following conversions:
- String and integer columns that represent date or timestamp data are converted to a timestamp type.
- String columns that represent numeric data are converted to a numeric type.
Improvements to AutoML generated notebooks
Preprocessing steps for date and timestamp columns are now incorporated in the databricks-automl-runtime
package, simplifying the notebooks generated by AutoML training. databricks-automl-runtime
is included in Databricks Runtime 9.1 LTS ML and above, and is also available on PyPI.
Feature store
The following improvements are available in Databricks Runtime 9.1 LTS ML and above.
- When you create a TrainingSet, you can now set
label=None
to support unsupervised learning applications. - You can now specify more than one feature in a single
FeatureLookup
. - You can now specify a custom path for feature tables. Use the
path
parameter increate_feature_table()
. The default is the database location. - New supported PySpark data types: ArrayType and ShortType.
Mlflow
The following improvements are available starting in Mlflow version 1.20.2, which is included in Databricks Runtime 9.1 LTS ML.
- Autologging for scikit-learn now records post-training metrics whenever a scikit-learn evaluation API, such as
sklearn.metrics.mean_squared_error
, is called. - Autologging for PySpark ML now records post-training metrics whenever a model evaluation API, such as
Evaluator.evaluate()
, is called. mlflow.*.log_model
andmlflow.*.save_model
now havepip_requirements
andextra_pip_requirements
arguments so that you can directly specify the pip requirements of the model to log or save.mlflow.*.log_model
andmlflow.*.save_model
now automatically infer the pip requirements of the model to log or save based on the current software environment.stdMetrics
entries are now recorded as training metrics during PySpark CrossValidator autologging.- PyTorch Lightning autologging now supports distributed execution.
Databricks Autologging (Public Preview)
The Databricks Autologging Public Preview has been expanded to new regions. Databricks Autologging is a no-code solution that provides automatic experiment tracking for machine learning training sessions on Azure Databricks. With Databricks Autologging, model parameters, metrics, files, and lineage information are automatically captured when you train models from a variety of popular machine learning libraries. Training sessions are recorded as MLflow Tracking Runs. Model files are also tracked so you can easily log them to the MLflow Model Registry and deploy them for real-time scoring with MLflow Model Serving.
For more information about Databricks Autologging, see Databricks Autologging.
Major changes to Databricks Runtime ML Python environment
Python packages upgraded
- automl 1.1.1 => 1.2.1
- feature_store 0.3.3 => 0.3.4.1
- holidays 0.10.5.2 => 0.11.2
- keras 2.5.0 => 2.6.0
- mlflow 1.19.0 => 1.20.2
- petastorm 0.11.1 => 0.11.2
- plotly 4.14.3 => 5.1.0
- spark-tensorflow-distributor 0.1.0 => 1.0.0
- sparkdl 2.2.0_db1 => 2.2.0_db3
- tensorboard 2.5.0 => 2.6.0
- tensorflow 2.5.0 => 2.6.0
Python packages added
- databricks-automl-runtime 0.1.0
System environment
The system environment in Databricks Runtime 9.1 LTS ML differs from Databricks Runtime 9.1 LTS as follows:
- DBUtils: Databricks Runtime ML does not include Library utility (dbutils.library) (legacy).
Use
%pip
commands instead. See Notebook-scoped Python libraries. - For GPU clusters, Databricks Runtime ML includes the following NVIDIA GPU libraries:
- CUDA 11.0
- cuDNN 8.1.0.77
- NCCL 2.10.3
- TensorRT 7.2.2
Libraries
The following sections list the libraries included in Databricks Runtime 9.1 LTS ML that differ from those included in Databricks Runtime 9.1 LTS.
In this section:
Top-tier libraries
Databricks Runtime 9.1 LTS ML includes the following top-tier libraries:
- AutoML
- GraphFrames
- Horovod and HorovodRunner
- MLflow
- PyTorch
- spark-tensorflow-connector
- TensorFlow
- TensorBoard
Python libraries
Databricks Runtime 9.1 LTS ML uses Virtualenv for Python package management and includes many popular ML packages.
In addition to the packages specified in the following sections, Databricks Runtime 9.1 LTS ML also includes the following packages:
- hyperopt 0.2.5.db2
- sparkdl 2.2.0_db3
- feature_store 0.3.4.1
- automl 1.2.1
Python libraries on CPU clusters
Library | Version | Library | Version | Library | Version |
---|---|---|---|---|---|
absl-py | 0.11.0 | Antergos Linux | 2015.10 (ISO-Rolling) | appdirs | 1.4.4 |
argon2-cffi | 20.1.0 | astor | 0.8.1 | astunparse | 1.6.3 |
async-generator | 1.10 | attrs | 20.3.0 | backcall | 0.2.0 |
bcrypt | 3.2.0 | bleach | 3.3.0 | boto3 | 1.16.7 |
botocore | 1.19.7 | Bottleneck | 1.3.2 | cachetools | 4.2.2 |
certifi | 2020.12.5 | cffi | 1.14.5 | chardet | 4.0.0 |
clang | 5.0 | click | 7.1.2 | cloudpickle | 1.6.0 |
cmdstanpy | 0.9.68 | configparser | 5.0.1 | convertdate | 2.3.2 |
cryptography | 3.4.7 | cycler | 0.10.0 | Cython | 0.29.23 |
databricks-automl-runtime | 0.1.0 | databricks-cli | 0.14.3 | dbus-python | 1.2.16 |
decorator | 5.0.6 | defusedxml | 0.7.1 | dill | 0.3.2 |
diskcache | 5.2.1 | distlib | 0.3.2 | distro-info | 0.23ubuntu1 |
entrypoints | 0.3 | ephem | 4.0.0.2 | facets-overview | 1.0.0 |
filelock | 3.0.12 | Flask | 1.1.2 | flatbuffers | 1.12 |
fsspec | 0.9.0 | future | 0.18.2 | gast | 0.4.0 |
gitdb | 4.0.7 | GitPython | 3.1.12 | google-auth | 1.22.1 |
google-auth-oauthlib | 0.4.2 | google-pasta | 0.2.0 | grpcio | 1.39.0 |
gunicorn | 20.0.4 | h5py | 3.1.0 | hijri-converter | 2.2.1 |
holidays | 0.11.2 | horovod | 0.22.1 | htmlmin | 0.1.12 |
idna | 2.10 | ImageHash | 4.2.1 | importlib-metadata | 3.10.0 |
ipykernel | 5.3.4 | ipython | 7.22.0 | ipython-genutils | 0.2.0 |
ipywidgets | 7.6.3 | isodate | 0.6.0 | itsdangerous | 1.1.0 |
jedi | 0.17.2 | Jinja2 | 2.11.3 | jmespath | 0.10.0 |
joblib | 1.0.1 | joblibspark | 0.3.0 | jsonschema | 3.2.0 |
jupyter-client | 6.1.12 | jupyter-core | 4.7.1 | jupyterlab-pygments | 0.1.2 |
jupyterlab-widgets | 1.0.0 | keras | 2.6.0 | Keras-Preprocessing | 1.1.2 |
kiwisolver | 1.3.1 | koalas | 1.8.1 | korean-lunar-calendar | 0.2.1 |
lightgbm | 3.1.1 | llvmlite | 0.37.0 | LunarCalendar | 0.0.9 |
Mako | 1.1.3 | Markdown | 3.3.3 | MarkupSafe | 1.1.1 |
matplotlib | 3.4.2 | missingno | 0.5.0 | mistune | 0.8.4 |
mleap | 0.17.0 | mlflow-skinny | 1.20.2 | multimethod | 1.4 |
nbclient | 0.5.3 | nbconvert | 6.0.7 | nbformat | 5.1.3 |
nest-asyncio | 1.5.1 | networkx | 2.5 | nltk | 3.6.1 |
notebook | 6.3.0 | numba | 0.54.0 | numpy | 1.19.2 |
oauthlib | 3.1.0 | opt-einsum | 3.3.0 | packaging | 20.9 |
pandas | 1.2.4 | pandas-profiling | 3.0.0 | pandocfilters | 1.4.3 |
paramiko | 2.7.2 | parso | 0.7.0 | patsy | 0.5.1 |
petastorm | 0.11.2 | pexpect | 4.8.0 | phik | 0.12.0 |
pickleshare | 0.7.5 | Pillow | 8.2.0 | pip | 21.0.1 |
plotly | 5.1.0 | prometheus-client | 0.10.1 | prompt-toolkit | 3.0.17 |
prophet | 1.0.1 | protobuf | 3.17.2 | psutil | 5.8.0 |
psycopg2 | 2.8.5 | ptyprocess | 0.7.0 | pyarrow | 4.0.0 |
pyasn1 | 0.4.8 | pyasn1-modules | 0.2.8 | pycparser | 2.20 |
pydantic | 1.8.2 | Pygments | 2.8.1 | PyGObject | 3.36.0 |
PyMeeus | 0.5.11 | PyNaCl | 1.3.0 | pyodbc | 4.0.30 |
pyparsing | 2.4.7 | pyrsistent | 0.17.3 | pystan | 2.19.1.1 |
python-apt | 2.0.0+ubuntu0.20.4.6 | python-dateutil | 2.8.1 | python-editor | 1.0.4 |
pytz | 2020.5 | PyWavelets | 1.1.1 | PyYAML | 5.4.1 |
pyzmq | 20.0.0 | regex | 2021.4.4 | requests | 2.25.1 |
requests-oauthlib | 1.3.0 | requests-unixsocket | 0.2.0 | rsa | 4.7.2 |
s3transfer | 0.3.7 | scikit-learn | 0.24.1 | scipy | 1.6.2 |
seaborn | 0.11.1 | Send2Trash | 1.5.0 | setuptools | 52.0.0 |
setuptools-git | 1.2 | shap | 0.39.0 | simplejson | 3.17.2 |
six | 1.15.0 | slicer | 0.0.7 | smmap | 3.0.5 |
spark-tensorflow-distributor | 1.0.0 | sqlparse | 0.4.1 | ssh-import-id | 5.10 |
statsmodels | 0.12.2 | tabulate | 0.8.7 | tangled-up-in-unicode | 0.1.0 |
tenacity | 6.2.0 | tensorboard | 2.6.0 | tensorboard-data-server | 0.6.1 |
tensorboard-plugin-wit | 1.8.0 | tensorflow-cpu | 2.6.0 | tensorflow-estimator | 2.6.0 |
termcolor | 1.1.0 | terminado | 0.9.4 | testpath | 0.4.4 |
threadpoolctl | 2.1.0 | torch | 1.9.0+cpu | torchvision | 0.10.0+cpu |
tornado | 6.1 | tqdm | 4.59.0 | traitlets | 5.0.5 |
typing-extensions | 3.7.4.3 | ujson | 4.0.2 | unattended-upgrades | 0.1 |
urllib3 | 1.25.11 | virtualenv | 20.4.1 | visions | 0.7.1 |
wcwidth | 0.2.5 | webencodings | 0.5.1 | websocket-client | 0.57.0 |
Werkzeug | 1.0.1 | wheel | 0.36.2 | widgetsnbextension | 3.5.1 |
wrapt | 1.12.1 | xgboost | 1.4.2 | zipp | 3.4.1 |
Python libraries on GPU clusters
Library | Version | Library | Version | Library | Version |
---|---|---|---|---|---|
absl-py | 0.11.0 | Antergos Linux | 2015.10 (ISO-Rolling) | appdirs | 1.4.4 |
argon2-cffi | 20.1.0 | astor | 0.8.1 | astunparse | 1.6.3 |
async-generator | 1.10 | attrs | 20.3.0 | backcall | 0.2.0 |
bcrypt | 3.2.0 | bleach | 3.3.0 | boto3 | 1.16.7 |
botocore | 1.19.7 | Bottleneck | 1.3.2 | cachetools | 4.2.2 |
certifi | 2020.12.5 | cffi | 1.14.5 | chardet | 4.0.0 |
clang | 5.0 | click | 7.1.2 | cloudpickle | 1.6.0 |
cmdstanpy | 0.9.68 | configparser | 5.0.1 | convertdate | 2.3.2 |
cryptography | 3.4.7 | cycler | 0.10.0 | Cython | 0.29.23 |
databricks-automl-runtime | 0.1.0 | databricks-cli | 0.14.3 | dbus-python | 1.2.16 |
decorator | 5.0.6 | defusedxml | 0.7.1 | dill | 0.3.2 |
diskcache | 5.2.1 | distlib | 0.3.2 | distro-info | 0.23ubuntu1 |
entrypoints | 0.3 | ephem | 4.0.0.2 | facets-overview | 1.0.0 |
filelock | 3.0.12 | Flask | 1.1.2 | flatbuffers | 1.12 |
fsspec | 0.9.0 | future | 0.18.2 | gast | 0.4.0 |
gitdb | 4.0.7 | GitPython | 3.1.12 | google-auth | 1.22.1 |
google-auth-oauthlib | 0.4.2 | google-pasta | 0.2.0 | grpcio | 1.39.0 |
gunicorn | 20.0.4 | h5py | 3.1.0 | hijri-converter | 2.2.1 |
holidays | 0.11.2 | horovod | 0.22.1 | htmlmin | 0.1.12 |
idna | 2.10 | ImageHash | 4.2.1 | importlib-metadata | 3.10.0 |
ipykernel | 5.3.4 | ipython | 7.22.0 | ipython-genutils | 0.2.0 |
ipywidgets | 7.6.3 | isodate | 0.6.0 | itsdangerous | 1.1.0 |
jedi | 0.17.2 | Jinja2 | 2.11.3 | jmespath | 0.10.0 |
joblib | 1.0.1 | joblibspark | 0.3.0 | jsonschema | 3.2.0 |
jupyter-client | 6.1.12 | jupyter-core | 4.7.1 | jupyterlab-pygments | 0.1.2 |
jupyterlab-widgets | 1.0.0 | keras | 2.6.0 | Keras-Preprocessing | 1.1.2 |
kiwisolver | 1.3.1 | koalas | 1.8.1 | korean-lunar-calendar | 0.2.1 |
lightgbm | 3.1.1 | llvmlite | 0.37.0 | LunarCalendar | 0.0.9 |
Mako | 1.1.3 | Markdown | 3.3.3 | MarkupSafe | 1.1.1 |
matplotlib | 3.4.2 | missingno | 0.5.0 | mistune | 0.8.4 |
mleap | 0.17.0 | mlflow-skinny | 1.20.2 | multimethod | 1.4 |
nbclient | 0.5.3 | nbconvert | 6.0.7 | nbformat | 5.1.3 |
nest-asyncio | 1.5.1 | networkx | 2.5 | nltk | 3.6.1 |
notebook | 6.3.0 | numba | 0.54.0 | numpy | 1.19.2 |
oauthlib | 3.1.0 | opt-einsum | 3.3.0 | packaging | 20.9 |
pandas | 1.2.4 | pandas-profiling | 3.0.0 | pandocfilters | 1.4.3 |
paramiko | 2.7.2 | parso | 0.7.0 | patsy | 0.5.1 |
petastorm | 0.11.2 | pexpect | 4.8.0 | phik | 0.12.0 |
pickleshare | 0.7.5 | Pillow | 8.2.0 | pip | 21.0.1 |
plotly | 5.1.0 | prompt-toolkit | 3.0.17 | prophet | 1.0.1 |
protobuf | 3.17.2 | psutil | 5.8.0 | psycopg2 | 2.8.5 |
ptyprocess | 0.7.0 | pyarrow | 4.0.0 | pyasn1 | 0.4.8 |
pyasn1-modules | 0.2.8 | pycparser | 2.20 | pydantic | 1.8.2 |
Pygments | 2.8.1 | PyGObject | 3.36.0 | PyMeeus | 0.5.11 |
PyNaCl | 1.3.0 | pyodbc | 4.0.30 | pyparsing | 2.4.7 |
pyrsistent | 0.17.3 | pystan | 2.19.1.1 | python-apt | 2.0.0+ubuntu0.20.4.6 |
python-dateutil | 2.8.1 | python-editor | 1.0.4 | pytz | 2020.5 |
PyWavelets | 1.1.1 | PyYAML | 5.4.1 | pyzmq | 20.0.0 |
regex | 2021.4.4 | requests | 2.25.1 | requests-oauthlib | 1.3.0 |
requests-unixsocket | 0.2.0 | rsa | 4.7.2 | s3transfer | 0.3.7 |
scikit-learn | 0.24.1 | scipy | 1.6.2 | seaborn | 0.11.1 |
Send2Trash | 1.5.0 | setuptools | 52.0.0 | setuptools-git | 1.2 |
shap | 0.39.0 | simplejson | 3.17.2 | six | 1.15.0 |
slicer | 0.0.7 | smmap | 3.0.5 | spark-tensorflow-distributor | 1.0.0 |
sqlparse | 0.4.1 | ssh-import-id | 5.10 | statsmodels | 0.12.2 |
tabulate | 0.8.7 | tangled-up-in-unicode | 0.1.0 | tenacity | 6.2.0 |
tensorboard | 2.6.0 | tensorboard-data-server | 0.6.1 | tensorboard-plugin-wit | 1.8.0 |
tensorflow | 2.6.0 | tensorflow-estimator | 2.6.0 | termcolor | 1.1.0 |
terminado | 0.9.4 | testpath | 0.4.4 | threadpoolctl | 2.1.0 |
torch | 1.9.0+cu111 | torchvision | 0.10.0+cu111 | tornado | 6.1 |
tqdm | 4.59.0 | traitlets | 5.0.5 | typing-extensions | 3.7.4.3 |
ujson | 4.0.2 | unattended-upgrades | 0.1 | urllib3 | 1.25.11 |
virtualenv | 20.4.1 | visions | 0.7.1 | wcwidth | 0.2.5 |
webencodings | 0.5.1 | websocket-client | 0.57.0 | Werkzeug | 1.0.1 |
wheel | 0.36.2 | widgetsnbextension | 3.5.1 | wrapt | 1.12.1 |
xgboost | 1.4.2 | zipp | 3.4.1 |
Spark packages containing Python modules
Spark Package | Python Module | Version |
---|---|---|
graphframes | graphframes | 0.8.1-db3-spark3.1 |
R libraries
The R libraries are identical to the R Libraries in Databricks Runtime 9.1 LTS.
Java and Scala libraries (Scala 2.12 cluster)
In addition to Java and Scala libraries in Databricks Runtime 9.1 LTS, Databricks Runtime 9.1 LTS ML contains the following JARs:
CPU clusters
Group ID | Artifact ID | Version |
---|---|---|
com.typesafe.akka | akka-actor_2.12 | 2.5.23 |
ml.combust.mleap | mleap-databricks-runtime_2.12 | 0.17.0-4882dc3 |
ml.dmlc | xgboost4j-spark_2.12 | 1.4.1 |
ml.dmlc | xgboost4j_2.12 | 1.4.1 |
org.graphframes | graphframes_2.12 | 0.8.1-db2-spark3.1 |
org.mlflow | mlflow-client | 1.20.2 |
org.mlflow | mlflow-spark | 1.20.2 |
org.scala-lang.modules | scala-java8-compat_2.12 | 0.8.0 |
org.tensorflow | spark-tensorflow-connector_2.12 | 1.15.0 |
GPU clusters
Group ID | Artifact ID | Version |
---|---|---|
com.typesafe.akka | akka-actor_2.12 | 2.5.23 |
ml.combust.mleap | mleap-databricks-runtime_2.12 | 0.17.0-4882dc3 |
ml.dmlc | xgboost4j-gpu_2.12 | 1.4.1 |
ml.dmlc | xgboost4j-spark-gpu_2.12 | 1.4.1 |
org.graphframes | graphframes_2.12 | 0.8.1-db2-spark3.1 |
org.mlflow | mlflow-client | 1.20.2 |
org.mlflow | mlflow-spark | 1.20.2 |
org.scala-lang.modules | scala-java8-compat_2.12 | 0.8.0 |
org.tensorflow | spark-tensorflow-connector_2.12 | 1.15.0 |