What is AutoML?
Databricks AutoML simplifies the process of applying machine learning to your datasets by automatically finding the best algorithm and hyperparameter configuration for you.
Provide your dataset and specify the type of machine learning problem, then AutoML does the following:
- Cleans and prepares your data.
- Orchestrates distributed model training and hyperparameter tuning across multiple algorithms.
- Finds the best model using open source evaluation algorithms from scikit-learn, xgboost, LightGBM, Prophet, and ARIMA.
- Presents the results. AutoML also generates source code notebooks for each trial, allowing you to review, reproduce, and modify the code as needed.
Get started with AutoML experiments through a low-code UI or the Python API.
Requirements
Databricks Runtime 9.1 ML or above. For the general availability (GA) version, Databricks Runtime 10.4 LTS ML or above.
- For time series forecasting, Databricks Runtime 10.0 ML or above.
- With Databricks Runtime 9.1 LTS ML and above, AutoML depends on the
databricks-automl-runtime
package, which contains components that are useful outside of AutoML and also helps simplify the notebooks generated by AutoML training.databricks-automl-runtime
is available on PyPI.
No additional libraries other than those preinstalled in Databricks Runtime for Machine Learning should be installed on the cluster.
- Any modification (removal, upgrades, or downgrades) to existing library versions results in run failures due to incompatibility.
To access files in your workspace, you must have network ports 1017 and 1021 open for AutoML experiments. To open these ports or confirm they are open, review your cloud VPN firewall configuration and security group rules or contact your local cloud administrator. For additional information on workspace configuration and deployment, see Create a workspace.
Use a compute resource with a supported compute access mode. Not all compute access modes have access to the Unity Catalog:
Compute access mode AutoML support Unity Catalog support single user Supported (must be the designated single user for the cluster) Supported Shared access mode Unsupported Unsupported No isolation shared Supported Unsupported
AutoML algorithms
Databricks AutoML trains and evaluates models based on the algorithms in the following table.
Note
For classification and regression models, the decision tree, random forests, logistic regression, and linear regression with stochastic gradient descent algorithms are based on scikit-learn.
Classification models | Regression models | Forecasting models |
---|---|---|
Decision trees | Decision trees | |
Random forests | Random forests | Auto-ARIMA (Available in Databricks Runtime 10.3 ML and above.) |
Logistic regression | Linear regression with stochastic gradient descent | |
XGBoost | XGBoost | |
LightGBM | LightGBM |
Trial notebook generation
AutoML generates notebooks of the source code behind trials so you can review, reproduce, and modify the code as needed.
For forecasting experiments, AutoML-generated notebooks are automatically imported to your workspace for all trials of your experiment.
For classification and regression experiments, AutoML-generated notebooks for data exploration and the best trial in your experiment are automatically imported to your workspace. Generated notebooks for other experiment trials are saved as MLflow artifacts on DBFS instead of auto-imported into your workspace. For all trials besides the best trial, the notebook_path
and notebook_url
in the TrialInfo
Python API are not set. If you need to use these notebooks, you can manually import them into your workspace with the AutoML experiment UI or the databricks.automl.import_notebook
Python API.
If you only use the data exploration notebook or best trial notebook generated by AutoML, the Source column in the AutoML experiment UI contains the link to the generated notebook for the best trial.
If you use other generated notebooks in the AutoML experiment UI, these are not automatically imported into the workspace. You can find the notebooks by clicking into each MLflow run. The IPython notebook is saved in the Artifacts section of the run page. You can download this notebook and import it into the workspace, if downloading artifacts is enabled by your workspace administrators.
Shapley values (SHAP) for model explainability
Note
For MLR 11.1 and below, SHAP plots are not generated if the dataset contains a datetime
column.
The notebooks produced by AutoML regression and classification runs include code to calculate Shapley values. Shapley values are based in game theory and estimate the importance of each feature to a model's predictions.
AutoML notebooks calculate Shapley values using the SHAP package. Because these calculations are highly memory-intensive, the calculations are not performed by default.
To calculate and display Shapley values:
- Go to the Feature importance section in an AutoML-generated trial notebook.
- Set
shap_enabled = True
. - Re-run the notebook.