Train ML models with Azure Databricks AutoML Python API

This article demonstrates how to train a model with Azure Databricks AutoML using the AutoML Python API. See Azure Databricks AutoML Python API reference for more details. The API provides functions to start classification, regression, and forecasting AutoML runs. Each function call trains a set of models and generates a trial notebook for each model.

See Requirements for AutoML experiments.

Setup an experiment using the AutoML API

The following steps generally describe how to set up an AutoML experiment using the API:

  1. Create a notebook and attach it to a cluster running Databricks Runtime ML.

  2. Identify which table you want to use from your existing data source or upload a data file to DBFS and create a table.

  3. To start an AutoML run, use the automl.regress() or automl.classify() function and pass the table, along with any other training parameters. To see all functions and parameters, see Azure Databricks AutoML Python API reference.

    For example:

    summary = automl.regress(dataset=train_pdf, target_col="col_to_predict")
    
  4. When the AutoML run begins, an MLflow experiment URL appears in the console. Use this URL to monitor the run's progress. Refresh the MLflow experiment to see the trials as they are completed.

  5. After the AutoML run completes:

  • Use the links in the output summary to navigate to the MLflow experiment or the notebook that generated the best results.
  • Use the link to the data exploration notebook to gain insights into the data passed to AutoML. You can also attach this notebook to the same cluster and re-run it to reproduce the results or do additional data analysis.
  • Use the summary object returned from the AutoML call to explore more details about the trials or to load a model trained by a given trial. Learn more about the AutoMLSummary object.
  • Clone any generated notebook from the trials and re-run it by attaching it to the same cluster to reproduce the results. You can also make necessary edits, re-run them to train additional models and log them into the same experiment.

Import a notebook

To import a notebook saved as an MLflow artifact, use the databricks.automl.import_notebook Python API. For more information see Import notebook.

Register and deploy a model

You can register and deploy your AutoML-trained model just like any registered model in the MLflow model registry; see Log, load, register, and deploy MLflow models.

No module named pandas.core.indexes.numeric

When serving a model built using AutoML with Model Serving, you may get the error: No module named 'pandas.core.indexes.numeric.

This is due to an incompatible pandas version between AutoML and the model serving endpoint environment. You can resolve this error by running the add-pandas-dependency.py script. The script edits the requirements.txt and conda.yaml for your logged model to include the appropriate pandas dependency version: pandas==1.5.3.

  1. Modify the script to include the run_id of the MLflow run where your model was logged.
  2. Re-registering the model to the MLflow model registry.
  3. Try serving the new version of the MLflow model.

Notebook examples

Review these notebooks to get started with AutoML.

The following notebook shows how to do classification with AutoML.

AutoML classification example notebook

Get notebook

The following notebook shows how to do regression with AutoML.

AutoML regression example notebook

Get notebook

The following notebook shows how to do forecasting with AutoML.

AutoML forecasting example notebook

Get notebook

Next steps

Azure Databricks AutoML Python API reference.