使用 Python 配置自动化 ML 试验Configure automated ML experiments in Python

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

本指南介绍如何通过 Azure 机器学习 SDK 定义各种自动机器学习试验的配置设置。In this guide, learn how to define various configuration settings of your automated machine learning experiments with the Azure Machine Learning SDK. 自动化机器学习将自动选择算法和超参数,并生成随时可用于部署的模型。Automated machine learning picks an algorithm and hyperparameters for you and generates a model ready for deployment. 可以使用多个选项来配置自动化机器学习试验。There are several options that you can use to configure automated machine learning experiments.

若要查看自动化机器学习试验的示例,请参阅教程:使用自动化机器学习训练分类模型使用云中的自动化机器学习训练模型To view examples of an automated machine learning experiments, see Tutorial: Train a classification model with automated machine learning or Train models with automated machine learning in the cloud.

自动化机器学习提供的配置选项:Configuration options available in automated machine learning:

  • 选择试验类型:分类、回归或时序预测Select your experiment type: Classification, Regression, or Time Series Forecasting
  • 数据源、格式和提取数据Data source, formats, and fetch data
  • 选择计算目标:本地或远程Choose your compute target: local or remote
  • 自动化机器学习试验设置Automated machine learning experiment settings
  • 运行自动化机器学习试验Run an automated machine learning experiment
  • 探索模型指标Explore model metrics
  • 注册和部署模型Register and deploy model

如果你更喜欢无代码体验,还可以在 Azure 机器学习工作室中创建自动化学习试验If you prefer a no code experience, you can also Create your automated machine learning experiments in Azure Machine Learning studio.

先决条件Prerequisites

在本文中,你需要:For this article you need,

选择试验类型Select your experiment type

在开始试验之前,应确定要解决的机器学习问题类型。Before you begin your experiment, you should determine the kind of machine learning problem you are solving. 自动化机器学习支持 classificationregressionforecasting 任务类型。Automated machine learning supports task types of classification, regression, and forecasting. 详细了解任务类型Learn more about task types.

下面的代码使用 AutoMLConfig 构造函数中的 task 参数将试验类型指定为 classificationThe following code uses the task parameter in the AutoMLConfig constructor to specify the experiment type as classification.

from azureml.train.automl import AutoMLConfig

# task can be one of classification, regression, forecasting
automl_config = AutoMLConfig(task = "classification")

数据源和格式Data source and format

自动化机器学习支持驻留在本地桌面上或云中(例如 Azure Blob 存储)的数据。Automated machine learning supports data that resides on your local desktop or in the cloud such as Azure Blob Storage. 数据可以读入 Pandas 数据帧或 Azure 机器学习 TabularDataset 中 。The data can be read into a Pandas DataFrame or an Azure Machine Learning TabularDataset. 了解有关数据集的详细信息Learn more about datasets.

训练数据的要求:Requirements for training data:

  • 数据必须为表格格式。Data must be in tabular form.
  • 要预测的值(目标列)必须位于数据中。The value to predict, target column, must be in the data.

对于远程试验,必须能够从远程计算访问训练数据。For remote experiments, training data must be accessible from the remote compute. AutoML 仅在处理远程计算时才接受 Azure 机器学习 TabularDatasetAutoML only accepts Azure Machine Learning TabularDatasets when working on a remote compute.

Azure 机器学习数据集公开的功能可以:Azure Machine Learning datasets expose functionality to:

  • 轻松地将数据从静态文件或 URL 源传输到工作区。Easily transfer data from static files or URL sources into your workspace.
  • 在云计算资源上运行时,使数据可用于训练脚本。Make your data available to training scripts when running on cloud compute resources. 有关使用 Dataset 类将数据装载到远程计算目标的示例,请参阅如何使用数据集进行训练See How to train with datasets for an example of using the Dataset class to mount data to your remote compute target.

下面的代码从一个 Web URL 创建 TabularDataset。The following code creates a TabularDataset from a web url. 有关从其他源(例如本地文件和数据存储)创建数据集的代码示例,请参阅创建 TabularDatasetSee Create a TabularDatasets for code examples on how to create datasets from other sources like local files and datastores.

from azureml.core.dataset import Dataset
data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/creditcard.csv"
dataset = Dataset.Tabular.from_delimited_files(data)

对于本地计算试验,我们建议使用 pandas 数据帧以提高处理速度。For local compute experiments, we recommend pandas dataframes for faster processing times.

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("your-local-file.csv")
train_data, test_data = train_test_split(df, test_size=0.1, random_state=42)
label = "label-col-name"

训练、验证和测试数据Training, validation, and test data

可以直接在 AutoMLConfig 构造函数中指定单独的训练集和验证集You can specify separate training and validation sets directly in the AutoMLConfig constructor. 详细了解如何配置数据拆分和交叉验证(针对 AutoML 试验)。Learn more about how to configure data splits and cross validation for your AutoML experiments.

如果未显式指定 validation_datan_cross_validation 参数,则 AutoML 将应用默认技术来决定如何执行验证。If you do not explicitly specify a validation_data or n_cross_validation parameter, AutoML applies default techniques to determine how validation is performed. 此决定依赖于分配给 training_data 参数的数据集中的行数。This determination depends on the number of rows in the dataset assigned to your training_data parameter.

训练数据大小Training data size 验证技术Validation technique
大于 20,000 行Larger than 20,000 rows 将应用训练/验证数据拆分。Train/validation data split is applied. 默认行为是将初始训练数据集的 10% 用作验证集。The default is to take 10% of the initial training data set as the validation set. 然后,该验证集将用于指标计算。In turn, that validation set is used for metrics calculation.
小于 20,000 行Smaller than 20,000 rows 将应用交叉验证方法。Cross-validation approach is applied. 默认折数取决于行数。The default number of folds depends on the number of rows.
如果数据集小于 1,000 行,则使用 10 折。If the dataset is less than 1,000 rows, 10 folds are used.
如果行数在 1,000 到 20,000 之间,则使用 3 折。If the rows are between 1,000 and 20,000, then three folds are used.

此时,你需要提供自己的测试数据来进行模型评估。At this time, you need to provide your own test data for model evaluation. 如果需要通过代码示例来演示如何引入你自己的测试数据进行模型评估,请参阅此 Jupyter 笔记本Test 节。For a code example of bringing your own test data for model evaluation see the Test section of this Jupyter notebook.

用于运行试验的计算环境Compute to run experiment

接下来,确定要在何处训练模型。Next determine where the model will be trained. 自动化机器学习训练试验可根据以下计算选项运行。An automated machine learning training experiment can run on the following compute options. 了解本地和远程计算选项的优缺点Learn the pros and cons of local and remote compute options.

  • 本地台式机或便携式计算机等本地计算机 – 如果数据集较小,并且你仍然处于探索阶段,则通常使用此选项。Your local machine such as a local desktop or laptop – Generally when you have a small dataset and you are still in the exploration stage. 有关本地计算示例,请参阅此笔记本See this notebook for a local compute example.

  • 云中的远程计算机 – Azure 机器学习托管计算是一个托管服务,可用于在 Azure 虚拟机的群集上训练机器学习模型。A remote machine in the cloud – Azure Machine Learning Managed Compute is a managed service that enables the ability to train machine learning models on clusters of Azure virtual machines.

    有关使用 Azure 机器学习托管计算的远程示例,请参阅此笔记本See this notebook for a remote example using Azure Machine Learning Managed Compute.

  • Azure 订阅中的 Azure Databricks 群集An Azure Databricks cluster in your Azure subscription. 可在此处找到更多详细信息:安装适用于自动化 ML 的 Azure Databricks 群集You can find more details here - Setup Azure Databricks cluster for Automated ML. 有关包含 Azure Databricks 的示例 Notebook,请参阅此 GitHub 站点See this GitHub site for examples of notebooks with Azure Databricks.

配置试验设置Configure your experiment settings

可以使用多个选项来配置自动化机器学习试验。There are several options that you can use to configure your automated machine learning experiment. 通过实例化 AutoMLConfig 对象来设置这些参数。These parameters are set by instantiating an AutoMLConfig object. 有关参数的完整列表,请参阅 AutoMLConfig 类See the AutoMLConfig class for a full list of parameters.

示例包括:Some examples include:

  1. 使用 AUC 作为主要指标加权的分类实验,其中实验超时分钟数设置为 30 分钟,且包含 2 折交叉验证。Classification experiment using AUC weighted as the primary metric with experiment timeout minutes set to 30 minutes and 2 cross-validation folds.

        automl_classifier=AutoMLConfig(
        task='classification',
        primary_metric='AUC_weighted',
        experiment_timeout_minutes=30,
        blocked_models=['XGBoostClassifier'],
        training_data=train_data,
        label_column_name=label,
        n_cross_validations=2)
    
  2. 下面是设置为 60 分钟后结束的回归试验示例,其中包含 5 折交叉验证。The following example is a regression experiment set to end after 60 minutes with five validation cross folds.

       automl_regressor = AutoMLConfig(
       task='regression',
       experiment_timeout_minutes=60,
       allowed_models=['KNN'],
       primary_metric='r2_score',
       training_data=train_data,
       label_column_name=label,
       n_cross_validations=5)
    
  3. 预测任务需要其他设置,请参阅自动训练时序预测模型一文来了解更多详细信息。Forecasting tasks require additional setup, see the Auto-train a time-series forecast model article for more details.

    time_series_settings = {
        'time_column_name': time_column_name,
        'time_series_id_column_names': time_series_id_column_names,
        'drop_column_names': ['logQuantity'],
        'forecast_horizon': n_test_periods
    }
    
    automl_config = AutoMLConfig(task = 'forecasting',
                                 debug_log='automl_oj_sales_errors.log',
                                 primary_metric='normalized_root_mean_squared_error',
                                 experiment_timeout_minutes=20,
                                 training_data=train_data,
                                 label_column_name=label,
                                 n_cross_validations=5,
                                 path=project_folder,
                                 verbosity=logging.INFO,
                                 **time_series_settings)
    

支持的模型Supported models

在自动化和优化过程中,自动化机器学习会尝试各种模型和算法。Automated machine learning tries different models and algorithms during the automation and tuning process. 用户不需要指定算法。As a user, there is no need for you to specify the algorithm.

三个不同的 task 参数值(第三个任务类型为 forecasting,并使用类似的算法池作为 regression 任务)确定要应用的算法模型的列表。The three different task parameter values (the third task-type is forecasting, and uses a similar algorithm pool as regression tasks) determine the list of algorithms, models, to apply. 使用 allowed_modelsblocked_models 参数通过要包含或排除的可用模型来进一步修改迭代。Use the allowed_models or blocked_models parameters to further modify iterations with the available models to include or exclude. 可以在分类预测回归SupportedModels 类中找到支持的模型的列表。The list of supported models can be found on SupportedModels Class for Classification, Forecasting, and Regression.

主要指标Primary Metric

primary metric 参数决定了将在模型训练期间用于优化的指标。The primary metric parameter determines the metric to be used during model training for optimization. 你可选择的可用指标取决于所选择的任务类型,下表显示了每种任务类型的有效主要指标。The available metrics you can select is determined by the task type you choose, and the following table shows valid primary metrics for each task type.

如需了解上述指标的具体定义,请参阅了解自动化机器学习结果集Learn about the specific definitions of these metrics in Understand automated machine learning results.

分类Classification 回归Regression 时序预测Time Series Forecasting
accuracyaccuracy spearman_correlationspearman_correlation spearman_correlationspearman_correlation
AUC_weightedAUC_weighted normalized_root_mean_squared_errornormalized_root_mean_squared_error normalized_root_mean_squared_errornormalized_root_mean_squared_error
average_precision_score_weightedaverage_precision_score_weighted r2_scorer2_score r2_scorer2_score
norm_macro_recallnorm_macro_recall normalized_mean_absolute_errornormalized_mean_absolute_error normalized_mean_absolute_errornormalized_mean_absolute_error
precision_score_weightedprecision_score_weighted

数据特征化Data featurization

在每个自动化机器学习实验中,数据都是自动缩放和规范化,以帮助对不同规模上的特征敏感的某些算法。In every automated machine learning experiment, your data is automatically scaled and normalized to help certain algorithms that are sensitive to features that are on different scales. 此缩放和规范化称为特征化。This scaling and normalization is referred to as featurization. 有关更多详细信息和代码示例,请参阅 AutoML 中的特征化See Featurization in AutoML for more detail and code examples.

AutoMLConfig 对象中配置试验时,可以启用/禁用设置 featurizationWhen configuring your experiments in your AutoMLConfig object, you can enable/disable the setting featurization. 下表列出了 AutoMLConfig 对象中的特征化的已接受设置。The following table shows the accepted settings for featurization in the AutoMLConfig object.

特征化配置Featurization Configuration 说明Description
"featurization": 'auto' 指示在处理过程中自动执行数据护栏和特征化步骤Indicates that as part of preprocessing, data guardrails and featurization steps are performed automatically. 默认设置。Default setting.
"featurization": 'off' 表示不应自动执行特征化步骤。Indicates featurization step shouldn't be done automatically.
"featurization": 'FeaturizationConfig' 指示应当使用自定义特征化步骤。Indicates customized featurization step should be used. 了解如何自定义特征化Learn how to customize featurization.

备注

自动化机器学习特征化步骤(特征规范化、处理缺失数据,将文本转换为数字等)成为了基础模型的一部分。Automated machine learning featurization steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. 使用模型进行预测时,将自动向输入数据应用在训练期间应用的相同特征化步骤。When using the model for predictions, the same featurization steps applied during training are applied to your input data automatically.

集成配置Ensemble configuration

集成模型默认启用,在 AutoML 运行中显示为最终的运行迭代次数。Ensemble models are enabled by default, and appear as the final run iterations in an AutoML run. 目前支持 VotingEnsembleStackEnsembleCurrently VotingEnsemble and StackEnsemble are supported.

投票实现了使用加权平均值的软投票。Voting implements soft-voting which uses weighted averages. 堆栈实现使用一个两层实现,其中的第一层具有与投票集成相同的模型,第二层模型用于从第一层中查找模型的最佳组合。The stacking implementation uses a two layer implementation, where the first layer has the same models as the voting ensemble, and the second layer model is used to find the optimal combination of the models from the first layer.

如果使用 ONNX 模型,或启用了模型可解释性,则会禁用堆栈,仅使用投票。If you are using ONNX models, or have model-explainability enabled, stacking is disabled and only voting is utilized.

可以通过使用 enable_voting_ensembleenable_stack_ensemble 布尔参数来禁用集成训练。Ensemble training can be disabled by using the enable_voting_ensemble and enable_stack_ensemble boolean parameters.

automl_classifier = AutoMLConfig(
        task='classification',
        primary_metric='AUC_weighted',
        experiment_timeout_minutes=30,
        training_data=data_train,
        label_column_name=label,
        n_cross_validations=5,
        enable_voting_ensemble=False,
        enable_stack_ensemble=False
        )

若要更改默认集成行为,可以将多个默认参数作为 kwargsAutoMLConfig 对象中提供。To alter the default ensemble behavior, there are multiple default arguments that can be provided as kwargs in an AutoMLConfig object.

重要

以下参数不是 AutoMLConfig 类的显式参数。The following parameters aren't explicit parameters of the AutoMLConfig class.

  • ensemble_download_models_timeout_sec:在 VotingEnsemble 和 StackEnsemble 模型生成期间,会下载来自先前子运行的多个拟合模型。ensemble_download_models_timeout_sec: During VotingEnsemble and StackEnsemble model generation, multiple fitted models from the previous child runs are downloaded. 如果遇到此错误 AutoMLEnsembleException: Could not find any models for running ensembling,则可能需要为要下载的模型提供更多时间。If you encounter this error: AutoMLEnsembleException: Could not find any models for running ensembling, then you may need to provide more time for the models to be downloaded. 默认值为 300 秒并行下载这些模型,且没有最大超时限制。The default value is 300 seconds for downloading these models in parallel and there is no maximum timeout limit. 如果需要更多时间,请将此参数配置为大于 300 秒的值。Configure this parameter with a higher value than 300 secs, if more time is needed.

    备注

    如果已超时且下载了模型,则融合会使用它下载的多个模型继续执行。If the timeout is reached and there are models downloaded, then the ensembling proceeds with as many models it has downloaded. 并不需要下载所有模型才能在超时内完成。It's not required that all the models need to be downloaded to finish within that timeout.

以下参数只应用于 StackEnsemble 模型:The following parameters only apply to StackEnsemble models:

  • stack_meta_learner_type:元学习器是针对单个异类模型的输出而训练出来的模型。stack_meta_learner_type: the meta-learner is a model trained on the output of the individual heterogeneous models. 默认的元学习器是用于分类任务的 LogisticRegression(或为 LogisticRegressionCV,如果启用了交叉验证的话),以及用于回归/预测任务的 ElasticNet(或为 ElasticNetCV,如果启用了交叉验证的话)。Default meta-learners are LogisticRegression for classification tasks (or LogisticRegressionCV if cross-validation is enabled) and ElasticNet for regression/forecasting tasks (or ElasticNetCV if cross-validation is enabled). 此参数可以是下列字符串之一:LogisticRegressionLogisticRegressionCVLightGBMClassifierElasticNetElasticNetCVLightGBMRegressorLinearRegressionThis parameter can be one of the following strings: LogisticRegression, LogisticRegressionCV, LightGBMClassifier, ElasticNet, ElasticNetCV, LightGBMRegressor, or LinearRegression.

  • stack_meta_learner_train_percentage:指定为训练元学习器而保留的训练集的比例(选择训练的训练和验证类型时)。stack_meta_learner_train_percentage: specifies the proportion of the training set (when choosing train and validation type of training) to be reserved for training the meta-learner. 默认值为 0.2Default value is 0.2.

  • stack_meta_learner_kwargs:要传递给元学习器的初始值设定项的可选参数。stack_meta_learner_kwargs: optional parameters to pass to the initializer of the meta-learner. 这些参数和参数类型对来自相应模型构造函数的参数和参数类型进行镜像,然后再转发到模型构造函数。These parameters and parameter types mirror the parameters and parameter types from the corresponding model constructor, and are forwarded to the model constructor.

下面的代码示例展示了如何在 AutoMLConfig 对象中指定自定义融合行为。The following code shows an example of specifying custom ensemble behavior in an AutoMLConfig object.

ensemble_settings = {
    "ensemble_download_models_timeout_sec": 600
    "stack_meta_learner_type": "LogisticRegressionCV",
    "stack_meta_learner_train_percentage": 0.3,
    "stack_meta_learner_kwargs": {
        "refit": True,
        "fit_intercept": False,
        "class_weight": "balanced",
        "multi_class": "auto",
        "n_jobs": -1
    }
}

automl_classifier = AutoMLConfig(
        task='classification',
        primary_metric='AUC_weighted',
        experiment_timeout_minutes=30,
        training_data=train_data,
        label_column_name=label,
        n_cross_validations=5,
        **ensemble_settings
        )

运行试验Run experiment

对于自动化 ML,可以创建 Experiment 对象,这是 Workspace 中用于运行实验的命名对象。For automated ML, you create an Experiment object, which is a named object in a Workspace used to run experiments.

from azureml.core.experiment import Experiment

ws = Workspace.from_config()

# Choose a name for the experiment and specify the project folder.
experiment_name = 'automl-classification'
project_folder = './sample_projects/automl-classification'

experiment = Experiment(ws, experiment_name)

提交试验以运行和生成模型。Submit the experiment to run and generate a model. AutoMLConfig 传递给 submit 方法以生成模型。Pass the AutoMLConfig to the submit method to generate the model.

run = experiment.submit(automl_config, show_output=True)

备注

首先在新的计算机上安装依赖项。Dependencies are first installed on a new machine. 最长可能需要在 10 分钟后才会显示输出。It may take up to 10 minutes before output is shown. show_output 设置为 True 可在控制台上显示输出。Setting show_output to True results in output being shown on the console.

退出条件Exit criteria

有几个选项可供定义来结束实验。There are a few options you can define to end your experiment.

条件Criteria descriptiondescription
无条件No criteria 如果未定义任何退出参数,则试验将继续,直到主要指标不再需要执行其他步骤。If you do not define any exit parameters the experiment continues until no further progress is made on your primary metric.
在一段时间后After a length of time 在设置中使用 experiment_timeout_minutes 来定义试验应继续运行多长时间(以分钟为单位)。Use experiment_timeout_minutes in your settings to define how long, in minutes, your experiment should continue to run.

若要避免试验超时失败,最少需要 15 分钟,如果行数乘以列数的大小超过 10,000,000,则最少需要 60 分钟。To help avoid experiment time out failures, there is a minimum of 15 minutes, or 60 minutes if your row by column size exceeds 10 million.
达到某个分数A score has been reached 使用 experiment_exit_score 将在达到指定的主要指标分数后完成试验。Use experiment_exit_score completes the experiment after a specified primary metric score has been reached.

探索模型和指标Explore models and metrics

如果在笔记本中操作,可以在小组件或内联单元中查看训练结果。You can view your training results in a widget or inline if you are in a notebook. 有关更多详细信息,请参阅跟踪和评估模型See Track and evaluate models for more details.

请参阅了解自动化机器学习结果,查看为每次运行提供的性能图表和指标的定义和示例。See Understand automated machine learning results for definitions and examples of the performance charts and metrics provided for each run.

若要获取特征化摘要并了解哪些功能已添加到特定模型,请参阅特征化透明度To get a featurization summary and understand what features were added to a particular model, see Featurization transparency.

注册和部署模型Register and deploy models

有关如何下载或注册模型以便部署到 Web 服务的详细信息,请参阅如何部署模型以及在何处部署模型For details on how to download or register a model for deployment to a web service, see how and where to deploy a model.

模型可解释性Model interpretability

模型可解释性让你可以了解模型进行预测的原因,以及基础特征重要性值。Model interpretability allows you to understand why your models made predictions, and the underlying feature importance values. SDK 包括各种包,这些包用于在训练和推理时间为本地和已部署的模型启用模型可解释性功能。The SDK includes various packages for enabling model interpretability features, both at training and inference time, for local and deployed models.

有关如何在自动化机器学习试验中启用可解释性功能的代码示例,请参阅操作方法See the how-to for code samples on how to enable interpretability features specifically within automated machine learning experiments.

有关如何在自动化机器学习之外的其他 SDK 区域中启用模型解释和特征重要性的基本信息,请参阅可解释性方面的概念文章。For general information on how model explanations and feature importance can be enabled in other areas of the SDK outside of automated machine learning, see the concept article on interpretability.

备注

解释客户端目前不支持 ForecastTCN 模型。The ForecastTCN model is not currently supported by the Explanation Client. 如果此模型作为最佳模型返回,则不会返回解释仪表板,并且不支持按需解释运行。This model will not return an explanation dashboard if it is returned as the best model, and does not support on-demand explanation runs.

后续步骤Next steps