使用 Python 配置自动化 ML 试验Configure automated ML experiments in Python

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

本指南介绍如何通过 Azure 机器学习 SDK 定义各种自动机器学习试验的配置设置。In this guide, learn how to define various configuration settings of your automated machine learning experiments with the Azure Machine Learning SDK. 自动化机器学习将自动选择算法和超参数,并生成随时可用于部署的模型。Automated machine learning picks an algorithm and hyperparameters for you and generates a model ready for deployment. 可以使用多个选项来配置自动化机器学习试验。There are several options that you can use to configure automated machine learning experiments.

若要查看自动化机器学习试验的示例,请参阅教程:使用自动化机器学习训练分类模型使用云中的自动化机器学习训练模型To view examples of an automated machine learning experiments, see Tutorial: Train a classification model with automated machine learning or Train models with automated machine learning in the cloud.

自动化机器学习提供的配置选项:Configuration options available in automated machine learning:

  • 选择试验类型:分类、回归或时序预测Select your experiment type: Classification, Regression or Time Series Forecasting
  • 数据源、格式和提取数据Data source, formats, and fetch data
  • 选择计算目标:本地或远程Choose your compute target: local or remote
  • 自动化机器学习试验设置Automated machine learning experiment settings
  • 运行自动化机器学习试验Run an automated machine learning experiment
  • 探索模型指标Explore model metrics
  • 注册和部署模型Register and deploy model

如果你更喜欢无代码体验,还可以在 Azure 机器学习工作室中创建自动化学习试验If you prefer a no code experience, you can also Create your automated machine learning experiments in Azure Machine Learning studio.

选择试验类型Select your experiment type

在开始试验之前,应确定要解决的机器学习问题类型。Before you begin your experiment, you should determine the kind of machine learning problem you are solving. 自动化机器学习支持分类、回归和预测任务类型。Automated machine learning supports task types of classification, regression and forecasting. 了解有关类型的详细信息。Learn more about task types.

在自动化和优化过程中,自动化机器学习支持以下算法。Automated machine learning supports the following algorithms during the automation and tuning process. 用户不需要指定算法。As a user, there is no need for you to specify the algorithm.

Note

如果你计划将自动化 ML 创建的模型导出为 ONNX 模型,只有标有 * 的算法才能转换为 ONNX 格式。If you plan to export your auto ML created models to an ONNX model, only those algorithms indicated with an * are able to be converted to the ONNX format. 详细了解如何将模型转换为 ONNXLearn more about converting models to ONNX.

另请注意,ONNX 目前只支持分类和回归任务。Also note, ONNX only supports classification and regression tasks at this time.

分类Classification 回归Regression 时序预测Time Series Forecasting
逻辑回归*Logistic Regression* 弹性网络*Elastic Net* 弹性网络Elastic Net
Light GBM*Light GBM* Light GBM*Light GBM* Light GBMLight GBM
梯度提升*Gradient Boosting* 梯度提升*Gradient Boosting* 渐进提升Gradient Boosting
决策树*Decision Tree* 决策树*Decision Tree* 决策树Decision Tree
K 最近的邻域*K Nearest Neighbors* K 最近的邻域*K Nearest Neighbors* K 近邻K Nearest Neighbors
线性 SVC*Linear SVC* LARS Lasso*LARS Lasso* LARS LassoLARS Lasso
支持矢量分类 (SVC)*Support Vector Classification (SVC)* 随机梯度下降 (SGD)*Stochastic Gradient Descent (SGD)* 随机梯度下降 (SGD)Stochastic Gradient Descent (SGD)
随机林*Random Forest* 随机林*Random Forest* 随机林Random Forest
极端随机树*Extremely Randomized Trees* 极端随机树*Extremely Randomized Trees* 极端随机树Extremely Randomized Trees
Xgboost*Xgboost* Xgboost*Xgboost* XgboostXgboost
平均感知器分类器Averaged Perceptron Classifier 在线梯度下降回归量Online Gradient Descent Regressor Auto-ARIMAAuto-ARIMA
Naive Bayes*Naive Bayes* ProphetProphet
随机梯度下降 (SGD)*Stochastic Gradient Descent (SGD)* ForecastTCNForecastTCN
线性 SVM 分类器*Linear SVM Classifier*

使用 AutoMLConfig 构造函数中的 task 参数来指定实验类型。Use the task parameter in the AutoMLConfig constructor to specify your experiment type.

from azureml.train.automl import AutoMLConfig

# task can be one of classification, regression, forecasting
automl_config = AutoMLConfig(task = "classification")

数据源和格式Data source and format

自动化机器学习支持驻留在本地桌面上或云中(例如 Azure Blob 存储)的数据。Automated machine learning supports data that resides on your local desktop or in the cloud such as Azure Blob Storage. 数据可以读入 Pandas 数据帧或 Azure 机器学习 TabularDataset 中 。The data can be read into a Pandas DataFrame or an Azure Machine Learning TabularDataset. 了解有关数据集的详细信息Learn more about datasets.

训练数据的要求:Requirements for training data:

  • 数据必须为表格格式。Data must be in tabular form.
  • 要预测的值(目标列)必须位于数据中。The value to predict, target column, must be in the data.

以下代码示例演示如何将数据存储为该格式。The following code examples demonstrate how to store the data in these formats.

  • TabularDatasetTabularDataset

    from azureml.core.dataset import Dataset
    from azureml.opendatasets import Diabetes
    
    tabular_dataset = Diabetes.get_tabular_dataset()
    train_dataset, test_dataset = tabular_dataset.random_split(percentage=0.1, seed=42)
    label = "Y"
    
  • Pandas 数据帧Pandas dataframe

    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    df = pd.read_csv("your-local-file.csv")
    train_data, test_data = train_test_split(df, test_size=0.1, random_state=42)
    label = "label-col-name"
    

在远程计算中提取用于运行试验的数据Fetch data for running experiment on remote compute

对于远程执行,必须可以从远程计算访问训练数据。For remote executions, training data must be accessible from the remote compute. SDK 中的类 Datasets 向以下对象公开功能:The class Datasets in the SDK exposes functionality to:

  • 轻松地将数据从静态文件或 URL 源传输到工作区easily transfer data from static files or URL sources into your workspace
  • 在云计算资源上运行时,使数据可用于训练脚本make your data available to training scripts when running on cloud compute resources

有关使用 Dataset 类将数据装载到计算目标的示例,请参阅操作方法See the how-to for an example of using the Dataset class to mount data to your compute target.

训练和验证数据Train and validation data

可以直接在 AutoMLConfig 构造函数中指定单独的训练集和验证集。You can specify separate train and validation sets directly in the AutoMLConfig constructor.

K 折交叉验证K-Folds Cross Validation

使用 n_cross_validations 设置指定交叉验证的数目。Use n_cross_validations setting to specify the number of cross validations. 训练数据集将随机拆分为大小相等的 n_cross_validations 折。The training data set will be randomly split into n_cross_validations folds of equal size. 在每个交叉验证轮次,某个折将用于验证剩余折上训练的模型。During each cross validation round, one of the folds will be used for validation of the model trained on the remaining folds. 重复此过程 n_cross_validations 次,直到每个折作为验证集使用了一次。This process repeats for n_cross_validations rounds until each fold is used once as validation set. 将报告在所有 n_cross_validations 轮次中获得的平均评分,并基于整个训练数据集重新训练相应的模型。The average scores across all n_cross_validations rounds will be reported, and the corresponding model will be retrained on the whole training data set.

详细了解 autoML 如何应用交叉验证来防止过度拟合模型Learn more about how autoML applies cross validation to prevent over-fitting models.

Monte Carlo 交叉验证(重复随机子采样)Monte Carlo Cross Validation (Repeated Random Sub-Sampling)

使用 validation_size 指定应该用于验证的训练数据集百分比,并使用 n_cross_validations 指定交叉验证的数目。Use validation_size to specify the percentage of the training dataset that should be used for validation, and use n_cross_validations to specify the number of cross validations. 在每个交叉验证轮次,将随机选择 validation_size 大小的子集来验证基于剩余数据训练的模型。During each cross validation round, a subset of size validation_size will be randomly selected for validation of the model trained on the remaining data. 最后,将报告在所有 n_cross_validations 轮次中获得的平均评分,并基于整个训练数据集重新训练相应的模型。Finally, the average scores across all n_cross_validations rounds will be reported, and the corresponding model will be retrained on the whole training data set. Monte Carlo 不支持时序预测。Monte Carlo is not supported for time series forecasting.

自定义验证数据集Custom validation dataset

如果随机拆分不可接受,请使用自定义验证数据集(通常是时序数据或不平衡数据)。Use custom validation dataset if random split is not acceptable, usually time series data or imbalanced data. 可以指定自己的验证数据集。You can specify your own validation dataset. 将会根据指定的验证数据集而不是随机数据集来评估模型。The model will be evaluated against the validation dataset specified instead of random dataset.

用于运行试验的计算环境Compute to run experiment

接下来,确定要在何处训练模型。Next determine where the model will be trained. 自动化机器学习训练试验可在以下计算选项中运行:An automated machine learning training experiment can run on the following compute options:

  • 本地台式机或便携式计算机等本地计算机 – 如果数据集较小,并且仍处于探索阶段,则通常使用此选项。Your local machine such as a local desktop or laptop – Generally when you have small dataset and you are still in the exploration stage.

  • 云中的远程计算机 – Azure 机器学习托管计算是一个托管服务,可用于在 Azure 虚拟机群集上训练机器学习模型。A remote machine in the cloud – Azure Machine Learning Managed Compute is a managed service that enables the ability to train machine learning models on clusters of Azure virtual machines.

    有关包含本地和远程计算目标的示例 Notebook,请参阅此 GitHub 站点See this GitHub site for examples of notebooks with local and remote compute targets.

  • Azure 订阅中的 Azure Databricks 群集。An Azure Databricks cluster in your Azure subscription. 如需了解更多详情,请参阅为自动化 ML 设置 Azure Databricks 群集You can find more details here - Setup Azure Databricks cluster for Automated ML

    有关包含 Azure Databricks 的示例 Notebook,请参阅此 GitHub 站点See this GitHub site for examples of notebooks with Azure Databricks.

配置试验设置Configure your experiment settings

可以使用多个选项来配置自动化机器学习试验。There are several options that you can use to configure your automated machine learning experiment. 通过实例化 AutoMLConfig 对象来设置这些参数。These parameters are set by instantiating an AutoMLConfig object. 有关参数的完整列表,请参阅 AutoMLConfig 类See the AutoMLConfig class for a full list of parameters.

示例包括:Some examples include:

  1. 使用 AUC 作为主要指标加权的分类实验,其中实验超时分钟数设置为 30 分钟,且包含 2 折交叉验证。Classification experiment using AUC weighted as the primary metric with experiment timeout minutes set to 30 minutes and 2 cross-validation folds.

        automl_classifier=AutoMLConfig(
        task='classification',
        primary_metric='AUC_weighted',
        experiment_timeout_minutes=30,
        blacklist_models=['XGBoostClassifier'],
        training_data=train_data,
        label_column_name=label,
        n_cross_validations=2)
    
  2. 下面是设置为 60 分钟后结束的回归试验示例,其中有 5 次交叉验证折叠。Below is an example of a regression experiment set to end after 60 minutes with five validation cross folds.

       automl_regressor = AutoMLConfig(
       task='regression',
       experiment_timeout_minutes=60,
       whitelist_models=['KNN'],
       primary_metric='r2_score',
       training_data=train_data,
       label_column_name=label,
       n_cross_validations=5)
    

三个不同的 task 参数值(第三个任务类型为 forecasting,并使用类似的算法池作为 regression 任务)确定要应用的模型的列表。The three different task parameter values (the third task-type is forecasting, and uses a similar algorithm pool as regression tasks) determine the list of models to apply. 使用 whitelistblacklist 参数通过要包含或排除的可用模型来进一步修改迭代。Use the whitelist or blacklist parameters to further modify iterations with the available models to include or exclude. 可以在分类预测回归SupportedModels 类中找到受支持的模型列表。The list of supported models can be found on SupportedModels Class for (Classification, Forecasting, and Regression).

为了帮助避免出现实验超时故障,自动化 ML 的验证服务要求将 experiment_timeout_minutes 设置为至少 15 分钟,或在按列排列的行的大小超过 10,000,000 时设置为 60 分钟。To help avoid experiment timeout failures, Automated ML's validation service will require that experiment_timeout_minutes be set to a minimum of 15 minutes, or 60 minutes if your row by column size exceeds 10 million.

主要指标Primary Metric

主要指标确定要在模型训练期间用于优化的指标。The primary metric determines the metric to be used during model training for optimization. 你可选择的可用指标取决于所选择的任务类型,下表显示了每种任务类型的有效主要指标。The available metrics you can select is determined by the task type you choose, and the following table shows valid primary metrics for each task type.

分类Classification 回归Regression 时序预测Time Series Forecasting
accuracyaccuracy spearman_correlationspearman_correlation spearman_correlationspearman_correlation
AUC_weightedAUC_weighted normalized_root_mean_squared_errornormalized_root_mean_squared_error normalized_root_mean_squared_errornormalized_root_mean_squared_error
average_precision_score_weightedaverage_precision_score_weighted r2_scorer2_score r2_scorer2_score
norm_macro_recallnorm_macro_recall normalized_mean_absolute_errornormalized_mean_absolute_error normalized_mean_absolute_errornormalized_mean_absolute_error
precision_score_weightedprecision_score_weighted

如需了解上述指标的具体定义,请参阅了解自动化机器学习结果集Learn about the specific definitions of these metrics in Understand automated machine learning results.

数据特征化Data featurization

在每个自动化机器学习实验中,数据都是自动缩放和规范化,以帮助对不同规模上的特征敏感的某些** 算法。In every automated machine learning experiment, your data is automatically scaled and normalized to help certain algorithms that are sensitive to features that are on different scales. 不过,你还可以启用其他特征化,例如缺失值插补、编码和转换。However, you can also enable additional featurization, such as missing values imputation, encoding, and transforms.

AutoMLConfig 对象中配置试验时,可以启用/禁用设置 featurizationWhen configuring your experiments in your AutoMLConfig object, you can enable/disable the setting featurization. 下表列出了 AutoMLConfig 类中的特征化的已接受设置。The following table shows the accepted settings for featurization in the AutoMLConfig class.

特征化配置Featurization Configuration 说明Description
"featurization": 'auto' 指示在处理过程中自动执行数据护栏和特征化步骤Indicates that as part of preprocessing, data guardrails and featurization steps are performed automatically. 默认设置Default setting
"featurization": 'off' 指示不应当自动执行特征化步骤。Indicates featurization step should not be done automatically.
"featurization": 'FeaturizationConfig' 指示应当使用自定义特征化步骤。Indicates customized featurization step should be used. 了解如何自定义特征化Learn how to customize featurization.

Note

自动化机器学习特征化步骤(特征规范化、处理缺失数据,将文本转换为数字等)成为了基础模型的一部分。Automated machine learning featurization steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. 使用模型进行预测时,将自动向输入数据应用在训练期间应用的相同特征化步骤。When using the model for predictions, the same featurization steps applied during training are applied to your input data automatically.

时序预测Time Series Forecasting

时序 forecasting 任务要求配置对象中包含其他参数:The time series forecasting task requires additional parameters in the configuration object:

  1. time_column_name:必需的参数,用于在包含有效时序的训练数据中定义列的名称。time_column_name: Required parameter that defines the name of the column in your training data containing a valid time-series.
  2. max_horizon:根据训练数据的周期定义要预测的时长。max_horizon: Defines the length of time you want to predict out based on the periodicity of the training data. 例如,如果你有带有每日时间粒度的训练数据,则可以定义训练模型的时长(以天为单位)。For example if you have training data with daily time grains, you define how far out in days you want the model to train for.
  3. grain_column_names:定义训练数据中包含各个时序数据的列名称。grain_column_names: Defines the name of columns that contain individual time series data in your training data. 例如,若要按店铺预测特定品牌的销售额,则可以将店铺列和品牌列定义为粒度列。For example, if you are forecasting sales of a particular brand by store, you would define store and brand columns as your grain columns. 将为每个颗粒/分组创建单独的时序和预测。Separate time-series and forecasts will be created for each grain/grouping.

有关下面所用设置的示例,请参阅示例笔记本For examples of the settings used below, see the sample notebook.

# Setting Store and Brand as grains for training.
grain_column_names = ['Store', 'Brand']
nseries = data.groupby(grain_column_names).ngroups

# View the number of time series data with defined grains
print('Data contains {0} individual time-series.'.format(nseries))
time_series_settings = {
    'time_column_name': time_column_name,
    'grain_column_names': grain_column_names,
    'drop_column_names': ['logQuantity'],
    'max_horizon': n_test_periods
}

automl_config = AutoMLConfig(task = 'forecasting',
                             debug_log='automl_oj_sales_errors.log',
                             primary_metric='normalized_root_mean_squared_error',
                             experiment_timeout_minutes=20,
                             training_data=train_data,
                             label_column_name=label,
                             n_cross_validations=5,
                             path=project_folder,
                             verbosity=logging.INFO,
                             **time_series_settings)

集成配置Ensemble configuration

默认情况下启用集成模型,并在自动机器化学习运行中显示为最终的运行迭代次数。Ensemble models are enabled by default, and appear as the final run iterations in an automated machine learning run. 目前支持的融合方法是投票和堆叠。Currently supported ensemble methods are voting and stacking. 投票是使用加权平均值作为软投票实现的,堆栈实现使用两层实现,其中第一层具有与投票集成相同的模型,第二层模型用于从第一层中查找模型的最佳组合。Voting is implemented as soft-voting using weighted averages, and the stacking implementation is using a two layer implementation, where the first layer has the same models as the voting ensemble, and the second layer model is used to find the optimal combination of the models from the first layer. 如果你使用的是 ONNX 模型,或 启用了模型可解释性,那么堆叠会被禁用,只会使用投票。If you are using ONNX models, or have model-explainability enabled, stacking will be disabled and only voting will be utilized.

有多个默认参数可以作为 kwargsAutoMLConfig 对象中提供,用于更改默认融合行为。There are multiple default arguments that can be provided as kwargs in an AutoMLConfig object to alter the default ensemble behavior.

  • ensemble_download_models_timeout_sec:在 VotingEnsemble 和 StackEnsemble 模型生成期间,会下载来自先前子运行的多个拟合模型。ensemble_download_models_timeout_sec: During VotingEnsemble and StackEnsemble model generation, multiple fitted models from the previous child runs are downloaded. 如果遇到此错误 AutoMLEnsembleException: Could not find any models for running ensembling,则可能需要为要下载的模型提供更多时间。If you encounter this error: AutoMLEnsembleException: Could not find any models for running ensembling, then you may need to provide more time for the models to be downloaded. 默认值为 300 秒并行下载这些模型,且没有最大超时限制。The default value is 300 seconds for downloading these models in parallel and there is no maximum timeout limit. 如果需要更多时间,请将此参数配置为大于 300 秒的值。Configure this parameter with a higher value than 300 secs, if more time is needed.

    Note

    如果已超时且下载了模型,则融合会使用它下载的多个模型继续执行。If the timeout is reached and there are models downloaded, then the ensembling proceeds with as many models it has downloaded. 并不需要下载所有模型才能在超时内完成。It's not required that all the models need to be downloaded to finish within that timeout.

以下参数只应用于 StackEnsemble 模型:The following parameters only apply to StackEnsemble models:

  • stack_meta_learner_type:元学习器是针对单个异类模型的输出而训练出来的模型。stack_meta_learner_type: the meta-learner is a model trained on the output of the individual heterogeneous models. 默认的元学习器是用于分类任务的 LogisticRegression(或为 LogisticRegressionCV,如果启用了交叉验证的话),以及用于回归/预测任务的 ElasticNet(或为 ElasticNetCV,如果启用了交叉验证的话)。Default meta-learners are LogisticRegression for classification tasks (or LogisticRegressionCV if cross-validation is enabled) and ElasticNet for regression/forecasting tasks (or ElasticNetCV if cross-validation is enabled). 此参数可以是下列字符串之一:LogisticRegressionLogisticRegressionCVLightGBMClassifierElasticNetElasticNetCVLightGBMRegressorLinearRegressionThis parameter can be one of the following strings: LogisticRegression, LogisticRegressionCV, LightGBMClassifier, ElasticNet, ElasticNetCV, LightGBMRegressor, or LinearRegression.

  • stack_meta_learner_train_percentage:指定为训练元学习器而保留的训练集的比例(选择训练的训练和验证类型时)。stack_meta_learner_train_percentage: specifies the proportion of the training set (when choosing train and validation type of training) to be reserved for training the meta-learner. 默认值为 0.2Default value is 0.2.

  • stack_meta_learner_kwargs:要传递给元学习器的初始值设定项的可选参数。stack_meta_learner_kwargs: optional parameters to pass to the initializer of the meta-learner. 这些参数和参数类型对来自相应模型构造函数的参数和参数类型进行镜像,然后再转发到模型构造函数。These parameters and parameter types mirror the parameters and parameter types from the corresponding model constructor, and are forwarded to the model constructor.

下面的代码示例展示了如何在 AutoMLConfig 对象中指定自定义融合行为。The following code shows an example of specifying custom ensemble behavior in an AutoMLConfig object.

ensemble_settings = {
    "ensemble_download_models_timeout_sec": 600
    "stack_meta_learner_type": "LogisticRegressionCV",
    "stack_meta_learner_train_percentage": 0.3,
    "stack_meta_learner_kwargs": {
        "refit": True,
        "fit_intercept": False,
        "class_weight": "balanced",
        "multi_class": "auto",
        "n_jobs": -1
    }
}

automl_classifier = AutoMLConfig(
        task='classification',
        primary_metric='AUC_weighted',
        experiment_timeout_minutes=30,
        training_data=train_data,
        label_column_name=label,
        n_cross_validations=5,
        **ensemble_settings
        )

默认启用集成训练,但可以通过使用 enable_voting_ensembleenable_stack_ensemble 布尔参数来禁用集成训练。Ensemble training is enabled by default, but it can be disabled by using the enable_voting_ensemble and enable_stack_ensemble boolean parameters.

automl_classifier = AutoMLConfig(
        task='classification',
        primary_metric='AUC_weighted',
        experiment_timeout_minutes=30,
        training_data=data_train,
        label_column_name=label,
        n_cross_validations=5,
        enable_voting_ensemble=False,
        enable_stack_ensemble=False
        )

运行试验Run experiment

对于自动化 ML,可以创建 Experiment 对象,这是 Workspace 中用于运行实验的命名对象。For automated ML, you create an Experiment object, which is a named object in a Workspace used to run experiments.

from azureml.core.experiment import Experiment

ws = Workspace.from_config()

# Choose a name for the experiment and specify the project folder.
experiment_name = 'automl-classification'
project_folder = './sample_projects/automl-classification'

experiment = Experiment(ws, experiment_name)

提交试验以运行和生成模型。Submit the experiment to run and generate a model. AutoMLConfig 传递给 submit 方法以生成模型。Pass the AutoMLConfig to the submit method to generate the model.

run = experiment.submit(automl_config, show_output=True)

Note

首先在新的计算机上安装依赖项。Dependencies are first installed on a new machine. 最长可能需要在 10 分钟后才会显示输出。It may take up to 10 minutes before output is shown. show_output 设置为 True 可在控制台上显示输出。Setting show_output to True results in output being shown on the console.

退出条件Exit criteria

有几个选项可供定义来结束实验。There are a few options you can define to end your experiment.

  1. 无条件:如果未定义任何退出参数,则试验将继续,直到主要指标不再需要执行其他步骤。No Criteria: If you do not define any exit parameters the experiment will continue until no further progress is made on your primary metric.
  2. 在一段时间后退出:使用“设置”中的 experiment_timeout_minutes,即可定义试验应持续运行的时长。Exit after a length of time: Using experiment_timeout_minutes in your settings allows you to define how long in minutes should an experiment continue in run.
  3. 达到分数后退出:使用 experiment_exit_score 将在达到主要指标分数后完成试验。Exit after a score has been reached: Using experiment_exit_score will complete the experiment after a primary metric score has been reached.

探索模型指标Explore model metrics

如果在笔记本中操作,可以在小组件或内联单元中查看训练结果。You can view your training results in a widget or inline if you are in a notebook. 有关更多详细信息,请参阅跟踪和评估模型See Track and evaluate models for more details.

有关如何下载或注册模型以便部署到 Web 服务的详细信息,请参阅如何部署模型以及在何处部署模型For details on how to download or register a model for deployment to a web service, see how and where to deploy a model.

了解自动化 ML 模型Understand automated ML models

任何使用自动化 ML 生成的模型都包括以下步骤:Any model produced using automated ML includes the following steps:

  • 自动化特征工程(如果 "featurization": 'auto'Automated feature engineering (if "featurization": 'auto')
  • 缩放/规范化和包含超参数值的算法Scaling/Normalization and algorithm with hyperparameter values

我们让从自动化 ML 的 fitted_model 输出中获取此信息的操作变得透明。We make it transparent to get this information from the fitted_model output from automated ML.

automl_config = AutoMLConfig(…)
automl_run = experiment.submit(automl_config …)
best_run, fitted_model = automl_run.get_output()

自动化特征工程Automated feature engineering

请参阅当 "featurization": 'auto' 时进行的预处理和自动化特征工程的列表。See the list of preprocessing and automated feature engineering that happens when "featurization": 'auto'.

请看以下示例:Consider this example:

  • 有四个输入功能:A(数值)、B(数值)、C(数值)、D(日期/时间)There are four input features: A (Numeric), B (Numeric), C (Numeric), D (DateTime)
  • 删除数值特征 C,因为它是一个 ID 列,具有所有唯一值Numeric feature C is dropped because it is an ID column with all unique values
  • 数值特征 A 和 B 包含缺失值,因此使用平均值进行估算Numeric features A and B have missing values and hence are imputed by the mean
  • 日期/时间特征 D 具有 11 个不同的工程特征DateTime feature D is featurized into 11 different engineered features

在拟合模型的第一个步骤中,可使用这 2 个 API 进行深入探索。Use these 2 APIs on the first step of fitted model to understand more. 请参阅此示例笔记本See this sample notebook.

  • API 1:get_engineered_feature_names() 返回工程特征名称的列表。API 1: get_engineered_feature_names() returns a list of engineered feature names.

    用法:Usage:

    fitted_model.named_steps['timeseriestransformer']. get_engineered_feature_names ()
    
    Output: ['A', 'B', 'A_WASNULL', 'B_WASNULL', 'year', 'half', 'quarter', 'month', 'day', 'hour', 'am_pm', 'hour12', 'wday', 'qday', 'week']
    

    此列表包括所有工程特征的名称。This list includes all engineered feature names.

    Note

    请将“timeseriestransformer”用于任务为“预测”的情况,否则请将“datatransformer”用于“回归”或“分类”任务。Use 'timeseriestransformer' for task='forecasting', else use 'datatransformer' for 'regression' or 'classification' task.

  • API 2:get_featurization_summary() 返回所有输入特征的特征化摘要。API 2: get_featurization_summary() returns featurization summary for all the input features.

    用法:Usage:

    fitted_model.named_steps['timeseriestransformer'].get_featurization_summary()
    

    Note

    请将“timeseriestransformer”用于任务为“预测”的情况,否则请将“datatransformer”用于“回归”或“分类”任务。Use 'timeseriestransformer' for task='forecasting', else use 'datatransformer' for 'regression' or 'classification' task.

    输出:Output:

    [{'RawFeatureName': 'A',
      'TypeDetected': 'Numeric',
      'Dropped': 'No',
      'EngineeredFeatureCount': 2,
      'Tranformations': ['MeanImputer', 'ImputationMarker']},
    {'RawFeatureName': 'B',
      'TypeDetected': 'Numeric',
      'Dropped': 'No',
      'EngineeredFeatureCount': 2,
      'Tranformations': ['MeanImputer', 'ImputationMarker']},
    {'RawFeatureName': 'C',
      'TypeDetected': 'Numeric',
      'Dropped': 'Yes',
      'EngineeredFeatureCount': 0,
      'Tranformations': []},
    {'RawFeatureName': 'D',
      'TypeDetected': 'DateTime',
      'Dropped': 'No',
      'EngineeredFeatureCount': 11,
      'Tranformations': ['DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime','DateTime']}]
    

    其中:Where:

    输出Output 定义Definition
    RawFeatureNameRawFeatureName 从提供的数据集中输入特征/列名称。Input feature/column name from the dataset provided.
    TypeDetectedTypeDetected 检测到的输入特征的数据类型。Detected datatype of the input feature.
    DroppedDropped 指示是否已删除或使用输入特征。Indicates if the input feature was dropped or used.
    EngineeringFeatureCountEngineeringFeatureCount 通过自动化特征工程转换生成的特征数。Number of features generated through automated feature engineering transforms.
    转换Transformations 应用于输入特征以生成工程特征的转换列表。List of transformations applied to input features to generate engineered features.

缩放/规范化以及具有超参数值的算法:Scaling/Normalization and algorithm with hyperparameter values:

若要了解管道的缩放/规范化和算法/超参数值,请使用 fitted_model.steps。To understand the scaling/normalization and algorithm/hyperparameter values for a pipeline, use fitted_model.steps. 详细了解缩放/规范化Learn more about scaling/normalization. 下面是示例输出:Here is a sample output:

[('RobustScaler', RobustScaler(copy=True, quantile_range=[10, 90], with_centering=True, with_scaling=True)), ('LogisticRegression', LogisticRegression(C=0.18420699693267145, class_weight='balanced', dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='multinomial', n_jobs=1, penalty='l2', random_state=None, solver='newton-cg', tol=0.0001, verbose=0, warm_start=False))

若要获取更多详细信息,请使用此帮助程序函数:To get more details, use this helper function:

from pprint import pprint


def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(
                e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        else:
            pprint(step[1].get_params())
            print()


print_model(model)

下面的示例输出用于使用特定算法(在此示例中为包含 RobustScalar 的 LogisticRegression)的管道。The following sample output is for a pipeline using a specific algorithm (LogisticRegression with RobustScalar, in this case).

RobustScaler
{'copy': True,
'quantile_range': [10, 90],
'with_centering': True,
'with_scaling': True}

LogisticRegression
{'C': 0.18420699693267145,
'class_weight': 'balanced',
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'max_iter': 100,
'multi_class': 'multinomial',
'n_jobs': 1,
'penalty': 'l2',
'random_state': None,
'solver': 'newton-cg',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}

预测类概率Predict class probability

使用自动化 ML 生成的模型都具有包装器对象,这些对象对其开源来源类中的功能进行镜像。Models produced using automated ML all have wrapper objects that mirror functionality from their open-source origin class. 自动化 ML 返回的大多数分类模型包装器对象都实现了 predict_proba() 函数,该函数接受特征(X 值)的数组式或稀疏矩阵数据样本,并返回每个样本的 n 维数组及其各自的类概率。Most classification model wrapper objects returned by automated ML implement the predict_proba() function, which accepts an array-like or sparse matrix data sample of your features (X values), and returns an n-dimensional array of each sample and its respective class probability.

假设你已使用上文中的相同调用检索了最佳运行和拟合的模型,则可以直接从拟合的模型调用 predict_proba(),并根据模型类型提供相应格式的 X_test 样本。Assuming you have retrieved the best run and fitted model using the same calls from above, you can call predict_proba() directly from the fitted model, supplying an X_test sample in the appropriate format depending on the model type.

best_run, fitted_model = automl_run.get_output()
class_prob = fitted_model.predict_proba(X_test)

如果基础模型不支持 predict_proba() 函数或者格式不正确,则会引发特定于模型类的异常。</span>If the underlying model does not support the predict_proba() function or the format is incorrect, a model class-specific exception will be thrown. 有关如何针对不同的模型类型实现此函数的示例,请参阅 RandomForestClassifierXGBoost 参考文档。See the RandomForestClassifier and XGBoost reference docs for examples of how this function is implemented for different model types.

模型可解释性Model interpretability

模型可解释性让你可以了解模型进行预测的原因,以及基础特征重要性值。Model interpretability allows you to understand why your models made predictions, and the underlying feature importance values. SDK 包括各种包,这些包用于在训练和推理时间为本地和已部署的模型启用模型可解释性功能。The SDK includes various packages for enabling model interpretability features, both at training and inference time, for local and deployed models.

有关如何在自动化机器学习试验中启用可解释性功能的代码示例,请参阅操作方法See the how-to for code samples on how to enable interpretability features specifically within automated machine learning experiments.

有关如何在自动化机器学习之外的其他 SDK 区域中启用模型解释和特征重要性的基本信息,请参阅可解释性方面的概念文章。For general information on how model explanations and feature importance can be enabled in other areas of the SDK outside of automated machine learning, see the concept article on interpretability.

后续步骤Next steps