通过自动化机器学习防止过度拟合和不均衡数据Prevent overfitting and imbalanced data with automated machine learning

在生成机器学习模型时,过度拟合和不均衡数据是常见的错误。Over-fitting and imbalanced data are common pitfalls when you build machine learning models. 默认情况下,Azure 机器学习的自动化机器学习提供图表和指标来帮助你识别这些风险,并实施最佳做法以帮助缓解这些风险。By default, Azure Machine Learning's automated machine learning provides charts and metrics to help you identify these risks, and implements best practices to help mitigate them.

识别过度拟合Identify over-fitting

如果模型非常适应定型数据,则在机器学习中会发生过度拟合,因此无法准确预测不可见的测试数据。Over-fitting in machine learning occurs when a model fits the training data too well, and as a result can't accurately predict on unseen test data. 换句话说,模型只是简单地记住了定型数据中的特定模式和噪音,但并不太灵活,无法预测实时数据。In other words, the model has simply memorized specific patterns and noise in the training data, but is not flexible enough to make predictions on real data.

请考虑以下已定型的模型及其相应的定型和测试准确度。Consider the following trained models and their corresponding train and test accuracies.

型号Model 训练准确度Train accuracy 测试准确度Test accuracy
AA 99.9%99.9% 95%95%
BB 87%87% 87%87%
CC 99.9%99.9% 45%45%

考虑模型 A:一个常见的误解是,如果针对不可见数据的测试准确度低于训练准确度,则模型就是过度拟合的。Considering model A, there is a common misconception that if test accuracy on unseen data is lower than training accuracy, the model is over-fitted. 但是,测试准确度应始终小于训练准确度,并且过度拟合与适当拟合的差别以较不准确的程度为准。However, test accuracy should always be less than training accuracy, and the distinction for over-fit vs. appropriately fit comes down to how much less accurate.

将模型 AB 相比较时,模型 A 是更好的模型,因为它的测试准确度更高;尽管测试准确度略低于 95%,但这种差异并不明显,也并不意味着存在过度拟合。When comparing models A and B, model A is a better model because it has higher test accuracy, and although the test accuracy is slightly lower at 95%, it is not a significant difference that suggests over-fitting is present. 不选择模型 B 的原因仅仅是训练和测试精度更为接近。You wouldn't choose model B simply because the train and test accuracies are closer together.

模型 C 代表了一个明显的过度拟合情况;定型准确度非常高,但测试准确度却并没有那么高。Model C represents a clear case of over-fitting; the training accuracy is very high but the test accuracy isn't anywhere near as high. 这种分辨方式虽然较主观,但却是从对问题和数据的了解以及可接受的误差大小中得出的。This distinction is subjective, but comes from knowledge of your problem and data, and what magnitudes of error are acceptable.

防止过度拟合Prevent over-fitting

在大多数严重情况下,过度拟合的模型将假定在定型过程中出现的功能值组合将始终生成与目标完全相同的输出。In the most egregious cases, an over-fitted model will assume that the feature value combinations seen during training will always result in the exact same output for the target.

防止过度拟合的最佳方式是遵循 ML 最佳做法,包括:The best way to prevent over-fitting is to follow ML best-practices including:

  • 使用更多训练数据,并消除统计偏差Using more training data, and eliminating statistical bias
  • 防止目标泄露Preventing target leakage
  • 使用较少的特征Using fewer features
  • 正则化和超参数优化Regularization and hyperparameter optimization
  • 模型复杂性限制Model complexity limitations
  • 交叉验证Cross-validation

在自动化 ML 的上下文中,上面所述的前三项是你要实施的最佳做法In the context of automated ML, the first three items above are best-practices you implement. 带粗体格式的后三项是自动化 ML 为了防止过度拟合而默认实施的最佳做法The last three bolded items are best-practices automated ML implements by default to protect against over-fitting. 在除自动化 ML 以外的设置中,为了避免过度拟合模型,值得遵循所有六项最佳做法。In settings other than automated ML, all six best-practices are worth following to avoid over-fitting models.

你要实施的最佳做法Best practices you implement

使用更多的数据是防止过度拟合的最简单且最可行的方法,通常,这种做法带来的额外好处就是提高准确度。Using more data is the simplest and best possible way to prevent over-fitting, and as an added bonus typically increases accuracy. 使用更多数据时,模型将更难以记住确切的模式,因此它被迫达成可以更灵活地适应更多条件的解决方案。When you use more data, it becomes harder for the model to memorize exact patterns, and it is forced to reach solutions that are more flexible to accommodate more conditions. 此外,必须识别统计偏差,以确保训练数据不包含实时预测数据中不存在的隔离模式。It's also important to recognize statistical bias, to ensure your training data doesn't include isolated patterns that won't exist in live-prediction data. 这种情况很难解决,因为定型集和测试集之间可能没有过度拟合,但与实时测试数据相比,可能会存在过度拟合。This scenario can be difficult to solve, because there may not be over-fitting between your train and test sets, but there may be over-fitting present when compared to live test data.

目标泄露是一个类似问题;在此情况下,你可能不会在定型/测试集之间发现过度拟合,但它却会在预测时出现。Target leakage is a similar issue, where you may not see over-fitting between train/test sets, but rather it appears at prediction-time. 当你的模型在定型期间“作弊”,访问在预测时不应有的数据时,就会发生目标泄漏。Target leakage occurs when your model "cheats" during training by having access to data that it shouldn't normally have at prediction-time. 例如,如果你的问题是在星期一预测星期五的商品价格,但某个功能意外包含星期四的数据,这就是模型在预测时不会有的数据,因为它不能预知未来。For example, if your problem is to predict on Monday what a commodity price will be on Friday, but one of your features accidentally included data from Thursdays, that would be data the model won't have at prediction-time since it cannot see into the future. 目标泄漏是一个很容易疏忽的错误,但问题的准确度异常高,则往往可以体现此错误。Target leakage is an easy mistake to miss, but is often characterized by abnormally high accuracy for your problem. 如果你正在尝试预测股票价格,并且定型模型的准确度为 95%,则功能中可能存在目标泄漏。If you are attempting to predict stock price and trained a model at 95% accuracy, there is likely target leakage somewhere in your features.

删除功能也有助于避免过度拟合,因为这样做可防止模型使用太多字段来记住特定模式,从而使模型更加灵活。Removing features can also help with over-fitting by preventing the model from having too many fields to use to memorize specific patterns, thus causing it to be more flexible. 定量测量可能很难,但如果你可以删除功能并保持相同的准确度,则很可能使模型更灵活,并且降低过度拟合的风险。It can be difficult to measure quantitatively, but if you can remove features and retain the same accuracy, you have likely made the model more flexible and have reduced the risk of over-fitting.

自动化 ML 实现的最佳实践Best practices automated ML implements

正则化是最小化代价函数以惩罚复杂和过度拟合模型的过程。Regularization is the process of minimizing a cost function to penalize complex and over-fitted models. 正则化函数的类型各不相同,但通常它们都会对模型系数的大小、方差和复杂性进行惩罚。There are different types of regularization functions, but in general they all penalize model coefficient size, variance, and complexity. 自动化 ML 结合用于控制过度拟合的不同模型超参数设置,使用 L1 (Lasso)、L2 (Ridge) 和 ElasticNet(同时包括 L1 和 L2)的不同组合。Automated ML uses L1 (Lasso), L2 (Ridge), and ElasticNet (L1 and L2 simultaneously) in different combinations with different model hyperparameter settings that control over-fitting. 简单而言,自动化 ML 会改变模型的管控程度,并选择最佳结果。In simple terms, automated ML will vary how much a model is regulated and choose the best result.

自动化 ML 还实现了显式的“模型复杂性限制”来防止过度拟合。Automated ML also implements explicit model complexity limitations to prevent over-fitting. 在大多数情况下,此实现专用于决策树或林算法,其中每个树的最大深度受到限制,并且在林或组合学习方法中使用的树总数也受到限制。In most cases this implementation is specifically for decision tree or forest algorithms, where individual tree max-depth is limited, and the total number of trees used in forest or ensemble techniques are limited.

交叉验证 (CV) 是从完整定型数据提取许多子集并针对每个子集定型一个模型的过程。Cross-validation (CV) is the process of taking many subsets of your full training data and training a model on each subset. 其思路是,针对某个子集时,模型可能会“幸运地”具有高准确度,但在使用多个子集时,模型不会每次都实现这种高准确度。The idea is that a model could get "lucky" and have great accuracy with one subset, but by using many subsets the model won't achieve this high accuracy every time. 执行 CV 时,需要提供一个验证维持数据集,指定 CV 折数(子集数),然后,自动化 ML 将训练模型并优化超参数,以尽量减少验证集的错误。When doing CV, you provide a validation holdout dataset, specify your CV folds (number of subsets) and automated ML will train your model and tune hyperparameters to minimize error on your validation set. 可能有一个 CV 折过度拟合,但如果使用许多的折,则可以减少最终模型过度拟合的可能性。One CV fold could be over-fit, but by using many of them it reduces the probability that your final model is over-fit. 缺点是 CV 会导致训练时间变得更长,从而增大成本,因为模型不是训练一次,而是针对 n 个 CV 子集中的每个子集训练一次。The tradeoff is that CV does result in longer training times and thus greater cost, because instead of training a model once, you train it once for each n CV subsets.

备注

默认情况下不启用交叉验证;它必须在自动化 ML 设置中进行配置。Cross-validation is not enabled by default; it must be configured in automated ML settings. 但是,在配置交叉验证并提供验证数据集后,此过程将自动执行。However, after cross-validation is configured and a validation data set has been provided, the process is automated for you. 请参阅See

标识具有不均衡数据的模型Identify models with imbalanced data

不均衡数据通常存在于机器学习分类场景的数据中,它是指在每个类中包含比例不相称的观察值的数据。Imbalanced data is commonly found in data for machine learning classification scenarios, and refers to data that contains a disproportionate ratio of observations in each class. 这种不平衡可能会对模型准确度造成错误的认知效应,因为输入数据与一个类存在偏差,从而导致训练的模型模拟该偏差。This imbalance can lead to a falsely perceived positive effect of a model's accuracy, because the input data has bias towards one class, which results in the trained model to mimic that bias.

此外,自动化 ML 运行会自动生成以下图表,以帮助你了解模型分类的正确性,并识别可能受到不平衡数据影响的模型。In addition, automated ML runs generate the following charts automatically, which can help you understand the correctness of the classifications of your model, and identify models potentially impacted by imbalanced data.

图表Chart 说明Description
混淆矩阵Confusion Matrix 根据数据的实际标签评估正确分类的标签。Evaluates the correctly classified labels against the actual labels of the data.
精准率-召回率Precision-recall 根据发现的数据标签实例比评估正确的标签比Evaluates the ratio of correct labels against the ratio of found label instances of the data
ROC 曲线ROC Curves 根据误报标签比评估正确的标签比。Evaluates the ratio of correct labels against the ratio of false-positive labels.

处理不平衡的数据Handle imbalanced data

自动化 ML 可以通过其内置功能来帮助处理不平衡的数据,以实现其简化机器学习工作流的目标,例如:As part of its goal of simplifying the machine learning workflow, automated ML has built in capabilities to help deal with imbalanced data such as,

  • 权重列:自动化 ML 支持将权重列用作输入,以便能够增大或减小数据中的行的权重,权重可用于使某个类的“重要性”更大或更小。A weight column: automated ML supports a column of weights as input, causing rows in the data to be weighted up or down, which can be used to make a class more or less "important".

  • 当少数类中的样本数等于或少于多数类中的样本数的 20% 时(其中少数类是指样本最少的类,多数类是指样本最多的类),自动化 ML 使用的算法会检测到不均衡现象。The algorithms used by automated ML detect imbalance when the number of samples in the minority class is equal to or fewer than 20% of the number of samples in the majority class, where minority class refers to the one with fewest samples and majority class refers to the one with most samples. 然后,AutoML 会使用子采样的数据运行试验,以检查使用类权重是否可以纠正此问题并提高性能。Subsequently, AutoML will run an experiment with sub-sampled data to check if using class weights would remedy this problem and improve performance. 如果通过此试验确定性能得到提高,则采用此补救措施。If it ascertains a better performance through this experiment, then this remedy is applied.

  • 使用性能指标来更好地处理不平衡的数据。Use a performance metric that deals better with imbalanced data. 例如,AUC_weighted 是一项主要指标,它基于表示每个类的样本的相对数量计算该类的贡献,因此能够更可靠地应对不平衡。For example, the AUC_weighted is a primary metric that calculates the contribution of every class based on the relative number of samples representing that class, hence is more robust against imbalance.

以下技术是用于处理自动化 ML 外部的不平衡数据的附加选项。The following techniques are additional options to handle imbalanced data outside of automated ML.

  • 通过向上采样较小的类或向下采样较大的类,重新采样来均衡类的不平衡性。Resampling to even the class imbalance, either by up-sampling the smaller classes or down-sampling the larger classes. 这些方法需要具备处理和分析方面的专业知识。These methods require expertise to process and analyze.

  • 查看不平衡数据的性能指标。Review performance metrics for imbalanced data. 例如,F1 分数是查准率和查全率的调和平均值。For example, the F1 score is the harmonic mean of precision and recall. 查准率用于度量分类器的准确度,查准率越高表示误报越少,而查全率则用于度量分类器的完整性,查全率越高表示误报越少。Precision measures a classifier's exactness, where higher precision indicates fewer false positives, while recall measures a classifier's completeness, where higher recall indicates fewer false negatives.

后续步骤Next steps

查看示例并了解如何使用自动化机器学习生成模型:See examples and learn how to build models using automated machine learning: