如何选择 Azure 机器学习的算法How to select algorithms for Azure Machine Learning

一个常见的问题是“我应该使用哪种机器学习算法?”A common question is “Which machine learning algorithm should I use?” 所选的算法主要取决于数据科学方案的两个不同方面:The algorithm you select depends primarily on two different aspects of your data science scenario:

  • 要将数据用于何种用途?What you want to do with your data? 具体而言,从以往的数据中学习信息是要解答哪个业务问题?Specifically, what is the business question you want to answer by learning from your past data?

  • 数据科学方案的要求是什么?What are the requirements of your data science scenario? 具体而言,解决方案支持的准确度、训练时间、线性度、参数数目和特征数目是多少?Specifically, what is the accuracy, training time, linearity, number of parameters, and number of features your solution supports?

选择算法时的考虑因素:想要知道什么?

业务方案和机器学习算法速查表Business scenarios and the Machine Learning Algorithm Cheat Sheet

Azure 机器学习算法速查表可帮助你解决首要考虑因素:要将数据用于何种用途The Azure Machine Learning Algorithm Cheat Sheet helps you with the first consideration: What you want to do with your data? 在机器学习算法速查表中,查找想要执行的任务,然后找到适用于预测分析解决方案的 Azure 机器学习设计器算法。On the Machine Learning Algorithm Cheat Sheet, look for task you want to do, and then find a Azure Machine Learning designer algorithm for the predictive analytics solution.

机器学习设计器提供全面的算法阵容,例如多类决策林推荐系统神经网络回归多类神经网络K 平均值聚类Machine Learning designer provides a comprehensive portfolio of algorithms, such as Multiclass Decision Forest, Recommendation systems, Neural Network Regression, Multiclass Neural Network, and K-Means Clustering. 每种算法旨在用于解决一种不同类型的机器学习问题。Each algorithm is designed to address a different type of machine learning problem. 有关完整列表,以及有关每种算法的工作原理与如何优化参数以优化算法的文档,请参阅机器学习设计器算法和模块参考See the Machine Learning designer algorithm and module reference for a complete list along with documentation about how each algorithm works and how to tune parameters to optimize the algorithm.

备注

若要下载机器学习算法速查表,请转到 Azure 机器学习算法速查表To download the machine learning algorithm cheat sheet, go to Azure Machine learning algorithm cheat sheet.

除了遵循 Azure 机器学习算法速查表中的指导以外,在为解决方案选择机器学习算法时,还应该记住其他要求。Along with guidance in the Azure Machine Learning Algorithm Cheat Sheet, keep in mind other requirements when choosing a machine learning algorithm for your solution. 下面是要考虑的其他因素,例如准确度、训练时间、线性度、参数数目和特征数目。Following are additional factors to consider, such as the accuracy, training time, linearity, number of parameters and number of features.

机器学习算法的比较Comparison of machine learning algorithms

一些学习算法会对数据的结构或期望的结果做出特定假设。Some learning algorithms make particular assumptions about the structure of the data or the desired results. 如果找到符合需求的算法,它可以提供更有用的结果、更准确的预测或更快的定型时间。If you can find one that fits your needs, it can give you more useful results, more accurate predictions, or faster training times.

下表总结了分类、回归和聚类系列算法的一些最重要的特征:The following table summarizes some of the most important characteristics of algorithms from the classification, regression, and clustering families:

算法Algorithm 准确性Accuracy 定型时间Training time 线性Linearity ParametersParameters 说明Notes
分类系列Classification family
双类逻辑回归Two-Class logistic regression Good 快速Fast Yes 44
双类决策林Two-class decision forest 很好Excellent 中等Moderate No 55 显示的评分时间变慢。Shows slower scoring times. 建议不要使用“一对多”多类分类,因为树预测累积中的梯级锁定会导致评分时间变慢Suggest not working with One-vs-All Multiclass, because of slower scoring times caused by tread locking in accumulating tree predictions
双类提升决策树Two-class boosted decision tree 很好Excellent 中等Moderate No 66 内存占用量大Large memory footprint
双类神经网络Two-class neural network Good 中等Moderate No 88
双类平均感知器Two-class averaged perceptron Good 中等Moderate Yes 44
双类支持向量机Two-class support vector machine Good 快速Fast Yes 55 适用于大型特征集Good for large feature sets
多类逻辑回归Multiclass logistic regression Good 快速Fast Yes 44
多类决策林Multiclass decision forest 很好Excellent 中等Moderate No 55 显示的评分时间变慢Shows slower scoring times
多类提升决策树Multiclass boosted decision tree 很好Excellent 中等Moderate No 66 提高了准确性,同时存在小的覆盖面降低的风险Tends to improve accuracy with some small risk of less coverage
多类神经网络Multiclass neural network Good 中等Moderate No 88
“一对多”多类分类One-vs-all multiclass - - - - 查看所选双类方法的属性See properties of the two-class method selected
回归系列Regression family
线性回归Linear regression Good 快速Fast Yes 44
决策林回归Decision forest regression 很好Excellent 中等Moderate No 55
提升决策树回归Boosted decision tree regression 很好Excellent 中等Moderate No 66 内存占用量大Large memory footprint
神经网络回归Neural network regression Good 中等Moderate No 88
群集系列Clustering family
K-Means 群集K-means clustering 很好Excellent 中等Moderate Yes 88 聚类算法A clustering algorithm

数据科学方案的要求Requirements for a data science scenario

知道要将数据用于何种用途后,需要确定解决方案的其他要求。Once you know what you want to do with your data, you need to determine additional requirements for your solution.

做出选择,并针对以下要求采取可能的折衷方案:Make choices and possibly trade-offs for the following requirements:

  • 精确度Accuracy
  • 定型时间Training time
  • 线性Linearity
  • 参数数目Number of parameters
  • 特征数量Number of features

精确度Accuracy

机器学习中的准确度根据真实结果数与案例总数之比来度量模型的有效性。Accuracy in machine learning measures the effectiveness of a model as the proportion of true results to total cases. 在机器学习设计器中,“评估模型”模块将计算一组符合行业标准的评估指标。In Machine Learning designer, the Evaluate Model module computes a set of industry-standard evaluation metrics. 可以使用此模块来度量已训练模型的准确度。You can use this module to measure the accuracy of a trained model.

获取最准确的答案可能并不总是必要的。Getting the most accurate answer possible isn’t always necessary. 有时,近似值便已足够,具体取决于想要将其用于何处。Sometimes an approximation is adequate, depending on what you want to use it for. 如果是这种情况,可以通过坚持使用更多的近似值方法大大减少处理时间。If that is the case, you may be able to cut your processing time dramatically by sticking with more approximate methods. 此外,近似值方法在性质上趋向于避免过度拟合。Approximate methods also naturally tend to avoid overfitting.

可通过三种方式使用“评估模型”模块:There are three ways to use the Evaluate Model module:

  • 针对训练数据生成评分以评估模型Generate scores over your training data in order to evaluate the model
  • 在模型中生成评分,但将这些评分与保留的测试集中的评分进行比较Generate scores on the model, but compare those scores to scores on a reserved testing set
  • 使用相同的数据集比较两个不同但相关的模型的评分Compare scores for two different but related models, using the same set of data

有关可用于评估机器学习模型准确度的指标和方法的完整列表,请参阅“评估模型”模块For a complete list of metrics and approaches you can use to evaluate the accuracy of machine learning models, see Evaluate Model module.

定型时间Training time

在监督式学习中,训练表示使用历史数据生成一个可以尽量减少误差的机器学习模型。In supervised learning, training means using historical data to build a machine learning model that minimizes errors. 算法之间定型模型所需的分钟数或小时数差异较大。The number of minutes or hours necessary to train a model varies a great deal between algorithms. 训练时间通常与准确度密切相关:两者通常是相辅相成的。Training time is often closely tied to accuracy; one typically accompanies the other.

此外,相较于其他算法,某些算法对数据点数目更敏感。In addition, some algorithms are more sensitive to the number of data points than others. 可以选择特定的算法,因为时间是有限的,尤其是数据集很大的情况下。You might choose a specific algorithm because you have a time limitation, especially when the data set is large.

在机器学习设计器中,创建和使用机器学习模型通常是一个三步过程:In Machine Learning designer, creating and using a machine learning model is typically a three-step process:

  1. 通过选择特定类型的算法并定义其参数或超参数来配置模型。Configure a model, by choosing a particular type of algorithm, and then defining its parameters or hyperparameters.

  2. 提供一个带标记且其数据与算法兼容的数据集。Provide a dataset that is labeled and has data compatible with the algorithm. 将数据和模型都连接到“训练模型”模块Connect both the data and the model to Train Model module.

  3. 训练完成后,结合某个评分模块使用训练的模型基于新数据做出预测。After training is completed, use the trained model with one of the scoring modules to make predictions on new data.

线性Linearity

统计学和机器学习中的线性度表示数据集中的某个变量与常数之间存在线性关系。Linearity in statistics and machine learning means that there is a linear relationship between a variable and a constant in your dataset. 例如,线性分类算法假设直线(或其更高维的模拟)可以将类分离。For example, linear classification algorithms assume that classes can be separated by a straight line (or its higher-dimensional analog).

许多机器学习算法都使用线性。Lots of machine learning algorithms make use of linearity. 在 Azure 机器学习设计器中,这些算法包括:In Azure Machine Learning designer, they include:

线性回归算法假定数据趋势遵循一条直线。Linear regression algorithms assume that data trends follow a straight line. 对于某些问题而言,这种假设可以成立,但对于其他一些问题,它会降低准确度。This assumption isn't bad for some problems, but for others it reduces accuracy. 尽管它们有缺点,但线性算法往往被用作首要策略。Despite their drawbacks, linear algorithms are popular as a first strategy. 它们往往算法简单且可快速定型。They tend to be algorithmically simple and fast to train.

非线性类边界

非线性类边界*:依赖于线性分类算法会导致较低的准确性。**Nonlinear class boundary* _: _Relying on a linear classification algorithm would result in low accuracy.*

非线性趋势数据

非线性趋势数据*:使用线性回归方法会生成比必要的更大的错误。Data with a nonlinear trend _: _Using a linear regression method would generate much larger errors than necessary.

参数数目Number of parameters

参数是数据科学家在设置算法时要旋转的旋钮。Parameters are the knobs a data scientist gets to turn when setting up an algorithm. 它们是影响算法行为的数字,例如错误容限、迭代次数,或算法行为方式的变体之间的选项。They are numbers that affect the algorithm’s behavior, such as error tolerance or number of iterations, or options between variants of how the algorithm behaves. 算法的训练时间和准确度有时可能对获取正确设置相当敏感。The training time and accuracy of the algorithm can sometimes be sensitive to getting just the right settings. 通常情况下,具有大量参数的算法需要进行最多的试用和错误,才能找到好的组合。Typically, algorithms with large numbers of parameters require the most trial and error to find a good combination.

此外,机器学习设计器中还提供了“优化模型超参数”模块:此模块的目标是确定机器学习模型的最佳超参数。Alternatively, there is the Tune Model Hyperparameters module in Machine Learning designer: The goal of this module is to determine the optimum hyperparameters for a machine learning model. 该模块使用不同的设置组合来生成并测试多个模型。The module builds and tests multiple models by using different combinations of settings. 它将比较所有模型的指标,以获取设置组合。It compares metrics over all models to get the combinations of settings.

虽然这是确保跨越参数空间的好方法,但训练模型所需的时间随参数数量呈指数增长。While this is a great way to make sure you’ve spanned the parameter space, the time required to train a model increases exponentially with the number of parameters. 优点是通常情况下,参数较多说明算法具有更大的灵活性。The upside is that having many parameters typically indicates that an algorithm has greater flexibility. 只要你能提供正确的参数设置组合,它通常能达到很好的精度。It can often achieve very good accuracy, provided you can find the right combination of parameter settings.

特征数量Number of features

在机器学习中,特征是你要尝试分析的现象的可量化变量。In machine learning, a feature is a quantifiable variable of the phenomenon you are trying to analyze. 对于某些类型的数据,相较于数据点的数量,特征的数量可能非常大。For certain types of data, the number of features can be very large compared to the number of data points. 这通常出现在遗传学或文本数据的情况下。This is often the case with genetics or textual data.

大量的特征会导致某些学习算法不可用,从而使得训练时间特别长。A large number of features can bog down some learning algorithms, making training time unfeasibly long. 支持向量机特别适合存在大量特征的方案。Support vector machines are particularly well suited to scenarios with a high number of features. 出于此原因,从信息检索到图文分类等许多应用场景中都使用了支持向量机。For this reason, they have been used in many applications from information retrieval to text and image classification. 支持向量机可用于分类和回归任务。Support vector machines can be used for both classification and regression tasks.

特征选择是指在指定了输出的情况下,将统计测试应用到输入的过程。Feature selection refers to the process of applying statistical tests to inputs, given a specified output. 目标是确定哪些列能够更准确地预测输出。The goal is to determine which columns are more predictive of the output. 机器学习设计器中的“基于筛选器的特征选择”模块提供多种特征选择算法供用户选择。The Filter Based Feature Selection module in Machine Learning designer provides multiple feature selection algorithms to choose from. 该模块包含“皮尔逊相关”和卡方值等相关性方法。The module includes correlation methods such as Pearson correlation and chi-squared values.

还可以使用“排列特征重要性”模块计算数据集的一组特征重要性评分。You can also use the Permutation Feature Importance module to compute a set of feature importance scores for your dataset. 然后,可以利用这些评分来帮助确定最适合在模型中使用的特征。You can then leverage these scores to help you determine the best features to use in a model.

后续步骤Next steps