“双类提升决策树”模块Two-Class Boosted Decision Tree module

本文介绍 Azure 机器学习设计器中的一个模块。This article describes a module in Azure Machine Learning designer.

使用此模块,可以根据提升决策树算法创建机器学习模型。Use this module to create a machine learning model that is based on the boosted decision trees algorithm.

提升决策树是一种集成学习方法,在此方法中,第二个树将针对第一个树的误差进行纠正,第三个树将针对第一个和第二个树的误差进行纠正,依此类推。A boosted decision tree is an ensemble learning method in which the second tree corrects for the errors of the first tree, the third tree corrects for the errors of the first and second trees, and so forth. 预测基于共同构成了预测的树的整个集成。Predictions are based on the entire ensemble of trees together that makes the prediction.

通常,当配置正确时,提升决策树是可在各种机器学习任务中获得最佳性能的最简单方法。Generally, when properly configured, boosted decision trees are the easiest methods with which to get top performance on a wide variety of machine learning tasks. 但是,它们也是占用大量内存的学习器之一,并且当前实现将所有内容都保存在内存中。However, they are also one of the more memory-intensive learners, and the current implementation holds everything in memory. 因此,提升决策树模型可能无法处理某些线性学习器可以处理的大型数据集。Therefore, a boosted decision tree model might not be able to process the large datasets that some linear learners can handle.

此模块基于 LightGBM 算法。This module is based on LightGBM algorithm.

配置方式How to configure

此模块创建一个未训练的分类模型。This module creates an untrained classification model. 由于分类是一种监督式学习方法,所以,若要训练模型,你需要一个“带标记的数据集”,其中包含一个标签列,该列在所有行中都有一个值。Because classification is a supervised learning method, to train the model, you need a tagged dataset that includes a label column with a value for all rows.

你可以使用训练模型来训练这种类型的模型。You can train this type of model using Train Model.

  1. 在 Azure 机器学习中,将 提升决策树 模块添加到你的管道。In Azure Machine Learning, add the Boosted Decision Tree module to your pipeline.

  2. 通过设置“创建训练程序模式”选项,指定要如何对模型进行训练。Specify how you want the model to be trained, by setting the Create trainer mode option.

    • “单个参数”:如果你知道自己想要如何配置模型,可以提供一组特定的值作为参数。Single Parameter: If you know how you want to configure the model, you can provide a specific set of values as arguments.

    • 参数范围:如果不确定最佳参数,可以使用 优化模型超参数模块来找到最佳参数。Parameter Range: If you are not sure of the best parameters, you can find the optimal parameters by using the Tune Model Hyperparameters module. 你提供某个值范围,然后训练程序就会循环访问多个设置组合,以确定可产生最佳结果的值组合。You provide some range of values, and the trainer iterates over multiple combinations of the settings to determine the combination of values that produces the best result.

  3. 对于“每个树的最大叶数”,请指定可在任何树中创建的终端节点(叶)的最大数目。For Maximum number of leaves per tree, indicate the maximum number of terminal nodes (leaves) that can be created in any tree.

    如果增大此值,则可能会增加树的大小并获得更好的精度,但风险是过度拟合和更长的训练时间。By increasing this value, you potentially increase the size of the tree and get better precision, at the risk of overfitting and longer training time.

  4. 对于“每个叶节点的最少样本数”,指定在树中创建任何终端节点(叶)所需的事例数。For Minimum number of samples per leaf node, indicate the number of cases required to create any terminal node (leaf) in a tree.

    通过增加此值,可以增加创建新规则的阈值。By increasing this value, you increase the threshold for creating new rules. 例如,使用默认值 1 时,即使是单个事例也可以导致创建新规则。For example, with the default value of 1, even a single case can cause a new rule to be created. 如果将值增加到 5,则训练数据将必须包含至少五个满足相同条件的事例。If you increase the value to 5, the training data would have to contain at least five cases that meet the same conditions.

  5. 对于“学习速率”,请键入一个介于 0 和 1 之间的数字,用以定义学习时的步幅。For Learning rate, type a number between 0 and 1 that defines the step size while learning.

    学习速率决定了学习器收敛于最优解的速度。The learning rate determines how fast or slow the learner converges on the optimal solution. 如果步幅太大,则可能会越过最优解。If the step size is too large, you might overshoot the optimal solution. 如果步幅太小,则训练将花费更长的时间来收敛于最优解。If the step size is too small, training takes longer to converge on the best solution.

  6. 对于“构造的树数”,请指定要在集成中创建的决策树的总数。For Number of trees constructed, indicate the total number of decision trees to create in the ensemble. 通过创建更多决策树,你可能会获得更好的覆盖范围,但训练时间将会增加。By creating more decision trees, you can potentially get better coverage, but training time will increase.

    如果将该值设置为 1,只会生成一个树(该树具有初始的参数集),而不会执行进一步的迭代。If you set the value to 1, only one tree is produced (the tree with the initial set of parameters) and no further iterations are performed.

  7. 对于“随机数种子”,可以键入非负整数作为随机种子值。For Random number seed, optionally type a non-negative integer to use as the random seed value. 指定种子可以确保具有相同数据和参数的运行之间的可再现性。Specifying a seed ensures reproducibility across runs that have the same data and parameters.

    随机种子默认设置为 0,这意味着将从系统时钟获取初始种子值。The random seed is set by default to 0, which means the initial seed value is obtained from the system clock. 使用随机种子的后续运行可能会产生不同的结果。Successive runs using a random seed can have different results.

  8. 训练模型:Train the model:

    • 如果将“创建训练程序模式”设置为“单个参数”,请连接带标记的数据集和训练模型模块 。If you set Create trainer mode to Single Parameter, connect a tagged dataset and the Train Model module.

    • 如果将“创建训练程序模式”设置为“参数范围”,请连接带标记的数据集并使用优化模型超参数来训练模型 。If you set Create trainer mode to Parameter Range, connect a tagged dataset and train the model by using Tune Model Hyperparameters.

    备注

    如果将参数范围传递给训练模型,则它只使用单个参数列表中的默认值。If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.

    如果将一组参数值传递给优化模型超参数模块,则当它期望每个参数有一系列设置时,它会忽略这些值,并为学习器使用默认值。If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range of settings for each parameter, it ignores the values, and uses the default values for the learner.

    如果选择“参数范围”选项并为任何参数输入单个值,则整个整理过程中都会使用你指定的单个值,即使其他参数的值发生一系列更改。If you select the Parameter Range option and enter a single value for any parameter, that single value you specified is used throughout the sweep, even if other parameters change across a range of values.

结果Results

在训练完成后:After training is complete:

  • 若要保存已训练模型的快照,请选择“训练模型”模块右侧面板中的“输出”选项卡。To save a snapshot of the trained model, select the Outputs tab in the right panel of the Train model module. 选择“注册数据集”图标将模型保存为可重用模块。Select the Register dataset icon to save the model as a reusable module.

  • 若要使用模型进行评分,请向管道中添加 评分模型 模块。To use the model for scoring, add the Score Model module to a pipeline.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.