快速林分位回归Fast Forest Quantile Regression

本文介绍 Azure 机器学习设计器中的模块。This article describes a module in Azure Machine Learning designer.

使用此模块可以在管道中创建快速林分位数回归模型。Use this module to create a fast forest quantile regression model in a pipeline. 如果要了解有关预测值分布的详细信息,而不是获得单个平均值预测值,则快速林分位数回归非常有用。Fast forest quantile regression is useful if you want to understand more about the distribution of the predicted value, rather than get a single mean prediction value. 此方法有很多应用,包括:This method has many applications, including:

  • 预测价格Predicting prices

  • 评估学生表现,或应用生长曲线表来评估儿童发育情况Estimating student performance or applying growth charts to assess child development

  • 在变量之间仅存在弱关系的情况下,发现预测性的关系Discovering predictive relationships in cases where there is only a weak relationship between variables

此回归算法是一种监督式学习方法,这意味着它需要一个已标记的数据集(其中包含标签列)。This regression algorithm is a supervised learning method, which means it requires a tagged dataset that includes a label column. 因为它是回归算法,所以标签列必须只包含数字值。Because it is a regression algorithm, the label column must contain only numerical values.

有关分位数回归的详细信息More about quantile regression

有许多不同类型的回归。There are many different types of regression. 简而言之,回归意味着将模型拟合为以数值向量表示的目标。Simply put, regression means fitting a model to a target expressed as a numeric vector. 然而,统计学家一直在研究更加高级的回归方法。However, statisticians have been developing increasingly advanced methods for regression.

分位数的最简单定义是一个用于将一组数据分割成大小相等的组的值;因此,分位数值标记了组之间的边界。The simplest definition of quantile is a value that divides a set of data into equal-sized groups; thus, the quantile values mark the boundaries between groups. 从统计学意义上来说,分位数是按固定间隔从随机变量的累积分布函数 (CDF) 的逆函数提取的值。Statistically speaking, quantiles are values taken at regular intervals from the inverse of the cumulative distribution function (CDF) of a random variable.

而线性回归模型尝试使用单个估计值(平均值)来预测数值变量的值,有时你需要预测目标变量的范围或整个分布。Whereas linear regression models attempt to predict the value of a numeric variable using a single estimate, the mean , sometimes you need to predict the range or entire distribution of the target variable. 为此,已开发了贝叶斯回归和分位数回归等方法。Techniques such as Bayesian regression and quantile regression have been developed for this purpose.

分位数回归有助于了解预测值的分布情况。Quantile regression helps you understand the distribution of the predicted value. 基于树的分位回归模型(例如本模块中使用的模型)有一些额外的优点,因此可用于预测非参数化分布。Tree-based quantile regression models, such as the one used in this module, have the additional advantage that they can be used to predict non-parametric distributions.

如何配置快速林分位数回归How to configure Fast Forest Quantile Regression

  1. 在设计器中将“快速林分位数回归”模块添加到管道。Add the Fast Forest Quantile Regression module to your pipeline in the designer. 可以在“机器学习算法”下的“回归”类别中找到此模块 。You can find this module under Machine Learning Algorithms , in the Regression category.

  2. 在“快速林分位数回归”模块的右侧窗格中,设置“创建训练程序模式”选项,以指定模型的训练方式 。In the right pane of the Fast Forest Quantile Regression module, specify how you want the model to be trained, by setting the Create trainer mode option.

    • 单个参数 :如果知道自己想要如何配置模型,请提供一组特定的值作为参数。Single Parameter : If you know how you want to configure the model, provide a specific set of values as arguments. 在训练模型时,请使用训练模型When you train the model, use Train Model.

    • 参数范围 :如果不确定最佳参数,请使用 优化模型超参数模块进行参数扫描。Parameter Range : If you are not sure of the best parameters, do a parameter sweep using the Tune Model Hyperparameters module. 训练程序将循环访问指定的多个值,以找到最佳配置。The trainer iterates over multiple values you specify to find the optimal configuration.

  3. 对于“树数”,键入可在系综中创建的树的最大数目。Number of Trees , type the maximum number of trees that can be created in the ensemble. 创建更多的树通常可以提高准确度,但同时会增加训练时间。If you create more trees, it generally leads to greater accuracy, but at the cost of longer training time.

  4. 对于“叶数”,键入可在任何树中创建的叶或终端节点的最大数目。Number of Leaves , type the maximum number of leaves, or terminal nodes, that can be created in any tree.

  5. 对于“形成叶所需的最小训练实例数”,指定在树中创建任何终端节点(叶)所需的最小示例数。Minimum number of training instances required to form a leaf , specify the minimum number of examples that are required to create any terminal node (leaf) in a tree.

    通过增加此值,可以增加创建新规则的阈值。By increasing this value, you increase the threshold for creating new rules. 例如,使用默认值 1 时,即使是单个案例也可以导致创建新规则。For example, with the default value of 1, even a single case can cause a new rule to be created. 如果将值增加到 5,则训练数据将必须包含至少 5 个满足相同条件的案例。If you increase the value to 5, the training data would have to contain at least 5 cases that meet the same conditions.

  6. 对于“装袋比例”,指定一个介于 0 和 1 之间的数字,该数字表示在生成每组分位数时要使用的样本的比例。Bagging fraction , specify a number between 0 and 1 that represents the fraction of samples to use when building each group of quantiles. 样本是随机选择的,并且带有替换部分。Samples are chosen randomly, with replacement.

  7. 对于“拆分比例”,键入一个介于 0 和 1 之间的数字,该数字表示在每次拆分树时要使用的特征的比例。Split fraction , type a number between 0 and 1 that represents the fraction of features to use in each split of the tree. 始终随机选择使用的功能。The features used are always chosen randomly.

  8. 对于“要估计的分位数”,请键入一个以分号分隔的分位数列表,其中的分位数是你希望模型为其训练并创建预测的分位数。Quantiles to be estimated , type a semicolon-separated list of the quantiles for which you want the model to train and create predictions.

    例如,如果要生成一个模型以便对四分位数进行估计,则应键入 0.25; 0.5; 0.75For example, if you want to build a model that estimates for quartiles, you would type 0.25; 0.5; 0.75.

  9. (可选)对于“随机数种子”,键入一个值以设定模型所用随机数生成器的种子。Optionally, type a value for Random number seed to seed the random number generator used by the model. 默认值为 0,这意味着选择随机种子。The default is 0, meaning a random seed is chosen.

    如果需要在针对相同数据执行的各次连续运行之中重新生成结果,应提供一个值。You should provide a value if you need to reproduce results across successive runs on the same data.

  10. 将训练数据集和未训练的模型连接到训练模块之一:Connect the training dataset and the untrained model to one of the training modules:

    • 如果将“创建训练程序模式”设置为“单个参数”,请使用 If you set Create trainer mode to Single Parameter , use the Train Model module.

    • 如果将“创建训练程序模式”设置为“参数范围”,请使用优化模型超参数模块 。If you set Create trainer mode to Parameter Range , use the Tune Model Hyperparameters module.

    警告

    • 如果将参数范围传递给训练模型模块,则它只使用参数范围列表中的第一个值。If you pass a parameter range to Train Model, it uses only the first value in the parameter range list.

    • 如果将一组参数值传递给优化模型超参数模块,则当它期望每个参数有一系列设置时,它会忽略这些值,并为学习器使用默认值。If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range of settings for each parameter, it ignores the values and uses the default values for the learner.

    • 如果选择“参数范围”选项并为任何参数输入单个值,则整个整理过程中都会使用你指定的单个值,即使其他参数的值发生一系列更改。If you select the Parameter Range option and enter a single value for any parameter, that single value you specified is used throughout the sweep, even if other parameters change across a range of values.

  11. 提交管道。Submit the pipeline.

结果Results

在训练完成后:After training is complete:

  • 若要保存已训练模型的快照,请选择训练模块,然后切换到右侧面板中的“输出+日志”选项卡。To save a snapshot of the trained model, select the training module, then switch to Outputs+logs tab in the right panel. 单击“注册数据集”图标。Click on the icon Register dataset . 可以在模块树中找到已作为模块保存的模型。You can find the saved model as a module in the module tree.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.