双类逻辑回归模块Two-Class Logistic Regression module

本文介绍了 Azure 机器学习设计器(预览版)中的一个模块。This article describes a module in Azure Machine Learning designer (preview).

使用此模块,可以创建可用于预测两个(且只有两个)结果的逻辑回归模型。Use this module to create a logistic regression model that can be used to predict two (and only two) outcomes.

逻辑回归是一种众所周知的统计方法,用于对许多类型的问题进行建模。Logistic regression is a well-known statistical technique that is used for modeling many kinds of problems. 此算法是一种“监督式学习” 方法;因此,你必须提供已包含结果的数据集来训练模型。This algorithm is a supervised learning method; therefore, you must provide a dataset that already contains the outcomes to train the model.

关于逻辑回归About logistic regression

逻辑回归是统计学中著名的用于预测结果概率的方法,尤其是常用于分类任务。Logistic regression is a well-known method in statistics that is used to predict the probability of an outcome, and is especially popular for classification tasks. 该算法通过将数据拟合到逻辑函数来预测事件发生的概率。The algorithm predicts the probability of occurrence of an event by fitting data to a logistic function.

在此模块中,分类算法针对二分变量或二元变量进行了优化。In this module, the classification algorithm is optimized for dichotomous or binary variables. 如果需要对多个结果进行分类,请使用多类逻辑回归模块。if you need to classify multiple outcomes, use the Multiclass Logistic Regression module.

配置方式How to configure

若要训练此模型,必须提供一个包含标签或分类列的数据集。To train this model, you must provide a dataset that contains a label or class column. 因为此模块专用于解决双类问题,所以标签或分类列必须包含恰好两个值。Because this module is intended for two-class problems, the label or class column must contain exactly two values.

例如,标签列可能是 [Voted],并且可能的值为“Yes”或“No”。For example, the label column might be [Voted] with possible values of "Yes" or "No". 或者,它可能是 [Credit Risk],并且可能的值为“High”或“Low”。Or, it might be [Credit Risk], with possible values of "High" or "Low".

  1. 将“双类逻辑回归”模块添加到管道。Add the Two-Class Logistic Regression module to your pipeline.

  2. 通过设置“创建训练程序模式”选项,指定要如何对模型进行训练。Specify how you want the model to be trained, by setting the Create trainer mode option.

    • “单个参数”:如果你知道自己想要如何配置模型,可以提供一组特定的值作为参数。Single Parameter: If you know how you want to configure the model, you can provide a specific set of values as arguments.

    • 参数范围:如果不确定最佳参数,可以使用优化模型超参数模块来找到最佳参数。Parameter Range: If you are not sure of the best parameters, you can find the optimal parameters by using the Tune Model Hyperparameters module. 你提供一定的值范围,然后训练程序会循环访问设置的多个组合,以确定可产生最佳结果的值组合。You provide some range of values, and the trainer iterates over multiple combinations of the settings to determine the combination of values that produces the best result.

  3. 对于“优化容差”,请指定在优化模型时要使用的阈值。For Optimization tolerance, specify a threshold value to use when optimizing the model. 如果两次迭代之间的改进低于指定的阈值,则会认为算法收敛于某个解,并且训练停止。If the improvement between iterations falls below the specified threshold, the algorithm is considered to have converged on a solution, and training stops.

  4. 对于“L1 正则化权重” 和“L2 正则化权重”,请键入要用于正则化参数 L1 和 L2 的值。For L1 regularization weight and L2 regularization weight, type a value to use for the regularization parameters L1 and L2. 对于这两个值,建议使用非零值。A non-zero value is recommended for both.

    正则化是一种通过处罚具有极端系数值的模型来防止过度拟合的方法。Regularization is a method for preventing overfitting by penalizing models with extreme coefficient values. 正则化的工作原理是将与系数值相关联的处罚添加到假设的误差。Regularization works by adding the penalty that is associated with coefficient values to the error of the hypothesis. 因此,具有极端系数值的准确模型受到的处罚更大,但具有保守值的不准确的模型受到的处罚更小。Thus, an accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.

    L1 和 L2 正则化具有不同的效果和用途。L1 and L2 regularization have different effects and uses.

    • L1 可用于稀疏模型,这在处理高维数据时非常有用。L1 can be applied to sparse models, which is useful when working with high-dimensional data.

    • 与此相反,L2 正则化更适合用于非稀疏数据。In contrast, L2 regularization is preferable for data that is not sparse.

    此算法支持 L1 和 L2 正则化值的线性组合:也就是说,如果 x = L1y = L2,则 ax + by = c 定义正则化术语的线性跨度。This algorithm supports a linear combination of L1 and L2 regularization values: that is, if x = L1 and y = L2, then ax + by = c defines the linear span of the regularization terms.

    Note

    想了解 L1 和 L2 正则化的详细信息?Want to learn more about L1 and L2 regularization? 以下文章讨论了 L1 和 L2 正则化有何不同,以及它们对模型拟合有何影响,并提供了用于逻辑回归和神经网络模型的代码示例:机器学习的 L1 和 L2 正则化The following article provides a discussion of how L1 and L2 regularization are different and how they affect model fitting, with code samples for logistic regression and neural network models: L1 and L2 Regularization for Machine Learning

    已为逻辑回归模型设计了 L1 和 L2 术语的不同线性组合,例如弹性网络正则化Different linear combinations of L1 and L2 terms have been devised for logistic regression models: for example, elastic net regularization. 建议你参考这些组合来定义在你的模型中有效的线性组合。We suggest that you reference these combinations to define a linear combination that is effective in your model.

  5. 对于“用于 L-BFGS 的内存大小”,请指定要用于 L-BFGS 优化的内存量。For Memory size for L-BFGS, specify the amount of memory to use for L-BFGS optimization.

    L-BFGS 表示“limited memory Broyden-Fletcher-Goldfarb-Shanno”。L-BFGS stands for "limited memory Broyden-Fletcher-Goldfarb-Shanno". 它是一种常用的参数估计优化算法。It is an optimization algorithm that is popular for parameter estimation. 此参数指示要存储以用于下一步计算的过去位置和渐变的数目。This parameter indicates the number of past positions and gradients to store for the computation of the next step.

    此优化参数限制用来计算下一步和方向的内存量。This optimization parameter limits the amount of memory that is used to compute the next step and direction. 指定的内存越少,训练越快,但准确性越低。When you specify less memory, training is faster but less accurate.

  6. 对于“随机数种子”,请键入一个整数值。For Random number seed, type an integer value. 如果希望结果在同一管道的多个运行上可重现,则定义种子值非常重要。Defining a seed value is important if you want the results to be reproducible over multiple runs of the same pipeline.

  7. 将标记的数据集添加到管道,并训练模型:Add a labeled dataset to the pipeline, and train the model:

    • 如果将“创建训练程序模式”设置为“单个参数”,请连接带标记的数据集和训练模型模块 。If you set Create trainer mode to Single Parameter, connect a tagged dataset and the Train Model module.

    • 如果将“创建训练程序模式”设置为“参数范围”,请连接带标记的数据集并使用优化模型超参数来训练模型 。If you set Create trainer mode to Parameter Range, connect a tagged dataset and train the model by using Tune Model Hyperparameters.

    Note

    如果将参数范围传递给训练模型,则它只使用单个参数列表中的默认值。If you pass a parameter range to Train Model, it uses only the default value in the single parameter list.

    如果将一组参数值传递给优化模型超参数模块,则当它期望每个参数有一系列设置时,它会忽略这些值,并为学习器使用默认值。If you pass a single set of parameter values to the Tune Model Hyperparameters module, when it expects a range of settings for each parameter, it ignores the values, and uses the default values for the learner.

    如果选择“参数范围”选项并为任何参数输入单个值,则整个整理过程中都会使用你指定的单个值,即使其他参数的值发生一系列更改。If you select the Parameter Range option and enter a single value for any parameter, that single value you specified is used throughout the sweep, even if other parameters change across a range of values.

  8. 提交管道。Submit the pipeline.

结果Results

在训练完成后:After training is complete:

  • 若要对新数据进行预测,请使用训练后的模型和新数据作为评分模型模块的输入。To make predictions on new data, use the trained model and new data as input to the Score Model module.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.