“基于 PCA 的异常情况检测”模块PCA-Based Anomaly Detection module

本文介绍了如何使用 Azure 机器学习设计器中的“基于 PCA 的异常情况检测”模块根据主体组件分析 (PCA) 创建异常情况检测模型。This article describes how to use the PCA-Based Anomaly Detection module in Azure Machine Learning designer, to create an anomaly detection model based on principal component analysis (PCA).

此模块在以下场景中帮助你构建模型:很容易从一个类(例如有效交易)中获得训练数据,但很难获得目标异常情况的足够样本。This module helps you build a model in scenarios where it's easy to get training data from one class, such as valid transactions, but difficult to get sufficient samples of the targeted anomalies.

例如,在检测欺诈性交易时,你通常没有足够的可用于训练的欺诈示例。For example, to detect fraudulent transactions, you often don't have enough examples of fraud to train on. 但是,你可能有许多良好交易的示例。But you might have many examples of good transactions. “基于 PCA 的异常情况检测”模块通过分析可用的特征来确定什么构成“正常”类,从而解决了这个问题。The PCA-Based Anomaly Detection module solves the problem by analyzing available features to determine what constitutes a "normal" class. 然后,此模块将应用距离指标来识别表示异常情况的事例。The module then applies distance metrics to identify cases that represent anomalies. 此方法允许你使用现有的不均衡数据来训练模型。This approach lets you train a model by using existing imbalanced data.

有关主体组件分析的详细信息More about principal component analysis

PCA 是机器学习中的一项成熟技术。PCA is an established technique in machine learning. 它经常用于探索性数据分析,因为它揭示了数据的内部结构并解释了数据差异。It's frequently used in exploratory data analysis because it reveals the inner structure of the data and explains the variance in the data.

PCA 会分析包含多个变量的数据。PCA works by analyzing data that contains multiple variables. 它将查找变量间的相关性,并确定最能捕获结果差异的值的组合。It looks for correlations among the variables and determines the combination of values that best captures differences in outcomes. 这些组合的特征值用来创建一个更精简的特征空间,称为“主体组件”。These combined feature values are used to create a more compact feature space called the principal components .

为了进行异常情况检测,将分析每个新输入。For anomaly detection, each new input is analyzed. 异常情况检测算法会计算其在特征向量上的投影以及规范化的重构误差。The anomaly detection algorithm computes its projection on the eigenvectors, together with a normalized reconstruction error. 此规范化误差用作异常情况得分。The normalized error is used as the anomaly score. 错误级别越高,实例越异常。The higher the error, the more anomalous the instance is.

若要详细了解 PCA 如何工作以及异常情况检测的实现,请参阅以下文章:For more information about how PCA works, and about the implementation for anomaly detection, see these papers:

如何配置“基于 PCA 的异常情况检测”How to configure PCA-Based Anomaly Detection

  1. 在设计器中向管道添加“基于 PCA 的异常情况检测”模块。Add the PCA-Based Anomaly Detection module to your pipeline in the designer. 可以在“异常情况检测”类别中找到此模块。You can find this module in the Anomaly Detection category.

  2. 在模块的右面板中,选择“训练模式”选项。In the right panel of the module, select the Training mode option. 指示是要使用特定的一组参数来训练模型,还是使用参数扫描来查找最佳参数。Indicate whether you want to train the model by using a specific set of parameters, or use a parameter sweep to find the best parameters.

    如果你知道自己想要如何配置模型,请选择“单一参数”选项并提供特定的一组值作为参数。If you know how you want to configure the model, select the Single Parameter option, and provide a specific set of values as arguments.

  3. 对于“要在 PCA 中使用的组件数”,请指定所需的输出特征或组件的数目。For Number of components to use in PCA , specify the number of output features or components that you want.

    在使用 PCA 的试验设计中,决定要包含的组件数是一个重要环节。The decision of how many components to include is an important part of experiment design that uses PCA. 一般指导原则是,不应包含与变量数相同的 PCA 组件数,General guidance is that you should not include the same number of PCA components as there are variables. 而应从较小的组件数开始,不断增加组件,直至满足某个条件。Instead, you should start with a smaller number of components and increase them until some criterion is met.

    当输出组件的数量小于数据集中提供的特征列的数量时,将获得最佳结果。The best results are obtained when the number of output components is less than the number of feature columns available in the dataset.

  4. 指定在随机 PCA 训练期间要执行的过度抽样的量。Specify the amount of oversampling to perform during randomized PCA training. 在异常情况检测问题中,数据不均衡导致难以应用标准 PCA 技术。In anomaly detection problems, imbalanced data makes it difficult to apply standard PCA techniques. 通过指定一定数量的过度抽样,你可以增加目标实例的数目。By specifying some amount of oversampling, you can increase the number of target instances.

    如果指定 1 ,则不会执行任何过度抽样。If you specify 1 , no oversampling is performed. 如果指定了高于 1 的任何值,则会生成更多用于训练模型的示例。If you specify any value higher than 1 , additional samples are generated to use in training the model.

    有两个选项可用,具体取决于你是否使用参数扫描:There are two options, depending on whether you're using a parameter sweep or not:

    • 随机 PCA 的过度抽样参数 :键入一个整数,用以表示少数类与正常类相比的过度抽样比率。Oversampling parameter for randomized PCA : Type a single whole number that represents the ratio of oversampling of the minority class over the normal class. (当你使用 单一参数 训练方法时,此选项可用。)(This option is available when you're using the Single parameter training method.)

    备注

    你无法查看过度采样的数据集。You can't view the oversampled data set. 若要详细了解如何将过度抽样与 PCA 一起使用,请参阅技术说明For more information on how oversampling is used with PCA, see Technical notes.

  5. 选择“启用输入特征平均值规范化”选项,将所有输入特征规范化为平均值零。Select the Enable input feature mean normalization option to normalize all input features to a mean of zero. 对于 PCA,通常建议规范化或缩放为零,因为 PCA 的目标是最大程度地提高变量之间的差异。Normalization or scaling to zero is generally recommended for PCA, because the goal of PCA is to maximize variance among variables.

    默认情况下选择此选项。This option is selected by default. 如果已通过不同的方法或缩放对值进行了规范化,请取消选择此选项。Deselect it if values have already been normalized through a different method or scale.

  6. 连接一个带标记的训练数据集和一个训练模块。Connect a tagged training dataset and one of the training modules.

    如果将“创建训练器模式”选项设置为“单一参数”,请使用训练异常情况检测模型模块。 If you set the Create trainer mode option to Single Parameter , use the Train Anomaly Detection Model module.

  7. 提交管道。Submit the pipeline.

结果Results

训练完成后,你可以保存训练后的模型。When training is complete, you can save the trained model. 也可以将其连接到评分模型模块来预测异常情况得分。Or you can connect it to the Score Model module to predict anomaly scores.

若要评估异常情况检测模型的结果,请执行以下操作:To evaluate the results of an anomaly detection model:

  1. 确保两个数据集中都有分数列。Ensure that a score column is available in both datasets.

    如果你尝试评估某个异常情况检测模型,但出现“评分数据集中没有要比较的分数列”错误,则表明你使用的是包含标签列但未包含概率分数的典型评估数据集。If you try to evaluate an anomaly detection model and get the error "There is no score column in scored dataset to compare," you're using a typical evaluation dataset that contains a label column but no probability scores. 请选择与异常情况检测模型的架构输出匹配的数据集,其中包括“评分标签”列和“评分概率”列。Choose a dataset that matches the schema output for anomaly detection models, which includes Scored Labels and Scored Probabilities columns.

  2. 确保标签列已标记。Ensure that label columns are marked.

    有时,与标签列关联的元数据在管道图中被删除。Sometimes the metadata associated with the label column is removed in the pipeline graph. 如果发生这种情况,当使用评估模型模块来比较两个异常情况检测模型的结果时,可能会出现“评分数据集中没有标签列”错误。If this happens, when you use the Evaluate Model module to compare the results of two anomaly detection models, you might get the error "There is no label column in scored dataset." 或者,可能会出现“评分数据集中没有要比较的标签列”错误消息。Or you might get the error "There is no label column in scored dataset to compare."

    可以通过在评估模型模块之前添加编辑元数据模块来避免这些错误。You can avoid these errors by adding the Edit Metadata module before the Evaluate Model module. 请使用列选择器来选择类列,然后在“字段”列表中选择“标签”。Use the column selector to choose the class column, and in the Fields list, select Label .

  3. 使用 执行 Python 脚本模块将标签列类别调整为 1(positive, normal)0(negative, abnormal)Use the Execute Python Script module to adjust label column categories as 1(positive, normal) and 0(negative, abnormal) .

    label_column_name = 'XXX'
    anomaly_label_category = YY
    dataframe1[label_column_name] = dataframe1[label_column_name].apply(lambda x: 0 if x == anomaly_label_category else 1)
    

技术说明Technical notes

此算法使用 PCA 来近似计算包含正常类的子空间。This algorithm uses PCA to approximate the subspace that contains the normal class. 与数据协方差矩阵的顶部特征值相关联的特征向量横跨该子空间。The subspace is spanned by eigenvectors associated with the top eigenvalues of the data covariance matrix.

对于每个新输入,异常情况探测器首先计算其在特征向量上的投影,然后计算规范化的重构误差。For each new input, the anomaly detector first computes its projection on the eigenvectors, and then computes the normalized reconstruction error. 此误差是异常情况得分。This error is the anomaly score. 误差越大,实例越异常。The higher the error, the more anomalous the instance. 有关如何计算正常空间的详细信息,请查看维基百科:Principal component analysis(主体组件分析)。For details on how the normal space is computed, see Wikipedia: Principal component analysis.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.

如需特定于设计器模块的错误的列表,请参阅设计器的异常和错误代码See Exceptions and error codes for the designer for a list of errors specific to the designer modules.