检测数据集中的数据偏移(预览版)Detect data drift (preview) on datasets

了解如何监视数据偏移并设置偏移幅度很大时的警报。Learn how to monitor data drift and set alerts when drift is high.

Azure 机器学习数据集监视器(预览版)具有以下功能:With Azure Machine Learning dataset monitors (preview), you can:

  • 分析数据的偏移,以了解数据在一段时间内的变化。Analyze drift in your data to understand how it changes over time.
  • 监视模型数据,以了解训练数据集与服务数据集之间的差异。Monitor model data for differences between training and serving datasets. 首先从部署的模型收集模型数据Start by collecting model data from deployed models.
  • 监视新数据,以了解任何基线与目标数据集之间的差异。Monitor new data for differences between any baseline and target dataset.
  • 分析数据中的特征,以跟踪统计属性在一段时间内的变化。Profile features in data to track how statistical properties change over time.
  • 针对数据偏移设置警报,以便针对潜在问题提前发出警告。Set up alerts on data drift for early warnings to potential issues.
  • 当你确定数据偏移太大时,[创建新的数据集版本](how-to-version-track-datasets。[Create a new dataset version](how-to-version-track-datasets when you determine the data has drifted too much.

使用 Azure 机器学习数据集来创建监视器。An Azure Machine learning dataset is used to create the monitor. 此数据集必须包含一个时间戳列。The dataset must include a timestamp column.

可以在 Python SDK 或 Azure 机器学习工作室中查看数据偏移指标。You can view data drift metrics with the Python SDK or in Azure Machine Learning studio. 可以通过与 Azure 机器学习工作区关联的 Azure Application Insights 资源获取其他指标和见解。Other metrics and insights are available through the Azure Application Insights resource associated with the Azure Machine Learning workspace.

重要

数据集的数据偏移检测目前为公共预览版。Data drift detection for datasets is currently in public preview. 该预览版在提供时没有附带服务级别协议,建议不要将其用于生产工作负载。The preview version is provided without a service level agreement, and it's not recommended for production workloads. 某些功能可能不受支持或者受限。Certain features might not be supported or might have constrained capabilities.

先决条件Prerequisites

若要创建和使用数据集监视器,需要:To create and work with dataset monitors, you need:

什么是数据偏移?What is data drift?

数据偏移是模型准确度不断下降的主要原因之一。Data drift is one of the top reasons model accuracy degrades over time. 对于机器学习模型,数据偏移是指模型输入数据的变化,这会导致模型性能下降。For machine learning models, data drift is the change in model input data that leads to model performance degradation. 监视数据偏移有助于检测这些模型性能问题。Monitoring data drift helps detect these model performance issues.

数据偏移的原因包括:Causes of data drift include:

  • 上游流程更改,例如,更换了传感器,使度量单位由英寸改为厘米。Upstream process changes, such as a sensor being replaced that changes the units of measurement from inches to centimeters.
  • 数据质量问题,例如,已损坏的传感器的读数始终为 0。Data quality issues, such as a broken sensor always reading 0.
  • 数据的自然偏移,例如,平均温度随着季节而变化。Natural drift in the data, such as mean temperature changing with the seasons.
  • 特征之间的关系变化,也称为共变偏移。Change in relation between features, or covariate shift.

Azure 机器学习通过计算单个指标来简化偏移检测,该指标将所比较数据集的复杂性抽象化。Azure Machine Learning simplifies drift detection by computing a single metric abstracting the complexity of datasets being compared. 这些数据集可能有数百个特征和数万个行。These datasets may have hundreds of features and tens of thousands of rows. 一旦检测到偏移,就可以通过向下钻取来了解哪些特征导致了偏移。Once drift is detected, you drill down into which features are causing the drift. 然后你可以检查特征级别指标,以调试和厘清偏移的根本原因。You then inspect feature level metrics to debug and isolate the root cause for the drift.

这种自上而下的方法可以轻松监视数据,不必使用传统的基于规则的方法。This top down approach makes it easy to monitor data instead of traditional rules-based techniques. 基于规则的方法(例如允许的数据范围或允许的唯一值)可能非常耗时且容易出错。Rules-based techniques such as allowed data range or allowed unique values can be time consuming and error prone.

在 Azure 机器学习中,我们使用数据集监视器进行数据偏移检测和报警。In Azure Machine Learning, you use dataset monitors to detect and alert for data drift.

数据集监视器Dataset monitors

数据集监视器的功能:With a dataset monitor you can:

  • 检测数据集中新数据的数据偏移并发出警报。Detect and alert to data drift on new data in a dataset.
  • 分析历史数据的偏移情况。Analyze historical data for drift.
  • 分析一段时间内的新数据。Profile new data over time.

数据偏移算法提供数据变化的整体度量,并指出需要对哪些特征做进一步的调查。The data drift algorithm provides an overall measure of change in data and indication of which features are responsible for further investigation. 数据集监视器通过分析 timeseries 数据集中的新数据来生成其他许多指标。Dataset monitors produce a number of other metrics by profiling new data in the timeseries dataset.

可以通过 Azure Application Insights 针对监视器生成的所有指标设置自定义警报。Custom alerting can be set up on all metrics generated by the monitor through Azure Application Insights. 数据集监视器可用于快速捕获数据问题,并通过识别可能的原因来减少调试问题所需的时间。Dataset monitors can be used to quickly catch data issues and reduce the time to debug the issue by identifying likely causes.

从概念上讲,在 Azure 机器学习中设置数据集监视器有三种主要方案。Conceptually, there are three primary scenarios for setting up dataset monitors in Azure Machine Learning.

方案Scenario 描述Description
监视模型的服务数据与训练数据之间的偏移Monitor a model's serving data for drift from the training data 由于服务数据与训练数据之间存在偏移时模型准确度下降,因此可以将此方案的结果解释为在代理中监视模型的准确度。Results from this scenario can be interpreted as monitoring a proxy for the model's accuracy, since model accuracy degrades when the serving data drifts from the training data.
监视时序数据集与前一个时间段之间的偏移。Monitor a time series dataset for drift from a previous time period. 此方案较为常见,可用于监视涉及到模型生成操作的上游或下游节点的数据集。This scenario is more general, and can be used to monitor datasets involved upstream or downstream of model building. 目标数据集必须有一个时间戳列。The target dataset must have a timestamp column. 基线数据集可以是任意表格数据集,其中包含与目标数据集共有的特征。The baseline dataset can be any tabular dataset that has features in common with the target dataset.
对过去的数据进行分析。Perform analysis on past data. 此方案可用于了解历史数据,并在数据集监视器的设置方面做出决策。This scenario can be used to understand historical data and inform decisions in settings for dataset monitors.

数据集监视器依赖于以下 Azure 服务。Dataset monitors depend on the following Azure services.

Azure 服务Azure service 描述Description
数据集Dataset 偏移使用机器学习数据集检索训练数据,并比较用于模型训练的数据。Drift uses Machine Learning datasets to retrieve training data and compare data for model training. 生成数据概要文件是为了生成一些报告指标,例如最小值、最大值、非重复值、非重复值计数。Generating profile of data is used to generate some of the reported metrics such as min, max, distinct values, distinct values count.
Azureml 管道和计算Azureml pipeline and compute 偏移计算作业托管在 azureml 管道中。The drift calculation job is hosted in azureml pipeline. 该作业按需或按计划触发,可以针对在创建偏移监视器时配置的计算运行。The job is triggered on demand or by schedule to run on a compute configured at drift monitor creation time.
Application insightsApplication insights 偏移会向属于机器学习工作区的 Application Insights 发出指标。Drift emits metrics to Application Insights belonging to the machine learning workspace.
Azure blob 存储Azure blob storage 偏移会向 Azure Blob 存储发出 JSON 格式的指标。Drift emits metrics in json format to Azure blob storage.

基线和目标数据集Baseline and target datasets

可以监视 Azure 机器学习数据集的数据偏移情况。You monitor Azure machine learning datasets for data drift. 创建数据集监视器时,需引用:When you create a dataset monitor, you will reference your:

  • 基线数据集 - 通常为模型的训练数据集。Baseline dataset - usually the training dataset for a model.
  • 目标数据集 - 通常为模型输入数据 - 可以与一段时间内的基线数据集进行比较。Target dataset - usually model input data - is compared over time to your baseline dataset. 这种比较意味着必须为目标数据集指定一个时间戳列。This comparison means that your target dataset must have a timestamp column specified.

该监视器会比较基线和目标数据集。The monitor will compare the baseline and target datasets.

创建目标数据集Create target dataset

需要通过数据中的某个列或者派生自文件路径模式的某个虚拟列指定一个时间戳列,为目标数据集设置 timeseries 特征。The target dataset needs the timeseries trait set on it by specifying the timestamp column either from a column in the data or a virtual column derived from the path pattern of the files. 可通过 Python SDKAzure 机器学习工作室创建带时间戳的数据集。Create the dataset with a timestamp through the Python SDK or Azure Machine Learning studio. 必须指定表示“时间戳”的列,才能向数据集添加 timeseries 特征。A column representing a "timestamp" must be specified to add timeseries trait to the dataset. 如果数据已分区成包含时间信息的文件夹结构(例如“{yyyy/MM/dd}”),请通过路径模式设置来创建虚拟列,并将其设置为“分区时间戳”,以提高时序功能的重要性。If your data is partitioned into folder structure with time info, such as '{yyyy/MM/dd}', create a virtual column through the path pattern setting and set it as the "partition timestamp" to improve the importance of time series functionality.

Dataset 类的 with_timestamp_columns() 方法定义数据集的时间戳列。The Dataset class with_timestamp_columns() method defines the time stamp column for the dataset.

from azureml.core import Workspace, Dataset, Datastore

# get workspace object
ws = Workspace.from_config()

# get datastore object 
dstore = Datastore.get(ws, 'your datastore name')

# specify datastore paths
dstore_paths = [(dstore, 'weather/*/*/*/*/data.parquet')]

# specify partition format
partition_format = 'weather/{state}/{date:yyyy/MM/dd}/data.parquet'

# create the Tabular dataset with 'state' and 'date' as virtual columns 
dset = Dataset.Tabular.from_parquet_files(path=dstore_paths, partition_format=partition_format)

# assign the timestamp attribute to a real or virtual column in the dataset
dset = dset.with_timestamp_columns('date')

# register the dataset as the target dataset
dset = dset.register(ws, 'target')

有关使用数据集的 timeseries 特征的完整示例,请参阅示例笔记本数据集 SDK 文档For a full example of using the timeseries trait of datasets, see the example notebook or the datasets SDK documentation.

创建数据集监视器Create dataset monitor

创建数据集监视器,以检测新数据集中的数据偏移并发出警报。Create a dataset monitor to detect and alert to data drift on a new dataset. 使用 Python SDKAzure 机器学习工作室Use either the Python SDK or Azure Machine Learning studio.

有关完整详细信息,请参阅有关数据偏移的 Python SDK 参考文档See the Python SDK reference documentation on data drift for full details.

以下示例演示如何使用 Python SDK 创建数据集监视器The following example shows how to create a dataset monitor using the Python SDK

from azureml.core import Workspace, Dataset
from azureml.datadrift import DataDriftDetector
from datetime import datetime

# get the workspace object
ws = Workspace.from_config()

# get the target dataset
dset = Dataset.get_by_name(ws, 'target')

# set the baseline dataset
baseline = target.time_before(datetime(2019, 2, 1))

# set up feature list
features = ['latitude', 'longitude', 'elevation', 'windAngle', 'windSpeed', 'temperature', 'snowDepth', 'stationName', 'countryOrRegion']

# set up data drift detector
monitor = DataDriftDetector.create_from_datasets(ws, 'drift-monitor', baseline, target, 
                                                      compute_target='cpu-cluster', 
                                                      frequency='Week', 
                                                      feature_list=None, 
                                                      drift_threshold=.6, 
                                                      latency=24)

# get data drift detector by name
monitor = DataDriftDetector.get_by_name(ws, 'drift-monitor')

# update data drift detector
monitor = monitor.update(feature_list=features)

# run a backfill for January through May
backfill1 = monitor.backfill(datetime(2019, 1, 1), datetime(2019, 5, 1))

# run a backfill for May through today
backfill1 = monitor.backfill(datetime(2019, 5, 1), datetime.today())

# disable the pipeline schedule for the data drift detector
monitor = monitor.disable_schedule()

# enable the pipeline schedule for the data drift detector
monitor = monitor.enable_schedule()

提示

有关设置 timeseries 数据集和数据偏移检测器的完整示例,请参阅我们的示例笔记本For a full example of setting up a timeseries dataset and data drift detector, see our example notebook.

了解数据偏移结果Understand data drift results

本部分说明了数据集监视结果,这些结果可在 Azure 工作室中的“数据集 / 数据集监视器”页中找到。 This section shows you the results of monitoring a dataset, found in the Datasets / Dataset monitors page in Azure studio. 你可以在此页上更新设置以及分析特定时间段内的现有数据。You can update the settings as well as analyze existing data for a specific time period on this page.

首先大致了解数据偏移幅度,并突出显示要进一步调查的特征。Start with the top-level insights into the magnitude of data drift and a highlight of features to be further investigated.

偏移概述

指标Metric 描述Description
数据偏移幅度Data drift magnitude 一段时间内基线与目标数据集之间的偏移百分比。A percentage of drift between the baseline and target dataset over time. 范围为 0 到 100,0 表示数据集相同,100 表示 Azure 机器学习数据偏移模型可以完全区分两个数据集。Ranging from 0 to 100, 0 indicates identical datasets and 100 indicates the Azure Machine Learning data drift model can completely tell the two datasets apart. 由于这种幅度是使用机器学习技术生成的,预期度量的精确百分比中存在干扰。Noise in the precise percentage measured is expected due to machine learning techniques being used to generate this magnitude.
常见偏移特征Top drifting features 显示数据集中因偏移最大而对“偏移幅度”指标造成最大影响的特征。Shows the features from the dataset that have drifted the most and are therefore contributing the most to the Drift Magnitude metric. 由于共变偏移,特征的基础分布不一定需要改变即可获得相对较高的特征重要性。Due to covariate shift, the underlying distribution of a feature does not necessarily need to change to have relatively high feature importance.
阈值Threshold 数据偏移幅度超出设定阈值就会触发警报。Data Drift magnitude beyond the set threshold will trigger alerts. 可在监视器设置中对其进行配置。This can be configured in the monitor settings.

偏移幅度趋势Drift magnitude trend

查看数据集与目标数据集在指定时段内的差异。See how the dataset differs from the target dataset in the specified time period. 越接近 100%,两个数据集的差异越大。The closer to 100%, the more the two datasets differ.

偏移幅度趋势

偏移幅度(按特征)Drift magnitude by features

此部分包含对所选特征的分布变化的特征级见解,以及一段时间内的其他统计信息。This section contains feature-level insights into the change in the selected feature's distribution, as well as other statistics, over time.

此外,将分析一段时间内的目标数据集。The target dataset is also profiled over time. 将把一段时间内每个特征的基线分布之间的统计距离与目标数据集的相应距离进行比较。The statistical distance between the baseline distribution of each feature is compared with the target dataset's over time. 这在概念上类似于数据偏移幅度。Conceptually, this is similar to the data drift magnitude. 但是,此统计距离适用于单个特征而非所有特征。However this statistical distance is for an individual feature rather than all features. 还可以使用最小值、最大值和平均值。Min, max, and mean are also available.

在 Azure 机器学习工作室中,单击图中的某个条形可查看该日期的特征级详细信息。In the Azure Machine Learning studio, click on a bar in the graph to see the the feature level details for that date. 默认情况下,可以看到基线数据集的分布,以及同一特征的最近运行的分布。By default, you see the baseline dataset's distribution and the most recent run's distribution of the same feature.

偏移幅度(按特征)

也可以在 Python SDK 中通过对 DataDriftDetector 对象运行 get_metrics() 方法检索这些指标。These metrics can also be retrieved in the Python SDK through the get_metrics() method on a DataDriftDetector object.

特征详细信息Feature details

最后,可通过向下滚动来查看每个单独特征的详细信息。Finally, scroll down to view details for each individual feature. 可使用图表上方的下拉列表选择特征,并另外选择要查看的指标。Use the dropdowns above the chart to select the feature, and additionally select the metric you want to view.

数值特征图和比较

图表中的指标取决于特征的类型。Metrics in the chart depend on the type of feature.

  • 数字特征Numeric features | 指标Metric | 说明Description |
    | ------ | ----------- |
    | Wasserstein 距离Wasserstein distance | 将基线分布转换为目标分布的最小工作量。Minimum amount of work to transform baseline distribution into the target distribution. | | 平均值Mean value | 特征的平均值。Average value of the feature. | | 最小值Min value | 特征的最小值。Minimum value of the feature. | | 最大值Max value | 特征的最大值。Maximum value of the feature. |

  • 分类特征Categorical features

    指标Metric 说明Description
    Euclidian 距离Euclidian distance 针对分类列进行的计算。Computed for categorical columns.  欧氏距离基于两个矢量进行计算,这两个矢量是根据两个数据集中同一分类列的经验分布生成的。Euclidean distance is computed on two vectors, generated from empirical distribution of the same categorical column from two datasets.  0 表示经验分布没有差别。0 indicates there is no difference in the empirical distributions.   与 0 的偏差越大,该列的偏移程度越大。The more it deviates from 0, the more this column has drifted.  对此指标进行时序绘图即可观察相关趋势,并可利用这些趋势来发现偏移特征。Trends can be observed from a time series plot of this metric and can be helpful in uncovering a drifting feature.
    唯一值Unique values 特征的唯一值(基数)数目。Number of unique values (cardinality) of the feature.

在此图表中,可以选择单个日期来比较目标与所显示特征的此日期之间的特征分布。On this chart, select a single date to compare the feature distribution between the target and this date for the displayed feature. 对于数值特征,这会显示两个概率分布。For numeric features, this shows two probability distributions. 如果特征为数值,则显示条形图。If the feature is numeric, a bar chart is shown.

选择一个与目标比较的日期

指标、警报和事件Metrics, alerts, and events

可以在与机器学习工作区关联的 Azure Application Insights 资源中查询指标。Metrics can be queried in the Azure Application Insights resource associated with your machine learning workspace. 可以访问 Application Insights 的所有功能,包括设置自定义警报规则和操作组,以触发电子邮件/短信/推送/语音或 Azure 函数等操作。You have access to all features of Application Insights including set up for custom alert rules and action groups to trigger an action such as, an Email/SMS/Push/Voice or Azure Function. 有关详细信息,请参阅完整的 Application Insights 文档。Refer to the complete Application Insights documentation for details.

若要开始,请导航到 Azure 门户并选择工作区的“概览”页。To get started, navigate to the Azure portal and select your workspace's Overview page. 关联的 Application Insights 资源位于最右侧:The associated Application Insights resource is on the far right:

Azure 门户概述Azure portal overview

在左侧窗格中选择“监视”下的“日志(分析)”:Select Logs (Analytics) under Monitoring on the left pane:

Application Insights 概述

数据集监视器指标存储为 customMetricsThe dataset monitor metrics are stored as customMetrics. 可以在设置数据集监视器之后编写和运行查询来查看指标:You can write and run a query after setting up a dataset monitor to view them:

Log Analytics 查询Log analytics query

识别要对其设置警报规则的指标后,创建新的警报规则:After identifying metrics to set up alert rules, create a new alert rule:

新建警报规则

可以使用现有操作组或创建一个新操作组来定义满足设置的条件时要执行的操作:You can use an existing action group, or create a new one to define the action to be taken when the set conditions are met:

新建操作组

故障排除Troubleshooting

数据偏移监视器的限制和已知问题:Limitations and known issues for data drift monitors:

  • 分析历史数据时的时间范围限制为监视器频率设置的 31 个间隔。The time range when analyzing historical data is limited to 31 intervals of the monitor's frequency setting.

  • 除非未指定特征列表(使用所有特征),否则特征限制为 200 个。Limitation of 200 features, unless a feature list is not specified (all features used).

  • 计算大小必须足够大才能处理数据。Compute size must be large enough to handle the data.

  • 确保数据集包含处于给定监视器运行的开始和结束日期范围内的数据。Ensure your dataset has data within the start and end date for a given monitor run.

  • 数据集监视器仅适用于包含 50 行或更多行的数据集。Dataset monitors will only work on datasets that contain 50 rows or more.

  • 数据集中的列或特征根据下表中的条件划分为分类值或数字值。Columns, or features, in the dataset are classified as categorical or numeric based on the conditions in the following table. 如果特征不满足这些条件 - 例如,某个字符串类型的列包含 100 个以上的唯一值 - 则会从数据偏移算法中删除该特征,但仍会对其进行分析。If the feature does not meet these conditions - for instance, a column of type string with >100 unique values - the feature is dropped from our data drift algorithm, but is still profiled.

    特征类型Feature type 数据类型Data type 条件Condition 限制Limitations
    分类Categorical string、bool、int、floatstring, bool, int, float 特征中的唯一值数小于 100,并小于行数的 5%。The number of unique values in the feature is less than 100 and less than 5% of the number of rows. Null 被视为其自身的类别。Null is treated as its own category.
    数值Numerical int、floatint, float 特征中的值为数字数据类型,且不符合分类特征的条件。The values in the feature are of a numerical data type and do not meet the condition for a categorical feature. 如果 15% 以上的值为 null,则会删除特征。Feature dropped if >15% of values are null.
  • 创建了数据偏移监视器,但无法在 Azure 机器学习工作室的“数据集监视器”页上看到数据时,请尝试以下操作。When you have created a data drift monitor but cannot see data on the Dataset monitors page in Azure Machine Learning studio, try the following.

    1. 检查是否已在页面顶部选择了正确的日期范围。Check if you have selected the right date range at the top of the page.
    2. 在“数据集监视器”选项卡上,选择试验链接以检查运行状态。On the Dataset Monitors tab, select the experiment link to check run status. 此链接位于表的最右侧。This link is on the far right of the table.
    3. 如果运行已成功完成,请检查驱动程序日志,以便查看已生成的指标数,或者查看是否有任何警告消息。If run completed successfully, check driver logs to see how many metrics has been generated or if there's any warning messages. 单击试验后,在“输出 + 日志”选项卡中查找驱动程序日志。Find driver logs in the Output + logs tab after you click on an experiment.
  • 如果 SDK backfill() 函数未生成预期的输出,则可能是由于身份验证问题。If the SDK backfill() function does not generate the expected output, it may be due to an authentication issue. 创建要传入到此函数中的计算时,请勿使用 Run.get_context().experiment.workspace.compute_targetsWhen you create the compute to pass into this function, do not use Run.get_context().experiment.workspace.compute_targets. 而应使用 ServicePrincipalAuthentication(例如以下代码)来创建要传入到该 backfill() 函数中的计算:Instead, use ServicePrincipalAuthentication such as the following to create the compute that you pass into that backfill() function:

    auth = ServicePrincipalAuthentication(
            tenant_id=tenant_id,
            service_principal_id=app_id,
            service_principal_password=client_secret
            )
    ws = Workspace.get("xxx", auth=auth, subscription_id="xxx", resource_group"xxx")
    compute = ws.compute_targets.get("xxx")
    
  • 在模型数据收集器中,数据到达 blob 存储帐户最多需要(但通常不到)10 分钟。From the Model Data Collector, it can take up to (but usually less than) 10 minutes for data to arrive in your blob storage account. 在脚本或笔记本中,等待 10 分钟,以确保运行以下单元。In a script or Notebook, wait 10 minutes to ensure cells below will run.

    import time
    time.sleep(600)
    

后续步骤Next steps