为生产环境中的模型收集数据Collect data for models in production

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

本文演示如何从 Azure Kubernetes 服务 (AKS) 群集上部署的 Azure 机器学习模型中收集数据,This article shows how to collect data from an Azure Machine Learning model deployed on an Azure Kubernetes Service (AKS) cluster. 然后将收集的数据存储在 Azure Blob 存储中。The collected data is then stored in Azure Blob storage.

启用收集后,收集的数据可帮助你:Once collection is enabled, the data you collect helps you:

  • 针对收集的生产数据监视数据偏移Monitor data drifts on the production data you collect.

  • 使用 Power BI 分析收集的数据Analyze collected data using Power BI

  • 更好地决定何时重新训练或优化模型。Make better decisions about when to retrain or optimize your model.

  • 使用收集的数据重新训练模型。Retrain your model with the collected data.

收集哪些数据,收集的数据存储在何处What is collected and where it goes

可以收集以下数据:The following data can be collected:

  • 从部署在 AKS 群集中的 Web 服务收集模型输入数据。Model input data from web services deployed in an AKS cluster. 不收集语音、音频、图像和视频数据。**Voice audio, images, and video are not collected.

  • 使用生产输入数据进行模型预测。Model predictions using production input data.

备注

基于此数据的预先聚合与预先计算目前不是收集服务的一部分。Preaggregation and precalculations on this data are not currently part of the collection service.

输出保存在 Blob 存储中。The output is saved in Blob storage. 由于数据将添加到 Blob 存储,因此你可以选择喜好的工具来运行分析。Because the data is added to Blob storage, you can choose your favorite tool to run the analysis.

Blob 中输出数据的路径遵循以下语法:The path to the output data in the blob follows this syntax:

/modeldata/<subscriptionid>/<resourcegroup>/<workspace>/<webservice>/<model>/<version>/<designation>/<year>/<month>/<day>/data.csv
# example: /modeldata/1a2b3c4d-5e6f-7g8h-9i10-j11k12l13m14/myresourcegrp/myWorkspace/aks-w-collv9/best_model/10/inputs/2018/12/31/data.csv

备注

在低于 0.1.0a16 的适用于 Python 的 Azure 机器学习 SDK 版本中,designation 参数命名为 identifierIn versions of the Azure Machine Learning SDK for Python earlier than version 0.1.0a16, the designation argument is named identifier. 如果使用早期版本开发代码,则需要相应地更新此名称。If you developed your code with an earlier version, you need to update it accordingly.

先决条件Prerequisites

启用数据收集Enable data collection

无论通过 Azure 机器学习或其他工具部署的模型是什么,都可以启用数据收集You can enable data collection regardless of the model you deploy through Azure Machine Learning or other tools.

若要启用数据收集,需要:To enable data collection, you need to:

  1. 打开评分文件。Open the scoring file.

  2. 在该文件顶部,添加以下代码:Add the following code at the top of the file:

    from azureml.monitoring import ModelDataCollector
    
  3. init 函数中声明数据集合变量:Declare your data collection variables in your init function:

    global inputs_dc, prediction_dc
    inputs_dc = ModelDataCollector("best_model", designation="inputs", feature_names=["feat1", "feat2", "feat3". "feat4", "feat5", "feat6"])
    prediction_dc = ModelDataCollector("best_model", designation="predictions", feature_names=["prediction1", "prediction2"])
    

    CorrelationId 是可选参数。CorrelationId is an optional parameter. 如果模型不需要此参数,则无需使用它。You don't need to use it if your model doesn't require it. 使用 CorrelationId 确实可以帮助你轻松映射到其他数据,例如 LoanNumberCustomerIdUse of CorrelationId does help you more easily map with other data, such as LoanNumber or CustomerId.

    稍后将使用 Identifier 参数在 Blob 中生成文件夹结构。The Identifier parameter is later used for building the folder structure in your blob. 可以使用此参数将原始数据与已处理的数据区分开来。You can use it to differentiate raw data from processed data.

  4. 将以下代码行添加到 run(input_df) 函数:Add the following lines of code to the run(input_df) function:

    data = np.array(data)
    result = model.predict(data)
    inputs_dc.collect(data) #this call is saving our input data into Azure Blob
    prediction_dc.collect(result) #this call is saving our input data into Azure Blob
    
  5. 在 AKS 中部署服务时,数据收集不会自动设置为 true。**Data collection is not automatically set to true when you deploy a service in AKS. 如以下示例所示更新配置文件:Update your configuration file, as in the following example:

    aks_config = AksWebservice.deploy_configuration(collect_model_data=True)
    

    也可以通过更改以下配置来为 Application Insights 启用服务监视:You can also enable Application Insights for service monitoring by changing this configuration:

    aks_config = AksWebservice.deploy_configuration(collect_model_data=True, enable_app_insights=True)
    
  6. 若要创建新映像并部署机器学习模型,请参阅部署方式和部署位置To create a new image and deploy the machine learning model, see How to deploy and where.

禁用数据收集Disable data collection

随时可以停止收集数据。You can stop collecting data at any time. 使用 Python 代码禁用数据收集。Use Python code to disable data collection.

## replace <service_name> with the name of the web service
<service_name>.update(collect_model_data=False)

验证并分析数据Validate and analyze your data

可以选择偏好的工具来分析收集到 Blob 存储中的数据。You can choose a tool of your preference to analyze the data collected in your Blob storage.

快速访问 Blob 数据Quickly access your blob data

  1. 登录到 Azure 机器学习Sign in to Azure Machine Learning.

  2. 打开你的工作区。Open your workspace.

  3. 选择“存储”。Select Storage.

    选择“存储”选项Select the Storage option

  4. Blob 输出数据的路径遵循以下语法:Follow the path to the blob's output data with this syntax:

    /modeldata/<subscriptionid>/<resourcegroup>/<workspace>/<webservice>/<model>/<version>/<designation>/<year>/<month>/<day>/data.csv
    # example: /modeldata/1a2b3c4d-5e6f-7g8h-9i10-j11k12l13m14/myresourcegrp/myWorkspace/aks-w-collv9/best_model/10/inputs/2018/12/31/data.csv
    

使用 Power BI 分析模型数据Analyze model data using Power BI

  1. 下载并打开 Power BI DesktopDownload and open Power BI Desktop.

  2. 选择“获取数据”,然后选择“Azure Blob 存储”。 Select Get Data and select Azure Blob Storage.

    Power BI Blob 设置Power BI blob setup

  3. 添加存储帐户名称并输入存储密钥。Add your storage account name and enter your storage key. 可以通过在 Blob 中选择“设置” > “访问密钥”找到此信息。 You can find this information by selecting Settings > Access keys in your blob.

  4. 选择“模型数据”容器,然后选择“编辑”。 Select the model data container and select Edit.

    Power BI NavigatorPower BI Navigator

  5. 在查询编辑器中,单击“名称”列的下面,并添加存储帐户。In the query editor, click under the Name column and add your storage account.

  6. 在筛选器中输入模型路径。Enter your model path into the filter. 如果只想查看特定年份或月份的文件,则只需展开筛选器路径即可。If you want to look only into files from a specific year or month, just expand the filter path. 例如,如果只想查看三月份的数据,请使用以下筛选路径:For example, to look only into March data, use this filter path:

    /modeldata/<subscriptionid>/<resourcegroupname>/<workspacename>/<webservicename>/<modelname>/<modelversion>/<designation>/<year>/3/modeldata/<subscriptionid>/<resourcegroupname>/<workspacename>/<webservicename>/<modelname>/<modelversion>/<designation>/<year>/3

  7. 基于“名称”值筛选相关的数据。Filter the data that is relevant to you based on Name values. 如果存储了预测和输入,则需要针对每个预测和输入创建一个查询。If you stored predictions and inputs, you need to create a query for each.

  8. 选择“内容”列标题旁边的向下双箭头,将文件合并在一起。Select the downward double arrows next to the Content column heading to combine the files.

    Power BI 内容Power BI Content

  9. 选择“确定” 。Select OK. 数据将预先加载。The data preloads.

    Power BI 合并文件Power BI Combine Files

  10. 选择“关闭并应用”。Select Close and Apply.

  11. 如果添加了输入和预测,则表会自动按 RequestId 值排序。If you added inputs and predictions, your tables are automatically ordered by RequestId values.

  12. 开始基于模型数据生成自定义报表。Start building your custom reports on your model data.

后续步骤Next steps

针对已收集的数据检测数据偏移Detect data drift on the data you have collected.