Team Data Science Process 生命周期建模阶段Modeling stage of the Team Data Science Process lifecycle

本文概述了与 Team Data Science Process (TDSP) 的建模阶段相关联的目标、任务和可交付结果。This article outlines the goals, tasks, and deliverables associated with the modeling stage of the Team Data Science Process (TDSP). 此过程提供可用于构建数据科学项目的建议生命周期。This process provides a recommended lifecycle that you can use to structure your data-science projects. 该生命周期概述了项目通常执行的主要阶段(通常以迭代方式进行):The lifecycle outlines the major stages that projects typically execute, often iteratively:

  1. 了解业务Business understanding
  2. 数据采集和理解Data acquisition and understanding
  3. 建模Modeling
  4. 部署Deployment
  5. 客户验收Customer acceptance

此处直观地展示了 TDSP 生命周期:Here is a visual representation of the TDSP lifecycle:

TDSP 生命周期


  • 确定机器学习模型的最佳数据功能。Determine the optimal data features for the machine-learning model.
  • 创建可精准预测目标的信息性机器学习模型。Create an informative machine-learning model that predicts the target most accurately.
  • 创建适用于生产的机器学习模型。Create a machine-learning model that's suitable for production.

如何执行How to do it

在此阶段中解决了三个主要任务:There are three main tasks addressed in this stage:

  • 特征工程:从原始数据创建数据特征,以实现模型定型。Feature engineering: Create data features from the raw data to facilitate model training.
  • 模型定型:通过比较模型的成功指标,找出最能准确回答问题的模型。Model training: Find the model that answers the question most accurately by comparing their success metrics.
  • 确定模型是否适用于生产。 Determine if your model is suitable for production.

特性工程Feature engineering

功能设计包括对原始变量的涵盖、聚合和转换,以创建分析中使用的功能。Feature engineering involves the inclusion, aggregation, and transformation of raw variables to create the features used in the analysis. 若要深入了解模型的驱动因素,则需要了解这些功能彼此间的关系,以及使用这些功能的机器学习算法方式。If you want insight into what is driving a model, then you need to understand how the features relate to each other and how the machine-learning algorithms are to use those features.

此步骤需要创造性地组合域专业知识,并从数据浏览的步骤中获取见解。This step requires a creative combination of domain expertise and the insights obtained from the data exploration step. 功能工程可平衡信息性变量的查找与添加,同时避免产生过多不相关的变量。Feature engineering is a balancing act of finding and including informative variables, but at the same time trying to avoid too many unrelated variables. 信息性变量会改善结果;而不相关的变量会将不必要的干扰引入模型。Informative variables improve your result; unrelated variables introduce unnecessary noise into the model. 还需要为在评分过程中获取的任何新数据生成一些功能。You also need to generate these features for any new data obtained during scoring. 因此,可仅根据在评分时可用的数据生成这些功能。As a result, the generation of these features can only depend on data that's available at the time of scoring.

有关使用各种 Azure 数据技术时的功能设计的相关技术指导,请参阅数据科学过程中的功能设计For technical guidance on feature engineering when make use of various Azure data technologies, see Feature engineering in the data science process.

模型定型Model training

根据所要回答的问题类型,会有许多可用的建模算法。Depending on the type of question that you're trying to answer, there are many modeling algorithms available. 有关选择算法的指南,请参阅如何选择 Microsoft Azure 机器学习的算法For guidance on choosing the algorithms, see How to choose algorithms for Microsoft Azure Machine Learning. 尽管本文使用的是 Azure 机器学习,但所述指南也适用于其他任何机器学习项目。Although this article uses Azure Machine Learning, the guidance it provides is useful for any machine-learning projects.

模型定型的过程包括以下步骤:The process for model training includes the following steps:

  • 随机拆分输入数据,以建模成定型数据集和测试数据集 。Split the input data randomly for modeling into a training data set and a test data set.
  • 使用定型数据集生成模型 。Build the models by using the training data set.
  • 评估定型数据集和测试数据集 。Evaluate the training and the test data set. 使用一系列竞争机器学习算法,以及关联的各种优化参数(称为“参数整理”),这些参数适用于回答与当前数据相关的问题 。Use a series of competing machine-learning algorithms along with the various associated tuning parameters (known as a parameter sweep) that are geared toward answering the question of interest with the current data.
  • 比较备用方法的成功指标,确定可回答问题的“最佳”解决方案 。Determine the “best” solution to answer the question by comparing the success metrics between alternative methods.


避免泄漏:若添加定型数据集外部数据,则会导致数据泄漏,因为此类数据允许模型或机器学习算法做出不切实际的良好预测。Avoid leakage: You can cause data leakage if you include data from outside the training data set that allows a model or machine-learning algorithm to make unrealistically good predictions. 泄露是数据科学家获得好到不真实的预测结果时会紧张的常见原因。Leakage is a common reason why data scientists get nervous when they get predictive results that seem too good to be true. 很难检测到这些依赖项。These dependencies can be hard to detect. 为避免泄漏,通常需要在生成分析数据集、创建模型和评估结果准确性之间进行循环。To avoid leakage often requires iterating between building an analysis data set, creating a model, and evaluating the accuracy of the results.

我们提供包含 TDSP 的自动化建模和报告工具,该工具能够运行多个算法和参数扫描以生成基准模型。We provide an automated modeling and reporting tool with TDSP that's able to run through multiple algorithms and parameter sweeps to produce a baseline model. 它还会生成基准建模报表,该报表汇总每个模型的性能和参数组合,包括变量重要性。It also produces a baseline modeling report that summarizes the performance of each model and parameter combination including variable importance. 此过程可以进一步促进功能设计,因为它也是可以迭代的。This process is also iterative as it can drive further feature engineering.


在此阶段中生成的项目包括:The artifacts produced in this stage include:

  • 功能集:“数据定义”报表的“功能集”部分介绍了为建模开发的功能 。Feature sets: The features developed for the modeling are described in the Feature sets section of the Data definition report. 它包括指向代码以生成功能的指针,以及说明如何生成功能的描述。It contains pointers to the code to generate the features and a description of how the feature was generated.
  • 模型报表:对于尝试过的每个模型,都会根据模板生成一个标准报表,用于详细介绍每次试验。Model report: For each model that's tried, a standard, template-based report that provides details on each experiment is produced.
  • 检查点决策 :评估模型的性能是否足以用于生产。Checkpoint decision : Evaluate whether the model performs sufficiently for production. 要提出的一些关键问题有:Some key questions to ask are:
    • 在给定测试数据的情况下,模型是否能充分地回答问题?Does the model answer the question with sufficient confidence given the test data?
    • 是否应尝试备用方法?Should you try any alternative approaches? 是否应收集其他数据、进行更多的功能设计或使用其他算法进行试验?Should you collect additional data, do more feature engineering, or experiment with other algorithms?

后续步骤Next steps

以下是 TDSP 生命周期中每个步骤的链接:Here are links to each step in the lifecycle of the TDSP:

  1. 了解业务Business understanding
  2. 数据采集和理解Data acquisition and understanding
  3. 建模Modeling
  4. 部署Deployment
  5. 客户验收Customer acceptance

我们还提供了完整的演练,演示特定方案过程中的所有步骤。We provide full end-to-end walkthroughs that demonstrate all the steps in the process for specific scenarios. 示例演练一文提供了包含链接和缩略图描述的方案列表。The Example walkthroughs article provides a list of the scenarios with links and thumbnail descriptions. 该演练演示如何将云、本地工具以及服务结合到一个工作流或管道中,以创建智能应用程序。The walkthroughs illustrate how to combine cloud, on-premises tools, and services into a workflow or pipeline to create an intelligent application.

有关如何在使用 Azure 机器学习工作室的 TDSP 中执行步骤的示例,请参阅通过 Azure 机器学习使用 TDSPFor examples of how to execute steps in TDSPs that use Azure Machine Learning Studio, see Use the TDSP with Azure Machine Learning.