Team Data Science Process 生命周期的业务理解阶段The business understanding stage of the Team Data Science Process lifecycle

本文概述了与 Team Data Science Process (TDSP) 的业务理解阶段相关联的目标、任务和可交付结果。This article outlines the goals, tasks, and deliverables associated with the business understanding stage of the Team Data Science Process (TDSP). 此过程提供可用于构建数据科学项目的建议生命周期。This process provides a recommended lifecycle that you can use to structure your data-science projects. 该生命周期概述了项目通常执行的主要阶段(通常以迭代方式进行):The lifecycle outlines the major stages that projects typically execute, often iteratively:

  1. 了解业务Business understanding
  2. 数据采集和理解Data acquisition and understanding
  3. 建模Modeling
  4. 部署Deployment
  5. 客户验收Customer acceptance

此处直观地展示了 TDSP 生命周期:Here is a visual representation of the TDSP lifecycle:

TDSP 生命周期


  • 指定关键变量以充当 模型目标,并使用其相关指标来确定项目是否成功。Specify the key variables that are to serve as the model targets and whose related metrics are used determine the success of the project.
  • 确定业务有权访问或需要获取的相关数据源。Identify the relevant data sources that the business has access to or needs to obtain.

如何执行How to do it

在此阶段中解决了两个主要任务:There are two main tasks addressed in this stage:

  • 定义目标:与客户和其他利益干系人协同合作,以了解和确定业务问题。Define objectives: Work with your customer and other stakeholders to understand and identify the business problems. 制定定义业务目标且能够通过数据科学技术得到解决的问题。Formulate questions that define the business goals that the data science techniques can target.
  • 标识数据源:查找相关数据,这些数据有助于解决定义项目目标的问题。Identify data sources: Find the relevant data that helps you answer the questions that define the objectives of the project.

定义目标Define objectives

  1. 此步骤的主要目标是确定分析需要预测的关键业务变量。A central objective of this step is to identify the key business variables that the analysis needs to predict. 这些变量被称为模型目标,而与之关联的指标则用于确定项目是否成功 。We refer to these variables as the model targets, and we use the metrics associated with them to determine the success of the project. 此类目标的两个示例是销售预测或订单涉嫌欺诈的概率。Two examples of such targets are sales forecasts or the probability of an order being fraudulent.

  2. 通过提出和完善相关、特定和明确的“尖锐”问题以定义项目目标。Define the project goals by asking and refining "sharp" questions that are relevant, specific, and unambiguous. 数据科学是使用名称和数字来回答此类问题的过程。Data science is a process that uses names and numbers to answer such questions. 数据科学或机器学习通常用于回答以下五类问题:You typically use data science or machine learning to answer five types of questions:

    • 多少?How much or how many? (回归)(regression)
    • 哪一类别?Which category? (分类)(classification)
    • 哪一组?Which group? (群集)(clustering)
    • 这是否很奇怪?Is this weird? (异常情况检测)(anomaly detection)
    • 应采用哪些选项?Which option should be taken? (建议)(recommendation)

    确定要提出上述哪些问题,以及如何回答才能实现业务目标。Determine which of these questions you're asking and how answering it achieves your business goals.

  3. 通过指定角色及其成员的责任,定义项目团队。Define the project team by specifying the roles and responsibilities of its members. 随着发现的信息不断增多,制定可以循环访问的高级里程碑计划。Develop a high-level milestone plan that you iterate on as you discover more information.

  4. 定义成功指标。Define the success metrics. 例如,你可能想要预测客户流失。For example, you might want to achieve a customer churn prediction. 在为期三个月的项目结束时,准确率需要达到百分之“x”。You need an accuracy rate of "x" percent by the end of this three-month project. 有了这些数据,你就可以进行客户促销,以减少客户流失。With this data, you can offer customer promotions to reduce churn. 指标必须为 SMARTThe metrics must be SMART:

    • S - 明确 (Specific)Specific
    • M - 可测量 (Measurable)Measurable
    • A - 可实现 (Achievable)Achievable
    • R - 相关 (Relevant)Relevant
    • T - 有时限 (Time-bound)Time-bound

确定数据源Identify data sources

标识包含尖锐问题答案的已知示例的数据源。Identify data sources that contain known examples of answers to your sharp questions. 查找以下数据:Look for the following data:

  • 与问题相关的数据。Data that's relevant to the question. 是否具有针对目标以及与该目标相关的功能的度量值?Do you have measures of the target and features that are related to the target?
  • 作为模型目标和感兴趣功能的准确度量值的数据。Data that's an accurate measure of your model target and the features of interest.

例如,你可能会发现现有系统需要收集和记录其他数据,进而解决此问题、实现项目目标。For example, you might find that the existing systems need to collect and log additional kinds of data to address the problem and achieve the project goals. 在这种情况下,你需要查找外部数据源或更新系统以收集新数据。In this situation, you might want to look for external data sources or update your systems to collect new data.


以下是此阶段中的可交付结果:Here are the deliverables in this stage:

  • 章程文档:TDSP 项目结构定义中提供的一个标准模板。Charter document: A standard template is provided in the TDSP project structure definition. 章程文档是一个动态文档。The charter document is a living document. 在发现新内容时,业务需求发生变化时,请更新整个项目中的模板。You update the template throughout the project as you make new discoveries and as business requirements change. 关键是在发现过程中有进展时,对本文档进行循环访问,以添加更多详细信息。The key is to iterate upon this document, adding more detail, as you progress through the discovery process. 让客户和其他利益干系人参与到更改中,并清楚地与他们交流更改的原因。Keep the customer and other stakeholders involved in making the changes and clearly communicate the reasons for the changes to them.
  • 数据源:TDSP 项目“数据报表”文件夹中的“数据定义”报表的“原始数据源”部分包含数据源 。Data sources: The Raw data sources section of the Data definitions report that's found in the TDSP project Data report folder contains the data sources. 此部分指定原始数据的原始位置和目标位置。This section specifies the original and destination locations for the raw data. 在后续阶段中,需要填写脚本等其他详细信息,以将数据移到分析环境中。In later stages, you fill in additional details like the scripts to move the data to your analytic environment.
  • 数据字典:本文档包含客户端提供的数据说明。Data dictionaries: This document provides descriptions of the data that's provided by the client. 这些说明介绍了架构(数据类型、验证规则的相关信息(若有))和实体关系图(若有)。These descriptions include information about the schema (the data types and information on the validation rules, if any) and the entity-relation diagrams, if available.

后续步骤Next steps

以下是 TDSP 生命周期中每个步骤的链接:Here are links to each step in the lifecycle of the TDSP:

  1. 了解业务Business understanding
  2. 数据采集和理解Data acquisition and understanding
  3. 建模Modeling
  4. 部署Deployment
  5. 客户验收Customer acceptance

我们还提供了完整的演练,演示特定方案过程中的所有步骤。We provide full end-to-end walkthroughs that demonstrate all the steps in the process for specific scenarios. 示例演练一文提供了包含链接和缩略图描述的方案列表。The Example walkthroughs article provides a list of the scenarios with links and thumbnail descriptions. 该演练演示如何将云、本地工具以及服务结合到一个工作流或管道中,以创建智能应用程序。The walkthroughs illustrate how to combine cloud, on-premises tools, and services into a workflow or pipeline to create an intelligent application.