什么是团队数据科学过程?What is the Team Data Science Process?

Team Data Science Process (TDSP) 是一种敏捷的迭代式数据科学方法,可有效交付预测分析解决方案和智能应用程序。The Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently. TDSP 通过建议团队角色如何最好地协同工作来帮助改进团队协作和学习。TDSP helps improve team collaboration and learning by suggesting how team roles work best together. TDSP 包含 Microsoft 和其他行业领导者的最佳做法和结构,以帮助成功实施数据科学计划。TDSP includes best practices and structures from Microsoft and other industry leaders to help toward successful implementation of data science initiatives. 其目标是帮助公司完全实现其分析程序的优势。The goal is to help companies fully realize the benefits of their analytics program.

本文提供 TDSP 及其主要组件的概述。This article provides an overview of TDSP and its main components. 我们在这里提供了可以使用各种工具实现的过程的一般说明。We provide a generic description of the process here that can be implemented with different kinds of tools. 其他链接主题中提供了过程生命周期所涉及的项目任务和角色的更详细说明。A more detailed description of the project tasks and roles involved in the lifecycle of the process is provided in additional linked topics. 此外,还提供了有关如何使用一套具体 Microsoft 工具和基础结构实现 TDSP 的指导(我们的团队就是使用这些工具和基础结构来实现 TDSP)。Guidance on how to implement the TDSP using a specific set of Microsoft tools and infrastructure that we use to implement the TDSP in our teams is also provided.

TDSP 的关键组件Key components of the TDSP

TDSP 包括以下关键组件:TDSP has the following key components:

  • 数据科学生命周期定义A data science lifecycle definition
  • 标准化项目结构A standardized project structure
  • 推荐数据科学项目使用的基础结构和资源Infrastructure and resources recommended for data science projects
  • 推荐用于项目执行的工具和实用程序Tools and utilities recommended for project execution

数据科学生命周期Data science lifecycle

Team Data Science Process (TDSP) 提供用于构建数据科学项目开发的生命周期。The Team Data Science Process (TDSP) provides a lifecycle to structure the development of your data science projects. 该生命周期概述了成功的项目所遵循的完整步骤。The lifecycle outlines the full steps that successful projects follow.

如果使用其他数据科学生命周期,如 CRISP-DMKDD 或组织自己的自定义过程,仍可在这些开发生命周期的上下文中使用基于任务的 TDSP。If you are using another data science lifecycle, such as CRISP-DM, KDD, or your organization's own custom process, you can still use the task-based TDSP in the context of those development lifecycles. 从较高层面讲,这些不同的方法具有很多共性。At a high level, these different methodologies have much in common.

此生命周期面向作为智能应用程序一部分交付的数据科学项目。This lifecycle has been designed for data science projects that ship as part of intelligent applications. 这些应用程序部署机器学习或人工智能模型以进行预测分析。These applications deploy machine learning or artificial intelligence models for predictive analytics. 探索性数据科学项目或即席分析项目也可以从使用此过程获益。Exploratory data science projects or improvised analytics projects can also benefit from using this process. 但在这种情况下可能不需要某些所述的步骤。But in such cases some of the steps described may not be needed.

该生命周期概述了项目通常执行的主要阶段(通常以迭代方式进行):The lifecycle outlines the major stages that projects typically execute, often iteratively:

  • 了解业务Business Understanding
  • 数据采集和理解Data Acquisition and Understanding
  • 建模Modeling
  • 部署Deployment

以下是 Team Data Science Process 生命周期的可视化表示形式 。Here is a visual representation of the Team Data Science Process lifecycle.


Team Data Science Process 生命周期主题中介绍了 TDSP 中每个生命周期阶段的目标、任务和文档项目。The goals, tasks, and documentation artifacts for each stage of the lifecycle in TDSP are described in the Team Data Science Process lifecycle topic. 这些任务和项目与项目角色相关联:These tasks and artifacts are associated with project roles:

  • 解决方案架构师Solution architect
  • 项目经理Project manager
  • 数据工程师Data engineer
  • 数据科学家Data scientist
  • 应用程序开发人员Application developer
  • 项目主管Project lead

下图提供了与这些角色(纵轴)的每个生命周期阶段(横轴)相关联的任务(蓝色)和项目(绿色)的网格视图。The following diagram provides a grid view of the tasks (in blue) and artifacts (in green) associated with each stage of the lifecycle (on the horizontal axis) for these roles (on the vertical axis).


标准化项目结构Standardized project structure

让所有项目共享一个目录结构并对项目文档使用模板可以方便团队成员查找有关其项目的信息。Having all projects share a directory structure and use templates for project documents makes it easy for the team members to find information about their projects. 所有代码和文档存储在 Git、TFS 或 Subversion 等版本控制系统 (VCS) 中,以实现团队协作。All code and documents are stored in a version control system (VCS) like Git, TFS, or Subversion to enable team collaboration. 在 Jira、Rally 和 Azure DevOps 等敏捷项目跟踪系统中跟踪任务和功能可以更密切地跟踪各项功能的代码。Tracking tasks and features in an agile project tracking system like Jira, Rally, and Azure DevOps allows closer tracking of the code for individual features. 此类跟踪还可让团队获得更准确的成本估算。Such tracking also enables teams to obtain better cost estimates. TDSP 建议在 VCS 中为每个项目创建一个独立的存储库,以实现版本控制、信息安全和协作。TDSP recommends creating a separate repository for each project on the VCS for versioning, information security, and collaboration. 为所有项目建立标准化结构有助于在整个组织中积累系统性的认知。The standardized structure for all projects helps build institutional knowledge across the organization.

我们在标准位置为文件夹结构和所需的文档提供模板。We provide templates for the folder structure and required documents in standard locations. 此文件夹结构可以组织包含用于数据探索和特征提取以及记录模型迭代的代码的文件。This folder structure organizes the files that contain code for data exploration and feature extraction, and that record model iterations. 这些模板可让团队成员更轻松地了解其他人完成的工作,以及将新成员添加到团队。These templates make it easier for team members to understand work done by others and to add new members to teams. 可以使用标记格式轻松查看和更新文档模板。It is easy to view and update document templates in markdown format. 使用模板提供包含每个项目关键问题的检查列表,以确保完善定义问题,以及交付件满足预期的质量。Use templates to provide checklists with key questions for each project to insure that the problem is well-defined and that deliverables meet the quality expected. 示例包括:Examples include:

  • 用于陈述业务问题和项目范围的项目纲领a project charter to document the business problem and scope of the project
  • 用于陈述原始数据的结构和统计信息的数据报告data reports to document the structure and statistics of the raw data
  • 用于陈述派生特征的模型报告model reports to document the derived features
  • 模型性能指标,例如 ROC 曲线或 MSEmodel performance metrics such as ROC curves or MSE


可以从 GitHub 克隆目录结构。The directory structure can be cloned from GitHub.

数据科学项目的基础结构和资源Infrastructure and resources for data science projects

TDSP 提供有关管理共享分析和存储基础结构的建议,例如:TDSP provides recommendations for managing shared analytics and storage infrastructure such as:

  • 用于存储数据集的云文件系统cloud file systems for storing datasets
  • 数据库databases
  • 大数据(SQL 或 Spark)群集big data (SQL or Spark) clusters
  • 机器学习服务machine learning service

存储原始数据集和已处理数据集的分析和存储基础结构可能位于云中或本地。The analytics and storage infrastructure, where raw and processed datasets are stored, may be in the cloud or on-premises. 此基础结构实现重现的分析。This infrastructure enables reproducible analysis. 它还可以避免重复,防止产生不一致情况和不必要的基础结构成本。It also avoids duplication, which may lead to inconsistencies and unnecessary infrastructure costs. TDSP 提供了工具用于预配和跟踪共享资源,并让每位团队成员安全连接到这些资源。Tools are provided to provision the shared resources, track them, and allow each team member to connect to those resources securely. 让项目成员创建一致的计算环境也是一种不错的做法。It is also a good practice to have project members create a consistent compute environment. 然后,不同的团队成员可以复制和验证试验。Different team members can then replicate and validate experiments.

下面是团队处理多个项目和共享各个云分析基础结构组件的示例。Here is an example of a team working on multiple projects and sharing various cloud analytics infrastructure components.


用于项目执行的工具和实用程序Tools and utilities for project execution

在大多数组织中引入过程很有难度。Introducing processes in most organizations is challenging. TDSP 提供了工具用于实现数据科学过程和生命周期,帮助削减屏障,提高客户的采用一致性。Tools provided to implement the data science process and lifecycle help lower the barriers to and increase the consistency of their adoption. TDSP 提供了一套初始工具和脚本,帮助在团队中快速采用 TDSP。TDSP provides an initial set of tools and scripts to jump-start adoption of TDSP within a team. 另外,它还帮助自动完成数据科学生命周期中的某些常见任务,例如数据探索和基线建模。It also helps automate some of the common tasks in the data science lifecycle such as data exploration and baseline modeling. 提供了一个妥善定义的结构,供个人将共享工具和实用程序贡献到团队的共享代码存储库中。There is a well-defined structure provided for individuals to contribute shared tools and utilities into their team's shared code repository. 然后,这些资源可由团队或组织中的其他项目利用。These resources can then be leveraged by other projects within the team or the organization. Microsoft 在 Azure 机器学习中提供了丰富的工具,前者既支持开源工具(Python、R、ONNX 和常见的深度学习框架)也支持 Microsoft 自己的工具 (AutoML)。Microsoft provides extensive tooling inside Azure Machine Learning supporting both open-source (Python, R, ONNX, and common deep-learning frameworks) and also Microsoft's own tooling (AutoML).

后续步骤Next steps

Team Data Science Process:角色和任务概述了致力于标准化此过程的数据科学团队的关键人员角色及其相关任务。Team Data Science Process: Roles and tasks Outlines the key personnel roles and their associated tasks for a data science team that standardizes on this process.