数据科学项目的敏捷开发Agile development of data science projects

本文档介绍开发人员如何使用 Team Data Science Process (TDSP) 在项目团队中以系统化的协作型版本控制方式执行数据科学项目。This document describes how developers can execute a data science project in a systematic, version controlled, and collaborative way within a project team by using the Team Data Science Process (TDSP). TDSP 是 Microsoft 开发的一个框架,它提供结构化的活动序列,可高效地执行基于云的预测分析解决方案。The TDSP is a framework developed by Microsoft that provides a structured sequence of activities to efficiently execute cloud-based, predictive analytics solutions. 有关致力于标准化 TDSP 的数据科学团队要处理的角色和任务的概述,请参阅团队数据科学流程角色和任务For an outline of the roles and tasks that are handled by a data science team standardizing on the TDSP, see Team Data Science Process roles and tasks.

本文包含如何执行以下操作的说明:This article includes instructions on how to:

  • 执行项目涉及的工作项的冲刺 (sprint) 规划 。Do sprint planning for work items involved in a project.
  • 向冲刺 (sprint) 添加工作项 。Add work items to sprints.
  • 创建和使用敏捷派生的工作项模板,该模板专门与 TDSP 生命周期阶段保持一致 。Create and use an agile-derived work item template that specifically aligns with TDSP lifecycle stages.

以下说明概述了使用 Azure DevOps 中的 Azure Boards 和 Azure Repos 设置 TDSP 团队环境所需的步骤。The following instructions outline the steps needed to set up a TDSP team environment using Azure Boards and Azure Repos in Azure DevOps. 此说明使用 Azure DevOps,因为这是在 Microsoft 实现 TDSP 的方法。The instructions use Azure DevOps because that is how to implement TDSP at Microsoft. 如果你的组使用不同的代码托管平台,团队主导的任务通常不会发生变化,但完成任务的方式不同。If your group uses a different code hosting platform, the team lead tasks generally don't change, but the way to complete the tasks is different. 例如,使用 GitHub 链接工作项与 Git 分支的方式可能不同于使用 Azure Repos 的链接方式。For example, linking a work item with a Git branch might not be the same with GitHub as it is with Azure Repos.

下图说明了数据科学项目涉及的典型冲刺 (sprint) 规划、编码和源代码控制工作流:The following figure illustrates a typical sprint planning, coding, and source-control workflow for a data science project:

Team Data Science Process

工作项类型Work item types

在 TDSP 冲刺 (sprint) 规划框架中,有四种频繁使用的工作项类型 :功能、用户情景、任务和 Bug 。In the TDSP sprint planning framework, there are four frequently used work item types: Features, User Stories, Tasks, and Bugs. 所有工作项的积压工作 (backlog) 都处于项目级别,而不是 Git 存储库级别。The backlog for all work items is at the project level, not the Git repository level.

下面是工作项类型的定义:Here are the definitions for the work item types:

  • 功能:一个功能对应一个项目协定。Feature: A Feature corresponds to a project engagement. 与客户端不同的协定是不同的功能,最好将项目的不同阶段视为不同的功能。Different engagements with a client are different Features, and it's best to consider different phases of a project as different Features. 如果选择类似 <ClientName>-<EngagementName> 的架构作为功能命名,则可以通过名称轻松识别出项目和协定的上下文。If you choose a schema such as <ClientName>-<EngagementName> to name your Features, you can easily recognize the context of the project and engagement from the names themselves.

  • 用户情景:用户情景是完成端到端功能所需的工作项。User Story: User Stories are work items needed to complete a Feature end-to-end. 用户情景示例包括:Examples of User Stories include:

    • 获取数据Get data
    • 浏览数据Explore data
    • 生成功能Generate features
    • 生成模型Build models
    • 使模型可操作Operationalize models
    • 重新定型Retrain models
  • 任务:任务是可分配的工作项,需要完成这些工作项才能完成特定的用户情景。Task: Tasks are assignable work items that need to be done to complete a specific User Story. 例如,用户情景“获取数据”中的任务可能是 :For example, Tasks in the User Story Get data could be:

    • 获取 SQL Server 凭据Get SQL Server credentials
    • 将数据上传到 SQL 数据仓库Upload data to SQL Data Warehouse
  • Bug:Bug 是现有代码或文档中的问题,必须修复这些问题才能完成任务。Bug: Bugs are issues in existing code or documents that must be fixed to complete a Task. 如果 Bug 是由于缺少工作项引起的,则可以将其升级为用户情景或任务。If Bugs are caused by missing work items, they can escalate to be User Stories or Tasks.

数据科学家可能更习惯于使用敏捷模板,该模板将功能、用户情景和任务替换为 TDSP 生命周期阶段和子阶段。Data scientists may feel more comfortable using an agile template that replaces Features, User Stories, and Tasks with TDSP lifecycle stages and substages. 若要创建敏捷派生的模板,使其专门与 TDSP 生命周期阶段保持一致,请参阅使用敏捷 TDSP 工作模板To create an agile-derived template that specifically aligns with the TDSP lifecycle stages, see Use an agile TDSP work template.

备注

TDSP 借用了软件代码管理 (SCM) 中的功能、用户情景、任务和 Bug 概念。TDSP borrows the concepts of Features, User Stories, Tasks, and Bugs from software code management (SCM). TDSP 概念可能与其传统意义上的 SCM 定义略有不同。The TDSP concepts might differ slightly from their conventional SCM definitions.

计划冲刺 (sprint)Plan sprints

许多数据科学家同时参与了多个项目,每个项目可能都需要数月才能完成,并且进度不同。Many data scientists are engaged with multiple projects, which can take months to complete and proceed at different paces. 冲刺规划有助于设置项目优先级、进行资源规划和分配。Sprint planning is useful for project prioritization, and resource planning and allocation. 在 Azure Boards 中,你可以轻松创建、管理和跟踪项目的工作项,并执行冲刺 (sprint) 规划来确保项目按预期进行。In Azure Boards, you can easily create, manage, and track work items for your projects, and conduct sprint planning to ensure projects are moving forward as expected.

有关冲刺 (sprint) 规划的详细信息,请参阅 Scrum 冲刺 (sprint)For more information about sprint planning, see Scrum sprints.

有关 Azure Boards 中的冲刺 (sprint) 规划的详细信息,请参阅将积压工作项分配给冲刺 (sprint)For more information about sprint planning in Azure Boards, see Assign backlog items to a sprint.

向积压工作 (backlog) 添加功能Add a Feature to the backlog

创建项目和项目代码存储库后,可以将功能添加到积压工作 (backlog) 来表示项目的工作。After your project and project code repository are created, you can add a Feature to the backlog to represent the work for your project.

  1. 在项目页中,在左侧导航栏中选择“版块” > “积压工作(backlog)” 。From your project page, select Boards > Backlogs in the left navigation.

  2. 在“积压工作(backlog)”选项卡上,如果顶部栏的工作项类型为“情景”,请下拉列表并选择“功能” 。On the Backlog tab, if the work item type in the top bar is Stories, drop down and select Features. 然后选择“新建工作项” 。Then select New Work Item.

    选择“新建工作项”

  3. 输入功能的标题,通常为项目名称,然后选择“添加到顶部” 。Enter a title for the Feature, usually your project name, and then select Add to top.

    输入标题,并选择“添加到顶部”

  4. 从“积压工作(backlog)”列表中,选择并打开新功能 。From the Backlog list, select and open the new Feature. 填写描述、分配团队成员并设置计划参数。Fill in the description, assign a team member, and set planning parameters.

    还可以通过选择“开发”部分下的“添加链接”,将功能链接到项目的 Azure Repos 代码存储库 。You can also link the Feature to the project's Azure Repos code repository by selecting Add link under the Development section.

    编辑此功能后,选择“保存并关闭”。After you edit the Feature, select Save & Close.

    编辑功能并选择“保存并关闭”

向功能添加用户情景Add a User Story to the Feature

可以在“功能”下添加用户情景来描述完成项目所需的主要步骤。Under the Feature, you can add User Stories to describe major steps needed to complete the project.

向功能中添加新的用户情景:To add a new User Story to a Feature:

  1. 在“积压工作(backlog)”选项卡上,选择功能左侧的 + 。On the Backlog tab, select the + to the left of the Feature.

    在“功能”下添加新的用户情景

  2. 为用户情景提供一个标题,并编辑分配、状态、描述、注释、计划和优先级等详细信息。Give the User Story a title, and edit details such as assignment, status, description, comments, planning, and priority.

    还可以通过选择“开发”部分下的“添加链接”,将用户情景链接到项目的 Azure Repos 代码存储库分支 。You can also link the User Story to a branch of the project's Azure Repos code repository by selecting Add link under the Development section. 选择想要将该工作项链接到的存储库和分支,然后选择“确定” 。Select the repository and branch you want to link the work item to, and then select OK.

    添加链接

  3. 编辑用户情景后,选择“保存并关闭” 。When you're finished editing the User Story, select Save & Close.

向用户情景添加任务Add a Task to a User Story

任务是完成各个用户情景所需的特定详细步骤。Tasks are specific detailed steps that are needed to complete each User Story. 完成一个用户情景的所有任务后,此用户情景应也已完成。After all Tasks of a User Story are completed, the User Story should be completed too.

若要向用户情景添加任务,请选择“用户情景”项旁边的 +,然后选择“任务” 。To add a Task to a User Story, select the + next to the User Story item, and select Task. 在任务中填写标题和其他信息。Fill in the title and other information in the Task.

向用户情景添加任务

创建功能、用户情景和任务后,可以在“积压工作(backlog)”或“版块”视图中查看它们,以跟踪它们的状态 。After you create Features, User Stories, and Tasks, you can view them in the Backlogs or Boards views to track their status.

积压工作 (backlog) 视图

版块视图

使用敏捷 TDSP 工作模板Use an agile TDSP work template

数据科学家可能更习惯于使用敏捷模板,该模板将功能、用户情景和任务替换为 TDSP 生命周期阶段和子阶段。Data scientists may feel more comfortable using an agile template that replaces Features, User Stories, and Tasks with TDSP lifecycle stages and substages. 在 Azure Boards 中,可以创建一个敏捷派生模板,该模板使用 TDSP 生命周期阶段来创建和跟踪工作项。In Azure Boards, you can create an agile-derived template that uses TDSP lifecycle stages to create and track work items. 以下步骤将引导你设置特定于数据科学的敏捷过程模板,并基于该模板创建数据科学工作项。The following steps walk through setting up a data science-specific agile process template and creating data science work items based on the template.

设置敏捷数据科学流程模板Set up an Agile Data Science Process template

  1. 从 Azure DevOps 组织主页的左侧导航栏中选择 “组织设置” 。From your Azure DevOps organization main page, select Organization settings from the left navigation.

  2. 在“组织设置”左侧导航栏中的“版块”下,选择“进程” 。In the Organization Settings left navigation, under Boards, select Process.

  3. 在“所有进程”窗格中,选择“敏捷”旁边的“...”,然后选择“创建继承的进程” 。In the All processes pane, select the ... next to Agile, and then select Create inherited process.

    从敏捷中创建继承的进程

  4. 在“从敏捷中创建继承的进程”对话框中,输入名称“AgileDataScienceProcess”,然后选择“创建进程” 。In the Create inherited process from Agile dialog, enter the name AgileDataScienceProcess, and select Create process.

    创建 AgileDataScienceProcess 进程

  5. 在“所有进程”中,选择新的 AgileDataScienceProcess 。In All processes, select the new AgileDataScienceProcess.

  6. 在“工作项类型”选项卡上,通过选择每个项旁边的“...”,再选择“禁用”来禁用“长篇故事”、“功能”、“用户情景”和“任务” 。On the Work item types tab, disable Epic, Feature, User Story, and Task by selecting the ... next to each item and then selecting Disable.

    禁用工作项类型

  7. 在“所有进程”中,选择“积压工作(backlog)级别”选项卡 。在“组合积压工作”下,选择“长篇故事(已禁用)”旁边的“...”,然后选择“编辑/重命名” 。In All processes, select the Backlog levels tab. Under Portfolios backlogs, select the ... next to Epic (disabled), and then select Edit/Rename.

  8. 在“编辑积压工作(backlog)级别”对话框中 :In the Edit backlog level dialog box:

    1. 在“名称”下,将“长篇故事”替换为“TDSP 项目” 。Under Name, replace Epic with TDSP Projects.
    2. 在“此积压工作(backlog)级别上的工作项类型”下,选择“新建工作项类型”,输入“TDSP 项目”,然后选择“添加” 。Under Work item types on this backlog level, select New work item type, enter TDSP Project, and select Add.
    3. 在“默认工作项类型”下,下拉菜单并选择“TDSP 项目” 。Under Default work item type, drop down and select TDSP Project.
    4. 选择“保存” 。Select Save.

    设置组合积压工作级别

  9. 按照相同步骤,将“功能”重命名为“TDSP 阶段”,并添加以下新建工作项类型 :Follow the same steps to rename Features to TDSP Stages, and add the following new work item types:

    • 了解业务Business Understanding
    • 数据采集 Data Acquisition
    • 建模Modeling
    • 部署Deployment
  10. 在“需求积压工作(backlog)”下,将“情景”重命名为“TDSP 子阶段”,添加新工作项类型“TDSP 子阶段”,将默认工作项类型设置为“TDSP 子阶段” 。Under Requirement backlog, rename Stories to TDSP Substages, add the new work item type TDSP Substage, and set the default work item type to TDSP Substage.

  11. 在“迭代积压工作(backlog)”下,添加新工作项类型“TDSP 任务”,并将其设置为默认工作项类型 。Under Iteration backlog, add a new work item type TDSP Task, and set it to be the default work item type.

完成步骤之后,积压工作 (backlog) 级别应如下所示:After you complete the steps, the backlog levels should look like this:

TDSP 模板积压工作 (backlog) 级别

创建敏捷数据科学流程工作项Create Agile Data Science Process work items

可以使用数据科学流程模板创建 TDSP 项目并跟踪与 TDSP 生命周期阶段相对应的工作项。You can use the data science process template to create TDSP projects and track work items that correspond to TDSP lifecycle stages.

  1. 在 Azure DevOps 组织主页上,选择“新建项目” 。From your Azure DevOps organization main page, select New project.

  2. 在“创建新项目”对话框中,为项目命名,然后选择“高级” 。In the Create new project dialog, give your project a name, and then select Advanced.

  3. 在“工作项进程”下,选择“AgileDataScienceProcess”,然后选择“创建” 。Under Work item process, drop down and select AgileDataScienceProcess, and then select Create.

    创建 TDSP 项目

  4. 在新建的项目中,在左侧导航栏中选择“版块” > “积压工作(backlog)” 。In the newly created project, select Boards > Backlogs in the left navigation.

  5. 若要使 TDSP 项目可见,请选择“配置团队设置”图标 。To make TDSP Projects visible, select the Configure team settings icon. 在“设置”屏幕中,选中“TDSP 项目”复选框,然后选择“保存并关闭” 。In the Settings screen, select the TDSP Projects check box, and then select Save and close.

    选中“TDSP 项目”复选框

  6. 若要创建特定于数据科学的 TDSP 项目,请在顶部栏中选择“TDSP 项目”,然后选择“新建工作项” 。To create a data science-specific TDSP Project, select TDSP Projects in the top bar, and then select New work item.

  7. 在弹出窗口中,为 TDSP 项目工作项命名,并选择“添加到顶部” 。In the popup, give the TDSP Project work item a name, and select Add to top.

    创建数据科学项目工作项

  8. 若要在 TDSP 项目下添加工作项,请选择项目旁边的 +,然后选择要创建的工作项类型 。To add a work item under the TDSP Project, select the + next to the project, and then select the type of work item to create.

    选择数据科学工作项类型

  9. 在新工作项中填写详细信息,然后选择“保存并关闭” 。Fill in the details in the new work item, and select Save & Close.

  10. 继续选择工作项旁边的 + 符号,以添加新的 TDSP 阶段、子阶段和任务 。Continue to select the + symbols next to work items to add new TDSP Stages, Substages, and Tasks.

下面是数据科学项目工作项在“积压工作(backlog)”视图中的显示方式示例 :Here is an example of how the data science project work items should appear in Backlogs view:

18

后续步骤Next steps

使用 Git 进行协作编程介绍了如何使用 Git 作为共享代码开发框架针对数据科学项目执行协作代码开发,以及如何将这些编程活动链接到使用敏捷流程规划的工作。Collaborative coding with Git describes how to do collaborative code development for data science projects using Git as the shared code development framework, and how to link these coding activities to the work planned with the agile process.

示例演练一文列出了特定方案的演练,并提供链接和缩略图描述。Example walkthroughs lists walkthroughs of specific scenarios, with links and thumbnail descriptions. 链接的方案展示了如何将云、本地工具以及服务合并到工作流或管道中,以此创建智能应用程序。The linked scenarios illustrate how to combine cloud and on-premises tools and services into workflows or pipelines to create intelligent applications.

关于敏捷流程的其他资源:Additional resources on agile processes: