在 Azure 上使用 Team Data Science Process 和 Azure DevOps Services 进行数据科学代码测试Data science code testing on Azure with the Team Data Science Process and Azure DevOps Services

本文提供的初步指导适用于在数据科学工作流中测试代码。This article gives preliminary guidelines for testing code in a data science workflow. 数据科学家可以通过此类测试以系统且有效的方式查看其代码的质量和预期结果。Such testing gives data scientists a systematic and efficient way to check the quality and expected outcome of their code. 我们使用的 Team Data Science Process (TDSP) 项目使用 UCI 成人收入数据集,该数据集是我们以前发布的,目的是演示代码测试方法。We use a Team Data Science Process (TDSP) project that uses the UCI Adult Income dataset that we published earlier to show how code testing can be done.

代码测试简介Introduction on code testing

“单元测试”是一种长期存在的用于软件开发的做法。"Unit testing" is a longstanding practice for software development. 但对于数据科学来说,“单元测试”的具体含义通常并不清晰,你应如何对数据科学生命周期的不同阶段进行代码测试通常也不明确,例如:But for data science, it's often not clear what "unit testing" means and how you should test code for different stages of a data science lifecycle, such as:

  • 数据准备工作Data preparation
  • 数据质量检查Data quality examination
  • 建模Modeling
  • 模型部署Model deployment

本文将术语“单元测试”替换为“代码测试”。This article replaces the term "unit testing" with "code testing." 代码测试是指以函数的形式进行测试,以便评估数据科学生命周期的某个步骤的代码是否生成“预期的”结果。It refers to testing as the functions that help to assess if code for a certain step of a data science lifecycle is producing results "as expected." 编写测试的人订阅负责定义什么才是“预期的”结果,具体取决于函数(例如,数据质量检查或建模)的结果。The person who's writing the test defines what's "as expected," depending on the outcome of the function--for example, data quality check or modeling.

本文提供的参考可以作为有用的资源。This article provides references as useful resources.

用于测试框架的 Azure DevOpsAzure DevOps for the testing framework

本文介绍了如何使用 Azure DevOps 来执行测试并使之自动化。This article describes how to perform and automate testing by using Azure DevOps. 你可以自行决定是否使用替代工具。You might decide to use alternative tools. 我们还介绍了如何使用 Azure DevOps 和生成代理来设置自动生成。We also show how to set up an automatic build by using Azure DevOps and build agents. 对于生成代理,我们使用 Azure 数据科学虚拟机 (DSVM)。For build agents, we use Azure Data Science Virtual Machines (DSVMs).

代码测试流Flow of code testing

在数据科学项目中进行代码测试的总体工作流如下所示:The overall workflow of testing code in a data science project looks like this:

代码测试流程图

详细步骤Detailed steps

请执行以下步骤,使用生成代理和 Azure DevOps 来设置并运行代码测试和自动化生成:Use the following steps to set up and run code testing and an automated build by using a build agent and Azure DevOps:

  1. 在 Visual Studio 桌面应用程序中创建一个项目:Create a project in the Visual Studio desktop application:

    Visual Studio 中的“创建新项目”屏幕

    创建项目以后,即可在右窗格的解决方案资源管理器中找到它:After you create your project, you'll find it in Solution Explorer in the right pane:

    创建项目的步骤

    Views\Shared_Layout.cshtml

  2. 将项目代码馈送到 Azure DevOps 项目代码存储库中:Feed your project code into the Azure DevOps project code repository:

    项目代码存储库

  3. 假定你已经完成了一些数据准备工作,例如数据引入、特征工程、标签列的创建。Suppose you've done some data preparation work, such as data ingestion, feature engineering, and creating label columns. 你需要确保代码生成预期的结果。You want to make sure your code is generating the results that you expect. 下面是一些可以用来测试数据处理代码是否正常运行的代码:Here's some code that you can use to test whether the data-processing code is working properly:

    • 检查列名是否正确:Check that column names are right:

      进行列名匹配的代码

    • 检查响应级别是否正确:Check that response levels are right:

      进行级别匹配的代码

    • 检查响应百分比是否合理:Check that response percentage is reasonable:

      检查响应百分比的代码

    • 检查数据中每个列的未命中率:Check the missing rate of each column in the data:

      检查未命中率的代码

  4. 完成数据处理和特征工程工作并训练出良好模型以后,请确保训练的模型能够正确地为新数据集评分。After you've done the data processing and feature engineering work, and you've trained a good model, make sure that the model you trained can score new datasets correctly. 可以使用下述两项测试来检查标签值的预测级别和分布:You can use the following two tests to check the prediction levels and distribution of label values:

    • 检查预测级别:Check prediction levels:

      用于检查预测级别的代码

    • 检查预测值的分布:Check the distribution of prediction values:

      用于检查预测值的代码

  5. 将所有测试函数一起放到名为 test_funcs.py 的 Python 脚本中:Put all test functions together into a Python script called test_funcs.py:

    用于测试函数的 Python 脚本

  6. 准备好测试代码以后,即可在 Visual Studio 中设置测试环境。After the test codes are prepared, you can set up the testing environment in Visual Studio.

    创建名为 test1.py 的 Python 文件。Create a Python file called test1.py. 在此文件中创建一个类,其中包括要完成的所有测试。In this file, create a class that includes all the tests you want to do. 以下示例显示了六个准备好的测试:The following example shows six tests prepared:

    类中包含一系列测试的 Python 文件

  7. 如果将 codetest.testCase 置于类名之后,则可自动发现这些测试。Those tests can be automatically discovered if you put codetest.testCase after your class name. 打开右窗格中的测试资源管理器,然后选择“全部运行”。 Open Test Explorer in the right pane, and select Run All. 所有测试都会按顺序运行,并且会告知测试是否成功。All the tests will run sequentially and will tell you if the test is successful or not.

    运行测试

  8. 使用 Git 命令将代码签入项目存储库中。Check in your code to the project repository by using Git commands. 最新的工作会很快反映在 Azure DevOps 中。Your most recent work will be reflected shortly in Azure DevOps.

    用于签入代码的 Git 命令

    Azure DevOps 中的最新工作

  9. 在 Azure DevOps 中设置自动生成和测试:Set up automatic build and test in Azure DevOps:

    a.a. 在项目存储库中选择“生成并发布”,然后选择“+新建”以创建新的生成过程。 In the project repository, select Build and Release, and then select +New to create a new build process.

    ![Selections for starting a new build process](./media/code-test/create_new_build.PNG)
    

    b.b. 按提示选择源代码位置、项目名称、存储库和分库信息。Follow the prompts to select your source code location, project name, repository, and branch information.

    ![Source, name, repository, and branch information](./media/code-test/fill_in_build_info.PNG)
    

    c.c. 选择模板。Select a template. 由于没有 Python 项目模板,请一开始选择“空进程” 。Because there's no Python project template, start by selecting Empty process.

    ![List of templates and "Empty process" button](./media/code-test/start_empty_process_template.PNG)
    

    d.d. 为生成命名并选择代理。Name the build and select the agent. 如果需要使用 DSVM 来完成生成过程,可以在这里选择默认值。You can choose the default here if you want to use a DSVM to complete the build process. 有关如何设置代理的详细信息,请参阅 Build and release agents(生成并发布代理)。For more information about setting agents, see Build and release agents.

    ![Build and agent selections](./media/code-test/select_agent.PNG)
    

    e.e. 在左窗格中选择“+”, 添加适合此生成阶段的任务。Select + in the left pane, to add a task for this build phase. 由于我们将运行 Python 脚本 test1.py 来完成所有检查,因此该任务将使用 PowerShell 命令来运行 Python 代码。Because we're going to run the Python script test1.py to complete all the checks, this task is using a PowerShell command to run Python code.

    !["Add tasks" pane with PowerShell selected](./media/code-test/add_task_powershell.PNG)
    

    f.f. 在 PowerShell 详细信息中填写所需的信息,例如 PowerShell 的名称和版本。In the PowerShell details, fill in the required information, such as the name and version of PowerShell. 选择“内联脚本” 作为类型。Choose Inline Script as the type.

    In the box under **Inline Script**, you can type **python test1.py**. Make sure the environment variable is set up correctly for Python. If you need a different version or kernel of Python, you can explicitly specify the path as shown in the figure: 
    
    ![PowerShell details](./media/code-test/powershell_scripts.PNG)
    

    g.g. 选择“保存并排队”,以完成生成管道过程。Select Save & queue to complete the build pipeline process.

    !["Save & queue" button](./media/code-test/save_and_queue_build_definition.PNG)
    

现在,每次将新提交的内容推送到代码存储库时,生成过程就会自动启动。Now every time a new commit is pushed to the code repository, the build process will start automatically. 你可以定义任何分支。You can define any branch. 此过程运行代理计算机中的 test1.py 文件,目的是确保代码中定义的所有内容都能正确运行。The process runs the test1.py file in the agent machine to make sure that everything defined in the code runs correctly.

如果警报设置正确,系统会在生成完成以后通过电子邮件通知你。If alerts are set up correctly, you'll be notified in email when the build is finished. 也可在 Azure DevOps 中检查生成状态。You can also check the build status in Azure DevOps. 如果生成失败,则可检查生成的详细信息,找出出错的片段。If it fails, you can check the details of the build and find out which piece is broken.

生成成功的电子邮件通知

生成成功的 Azure DevOps 通知

后续步骤Next steps

  • 如需通过具体的示例来了解数据科学方案的单元测试,请参阅 UCI 收入预测存储库See the UCI income prediction repository for concrete examples of unit tests for data science scenarios.
  • 请按你自己的数据科学项目的 UCI 收入预测方案中的上述大纲和示例进行操作。Follow the preceding outline and examples from the UCI income prediction scenario in your own data science projects.

参考资料References