Team Data Science Process 的数据采集和理解阶段Data acquisition and understanding stage of the Team Data Science Process

本文概述了与 Team Data Science Process (TDSP) 的数据采集和理解阶段相关联的目标、任务和可交付结果。This article outlines the goals, tasks, and deliverables associated with the data acquisition and understanding stage of the Team Data Science Process (TDSP). 此过程提供可用于构建数据科学项目的建议生命周期。This process provides a recommended lifecycle that you can use to structure your data-science projects. 该生命周期概述了项目通常执行的主要阶段(通常以迭代方式进行):The lifecycle outlines the major stages that projects typically execute, often iteratively:

  1. 了解业务Business understanding
  2. 数据采集和理解Data acquisition and understanding
  3. 建模Modeling
  4. 部署Deployment
  5. 客户验收Customer acceptance

此处直观地展示了 TDSP 生命周期:Here is a visual representation of the TDSP lifecycle:

TDSP 生命周期

目标Goals

  • 生成一个干净的、高质量的数据集,了解其与目标变量的关系。Produce a clean, high-quality data set whose relationship to the target variables is understood. 在合适的分析环境中找到数据集,以便进行建模。Locate the data set in the appropriate analytics environment so you are ready to model.
  • 开发数据管道的一个解决方案体系结构,以定期对数据进行刷新和评分。Develop a solution architecture of the data pipeline that refreshes and scores the data regularly.

如何执行How to do it

在此阶段中解决了三个主要任务:There are three main tasks addressed in this stage:

  • 将数据引入到目标分析环境中。Ingest the data into the target analytic environment.
  • 浏览数据以确定数据质量是否足以回答问题。Explore the data to determine if the data quality is adequate to answer the question.
  • 设置数据管道以对新数据或定期刷新的数据进行评分。Set up a data pipeline to score new or regularly refreshed data.

引入数据Ingest the data

设置该过程,将数据从源位置移动到要运行定型和预测等分析操作的目标位置。Set up the process to move the data from the source locations to the target locations where you run analytics operations, like training and predictions. 要了解如何使用各种 Azure 数据服务移动数据的技术详细信息和选项,请参阅将数据加载到存储环境以进行分析For technical details and options on how to move the data with various Azure data services, see Load data into storage environments for analytics.

浏览数据Explore the data

在对模型定型之前,需要对数据进行深刻理解。Before you train your models, you need to develop a sound understanding of the data. 实际的数据集通常比较杂乱,缺少值或存在大量其他差异。Real-world data sets are often noisy, are missing values, or have a host of other discrepancies. 可使用数据汇总和可视化来审核数据的质量,并提供处理数据所需的信息,然后进行建模。You can use data summarization and visualization to audit the quality of your data and provide the information you need to process the data before it's ready for modeling. 此过程通常是迭代的。This process is often iterative.

TDSP 提供名为 IDEAR 的自动化实用工具,以帮助实现数据的可视化并准备数据摘要报表。TDSP provides an automated utility, called IDEAR, to help visualize the data and prepare data summary reports. 建议首先从 IDEAR 开始浏览数据,这样有助于以交互方式开始了解数据,而无需进行编码。We recommend that you start with IDEAR first to explore the data to help develop initial data understanding interactively with no coding. 然后即可编写用于数据浏览和可视化的自定义代码。Then you can write custom code for data exploration and visualization. 有关清理数据的指南,请参阅用于准备数据以增强机器学习的任务For guidance on cleaning the data, see Tasks to prepare data for enhanced machine learning.

对清理后数据的质量感到满意后,下一步是更好地理解数据中固有的模式。After you're satisfied with the quality of the cleansed data, the next step is to better understand the patterns that are inherent in the data. 这有助于为目标选择并开发合适的预测模型。This helps you choose and develop an appropriate predictive model for your target. 查找表明数据与目标的连接情况的证据。Look for evidence for how well connected the data is to the target. 然后确定在进行后续建模步骤时是否有足够的数据可用。Then determine whether there is sufficient data to move forward with the next modeling steps. 同样,此过程通常是迭代的。Again, this process is often iterative. 可能需要查找包含更准确或更相关数据的新数据源,以增加在上一阶段中最初标识的数据集。You might need to find new data sources with more accurate or more relevant data to augment the data set initially identified in the previous stage.

设置数据管道Set up a data pipeline

除了初始引入和清理数据以外,作为持续学习过程的一部分,通常还需设置对新数据进行评分或定期刷新数据的过程。In addition to the initial ingestion and cleaning of the data, you typically need to set up a process to score new data or refresh the data regularly as part of an ongoing learning process. 可通过设置数据管道或工作流来实现上述目的。You do this by setting up a data pipeline or workflow. 使用 Azure 数据工厂将数据从本地 SQL Server 实例移动到 Azure SQL 数据库一文提供了如何使用 Azure 数据工厂来设置管道的示例。The Move data from an on-premises SQL Server instance to Azure SQL Database with Azure Data Factory article gives an example of how to set up a pipeline with Azure Data Factory.

此阶段会开发数据管道的一个解决方案体系结构。In this stage, you develop a solution architecture of the data pipeline. 在数据科学项目的下一阶段,同时开发管道。You develop the pipeline in parallel with the next stage of the data science project. 根据业务需求以及集成有此解决方案的现有系统的约束,管道可以是以下任一类型:Depending on your business needs and the constraints of your existing systems into which this solution is being integrated, the pipeline can be one of the following:

  • 基于批处理的管道Batch-based
  • 流式处理管道或实时管道Streaming or real time
  • 混合管道A hybrid

项目Artifacts

以下是此阶段中的可交付结果:The following are the deliverables in this stage:

  • 数据质量报表:此报表包含数据摘要、每个属性和目标之间的关系、变量排名等。Data quality report: This report includes data summaries, the relationships between each attribute and target, variable ranking, and more. 作为 TDSP 一部分提供的 IDEAR 工具可在任何表格数据集(例如,CSV 文件或关系表)上快速生成此报表。The IDEAR tool provided as part of TDSP can quickly generate this report on any tabular data set, such as a CSV file or a relational table.
  • 解决方案体系结构:解决方案体系结构可以是生成模型后数据管道的关系图或说明,该数据管道用于对新数据进行评分或预测。Solution architecture: The solution architecture can be a diagram or description of your data pipeline that you use to run scoring or predictions on new data after you have built a model. 它还包含要基于新数据重新定型模型的管道。It also contains the pipeline to retrain your model based on new data. 使用 TDSP 目录结构模板时,将此文档存储到项目目录中。Store the document in the Project directory when you use the TDSP directory structure template.
  • 检查点决策:开始完整的功能设计和建模前,可以重新评估项目,以确定预期值是否足以继续投入资金。Checkpoint decision: Before you begin full-feature engineering and model building, you can reevaluate the project to determine whether the value expected is sufficient to continue pursuing it. 例如,用户可能已准备好继续进行项目,但需要收集更多数据或由于没有数据可以回答问题而放弃项目。You might, for example, be ready to proceed, need to collect more data, or abandon the project as the data does not exist to answer the question.

后续步骤Next steps

以下是 TDSP 生命周期中每个步骤的链接:Here are links to each step in the lifecycle of the TDSP:

  1. 了解业务Business understanding
  2. 数据采集和理解Data acquisition and understanding
  3. 建模Modeling
  4. 部署Deployment
  5. 客户验收Customer acceptance

我们还提供了完整的演练,演示特定方案过程中的所有步骤。We provide full end-to-end walkthroughs that demonstrate all the steps in the process for specific scenarios. 示例演练一文提供了包含链接和缩略图描述的方案列表。The Example walkthroughs article provides a list of the scenarios with links and thumbnail descriptions. 该演练演示如何将云、本地工具以及服务结合到一个工作流或管道中,以创建智能应用程序。The walkthroughs illustrate how to combine cloud, on-premises tools, and services into a workflow or pipeline to create an intelligent application.

有关如何在使用 Azure 机器学习工作室的 TDSP 中执行步骤的示例,请参阅通过 Azure 机器学习使用 TDSPFor examples of how to execute steps in TDSPs that use Azure Machine Learning Studio, see Use the TDSP with Azure Machine Learning.