用于数据科学家的 Team Data Science ProcessTeam Data Science Process for data scientists

本文指导读者利用一组典型对象,通过配合 Azure 技术实现综合性数据科学解决方案。This article provides guidance to a set of objectives that are typically used to implement comprehensive data science solutions with Azure technologies. 本文将指导你:You are guided through:

  • 了解分析工作负荷understanding an analytics workload
  • 使用 Team Data Science Processusing the Team Data Science Process
  • 使用 Azure 机器学习using Azure Machine Learning
  • 了解数据传输和存储的基础知识the foundations of data transfer and storage
  • 提供数据源文档providing data source documentation
  • 使用工具来处理分析using tools for analytics processing

这些培训材料与 Team Data Science Process (TDSP) 及 Microsoft 与开源代码软件和工具包相关,有助于构想、执行和交付数据科学解决方案。These training materials are related to the Team Data Science Process (TDSP) and Microsoft and open-source software and toolkits, which are helpful for envisioning, executing and delivering data science solutions.

课程路径Lesson Path

可通过下表中所列项目进行自学。You can use the items in the following table to guide your own self-study. 要按照路径学习请参阅“说明”列,要查看学习参考请单击“主题”链接,要检查技能掌握情况请使用“知识检查”列 。Read the Description column to follow the path, click on the Topic links for study references, and check your skills using the Knowledge Check column.

目标Objective 主题Topic 说明Description 知识检查Knowledge Check
了解开发分析项目的过程Understand the processes for developing analytic projects Team Data Science Process 概述An introduction to the Team Data Science Process 我们首先开始介绍 Team Data Science Process (TDSP)。We begin by covering an overview of the Team Data Science Process – the TDSP. 此过程将引导你逐步了解分析项目。This process guides you through each step of an analytics project. 阅读以下各节,详细了解相关过程及其实现方法。Read through each of these sections to learn more about the process and how you can implement it. 查看项目的 TDSP 项目结构项目并将其下载到本地计算机Review and download the TDSP Project Structure artifacts to your local machine for your project.
敏捷开发Agile Development Team Data Science Process 适用于多种不同的编程方法。The Team Data Science Process works well with many different programming methodologies. 在此学习路径下,我们将使用敏捷软件开发。In this Learning Path, we use Agile software development. 请参阅文章“什么是敏捷开发?”Read through the “What is Agile Development?” 和“生成敏捷区域性”,了解 Agile 的基本使用方法。and “Building Agile Culture” articles, which cover the basics of working with Agile. 本站点还收录了其他参考资料,以供读者深入了解。There are also other references at this site where you can learn more. 向同事解释持续集成和持续交付。Explain Continuous Integration and Continuous Delivery to a colleague.
用于数据科学的 DevOpsDevOps for Data Science 开发者操作 (DevOps) 提供人员、进程和平台,使用者可通过它来处理项目,并将解决方案集成到组织的标准 IT 中。Developer Operations (DevOps) involves people, processes, and platforms you can use to work through a project and integrate your solution into an organization's standard IT. 从应用和安全性来看,集成是必需的。This integration is essential for adoption, safety, and security. 在此联机课程中,你将了解 DevOps 的相关实践,以及所拥有的某些工具链选项。In this online course, you learn about DevOps practices as well as understand some of the toolchain options you have. 为技术受众准备一篇时长 30 分钟的演示文稿,介绍 DevOps 对于分析项目的重要性。Prepare a 30-minute presentation to a technical audience on how DevOps is essential for analytics projects.
了解数据存储和处理的相关技术Understand the Technologies for Data Storage and Processing Microsoft 商业分析和 AIMicrosoft Business Analytics and AI 在此学习路径下,我们介绍了几种用于创建分析解决方案的技术,但 Microsoft 所拥有的技术远不止这些。We focus on a few technologies in this Learning Path that you can use to create an analytics solution, but Microsoft has many more. 要了解已有的选项,必须查看 Microsoft Azure、Azure Stack 和本地选项上可用的平台及功能。To understand the options you have, it’s important to review the platforms and features available in Microsoft Azure, the Azure Stack, and on-premises options. 查看此资源,了解在解决分析问题时可用的各种工具。Review this resource to learn the various tools you have available to answer analytics question. 从此学习班下载并查看演示文稿材料Download and review the presentation materials from this workshop.
设置和配置培训、开发及生产环境Setup and Configure your training, development, and production environments Microsoft AzureMicrosoft Azure 现在,让我们在 Microsoft Azure 中创建一个培训用帐户,学习如何创建开发和测试环境。Now let’s create an account in Microsoft Azure for training and learn how to create development and test environments. 你可以通过这些免费培训资源入门。These free training resources get you started. 完成“初学者”和“中级”路径。Complete the “Beginner” and “Intermediate” paths. 如果没有 Azure 帐户,请先创建一个If you do not have an Azure Account, create one. 登录至 Microsoft Azure 门户并创建一个资源组用于培训。Log in to the Microsoft Azure portal and create one Resource Group for training.
Microsoft Azure 命令行接口 (CLI)The Microsoft Azure Command-Line Interface (CLI) Microsoft Azure 的使用范围极其广泛 - 从图形工具(如 VSCode 和 Visual Studio)到 Web 接口(如 Azure 门户)、命令行(如 Azure PowerShell 命令行)和函数等均适用。There are multiple ways of working with Microsoft Azure – from graphical tools like VSCode and Visual Studio, to Web interfaces such as the Azure portal, and from the command line, such as Azure PowerShell commands and functions. 在本文中,我们介绍了命令行接口 (CLI),这是一种可在工作站本地、Windows 和其他操作系统,以及 Azure 门户中使用的工具。In this article, we cover the Command-Line Interface (CLI), which you can use locally on your workstation, in Windows and other Operating Systems, as well as in the Azure portal. 使用 Azure CLI 设置默认订阅Set your default subscription with the Azure CLI.
Microsoft Azure 存储Microsoft Azure Storage 需要一个位置来存储数据。You need a place to store your data. 在本文中,你将了解 Microsoft Azure 的存储选项,学习如何创建存储帐户,如何将数据复制或移动到云。In this article, you learn about Microsoft Azure’s storage options, how to create a storage account, and how to copy or move data to the cloud. 请通读此简介了解详细内容。Read through this introduction to learn more. 在培训资源组中创建存储帐户,为 blob 对象创建容器以及上传和下载数据。Create a Storage Account in your training Resource Group, create a container for a Blob object, and upload and download data.
Microsoft Azure Active DirectoryMicrosoft Azure Active Directory Microsoft Azure Active Directory (AAD) 是维护应用程序安全的基础。Microsoft Azure Active Directory (AAD) forms the basis of securing your application. 在本文中,你将了解关于帐户、权限和特权的详细信息。In this article, you learn more about accounts, rights, and permissions. Active Directory 和安全性都是较复杂的主题,因此请通读此资源以了解相关基础知识。Active Directory and security are complex topics, so just read through this resource to understand the fundamentals. 向 Azure Active Directory 添加用户Add one user to Azure Active Directory. 注意:如果不是该订阅的管理员,则可能不具备执行此操作的权限。NOTE: You may not have permissions for this action if you are not the administrator for the subscription. 如果是这种情况,只需查看此教程以了解详细信息If that’s the case, simply review this tutorial to learn more.
Microsoft Azure 数据科学虚拟机The Microsoft Azure Data Science Virtual Machine 可以在多个操作系统中以本地方式安装用于数据科学的工具。You can install the tools for working with Data Science locally on multiple operating systems. 但是,Microsoft Azure 数据科学虚拟机 (DSVM) 可提供用户所需的一切工具,以及大量可供使用的项目模板。But the Microsoft Azure Data Science Virtual Machine (DSVM) contains all of the tools you need and plenty of project samples to work with. 在本文中,你将了解关于 DVSM 的详细信息,并学习如何通过这些示例开始工作。In this article, you learn more about the DVSM and how to work through its examples. 此资源介绍了数据科学虚拟机的相关信息,指导如何创建数据科学虚拟机,并介绍了使用它来开发代码的若干选项。This resource explains the Data Science Virtual Machine, how you can create one, and a few options for developing code with it. 它还提供完成此学习路径所需的一切软件,以便你能顺利完成此主题的知识路径。It also contains all the software you need to complete this learning path – so make sure you complete the Knowledge Path for this topic. 创建数据科学虚拟机,并至少通过一个实验室加以使用Create a Data Science Virtual Machine and work through at least one lab.
安装和了解使用数据科学解决方案的相关工具和技术Install and Understand the tools and technologies for working with Data Science solutions
使用 gitWorking with git 要通过 TDSP 完成 DevOps 过程,需要版本控制系统。To follow our DevOps process with the TDSP, we need to have a version-control system. Microsoft Azure 机器学习使用的是 git,这是一款常用的开源分布式存储库系统。Microsoft Azure Machine Learning uses git, a popular open-source distributed repository system. 本文详细介绍如何安装、配置、使用 git 和中央存储库 GitHub。In this article, you learn more about how to install, configure, and work with git and a central repository – GitHub. 克隆此 GitHub 项目作为学习路径项目结构Clone this GitHub project for your learning path project structure.
VSCodeVSCode VSCode 是跨平台的集成开发环境 (IDE),支持多种语言和 Azure 工具。VSCode is a cross-platform Integrated Development Environment (IDE) that you can use with multiple languages and Azure tools. 可使用此单一环境来创建自己的整套解决方案。You can use this single environment to create your entire solution. 请观看这些介绍视频以开始。Watch these introductory videos to get started. 安装 VSCode,并在交互式编辑器演练场中演练 VS 代码功能Install VSCode, and work through the VS Code features in the Interactive Editor Playground.
使用 Python 进行编程Programming with Python 我们将在此解决方案中使用 Python,这是数据科学中最常用的语言之一。In this solution we use Python, one of the most popular languages in Data Science. 本文介绍了使用 Python 编写分析代码的基础知识,并提供用于深入学习的相关资源。This article covers the basics of writing analytic code with Python, and resources to learn more. 演练此参考资料中的第 1-9 节,并检查所学知识。Work through sections 1-9 of this reference, then check your knowledge. 使用 Python 向 Azure 表添加一个实体Add one entity to an Azure Table using Python.
使用 NotebookWorking with Notebooks 在同一文档中引入文本和代码的一种方法是使用 Notebook。Notebooks are a way of introducing text and code in the same document. Azure 机器学习与 Notebook 配合使用,因此了解如何使用它们是有益的。Azure Machine Learning work with Notebooks, so it is beneficial to understand how to use them. 通读此教程,并在知识检查部分进行尝试。Read through this tutorial and give it a try in the Knowledge Check section. 打开此页,然后单击“欢迎使用 Python.ipynb”链接。Open this page, and click on the “Welcome to Python.ipynb” link. 在该页面上演练相关示例。Work through the examples on that page.
机器学习Machine Learning
创建高级分析解决方案需要处理数据和使用机器学习,这也是使用人工智能和深入学习的基础。Creating advanced Analytic solutions involves working with data, using Machine Learning, which also forms the basis of working with Artificial Intelligence and Deep Learning. 此课程将介绍详细介绍机器学习。This course teaches you more about Machine Learning. 要获取数据科学的完整课程,请查看此证书For a comprehensive course on Data Science, check out this certification. 在机器学习算法中定位资源。Locate a resource on Machine Learning Algorithms. (提示:搜索“Azure 机器学习算法备忘单”)(Hint: Search on “azure machine learning algorithm cheat sheet”)
scikit-learnscikit-learn 可使用 scikit-learn 工具集在 Python 中执行数据科学任务。The scikit-learn set of tools allows you to perform data science tasks in Python. 我们在自己的解决方案中使用了此框架。We use this framework in our solution. 本文介绍了相关基础知识,并说明了在何处可以进行更深入的学习。This article covers the basics and explains where you can learn more. 使用 Iris 数据集,保留使用 Pickle 的 SVM 模型。Using the Iris dataset, persist an SVM model using Pickle.
使用 DockerWorking with Docker Docker 是一个分布式平台,用于生成、装运和运行应用程序,此外在 Azure 机器学习中也经常使用它。Docker is a distributed platform used to build, ship, and run applications, and is used frequently in Azure Machine Learning. 本文介绍了关于此技术的基础知识,并说明了在何处可以进行更深入的学习。This article covers the basics of this technology and explains where you can go to learn more. 打开 Visual Studio Code 并安装 Docker 扩展Open Visual Studio Code, and install the Docker Extension. 创建简单的 Node Docker 容器Create a simple Node Docker container.
HDInsightHDInsight HDInsight 是 Hadoop 开源基础结构,可作为一种服务在 Microsoft Azure 中使用。HDInsight is the Hadoop open-source infrastructure, available as a service in Microsoft Azure. 你的机器学习算法可能具有较大的数据集,而 HDInsight 可大规模地存储、传输和处理数据。Your Machine Learning algorithms may involve large sets of data, and HDInsight has the ability to store, transfer and process data at large scale. 本文介绍如何使用 HDInsight。This article covers working with HDInsight. 创建小型 HDInsight 群集Create a small HDInsight cluster. 使用 HiveQL 语句将列投影到 /example/data/sample.log 文件Use HiveQL statements to project columns onto an /example/data/sample.log file. 或者,也可在本地系统中完成此知识检查Alternatively, you can complete this knowledge check on your local system.
根据业务需求创建数据处理流Create a Data Processing Flow from Business Requirements 根据 TDSP 确定问题Determining the Question, following the TDSP 在安装和配置了开发环境并对相关技术和进程有一定了解之后,即可通过 TDSP 将所有内容集合在一起以执行分析。With the development environment installed and configured, and the understanding of the technologies and processes in place, it’s time to put everything together using the TDSP to perform an analysis. 我们需要先定义问题、选择数据源,并完成 Team Data Science Process 中的剩余步骤。We need to start by defining the question, selecting the data sources, and the rest of the steps in the Team Data Science Process. 在演练此流程的过程中,请注意 DevOps 进程。Keep in mind the DevOps process as we work through this process. 在本文中,你将了解如何从组织获取需求,如何通过应用程序来创建数据流映射,从而使用 Team Data Science Process 定义解决方案。In this article, you learn how to take the requirements from your organization and create a data flow map through your application to define your solution using the Team Data Science Process.
使用 Azure 机器学习创建预测解决方案Use Azure Machine Learning to create a predictive solution Azure 机器学习Azure Machine Learning Microsoft Azure 机器学习使用 AI 进行数据整理和特征设计,管理试验,并跟踪模型运行。Microsoft Azure Machine Learning uses AI for data wrangling and feature engineering, manages experiments, and tracks model runs. 所有这些都在单个环境中运行,且大部分功能都可以在本地或 Azure 中运行。All of this works in a single environment and most functions can run locally or in Azure. 可使用 PyTorch、TensorFlow 和其他框架来创建试验。You can use the PyTorch, TensorFlow, and other frameworks to create your experiments. 在本文中,我们将重点介绍此过程的完整示例,其中用到的一切内容都是迄今为止你已学习过的。In this article, we focus on a complete example of this process, using everything you’ve learned so far.
使用 Power BI 可视化结果Use Power BI to visualize results Power BIPower BI Power BI 是 Microsoft 的数据可视化工具。Power BI is Microsoft’s data visualization tool. 从 Web 到移动设备和台式计算机,多个平台都支持这款工具。It is available on multiple platforms from Web to mobile devices and desktop computers. 本文介绍如何通过访问 Azure 存储中的结果并使用 Power BI 创建可视化效果,来处理已创建的解决方案的输出。In this article, you learn how to work with the output of the solution you’ve created by accessing the results from Azure storage and creating visualizations using Power BI. 在 Power BI 中完成本教程。Complete this tutorial on Power BI. 然后将 Power BI 连接到在运行试验的过程中创建的 Blob CSV。Then connect Power BI to the Blob CSV created in an experiment run.
监视解决方案Monitor your Solution Application InsightsApplication Insights 有多种工具可用于监视你的最终解决方案。There are multiple tools you can use to monitor your end solution. 使用 Azure Application Insights,可轻松将内置的监视功能集成到解决方案中。Azure Application Insights makes it easy to integrate built-in monitoring into your solution. 设置 Application Insights 以监视 应用程序Set up Application Insights to monitor an Application.
Azure Monitor 日志Azure Monitor logs 监视应用程序的另一种方法是将其集成到 DevOps 进程。Another method to monitor your application is to integrate it into your DevOps process. Azure Monitor 日志系统提供了丰富的功能集,可帮助你在部署分析解决方案后查看这些解决方案。The Azure Monitor logs system provides a rich set of features to help you watch your analytic solutions after you deploy them. 完成本教程了解如何使用 Azure Monitor 日志。Complete this tutorial on using Azure Monitor logs.
完成此学习路径Complete this Learning Path 祝贺!Congratulations! 你已完成此学习路径。You’ve completed this learning path.

后续步骤Next steps

用于开发者操作的 Team Data Science Process 本文探讨特定于高级分析和认知服务解决方案实施项目的开发者操作 (DevOps) 功能。Team Data Science Process for Developer Operations This article explores the Developer Operations (DevOps) functions that are specific to an Advanced Analytics and Cognitive Services solution implementation.