用于数据科学项目的平台和工具Platforms and tools for data science projects

Microsoft 为云或本地平台提供了整套分析资源。Microsoft provides a full spectrum of analytics resources for both cloud or on-premises platforms. 部署这些服务和资源可让数据科学项目的执行变得有效且可缩放。They can be deployed to make the execution of your data science projects efficient and scalable. Team Data Science Process (TDSP) 为团队以可跟踪、版本受控和协作的方式实施数据科学项目提供指导。Guidance for teams implementing data science projects in a trackable, version controlled, and collaborative way is provided by the Team Data Science Process (TDSP). 有关致力于标准化此过程的数据科学团队处理的人员角色及其相关任务的概述,请参阅 Team Data Science Process 角色和任务For an outline of the personnel roles, and their associated tasks that are handled by a data science team standardizing on this process, see Team Data Science Process roles and tasks.

使用 TDSP 的数据科学团队可用的分析资源包括:The analytics resources available to data science teams using the TDSP include:

  • 数据科学虚拟机(Windows 和 Linux CentOS)Data Science Virtual Machines (both Windows and Linux CentOS)
  • HDInsight Spark 群集HDInsight Spark Clusters
  • Synapse AnalyticsSynapse Analytics
  • Azure Data LakeAzure Data Lake
  • HDInsight Hive 群集HDInsight Hive Clusters
  • Azure 文件存储Azure File Storage
  • SQL Server 2019 R 和 Python 服务SQL Server 2019 R and Python Services

本文档简要介绍上述资源,并提供 TDSP 团队发布的教程和演练的链接。In this document, we briefly describe the resources and provide links to the tutorials and walkthroughs the TDSP teams have published. 可以借助这些参考材料了解如何逐步使用这些资源,并开始使用它们来生成智能应用程序。They can help you learn how to use them step by step and start using them to build your intelligent applications. 这些资源的产品页上提供了其详细信息。More information on these resources is available on their product pages.

数据科学虚拟机 (DSVM)Data Science Virtual Machine (DSVM)

Microsoft 在 Windows 和 Linux 上提供的数据科学虚拟机包含用于数据科学建模与开发活动的热门工具。The data science virtual machine offered on both Windows and Linux by Microsoft, contains popular tools for data science modeling and development activities. 这些工具包括:It includes tools such as:

  • Microsoft R Server Developer EditionMicrosoft R Server Developer Edition
  • Enthought Python 分发版Anaconda Python distribution
  • 用于 Python 和 R 的 Jupyter NotebookJupyter notebooks for Python and R
  • Windows/Eclipse on Linux 上包含 Python 和 R 工具的 Visual Studio Community EditionVisual Studio Community Edition with Python and R Tools on Windows / Eclipse on Linux
  • 适用于 Windows 的 Power BI DesktopPower BI desktop for Windows
  • Windows/Postgres on Linux 上的 SQL Server 2016 Developer EditionSQL Server 2016 Developer Edition on Windows / Postgres on Linux

它还包括 ML 和 AI 工具,例如 xgboost、mxnet 和 Vowpal Wabbit。It also includes ML and AI tools like xgboost, mxnet and Vowpal Wabbit.

DSVM 目前可在 WindowsLinux CentOS 操作系统中使用。Currently DSVM is available in Windows and Linux CentOS operating systems. 根据计划在其上执行的数据科学项目的需求,选择 DSVM 的大小(CPU 核心数和内存量)。Choose the size of your DSVM (number of CPU cores and the amount of memory) based on the needs of the data science projects that you are planning to execute on it.

若要了解如何在 DSVM 上有效执行某些常见的数据科学任务,请参阅 Data Science Virtual Machine 的十大功能To learn how to execute some of the common data science tasks on the DSVM efficiently, see 10 things you can do on the Data science Virtual Machine

Azure HDInsight Spark 群集Azure HDInsight Spark clusters

Apache Spark 是一种开源并行处理框架,支持使用内存中处理来提升大数据分析应用程序的性能。Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Spark 处理引擎是专为速度、易用性和复杂分析打造的产品。The Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark 的内存中计算功能使其成为机器学习和图形计算中的迭代算法的最佳选择。Spark's in-memory computation capabilities make it a good choice for iterative algorithms in machine learning and for graph computations. Spark 也能与 Azure Blob 存储 (WASB) 兼容,因此可以轻松使用 Spark 处理存储在 Azure 中的现有数据。Spark is also compatible with Azure Blob storage (WASB), so your existing data stored in Azure can easily be processed using Spark.

在 HDInsight 中创建 Spark 群集时,即会创建已安装并配置了 Spark 的 Azure 计算资源。When you create a Spark cluster in HDInsight, you create Azure compute resources with Spark installed and configured. 在 HDInsight 中创建 Spark 群集需要约 10 分钟。It takes about 10 minutes to create a Spark cluster in HDInsight. 将要处理的数据存储在 Azure Blob 存储中。Store the data to be processed in Azure Blob storage. 有关在群集中使用 Azure Blob 存储的信息,请参阅将 HDFS 兼容的 Azure Blob 存储与 HDInsight 中的 Hadoop 配合使用For information on using Azure Blob Storage with a cluster, see Use HDFS-compatible Azure Blob storage with Hadoop in HDInsight.

Microsoft 的 TDSP 团队发布了两篇端到端演练,介绍如何使用 Azure HDInsight Spark 群集生成数据科学解决方案,其中一个解决方案使用 Python,另一个使用 Scala。TDSP team from Microsoft has published two end-to-end walkthroughs on how to use Azure HDInsight Spark Clusters to build data science solutions, one using Python and the other Scala. 有关 Azure HDInsight“Spark 群集”的详细信息,请参阅概述 :HDInsight Linux 上的 Apache SparkFor more information on Azure HDInsight Spark Clusters, see Overview: Apache Spark on HDInsight Linux. 若要了解如何在 Azure HDInsight Spark 群集上使用 Python 生成数据科学解决方案,请参阅有关在 Azure HDInsight 上使用 Spark 展开数据科学的概述To learn how to build a data science solution using Python on an Azure HDInsight Spark Cluster, see Overview of Data Science using Spark on Azure HDInsight. 若要了解如何在 Azure HDInsight Spark 群集上使用 Scala 生成数据科学解决方案,请参阅在 Azure 上使用 Scala 和 Spark 展开数据科学To learn how to build a data science solution using Scala on an Azure HDInsight Spark Cluster, see Data Science using Scala and Spark on Azure.

Azure SQL 数据仓库Azure SQL Data Warehouse

使用 Azure SQL 数据仓库可以在片刻之间轻松缩放计算资源,无需过度预配或过度付费。Azure SQL Data Warehouse allows you to scale compute resources easily and in seconds, without over-provisioning or over-paying. 此外,SQL 数据仓库提供暂停使用计算资源的独特选项,为我们赋予更大的自由以更好地管理云成本。It also offers the unique option to pause the use of compute resources, giving you the freedom to better manage your cloud costs. 部署可缩放计算资源的能力使我们能够将所有数据放入 Azure SQL 数据仓库。The ability to deploy scalable compute resources makes it possible to bring all your data into Azure SQL Data Warehouse. 存储成本低廉,只需针对想要分析的数据集部分运行计算资源。Storage costs are minimal and you can run compute only on the parts of datasets that you want to analyze.

有关 Azure SQL 数据仓库的详细信息,请参阅 SQL 数据仓库网站。For more information on Azure SQL Data Warehouse, see the SQL Data Warehouse website. 若要了解如何使用 SQL 数据仓库生成端到端高级分析解决方案,请参阅运行中的 Team Data Science Process:使用 SQL 数据仓库 To learn how to build end-to-end advanced analytics solutions with SQL Data Warehouse, see The Team Data Science Process in action: using SQL Data Warehouse.

Azure Data LakeAzure Data Lake

Azure Data Lake 是在施加任何正式要求或架构之前,在单个位置收集的每种数据的企业级存储库。Azure data lake is as an enterprise-wide repository of every type of data collected in a single location, prior to any formal requirements or schema being imposed. 这种灵活性允许将每种数据保留在 Data Lake 中,而不管数据的大小或结构或引入速度如何。This flexibility allows every type of data to be kept in a data lake, regardless of its size or structure or how fast it is ingested. 然后,组织可以使用 Hadoop 或高级分析功能在这些 Data Lake 中查找模式。Organizations can then use Hadoop or advanced analytics to find patterns in these data lakes. 在策划数据并其转移到数据仓库之前,Data Lake 还可以充当较低成本的数据准备工作的存储库。Data lakes can also serve as a repository for lower-cost data preparation before curating the data and moving it into a data warehouse.

有关 Azure Data Lake 的详细信息,请参阅 Azure Data Lake 简介For more information on Azure Data Lake, see Introducing Azure Data Lake. 要了解如何使用 Azure Data Lake 生成可缩放的端到端数据科学解决方案,请参阅 Azure Data Lake 中可缩放的数据科学:端到端演练To learn how to build a scalable end-to-end data science solution with Azure Data Lake, see Scalable Data Science in Azure Data Lake: An end-to-end Walkthrough

Azure HDInsight Hive (Hadoop) 群集Azure HDInsight Hive (Hadoop) clusters

Apache Hive 是适用于 Hadoop 的数据仓库系统,可让你使用 HiveQL(类似于 SQL 的查询语言)来进行数据汇总、查询和数据分析。Apache Hive is a data warehouse system for Hadoop, which enables data summarization, querying, and the analysis of data using HiveQL, a query language similar to SQL. 使用 Hive 能够以交互方式浏览数据,或者创建可重用的批处理作业。Hive can be used to interactively explore your data or to create reusable batch processing jobs.

Hive 可以实现将结构投影到很大程度上未结构化的数据上。Hive allows you to project structure on largely unstructured data. 定义结构后,可以使用 Hive 在 Hadoop 群集中查询这些数据,而无需使用甚至无需了解 Java 或 MapReduce。After you define the structure, you can use Hive to query that data in a Hadoop cluster without having to use, or even know, Java or MapReduce. 借助 HiveQL(Hive 查询语言)可以使用类似于 T-SQL 的语句编写查询。HiveQL (the Hive query language) allows you to write queries with statements that are similar to T-SQL.

对于数据科学家而言,Hive 可以在 Hive 查询中运行 Python 用户定义的函数 (UDF) 来处理记录。For data scientists, Hive can run Python User-Defined Functions (UDFs) in Hive queries to process records. 这种能力大大扩展了数据分析中的 Hive 查询功能。This ability extends the capability of Hive queries in data analysis considerably. 具体而言,Hive 可让数据科学家使用他们最熟悉的语言(包括类似于 SQL 的 HiveQL 和 Python)开展可缩放的特征工程。Specifically, it allows data scientists to conduct scalable feature engineering in languages they are mostly familiar with: the SQL-like HiveQL and Python.

有关 Azure HDInsight Hive 群集的详细信息,请参阅在 HDInsight 中将 Hive 和 HiveQL 与 Hadoop 配合使用For more information on Azure HDInsight Hive Clusters, see Use Hive and HiveQL with Hadoop in HDInsight. 若要了解如何使用 Azure HDInsight Hive 群集生成可缩放的端到端数据科学解决方案,请参阅运行中的 Team Data Science Process:使用 HDInsight Hadoop 群集To learn how to build a scalable end-to-end data science solution with Azure HDInsight Hive Clusters, see The Team Data Science Process in action: using HDInsight Hadoop clusters.

Azure 文件存储Azure File Storage

Azure 文件存储是一种使用标准服务器消息块 (SMB) 协议在云中提供文件共享的服务。Azure File Storage is a service that offers file shares in the cloud using the standard Server Message Block (SMB) Protocol. 支持 SMB 2.1 和 SMB 3.0。Both SMB 2.1 and SMB 3.0 are supported. 通过 Azure 文件存储,可以将依赖文件共享的旧版应用程序快速迁移到 Azure 且无成本高昂的重写。With Azure File storage, you can migrate legacy applications that rely on file shares to Azure quickly and without costly rewrites. 在 Azure 虚拟机或云服务中或者从本地客户端运行的应用程序可以在云中装载文件共享,就像桌面应用程序装载典型的 SMB 共享一样。Applications running in Azure virtual machines or cloud services or from on-premises clients can mount a file share in the cloud, just as a desktop application mounts a typical SMB share. 之后,任意数量的应用程序组件可以装载并同时访问文件存储共享。Any number of application components can then mount and access the File storage share simultaneously.

能够创建一个 Azure 文件存储作为与项目团队成员共享项目数据的位置,对于数据科学项目特别有用。Especially useful for data science projects is the ability to create an Azure file store as the place to share project data with your project team members. 然后,每个成员可以访问 Azure 文件存储中的相同数据副本。Each of them then has access to the same copy of the data in the Azure file storage. 他们还可使用此文件存储来共享执行项目期间生成的特征集。They can also use this file storage to share feature sets generated during the execution of the project. 如果项目是客户参与项目,则客户可以在其自己的 Azure 订阅下创建一个 Azure 文件存储,用来与你共享项目数据和特征。If the project is a client engagement, your clients can create an Azure file storage under their own Azure subscription to share the project data and features with you. 这样,客户便可以完全控制项目数据资产。In this way, the client has full control of the project data assets. 有关 Azure 文件存储的详细信息,请参阅 在 Windows 上开始使用 Azure 文件存储如何通过 Linux 使用 Azure 文件存储For more information on Azure File Storage, see Get started with Azure File storage on Windows and How to use Azure File Storage with Linux.

SQL Server 2019 R 和 Python 服务SQL Server 2019 R and Python Services

R 服务(数据库内部)提供一个平台,用于开发和部署可以发现新见解的智能应用程序。R Services (In-database) provides a platform for developing and deploying intelligent applications that can uncover new insights. 可以使用丰富强大的 R 语言(包括 R 社区提供的许多包)来创建模型,基于 SQL Server 数据生成预测。You can use the rich and powerful R language, including the many packages provided by the R community, to create models and generate predictions from your SQL Server data. 由于 R 服务(数据库内部)可将 R 语言与 SQL Server 集成,因此可以保持与数据接近的分析结果,同时消除与数据移动相关的成本和安全风险。Because R Services (In-database) integrates the R language with SQL Server, analytics are kept close to the data, which eliminates the costs and security risks associated with moving data.

R 服务(数据库内部)凭借一套综合性的 SQL Server 工具和技术来支持开放源代码 R 语言。R Services (In-database) supports the open source R language with a comprehensive set of SQL Server tools and technologies. 它们提供优异的性能、安全性、可靠性和可管理性。They offer superior performance, security, reliability, and manageability. 可以使用便捷、熟悉的工具部署 R 解决方案。You can deploy R solutions using convenient and familiar tools. 生产应用程序可以使用 Transact-SQL 调用 R 运行时以及检索预测数据和视觉对象。Your production applications can call the R runtime and retrieve predictions and visuals using Transact-SQL. 还可以使用 ScaleR 库来改善 R 解决方案的缩放性和性能。You also use the ScaleR libraries to improve the scale and performance of your R solutions. 有关详细信息,请参阅 SQL Server R ServicesFor more information, see SQL Server R Services.

Microsoft 的 TDSP 团队发布了两篇端到端演练,介绍如何在 SQL Server 2016 R 服务中生成数据科学解决方案:一篇面向 R 程序员,另一篇面向 SQL 开发人员。The TDSP team from Microsoft has published two end-to-end walkthroughs that show how to build data science solutions in SQL Server 2016 R Services: one for R programmers and one for SQL developers. R 程序员可参阅数据科学端到端演练For R Programmers, see Data Science End-to-End Walkthrough. SQL 开发人员可参阅面向 SQL 开发人员的数据库内部高级分析(教程)For SQL Developers, see In-Database Advanced Analytics for SQL Developers (Tutorial).

附录:用于设置数据科学项目的工具Appendix: Tools to set up data science projects

在 Windows 上安装 Git 凭据管理器Install Git Credential Manager on Windows

如果在 Windows 上遵循 TDSP,需要安装 Git 凭据管理器 (GCM) 来与 Git 存储库通信。If you are following the TDSP on Windows, you need to install the Git Credential Manager (GCM) to communicate with the Git repositories. 若要安装 GCM,首先需要安装 ChocolatyTo install GCM, you first need to install Chocolaty. 若要安装 Chocolaty 和 GCM,请在 Windows PowerShell 中以管理员身份运行以下命令:To install Chocolaty and the GCM, run the following commands in Windows PowerShell as an Administrator:

iwr https://chocolatey.org/install.ps1 -UseBasicParsing | iex
choco install git-credential-manager-for-windows -y

在 Linux (CentOS) 计算机上安装 GitInstall Git on Linux (CentOS) machines

运行以下 bash 命令,在 Linux (CentOS) 计算机上安装 Git:Run the following bash command to install Git on Linux (CentOS) machines:

sudo yum install git

在 Linux (CentOS) 计算机上生成公共 SSH 密钥Generate public SSH key on Linux (CentOS) machines

如果使用 Linux (CentOS) 计算机运行 git 命令,需要将计算机的公共 SSH 密钥添加到 Azure DevOps Services,使 Azure DevOps Services 能够识别此计算机。If you are using Linux (CentOS) machines to run the git commands, you need to add the public SSH key of your machine to your Azure DevOps Services, so that this machine is recognized by the Azure DevOps Services. 首先,需要生成公共 SSH 密钥,并在 Azure DevOps Services 安全设置页中将该密钥添加到 SSH 公钥。First, you need to generate a public SSH key and add the key to SSH public keys in your Azure DevOps Services security setting page.

  1. 若要生成 SSH 密钥,请运行以下两条命令:To generate the SSH key, run the following two commands:

    cat .ssh/id_rsa.pub

    生成 SSH 密钥的命令

  2. 复制整个 SSH 密钥,包括 ssh-rsaCopy the entire ssh key including ssh-rsa.

  3. 登录到 Azure DevOps Services。Log in to your Azure DevOps Services.

  4. 单击页面右上角的“<你的姓名>”,再单击“安全性”。 Click <Your Name> at the top right corner of the page and click security.


  5. 依次单击“SSH 公钥”、“+添加”。 Click SSH public keys, and click +Add.

    单击 SSH 公钥,再单击“+ 添加”

  6. 将复制的 SSH 密钥粘贴到文本框中并保存。Paste the ssh key copied into the text box and save.

后续步骤Next steps

还提供了完整的、端到端的演练,演示特定方案处理过程中的所有步骤。Full end-to-end walkthroughs that demonstrate all the steps in the process for specific scenarios are also provided. 示例演练主题中,列出了相关步骤并链接了缩略图说明。They are listed and linked with thumbnail descriptions in the Example walkthroughs topic. 这些演练演示如何将云、本地工具和服务合并到工作流或管道中,以创建智能应用程序。They illustrate how to combine cloud, on-premises tools, and services into a workflow or pipeline to create an intelligent application.

有关如何使用 Azure 机器学习工作室(经典)执行 Team Data Science Process 中的步骤的示例,请参见使用 Azure ML 学习路径。For examples that show how to execute steps in the Team Data Science Process by using Azure Machine Learning Studio (classic), see the With Azure ML learning path.