在 Azure 上使用 PySpark 和 Scala 进行 HDInsight Spark 数据科学演练HDInsight Spark data science walkthroughs using PySpark and Scala on Azure

这些演练在 Azure Spark 群集上使用 PySpark 和 Scala 执行预测分析。These walkthroughs use PySpark and Scala on an Azure Spark cluster to do predictive analytics. 它们遵循 Team Data Science Process 中所述的步骤。They follow the steps outlined in the Team Data Science Process. 有关 Team Data Science Process 的概述,请参阅数据科学过程For an overview of the Team Data Science Process, see Data Science Process. 有关 Spark on HDInsight 的概述,请参阅 Spark on HDInsight 简介For an overview of Spark on HDInsight, see Introduction to Spark on HDInsight.

其他执行 Team Data Science Process 的数据科学演练按所使用的平台分组。Additional data science walkthroughs that execute the Team Data Science Process are grouped by the platform that they use. 有关这些示例的明细,请参阅执行 Team Data Science Process 的演练See Walkthroughs executing the Team Data Science Process for an itemization of these examples.

在 Azure Spark 上使用 PySpark 预测出租车小费Predict taxi tips using PySpark on Azure Spark

在 Azure HDInsight 上使用 Spark 演练使用纽约出租车数据来预测是否会支付小费,以及预期支付的金额范围。The Use Spark on Azure HDInsight walkthrough uses data from New York taxis to predict whether a tip is paid and the range of amounts expected to be paid. 该演练在方案中使用 Team Data Science Process,以便使用 Azure HDInsight Spark 群集存储、探索并特征化纽约市出租车行程与车费公用数据集中的工程数据。It uses the Team Data Science Process in a scenario using an Azure HDInsight Spark cluster to store, explore, and feature engineer data from the publicly available NYC taxi trip and fare dataset. 本概述主题会设置 HDInsight Spark 群集和 Jupyter PySpark Notebook,以便在演练的余下内容中使用。This overview topic sets you up with an HDInsight Spark cluster and the Jupyter PySpark notebooks used in the rest of the walkthrough. 这些 Notebook 演示如何浏览数据以及如何创建和使用模型。These notebooks show you how to explore your data and then how to create and consume models. 高级数据浏览和建模笔记本介绍了如何包括交叉验证、超参数扫描和模型评估。The advanced data exploration and modeling notebook shows how to include cross-validation, hyper-parameter sweeping, and model evaluation.

使用 Spark 进行数据探索和建模Data Exploration and modeling with Spark

通过完成使用 Spark MLlib 工具包为数据创建二元分类和回归模型主题,探索数据集并创建、评分和评估机器学习模型。Explore the dataset and create, score, and evaluate the machine learning models by working through the Create binary classification and regression models for data with the Spark MLlib toolkit topic.

使用模型Model consumption

若要了解如何评分在本主题中创建的分类和回归模型,请参阅评分和评估 Spark 生成的机器学习模型To learn how to score the classification and regression models created in this topic, see Score and evaluate Spark-built machine learning models.

交叉验证和超参数扫描Cross-validation and hyperparameter sweeping

参阅使用 Spark 进行高级数据探索和建模,了解如何使用交叉验证和超参数扫描训练模型。See Advanced data exploration and modeling with Spark on how models can be trained using cross-validation and hyper-parameter sweeping.

在 Azure Spark 上使用 Scala 预测出租车小费Predict taxi tips using Scala on Azure Spark

在 Azure 上将 Scala 与 Spark 配合使用演练使用纽约出租车数据来预测是否会支付小费,以及预期支付的金额范围。The Use Scala with Spark on Azure walkthrough uses data from New York taxis to predict whether a tip is paid and the range of amounts expected to be paid. 该演练介绍如何通过 Azure HDInsight Spark 群集上的 Spark 机器学习库 (MLlib) 和 SparkML 包,对受监督的机器学习任务使用 Scala。It shows how to use Scala for supervised machine learning tasks with the Spark machine learning library (MLlib) and SparkML packages on an Azure HDInsight Spark cluster. 它指导完成构成数据科学过程的任务:数据引入和浏览、可视化、特征工程、建模和模型使用。It walks you through the tasks that constitute the Data Science Process: data ingestion and exploration, visualization, feature engineering, modeling, and model consumption. 生成的模型包括逻辑和线性回归、随机林和梯度提升树。The models built include logistic and linear regression, random forests, and gradient boosted trees.

后续步骤Next steps

有关构成 Team Data Science Process 的关键组件的讨论,请参阅 Team Data Science Process 概述For a discussion of the key components that comprise the Team Data Science Process, see Team Data Science Process overview.

有关可用于构建数据科学项目的 Team Data Science Process 生命周期的讨论,请参阅 Team Data Science Process 生命周期For a discussion of the Team Data Science Process lifecycle that you can use to structure your data science projects, see Team Data Science Process lifecycle. 生命周期概述了执行项目时,其从开始到结束所遵循的步骤。The lifecycle outlines the steps, from start to finish, that projects usually follow when they are executed.