有关在 Azure HDInsight 上使用 Spark 展开数据科研的概述Overview of data science using Spark on Azure HDInsight

这一系列主题显示了如何使用 HDInsight Spark 完成常见的数据科研任务,例如数据引入、特征工程、建模和模型评估。This suite of topics shows how to use HDInsight Spark to complete common data science tasks such as data ingestion, feature engineering, modeling, and model evaluation. 所使用的数据是 2013 年 NYC 出租车行程和费用数据集样本。The data used is a sample of the 2013 NYC taxi trip and fare dataset. 生成的模型包括逻辑和线性回归、随机林和梯度提升树。The models built include logistic and linear regression, random forests, and gradient boosted trees. 本主题还介绍了如何在 Azure Blob 存储 (WASB) 中存储这些模型,以及如何对其预测性能进行评分和评估。The topics also show how to store these models in Azure blob storage (WASB) and how to score and evaluate their predictive performance. 更高级的主题介绍了如何使用交叉验证和超参数扫描训练模型。More advanced topics cover how models can be trained using cross-validation and hyper-parameter sweeping. 本概述主题参与了有关如何设置 Spark 群集的主题,读者需要完成所提供的三个演练中的步骤。This overview topic also references the topics that describe how to set up the Spark cluster that you need to complete the steps in the walkthroughs provided.

Spark 和 MLlibSpark and MLlib

Spark 是一种开放源代码并行处理框架,支持内存中处理,以提升大数据分析应用程序的性能。Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Spark 处理引擎是专为速度、易用性和复杂分析打造的产品。The Spark processing engine is built for speed, ease of use, and sophisticated analytics. Spark 的内存中分布式计算功能使其成为机器学习和图形计算中使用的迭代算法的最佳选择。Spark's in-memory distributed computation capabilities make it a good choice for the iterative algorithms used in machine learning and graph computations. MLlib 是 Spark 的可缩放机器学习库,向此分布式环境引入算法建模功能。MLlib is Spark's scalable machine learning library that brings the algorithmic modeling capabilities to this distributed environment.

HDInsight SparkHDInsight Spark

HDInsight Spark 是 Azure 托管的开放源代码 Spark 产品。HDInsight Spark is the Azure hosted offering of open-source Spark. 它还包括对 Spark 群集上 Jupyter PySpark 笔记本的支持,这些笔记本可以运行 Spark SQL 交互式查询,用于对存储在 Azure Blob (WASB) 中的数据进行转换、筛选和可视化。It also includes support for Jupyter PySpark notebooks on the Spark cluster that can run Spark SQL interactive queries for transforming, filtering, and visualizing data stored in Azure Blobs (WASB). PySpark 是用于 Spark 的 Python API。PySpark is the Python API for Spark. 提供解决方案并显示相关图表以可视化此处的数据的代码片段在 Spark 群集上安装的 Jupyter 笔记本中运行。The code snippets that provide the solutions and show the relevant plots to visualize the data here run in Jupyter notebooks installed on the Spark clusters. 这些主题中的建模步骤包含代码,显示每种类型模型的训练、评估、保存和使用方式。The modeling steps in these topics contain code that shows how to train, evaluate, save, and consume each type of model.

设置:Spark 群集和 Jupyter 笔记本Setup: Spark clusters and Jupyter notebooks

本演练中提供的设置步骤和代码适用于 HDInsight Spark 1.6。Setup steps and code are provided in this walkthrough for using an HDInsight Spark 1.6. 但是,Jupyter 笔记本是针对 HDInsight Spark 1.6 和 Spark 2.0 群集提供的。But Jupyter notebooks are provided for both HDInsight Spark 1.6 and Spark 2.0 clusters. 包含这些笔记本的 GitHub 存储库的 Readme.md 中提供了这些笔记本的说明和链接。A description of the notebooks and links to them are provided in the Readme.md for the GitHub repository containing them. 而且,此处和位于链接笔记本中的代码是泛型代码,应适用于任何 Spark 群集。Moreover, the code here and in the linked notebooks is generic and should work on any Spark cluster. 如果不使用 HDInsight Spark,群集设置和管理步骤可能与此处所示内容稍有不同。If you are not using HDInsight Spark, the cluster setup and management steps may be slightly different from what is shown here. 为方便起见,下面提供了适用于 Spark 1.6 的 Jupyter 笔记本(要在 Jupyter Notebook 服务器的 pySpark 内核中运行)的链接和适用于 Spark 2.0 的 Jupyter 笔记本(要在 Jupyter Notebook 服务器的 pySpark3 内核中运行)的链接:For convenience, here are the links to the Jupyter notebooks for Spark 1.6 (to be run in the pySpark kernel of the Jupyter Notebook server) and Spark 2.0 (to be run in the pySpark3 kernel of the Jupyter Notebook server):

Spark 1.6 笔记本Spark 1.6 notebooks

这些笔记本将要在 Jupyter 笔记本服务器的 pySpark 内核中运行。These notebooks are to be run in the pySpark kernel of Jupyter notebook server.

Spark 2.0 笔记本Spark 2.0 notebooks

这些笔记本将要在 Jupyter 笔记本服务器的 pySpark3 内核中运行。These notebooks are to be run in the pySpark3 kernel of Jupyter notebook server.

  • Spark2.0-pySpark3-machine-learning-data-science-spark-advanced-data-exploration-modeling.ipynb:此文件提供有关如何在 Spark 2.0 群集中使用此处所述的 NYC 出租车行程和费用数据集执行数据探索、建模和评分的信息。Spark2.0-pySpark3-machine-learning-data-science-spark-advanced-data-exploration-modeling.ipynb: This file provides information on how to perform data exploration, modeling, and scoring in Spark 2.0 clusters using the NYC Taxi trip and fare data-set described here. 对于快速浏览我们为 Spark 2.0 提供的代码而言,此笔记本可能是一个很好的起点。This notebook may be a good starting point for quickly exploring the code we have provided for Spark 2.0. 如需用于分析 NYC 出租车数据的更详细笔记本,请参阅此列表中的下一个笔记本。For a more detailed notebook analyzes the NYC Taxi data, see the next notebook in this list. 请参阅此列表后面比较这些笔记本的说明。See the notes following this list that compare these notebooks.
  • Spark2.0 pySpark3_NYC_Taxi_Tip_Regression.ipynb:此文件说明如何使用此处所述的纽约市出租车里程与收费数据集执行数据整理(Spark SQL 和数据帧操作)、探索、建模和评分。Spark2.0-pySpark3_NYC_Taxi_Tip_Regression.ipynb: This file shows how to perform data wrangling (Spark SQL and dataframe operations), exploration, modeling and scoring using the NYC Taxi trip and fare data-set described here.
  • Spark2.0-pySpark3_Airline_Departure_Delay_Classification.ipynb:此文件说明如何使用 2011 到 2012 年的已知航班准时出发数据集执行数据整理(Spark SQL 和数据帧操作)、探索、建模和评分。Spark2.0-pySpark3_Airline_Departure_Delay_Classification.ipynb: This file shows how to perform data wrangling (Spark SQL and dataframe operations), exploration, modeling and scoring using the well-known Airline On-time departure dataset from 2011 and 2012. 我们已在建模之前将航班数据集与机场天气数据(例如风速、温度、海拔等)相集成,因此可在模型中包含这些天气特征。We integrated the airline dataset with the airport weather data (e.g. windspeed, temperature, altitude etc.) prior to modeling, so these weather features can be included in the model.

备注

航班数据集已添加到 Spark 2.0 笔记本,以方便演示分类算法的用法。The airline dataset was added to the Spark 2.0 notebooks to better illustrate the use of classification algorithms. 有关航班准时出发数据集和天气数据集的信息,请参阅以下链接:See the following links for information about airline on-time departure dataset and weather dataset:

备注

有关 NYC 出租车和飞机航班延迟数据集的 Spark 2.0 笔记本可能需要 10 分钟或更长时间运行(具体取决于 HDI 群集大小)。The Spark 2.0 notebooks on the NYC taxi and airline flight delay data-sets can take 10 mins or more to run (depending on the size of your HDI cluster). 以上列表中的第一个笔记本全面显示了数据探索、可视化和 ML 模型训练的许多方面,该笔记本使用下采样的 NYC 数据集(其中已预先联接出租车和费用文件)运行时所花费的时间更短:Spark2.0-pySpark3-machine-learning-data-science-spark-advanced-data-exploration-modeling.ipynbThe first notebook in the above list shows many aspects of the data exploration, visualization and ML model training in a notebook that takes less time to run with down-sampled NYC data set, in which the taxi and fare files have been pre-joined: Spark2.0-pySpark3-machine-learning-data-science-spark-advanced-data-exploration-modeling.ipynb. 此笔记本完成运行的时间短得多(2-3 分钟),对于快速浏览我们为 Spark 2.0 提供的代码而言,此笔记本可能是一个很好的起点。This notebook takes a much shorter time to finish (2-3 mins) and may be a good starting point for quickly exploring the code we have provided for Spark 2.0.

有关 Spark 2.0 模型的实施以及使用模型进行评分的指南,请参阅 Spark 1.6 document on consumption(有关使用 Spark 1.6 的文档)中的示例,其中概述了所要执行的步骤。For guidance on the operationalization of a Spark 2.0 model and model consumption for scoring, see the Spark 1.6 document on consumption for an example outlining the steps required. 要在 Spark 2.0 中使用此功能,将请 Python 代码文件替换为此文件To use this on Spark 2.0, replace the Python code file with this file.

先决条件Prerequisites

以下过程与 Spark 1.6 相关。The following procedures are related to Spark 1.6. 对于 Spark 2.0 版本,请使用前面所述和链接到的笔记本。For the Spark 2.0 version, use the notebooks described and linked to previously.

  1. 必须拥有 Azure 订阅。You must have an Azure subscription.

  2. 需要使用 Spark 1.6 群集完成本演练。You need a Spark 1.6 cluster to complete this walkthrough. 若要创建群集,请参阅入门:在 Azure HDInsight 上创建 Apache Spark 中提供的说明。To create one, see the instructions provided in Get started: create Apache Spark on Azure HDInsight. 从“选择群集类型” 菜单中指定群集类型和版本。The cluster type and version is specified from the Select Cluster Type menu.

配置群集

备注

有关演示如何使用 Scala 而非 Python 完成端到端数据科研过程任务的主题,请参阅在 Azure 上使用 Scala 与 Spark 开展数据科研For a topic that shows how to use Scala rather than Python to complete tasks for an end-to-end data science process, see the Data Science using Scala with Spark on Azure.

警告

HDInsight 群集是基于分钟按比例计费,而不管用户是否使用它们。Billing for HDInsight clusters is prorated per minute, whether you use them or not. 请务必在使用完群集之后将其删除。Be sure to delete your cluster after you finish using it. 请参阅如何删除 HDInsight 群集See how to delete an HDInsight cluster.

NYC 2013 年出租车数据The NYC 2013 Taxi data

NYC 出租车行程数据是大约 20 GB(未压缩时约为 48 GB)的压缩逗号分隔值 (CSV) 文件,其中包含超过 1.73 亿个单独行程及每个行程支付的费用。The NYC Taxi Trip data is about 20 GB of compressed comma-separated values (CSV) files (~48 GB uncompressed), comprising more than 173 million individual trips and the fares paid for each trip. 每个行程记录都包括上车和下车的位置和时间、匿名的出租车司机驾驶证编号和牌照(出租车的唯一 ID)编号。Each trip record includes the pick up and dropoff location and time, anonymized hack (driver's) license number and medallion (taxi’s unique id) number. 数据涵盖 2013 年的所有行程,并在每个月的以下两个数据集中提供:The data covers all trips in the year 2013 and is provided in the following two datasets for each month:

  1. “trip_data”CSV 文件包含行程的详细信息,例如乘客数、上车和下车地点、行程持续时间和行程距离。The 'trip_data' CSV files contain trip details, such as number of passengers, pick up and dropoff points, trip duration, and trip length. 下面是一些示例记录:Here are a few sample records:

     medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
     89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,1,N,2013-01-01 15:11:48,2013-01-01 15:18:10,4,382,1.00,-73.978165,40.757977,-73.989838,40.751171
     0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-06 00:18:35,2013-01-06 00:22:54,1,259,1.50,-74.006683,40.731781,-73.994499,40.75066
     0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,1,N,2013-01-05 18:49:41,2013-01-05 18:54:23,1,282,1.10,-74.004707,40.73777,-74.009834,40.726002
     DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:54:15,2013-01-07 23:58:20,2,244,.70,-73.974602,40.759945,-73.984734,40.759388
     DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,1,N,2013-01-07 23:25:03,2013-01-07 23:34:24,1,560,2.10,-73.97625,40.748528,-74.002586,40.747868
    
  2. “trip_fare”CSV 文件包含每个行程支付费用的详细信息,例如付款类型、费用金额、附加费和税金、小费和通行费以及支付的总金额。The 'trip_fare' CSV files contain details of the fare paid for each trip, such as payment type, fare amount, surcharge and taxes, tips and tolls, and the total amount paid. 下面是一些示例记录:Here are a few sample records:

     medallion, hack_license, vendor_id, pickup_datetime, payment_type, fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount
     89D227B655E5C82AECF13C3F540D4CF4,BA96DE419E711691B9445D6A6307C170,CMT,2013-01-01 15:11:48,CSH,6.5,0,0.5,0,0,7
     0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-06 00:18:35,CSH,6,0.5,0.5,0,0,7
     0BD7C8F5BA12B88E0B67BED28BEA73D8,9FD8F69F0804BDB5549F40E9DA1BE472,CMT,2013-01-05 18:49:41,CSH,5.5,1,0.5,0,0,7
     DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:54:15,CSH,5,0.5,0.5,0,0,6
     DFD2202EE08F7A8DC9A57B02ACB81FE2,51EE87E3205C985EF8431D850C786310,CMT,2013-01-07 23:25:03,CSH,9.5,0.5,0.5,0,0,10.5
    

我们已对这些文件进行 0.1% 采样,并将 trip_data 和 trip_fare CVS 文件联接到单个数据集中,以用作本演练的输入数据集。We have taken a 0.1% sample of these files and joined the trip_data and trip_fare CVS files into a single dataset to use as the input dataset for this walkthrough. 联接 trip_data 和 trip_fare 的唯一键由以下字段组成:medallion、hack_licence 和 pickup_datetime。The unique key to join trip_data and trip_fare is composed of the fields: medallion, hack_licence and pickup_datetime. 每个数据集记录包含以下属性,用于表示 NYC 出租车行程:Each record of the dataset contains the following attributes representing a NYC Taxi trip:

字段Field 简要说明Brief Description
medallionmedallion 匿名的出租车牌照(出租车的唯一 ID)Anonymized taxi medallion (unique taxi id)
hack_licensehack_license 匿名的出租车司机驾驶证编号Anonymized Hackney Carriage License number
vendor_idvendor_id 出租车供应商 IDTaxi vendor id
rate_coderate_code NYC 出租车费率NYC taxi rate of fare
store_and_fwd_flagstore_and_fwd_flag 停靠和行进标志Store and forward flag
pickup_datetimepickup_datetime 上车日期和时间Pick up date & time
dropoff_datetimedropoff_datetime 下车日期和时间Dropoff date & time
pickup_hourpickup_hour 上车小时Pick up hour
pickup_weekpickup_week 上车星期(一年中的某一周)Pick up week of the year
weekdayweekday 工作日(周一到周日)Weekday (range 1-7)
passenger_countpassenger_count 出租车行程中的乘客数Number of passengers in a taxi trip
trip_time_in_secstrip_time_in_secs 行程时间(以秒为单位)Trip time in seconds
trip_distancetrip_distance 行程距离(以英里为单位)Trip distance traveled in miles
pickup_longitudepickup_longitude 上车经度Pick up longitude
pickup_latitudepickup_latitude 上车纬度Pick up latitude
dropoff_longitudedropoff_longitude 下车经度Dropoff longitude
dropoff_latitudedropoff_latitude 下车纬度Dropoff latitude
direct_distancedirect_distance 上车位置和下车位置之间的直线距离Direct distance between pick up and dropoff locations
payment_typepayment_type 付款类型(现金、信用卡等)Payment type (cash, credit-card etc.)
fare_amountfare_amount 费用金额Fare amount in
surchargesurcharge 附加费Surcharge
mta_taxmta_tax MTA 税金Mta tax
tip_amounttip_amount 小费金额Tip amount
tolls_amounttolls_amount 通行费金额Tolls amount
total_amounttotal_amount 总金额Total amount
tippedtipped 是否有小费(0 表示没有,1 表示有)Tipped (0/1 for no or yes)
tip_classtip_class 小费等级(0:0 美元,1:0-5 美元,2:6-10 美元,3:11-20 美元,4:大于 20 美元)Tip class (0: $0, 1: $0-5, 2: $6-10, 3: $11-20, 4: > $20)

从 Spark 群集上的 Jupyter 笔记本执行代码Execute code from a Jupyter notebook on the Spark cluster

可以从 Azure 门户启动 Jupyter 笔记本。You can launch the Jupyter Notebook from the Azure portal. 在仪表板上找到 Spark 群集,并单击进入群集管理页面。Find your Spark cluster on your dashboard and click it to enter management page for your cluster. 若要打开与 Spark 群集相关联的笔记本,请依次单击“群集仪表板” -> “Jupyter 笔记本” 。To open the notebook associated with the Spark cluster, click Cluster Dashboards -> Jupyter Notebook .

群集仪表板

还可以浏览到 https://CLUSTERNAME.azurehdinsight.cn/jupyter 访问 Jupyter 笔记本。You can also browse to https://CLUSTERNAME.azurehdinsight.cn/jupyter to access the Jupyter Notebooks. 将此 URL 的 CLUSTERNAME 部分替换成自己的群集名称。Replace the CLUSTERNAME part of this URL with the name of your own cluster. 需要使用管理员帐户密码才能访问笔记本。You need the password for your admin account to access the notebooks.

浏览 Jupyter 笔记本

选择 PySpark 可查看包含几个使用 PySpark API 的预打包笔记本示例的目录。包含这一系列 Spark 主题的代码示例的笔记本可在 GitHub 上获取Select PySpark to see a directory that contains a few examples of pre-packaged notebooks that use the PySpark API.The notebooks that contain the code samples for this suite of Spark topic are available at GitHub

可将笔记本直接从 GitHub 上传到 Spark 群集上的 Jupyter 笔记本服务器。You can upload the notebooks directly from GitHub to the Jupyter notebook server on your Spark cluster. 在 Jupyter 的主页上,单击屏幕右侧的“上传” 按钮。On the home page of your Jupyter, click the Upload button on the right part of the screen. 它将打开文件资源管理器。It opens a file explorer. 可以在此处粘贴笔记本的 GitHub(原始内容)URL,并单击“打开” 。Here you can paste the GitHub (raw content) URL of the Notebook and click Open.

将再次看到文件名显示在 Jupyter 文件列表上,并带有“上传” 按钮。You see the file name on your Jupyter file list with an Upload button again. 单击此“上传”按钮 。Click this Upload button. 现在已导入笔记本。Now you have imported the notebook. 重复上述步骤,上传本演练中的其他笔记本。Repeat these steps to upload the other notebooks from this walkthrough.

提示

可以右键单击浏览器上的链接,并选择“复制链接”获取 GitHub 原始内容 URL 。You can right-click the links on your browser and select Copy Link to get the GitHub raw content URL. 可将此 URL 粘贴到“Jupyter 上传文件资源管理器”对话框中。You can paste this URL into the Jupyter Upload file explorer dialog box.

现在可以:Now you can:

  • 通过单击笔记本查看代码。See the code by clicking the notebook.
  • 通过按 SHIFT-ENTER 执行每个单元格。Execute each cell by pressing SHIFT-ENTER.
  • 通过依次单击“单元格” -> “运行” 运行整个笔记本。Run the entire notebook by clicking on Cell -> Run.
  • 使用查询的自动可视化。Use the automatic visualization of queries.

提示

PySpark 内核自动将 SQL (HiveQL) 查询的输出可视化。The PySpark kernel automatically visualizes the output of SQL (HiveQL) queries. 可通过使用笔记本中的“类型” 菜单按钮从多个不同类型的可视化(表、饼图、行、区域或栏)中选择:You are given the option to select among several different types of visualizations (Table, Pie, Line, Area, or Bar) by using the Type menu buttons in the notebook:

泛型方法的逻辑回归 ROC 曲线

后续步骤What's next?

现在,已完成 HDInsight Spark 群集设置并且上传了 Jupyter 笔记本,可以随时完成对应于这三个 PySpark 笔记本的主题。Now that you are set up with an HDInsight Spark cluster and have uploaded the Jupyter notebooks, you are ready to work through the topics that correspond to the three PySpark notebooks. 它们介绍了如何浏览数据以及如何创建和使用模型。They show how to explore your data and then how to create and consume models. 高级数据浏览和建模笔记本介绍了如何包括交叉验证、超参数扫描和模型评估。The advanced data exploration and modeling notebook shows how to include cross-validation, hyper-parameter sweeping, and model evaluation.

使用 Spark 进行数据探索和建模: 通过完成使用 Spark MLlib 工具包为数据创建二元分类和回归模型主题,探索数据集并创建、评分和评估机器学习模型。Data Exploration and modeling with Spark: Explore the dataset and create, score, and evaluate the machine learning models by working through the Create binary classification and regression models for data with the Spark MLlib toolkit topic.

使用模型: 若要了解如何评分在本主题中创建的分类和回归模型,请参阅评分和评估 Spark 生成的机器学习模型Model consumption: To learn how to score the classification and regression models created in this topic, see Score and evaluate Spark-built machine learning models.

交叉验证和超参数扫描:请参阅使用 Spark 进行高级数据探索和建模,了解如何使用交叉验证和超参数扫描训练模型Cross-validation and hyperparameter sweeping: See Advanced data exploration and modeling with Spark on how models can be trained using cross-validation and hyper-parameter sweeping