教程:在 Azure HDInsight 中的 Apache Spark 群集上加载数据并运行查询Tutorial: Load data and run queries on an Apache Spark cluster in Azure HDInsight

本教程介绍如何从 csv 文件创建数据帧,以及如何针对 Azure HDInsight 中的 Apache Spark 群集运行交互式 Spark SQL 查询。In this tutorial, you learn how to create a dataframe from a csv file, and how to run interactive Spark SQL queries against an Apache Spark cluster in Azure HDInsight. 在 Spark 中,数据帧是已组织成命名列的分布式数据集合。In Spark, a dataframe is a distributed collection of data organized into named columns. 数据帧在概念上相当于关系型数据库中的表,或 R/Python 中的数据帧。Dataframe is conceptually equivalent to a table in a relational database or a data frame in R/Python.

本教程介绍如何执行下列操作:In this tutorial, you learn how to:

  • 从 csv 文件创建数据帧Create a dataframe from a csv file
  • 对数据帧运行查询Run queries on the dataframe

必备条件Prerequisites

HDInsight 上的 Apache Spark 群集。An Apache Spark cluster on HDInsight. 请参阅创建 Apache Spark 群集See Create an Apache Spark cluster.

创建 Jupyter 笔记本Create a Jupyter notebook

Jupyter Notebook 是支持各种编程语言的交互式笔记本环境。Jupyter Notebook is an interactive notebook environment that supports various programming languages. 通过此笔记本可以与数据进行交互、结合代码和 markdown 文本以及执行简单的可视化效果。The notebook allows you to interact with your data, combine code with markdown text and perform simple visualizations.

  1. 通过将 SPARKCLUSTER 替换为 Spark 群集的名称来编辑 URL https://SPARKCLUSTER.azurehdinsight.cn/jupyterEdit the URL https://SPARKCLUSTER.azurehdinsight.cn/jupyter by replacing SPARKCLUSTER with the name of your Spark cluster. 然后在 Web 浏览器中输入已编辑的 URL。Then enter the edited URL in a web browser. 出现提示时,请输入群集的群集登录凭据。If prompted, enter the cluster login credentials for the cluster.

  2. 从 Jupyter 网页中选择“新建” > “PySpark”,以创建笔记本 。From the Jupyter web page, Select New > PySpark to create a notebook.

    创建 Jupyter Notebook 以运行交互式 Spark SQL 查询Create a Jupyter Notebook to run interactive Spark SQL query

    新笔记本随即创建,并以名称 Untitled(Untitled.ipynb) 打开。A new notebook is created and opened with the name Untitled(Untitled.ipynb).

    备注

    如果使用 PySpark 内核创建 Notebook,在运行第一个代码单元格时,系统会自动创建 spark 会话。By using the PySpark kernel to create a notebook, the spark session is automatically created for you when you run the first code cell. 不需要显式创建会话。You do not need to explicitly create the session.

从 csv 文件创建数据帧Create a dataframe from a csv file

应用程序可以直接从远程存储(例如 Azure 存储或 Azure Data Lake Storage)上的文件或文件夹创建数据帧;从 Hive 表或从 Spark 支持的其他数据源(例如 Cosmos DB、Azure SQL DB、DW 等)创建数据帧。以下屏幕截图显示本教程中所用 HVAC.csv 文件的快照。Applications can create dataframes directly from files or folders on the remote storage such as Azure Storage or Azure Data Lake Storage; from a Hive table; or from other data sources supported by Spark, such as Cosmos DB, Azure SQL DB, DW, etc. The following screenshot shows a snapshot of the HVAC.csv file used in this tutorial. 所有 HDInsight Spark 群集都随附了该 csv 文件。The csv file comes with all HDInsight Spark clusters. 该数据捕获了一些建筑物的温度变化。The data captures the temperature variations of some buildings.

交互式 Spark SQL 查询的数据快照Snapshot of data for interactive Spark SQL query

  1. 在 Jupyter 笔记本的空单元格中粘贴以下代码,然后按 SHIFT + ENTER 运行这些代码。Paste the following code in an empty cell of the Jupyter notebook, and then press SHIFT + ENTER to run the code. 该代码导入此方案所需的类型:The code imports the types required for this scenario:

    from pyspark.sql import *
    from pyspark.sql.types import *
    

    在 Jupyter 中运行交互式查询时,Web 浏览器窗口或选项卡标题中会显示“(繁忙)” 状态和 Notebook 标题。When running an interactive query in Jupyter, the web browser window or tab caption shows a (Busy) status along with the notebook title. 右上角“PySpark” 文本的旁边还会出现一个实心圆。You also see a solid circle next to the PySpark text in the top-right corner. 作业完成后,实心圆将变成空心圆。After the job is completed, it changes to a hollow circle.

    交互式 Spark SQL 查询的状态Status of interactive Spark SQL query

  2. 记下返回的会话 ID。Note the session id returned. 在上图中,会话 ID 为 0。From the picture above, the session id is 0. 若要检索会话详细信息,可以根据需要导航到 https://CLUSTERNAME.azurehdinsight.cn/livy/sessions/ID/statements,其中的 CLUSTERNAME 是 Spark 群集的名称,ID 是会话 ID 号。If desired, you can retrieve the session details by navigating to https://CLUSTERNAME.azurehdinsight.cn/livy/sessions/ID/statements where CLUSTERNAME is the name of your Spark cluster and ID is your session id number.

  3. 运行以下代码,创建数据帧和临时表 (hvac) 。Run the following code to create a dataframe and a temporary table (hvac) by running the following code.

    # Create a dataframe and table from sample data
    csvFile = spark.read.csv('/HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv', header=True, inferSchema=True)
    csvFile.write.saveAsTable("hvac")
    

对数据帧运行查询Run queries on the dataframe

创建表后,可以针对数据运行交互式查询。Once the table is created, you can run an interactive query on the data.

  1. 在 Notebook 的空单元格中运行以下代码:Run the following code in an empty cell of the notebook:

    %%sql
    SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM hvac WHERE date = \"6/1/13\"
    

    以下表格输出随即显示。The following tabular output is displayed.

    交互式 Spark 查询结果的表输出Table output of interactive Spark query result

  2. 也可以在其他视觉效果中查看结果。You can also see the results in other visualizations as well. 若要查看相同输出的面积图,请选择“面积” ,然后设置其他值,如下所示。To see an area graph for the same output, select Area then set other values as shown.

    交互式 Spark 查询结果的面积图Area graph of interactive Spark query result

  3. 从笔记本菜单栏中,导航到“文件” > “保存和检查点” 。From the notebook menu bar, navigate to File > Save and Checkpoint.

  4. 如果现在开始下一教程,请使笔记本保持打开状态。If you're starting the next tutorial now, leave the notebook open. 否则,请关闭笔记本以释放群集资源:从笔记本菜单栏,导航到“文件” > “关闭并停止” 。If not, shut down the notebook to release the cluster resources: from the notebook menu bar, navigate to File > Close and Halt.

清理资源Clean up resources

有了 HDInsight,便可以将数据和 Jupyter Notebook 存储在 Azure 存储或 Azure Data Lake Store 中,以便在群集不用时安全地删除群集。With HDInsight, your data and Jupyter notebooks are stored in Azure Storage or Azure Data Lake Storage, so you can safely delete a cluster when it is not in use. 此外,还需要支付 HDInsight 群集费用,即使未使用。You are also charged for an HDInsight cluster, even when it is not in use. 由于群集费用高于存储空间费用数倍,因此在不使用群集时将其删除可以节省费用。Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use. 如果打算立即开始学习下一教程,可能需要保留该群集。If you plan to work on the next tutorial immediately, you might want to keep the cluster.

在 Azure 门户中打开群集,然后选择“删除” 。Open the cluster in the Azure portal, and select Delete.

删除 HDInsight 群集Delete HDInsight cluster

还可以选择资源组名称来打开“资源组”页,然后选择“删除资源组” 。You can also select the resource group name to open the resource group page, and then select Delete resource group. 通过删除资源组,可以删除 HDInsight Spark 群集和默认存储帐户。By deleting the resource group, you delete both the HDInsight Spark cluster, and the default storage account.

后续步骤Next steps

本教程介绍如何从 csv 文件创建数据帧,以及如何针对 Azure HDInsight 中的 Apache Spark 群集运行交互式 Spark SQL 查询。In this tutorial, you learned how to create a dataframe from a csv file, and how to run interactive Spark SQL queries against an Apache Spark cluster in Azure HDInsight. 请转到下一篇文章,了解如何将在 Apache Spark 中注册的数据拉取到 Power BI 等 BI 分析工具中。Advance to the next article to see how the data you registered in Apache Spark can be pulled into a BI analytics tool such as Power BI.