数据帧教程 DataFrames tutorial

Apache Spark 数据帧 API 提供了一组丰富的功能(选择列、筛选、联接、聚合等),可让你有效地解决常见的数据分析问题。The Apache Spark DataFrame API provides a rich set of functions (select columns, filter, join, aggregate, and so on) that allow you to solve common data analysis problems efficiently. 数据帧还允许通过自定义 Python、R、Scala 和 SQL 代码无缝地混合操作。DataFrames also allow you to intermix operations seamlessly with custom Python, R, Scala, and SQL code. 本教程模块介绍如何执行以下操作:In this tutorial module, you will learn how to:

我们还提供了一个示例笔记本,你可以导入该笔记本,访问并运行模块中包含的所有代码示例。We also provide a sample notebook that you can import to access and run all of the code examples included in the module.

加载示例数据 Load sample data

若要开始使用数据帧,最简单的方法是使用 Azure Databricks 工作区中可访问的 /databricks-datasets 文件夹中提供的 Azure Databricks 数据集示例。The easiest way to start working with DataFrames is to use an example Azure Databricks dataset available in the /databricks-datasets folder accessible within the Azure Databricks workspace. 若要访问将城市人口与房屋售价中值进行比较的文件,请加载文件 /databricks-datasets/samples/population-vs-price/data_geo.csvTo access the file that compares city population versus median sale prices of homes, load the file /databricks-datasets/samples/population-vs-price/data_geo.csv.

示例笔记本是 SQL 笔记本,因此接下来的几个命令使用 %python magic 命令Because the sample notebook is a SQL notebook, the next few commands use the %python magic command.

# Use the Spark CSV datasource with options specifying:
# - First line of file is a header
# - Automatically infer the schema of the data
%python
data = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true", inferSchema="true")
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values

查看数据帧 View the DataFrame

现在你已创建 data 数据帧,接下来可使用标准 Spark 命令(例如 take())快速访问数据。Now that you have created the data DataFrame, you can quickly access the data using standard Spark commands such as take(). 例如,可使用命令 data.take(10) 查看 data 数据帧的前 10 行。For example, you can use the command data.take(10) to view the first ten rows of the data DataFrame.

%python
data.take(10)

数据帧采用DataFrame take

若要以表格格式查看此数据,可使用 Azure Databricks display() 命令,而不是将数据导出到第三方工具。To view this data in a tabular format, you can use the Azure Databricks display() command instead of exporting the data to a third-party tool.

%python
display(data)

显示数据帧Display DataFrame

运行 SQL 查询 Run SQL queries

发出 SQL 查询之前,必须先将 data 数据帧保存为表或临时视图:Before you can issue SQL queries, you must save your data DataFrame as a table or temporary view:

# Register table so it is accessible via SQL Context
%python
data.createOrReplaceTempView("data_geo")

然后在新单元中指定 SQL 查询,按州列出 2015 年售价中值:Then, in a new cell, specify a SQL query to list the 2015 median sales price by state:

select `State Code`, `2015 median sales price` from data_geo

SQL 查询 1SQL Query 1

或者,查询华盛顿州的估算人口:Or, query for population estimate in the state of Washington:

select City, `2014 Population estimate` from data_geo where `State Code` = 'WA';

SQL 查询 2SQL Query 2

直观呈现数据帧 Visualize the DataFrame

使用 Azure Databricks display() 命令的另一个好处是,可使用许多嵌入式可视化效果快速查看此数据。An additional benefit of using the Azure Databricks display() command is that you can quickly view this data with a number of embedded visualizations. 单击图表按钮旁边的向下箭头来显示一个可视化效果类型列表:Click the down arrow next to the Chart Button to display a list of visualization types:

将数据帧显示为条形图Display DataFrame as bar chart

然后,选择地图图标,创建上一部分的售价 SQL 查询的地图可视化效果:Then, select the Map icon to create a map visualization of the sale price SQL query from the previous section:

将数据帧显示为地图Display DataFrame as map

笔记本 Notebook

若要运行这些代码示例和可视化效果等,请导入以下笔记本。To run these code examples, visualizations, and more, import the following notebook. 有关数据帧的详细信息,请参阅数据帧和数据集For more DataFrame examples, see DataFrames and Datasets.

Apache Spark 数据帧笔记本Apache Spark DataFrames notebook

获取笔记本Get notebook