Apache Spark 入门Get started with Apache Spark

本教程模块可帮助你快速开始使用 Apache Spark。This tutorial module helps you to get started quickly with using Apache Spark. 我们简要讨论关键概念,以便你可以快速开始编写第一个 Apache Spark 应用程序。We discuss key concepts briefly, so you can get right down to writing your first Apache Spark application. 在本指南中的其他教程模块中,你将有机会更深入地了解所选文章的内容。In the other tutorial modules in this guide, you will have the opportunity to go deeper into the article of your choice.

本教程将介绍以下内容:In this tutorial module, you will learn:

我们还提供了一个示例笔记本,你可以导入该笔记本以访问并运行模块中包含的所有代码示例。We also provide sample notebooks that you can import to access and run all of the code examples included in the module.


完整快速入门:使用 Azure 门户在 Azure Databricks 上运行 Spark 作业Complete Quickstart: Run a Spark job on Azure Databricks using the Azure portal.

Spark 界面 Spark interfaces

你应该了解三个关键的 Apache Spark 接口:可复原的分布式数据集、数据帧和数据集。There are three key Apache Spark interfaces that you should know about: Resilient Distributed Dataset, DataFrame, and Dataset.

  • 可复原分布式数据库:第一个 Apache Spark 抽象是可复原分布式数据集 (RDD)。Resilient Distributed Dataset: The first Apache Spark abstraction was the Resilient Distributed Dataset (RDD). 它是一系列数据对象的接口,由位于计算机集合(群集)中的一个或多个类型组成。It is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). RDD 可以通过多种方式创建,并且是可用的“最低级别”API。RDDs can be created in a variety of ways and are the “lowest level” API available. 虽然这是 Apache Spark 的原始数据结构,但你应该关注数据帧 API,它是 RDD 功能的超集。While this is the original data structure for Apache Spark, you should focus on the DataFrame API, which is a superset of the RDD functionality. 可在 Java、Python 和 Scala 语言中使用 RDD API。The RDD API is available in the Java, Python, and Scala languages.
  • 数据帧:这些在概念上类似于你可能熟悉的 pandas Python 库和 R 语言中的数据帧。DataFrame: These are similar in concept to the DataFrame you may be familiar with in the pandas Python library and the R language. 可在 Java、Python、R 和 Scala 语言中使用数据帧 API。The DataFrame API is available in the Java, Python, R, and Scala languages.
  • 数据集:数据帧和 RDD 的组合。Dataset: A combination of DataFrame and RDD. 它提供 RDD 中可用的类型化接口,同时提供数据帧的便利性。It provides the typed interface that is available in RDDs while providing the convenience of the DataFrame. 可在 Java 和 Scala 语言中使用数据集 API。The Dataset API is available in the Java and Scala languages.

在许多情况下,特别是在数据帧和数据集中嵌入了性能优化的情况下,不需要使用 RDD。In many scenarios, especially with the performance optimizations embedded in DataFrames and Datasets, it will not be necessary to work with RDDs. 但了解 RDD 抽象非常重要,因为:But it is important to understand the RDD abstraction because:

  • RDD 是底层基础结构,使 Spark 能极速运行并提供数据世系。The RDD is the underlying infrastructure that allows Spark to run so fast and provide data lineage.
  • 如果要深入了解更高级的 Spark 组件,可能需要使用 RDD。If you are diving into more advanced components of Spark, it may be necessary to use RDDs.
  • Spark UI 中的可视化效果引用 RDD。The visualizations within the Spark UI reference RDDs.

开发 Spark 应用程序时,通常使用数据帧教程数据集教程When you develop Spark applications, you typically use DataFrames tutorial and Datasets tutorial.

编写第一个 Apache Spark 应用程序 Write your first Apache Spark application

若要编写第一个 Apache Spark 应用程序,请将代码添加到 Azure Databricks 笔记本的单元格中。To write your first Apache Spark application, you add code to the cells of an Azure Databricks notebook. 本示例使用 Python。This example uses Python. 有关详细信息,还可以参考 Apache Spark 快速入门指南For more information, you can also reference the Apache Spark Quick Start Guide.

此第一个命令列出 Databricks 文件系统中的文件夹的内容:This first command lists the contents of a folder in the Databricks File System:

# Take a look at the file system

文件夹内容Folder contents

下一条命令使用 spark、每个笔记本中可用的 SparkSession,来读取 README.md 文本文件并创建名为 textFile 的数据帧:The next command uses spark, the SparkSession available in every notebook, to read the README.md text file and create a DataFrame named textFile:

textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")

若要计算文本文件的行数,请将 count 操作应用于数据帧:To count the lines of the text file, apply the count action to the DataFrame:


文本文件行计数Text file line count

有一点需要注意,读取文本文件不会生成任何输出,而第三个命令,即执行 count,会生成输出。One thing you may notice is that the second command, reading the text file, does not generate any output while the third command, performing the count, does. 出现这种情况的原因是,第一个命令是一个“转换”,而第二个命令是一个“操作” 。The reason for this is that the first command is a transformation while the second one is an action. 转换有延迟,并仅在运行操作时运行。Transformations are lazy and run only when an action is run. 这允许 Spark 优化性能(例如,在联接之前运行筛选器),而不是以串行方式运行命令。This allows Spark to optimize for performance (for example, run a filter prior to a join), instead of running commands serially. 有关转换和操作的完整列表,请参阅 Apache Spark 编程指南:转换操作For a complete list of transformations and actions, refer to the Apache Spark Programming Guide: Transformations and Actions.

Azure Databricks 数据集 Azure Databricks datasets

Azure Databricks 包括工作区中可用于学习 Spark 或测试算法的各种数据集Azure Databricks includes a variety of datasets within the workspace that you can use to learn Spark or test out algorithms. 整个入门指南中会不断提到这些数据集。You’ll see these throughout the getting started guide. 可在 /databricks-datasets 文件夹中找到这些数据集。The datasets are available in the /databricks-datasets folder.

笔记本 Notebooks

若要访问这些代码示例,请导入以下笔记本之一。To access these code examples and more, import the one of the following notebooks.

Apache Spark Python 快速入门笔记本Apache Spark Quick Start Python notebook

获取笔记本Get notebook

Apache Spark 快速入门 Scala 笔记本Apache Spark Quick Start Scala notebook

获取笔记本Get notebook