可视化效果Visualizations

Azure Databricks 使用 displaydisplayHTML 函数直接支持各种类型的可视化效果。Azure Databricks supports various types of visualizations out of the box using the display and displayHTML functions.

Azure Databricks 还以原生方式支持以 Python 和 R 编写的可视化库,并允许你安装和使用第三方库。Azure Databricks also natively supports visualization libraries in Python and R and lets you install and use third-party libraries.

display 函数display function

display 函数支持多种数据和可视化类型。The display function supports several data and visualization types.

本部分内容:In this section:

数据类型Data types

数据帧DataFrames

在 Azure Databricks 中创建数据帧可视化效果的最简单方式是调用 display(<dataframe-name>)The easiest way to create a DataFrame visualization in Azure Databricks is to call display(<dataframe-name>). 例如,如果你有一个钻石数据集的 Spark 数据帧 diamonds_df,该数据集已按钻石颜色进行分组。你要调用以下代码并计算平均价格For example, if you have a Spark DataFrame diamonds_df of a diamonds dataset grouped by diamond color, computing the average price, and you call

from pyspark.sql.functions import avg
diamonds_df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")

display(diamonds_df.select("color","price").groupBy("color").agg(avg("price")))

此时会显示一个表格,其中包含钻石颜色与平均价格。A table of diamond color versus average price displays.

钻石颜色与平均价格Diamond color versus average price

提示

如果在调用 display 函数后只看到 OK 而没有呈现可视化效果,原因很可能是传入的数据帧或集合为空。If you see OK with no rendering after calling the display function, mostly likely the DataFrame or collection you passed in is empty.

display() 支持 pandas 数据帧.display() supports pandas DataFrames. 如果引用 pandas 或 koalas 数据帧但不指定 display,则会像在 Jupyter 中一样呈现表。If you reference a pandas or Koalas DataFrame without display, the table is rendered as it would be in Jupyter.

数据帧 display 方法 DataFrame display method

备注

在 Databricks Runtime 7.1 及更高版本中可用。Available in Databricks Runtime 7.1 and above.

PySparkpandasKoalas 数据帧有一个调用 Azure Databricks display 函数的 display 方法。PySpark, pandas, and Koalas DataFrames have a display method that calls the Azure Databricks display function. 可以在执行简单的数据帧操作后调用该方法,例如:You can call it after a simple DataFrame operation, for example:

diamonds_df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")
diamonds_df.select("color","price").display()

也可以在一系列链式数据帧操作结束时调用该方法,例如:or at the end of a series of chained DataFrame operations, for example:

from pyspark.sql.functions import avg
diamonds_df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")

diamonds_df.select("color","price").groupBy("color").agg(avg("price")).display()

图像 Images

display 以富 HTML 的形式呈现包含图像数据类型的列。display renders columns containing image data types as rich HTML. display 尝试为匹配 Spark ImageSchemaDataFrame 列呈现图像缩略图。display attempts to render image thumbnails for DataFrame columns matching the Spark ImageSchema. 对于通过 readImages 函数成功读入的任何图像,可以正常运行缩略图呈现。Thumbnail rendering works for any images successfully read in through the readImages function. 对于通过其他方式生成的图像值,Azure Databricks 支持呈现 1、3 或 4 通道图像(每个通道由一个字节组成),但存在以下约束:For image values generated through other means, Azure Databricks supports the rendering of 1, 3, or 4 channel images (where each channel consists of a single byte), with the following constraints:

  • 单通道图像mode 字段必须等于 0。One-channel images : mode field must be equal to 0. heightwidthnChannels 字段必须准确描述 data 字段中的二进制图像数据height, width, and nChannels fields must accurately describe the binary image data in the data field
  • 三通道图像mode 字段必须等于 16。Three-channel images : mode field must be equal to 16. heightwidthnChannels 字段必须准确描述 data 字段中的二进制图像数据。height, width, and nChannels fields must accurately describe the binary image data in the data field. data 字段必须包含三字节区块形式的像素数据,每个像素的通道顺序为 (blue, green, red)The data field must contain pixel data in three-byte chunks, with the channel ordering (blue, green, red) for each pixel.
  • 四通道图像mode 字段必须等于 24。Four-channel images : mode field must be equal to 24. heightwidthnChannels 字段必须准确描述 data 字段中的二进制图像数据。height, width, and nChannels fields must accurately describe the binary image data in the data field. data 字段必须包含四字节区块形式的像素数据,每个像素的通道顺序为 (blue, green, red, alpha)The data field must contain pixel data in four-byte chunks, with the channel ordering (blue, green, red, alpha) for each pixel.
示例Example

假设某个文件夹包含一些图像:Suppose you have a folder containing some images:

图像数据的文件夹Folder of image data

如果使用 ImageSchema.readImages 将图像读入数据帧,然后显示数据帧,则 display 会呈现图像的缩略图:If you read the images into a DataFrame with ImageSchema.readImages and then display the DataFrame, display renders thumbnails of the images:

from pyspark.ml.image import ImageSchema
image_df = ImageSchema.readImages(sample_img_dir)
display(image_df)

显示图像数据帧Display image DataFrame

结构化流式处理数据帧 Structured Streaming DataFrames

若要实时地直观显示流式处理查询的结果,可以将 Scala 和 Python 中的 display 用于结构化流式处理数据帧。To visualize the result of a streaming query in real time you can display a Structured Streaming DataFrame in Scala and Python.

PythonPython
streaming_df = spark.readStream.format("rate").load()
display(streaming_df.groupBy().count())
ScalaScala
val streaming_df = spark.readStream.format("rate").load()
display(streaming_df.groupBy().count())

display 支持以下可选参数:display supports the following optional parameters:

  • streamName:流式处理查询名称。streamName: the streaming query name.
  • trigger (Scala) 和 processingTime (Python):定义运行流式处理查询的频率。trigger (Scala) and processingTime (Python): defines how often the streaming query is run. 如果未指定,则系统会在上一处理完成后立即检查是否有新数据可用。If not specified, the system checks for availability of new data as soon as the previous processing has completed. 为了降低生产成本,我们建议你总是设置一个触发器时间间隔。To reduce the cost in production, we recommend that you always set a trigger interval.
  • checkpointLocation:系统写入所有检查点信息的位置。checkpointLocation: the location where the system writes all the checkpoint information. 如果未指定,则系统会在 DBFS 上自动生成一个临时检查点位置。If it is not specified, the system automatically generates a temporary checkpoint location on DBFS. 为了使你的流可以继续从中断的位置处理数据,你必须提供一个检查点位置。In order for your stream to continue processing data from where it left off, you must provide a checkpoint location. 建议在生产环境中总是指定 checkpointLocation 选项。We recommend that in production you always specify the checkpointLocation option.
PythonPython
streaming_df = spark.readStream.format("rate").load()
display(streaming_df.groupBy().count(), processingTime = "5 seconds", checkpointLocation = "dbfs:/<checkpoint-path>")
ScalaScala
import org.apache.spark.sql.streaming.Trigger

val streaming_df = spark.readStream.format("rate").load()
display(streaming_df.groupBy().count(), trigger = Trigger.ProcessingTime("5 seconds"), checkpointLocation = "dbfs:/<checkpoint-path>")

有关这些参数的详细信息,请参阅启动流式处理查询For more information about these parameters, see Starting Streaming Queries.

绘图类型Plot types

display 函数支持一组丰富的绘图类型:The display function supports a rich set of plot types:

图表类型Chart types

选择并配置图表类型Choose and configure a chart type

若要选择条形图,请单击“条形图”图标To choose a bar chart, click the bar chart icon 图表按钮::

条形图图标Bar chart icon

若要选择其他绘图类型,请单击To choose another plot type, click 下拉按钮 (位于条形图to the right of the bar chart 图表按钮 右侧),然后选择绘图类型。and choose the plot type.

图表工具栏Chart toolbar

折线图和条形图都具有内置工具栏,该工具栏支持一组丰富的客户端交互。Both line and bar charts have a built-in toolbar that support a rich set of client-side interactions.

图表工具栏Chart toolbar

若要配置图表,请单击“绘图选项…”。To configure a chart, click Plot Options… .

绘图选项Plot options

折线图具有多个自定义图表选项:设置 Y 轴范围、显示和隐藏点,以及显示带有对数刻度的 Y 轴。The line chart has a few custom chart options: setting a Y-axis range, showing and hiding points, and displaying the Y-axis with a log scale.

有关旧图表类型的信息,请参阅:For information about legacy chart types, see:

图表之间的颜色一致性Color consistency across charts

Azure Databricks 支持图表中的两种颜色一致性:系列集和全局。Azure Databricks supports two kinds of color consistency across charts: series set and global.

如果系列的值相同但顺序不同(例如,A = ["Apple", "Orange", "Banana"],B = ["Orange", "Banana", "Apple"]),则“系列集”颜色一致性会将相同的颜色分配给相同的值。Series set color consistency assigns the same color to the same value if you have series with the same values but in different orders (for example, A = ["Apple", "Orange", "Banana"] and B = ["Orange", "Banana", "Apple"]). 这些值在绘制之前已排序,因此,两个图例的排序方式相同 (["Apple", "Banana", "Orange"]),并为相同的值分配相同的颜色。The values are sorted before plotting, so both legends are sorted the same way (["Apple", "Banana", "Orange"]), and the same values are given the same colors. 但是,如果系列为 C = ["Orange", "Banana"],则它的颜色与集 A 不一致,因为该集是不相同的。However, if you have a series C = ["Orange", "Banana"], it would not be color consistent with set A because the set isn’t the same. 排序算法会将第一种颜色分配给集 C 中的“Banana”,将第二种颜色分配给集 A 中的“Banana”。如果希望这些系列的颜色一致,可以指定图表使用全局颜色一致性。The sorting algorithm would assign the first color to “Banana” in set C but the second color to “Banana” in set A. If you want these series to be color consistent, you can specify that charts should have global color consistency.

在“全局”颜色一致性中,无论系列的值是什么,每个值都始终映射到相同的颜色。In global color consistency, each value is always mapped to the same color no matter what values the series have. 若要为每个图表启用此行为,请选中“全局颜色一致性”复选框。To enable this for each chart, select the Global color consistency checkbox.

全局颜色一致性Global color consistency

备注

为了实现这种一致性,Azure Databricks 会直接从值哈希处理到颜色。To do achieve this consistency, Azure Databricks hashes directly from values to colors. 为了避免冲突(两个值的颜色完全相同),哈希处理将对较大的颜色集进行,但这会造成这样的负面影响:无法保证颜色的鲜艳或易于分辨性;如果颜色过多,在一定程度上它们看上去会很相似。To avoid collisions (where two values go to the exact same color), the hash is to a large set of colors, which has the side effect that nice-looking or easily distinguishable colors cannot be guaranteed; with many colors there are bound to be some that are very similar looking.

机器学习可视化效果 Machine learning visualizations

除了标准图表类型外,display 函数还支持以下机器学习训练参数和结果的可视化:In addition to the standard chart types, the display function supports visualizations of the following machine learning training parameters and results:

残差Residuals

对于线性回归和逻辑回归,display 支持呈现拟合与残差绘图。For linear and logistic regressions, display supports rendering a fitted versus residuals plot. 若要获取此绘图,请提供模型和数据帧。To obtain this plot, you supply the model and DataFrame.

以下示例对城市人口进行线性回归以容纳售价数据,然后显示残差与拟合数据。The following example runs a linear regression on city population to house sale price data and then displays the residuals versus the fitted data.

# Load data
pop_df = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true", inferSchema="true")

# Drop rows with missing values and rename the feature and label columns, replacing spaces with _
pop_df = data.dropna() # drop rows with missing values
exprs = [col(column).alias(column.replace(' ', '_')) for column in data.columns]

# Register a UDF to convert the feature (2014_Population_estimate) column vector to a VectorUDT type and apply it to the column.
from pyspark.ml.linalg import Vectors, VectorUDT

spark.udf.register("oneElementVec", lambda d: Vectors.dense([d]), returnType=VectorUDT())
tdata = data.select(*exprs).selectExpr("oneElementVec(2014_Population_estimate) as features", "2015_median_sales_price as label")

# Run a linear regression
from pyspark.ml.regression import LinearRegression

lr = LinearRegression()
modelA = lr.fit(tdata, {lr.regParam:0.0})

# Plot residuals versus fitted data
display(modelA, tdata)

显示残差Display residuals

ROC 曲线ROC curves

对于逻辑回归,display 支持呈现 ROC 曲线。For logistic regressions, display supports rendering an ROC curve. 若要获取此绘图,请提供模型、用作 fit 方法的输入的已准备数据,以及参数 "ROC"To obtain this plot, you supply the model, the prepped data that is input to the fit method, and the parameter "ROC".

以下示例开发一个分类器,用于基于个人的各种属性预测此人在一年中的收入是 <=50K 还是 >50K。The following example develops a classifier that predicts if an individual earns <=50K or >50k a year from various attributes of the individual. 成年人数据集派生自人口统计数据,包括有关 48842 个人及其每年收入的信息。The Adult dataset derives from census data, and consists of information about 48842 individuals and their annual income.

CREATE TABLE adult (
  age DOUBLE,
  workclass STRING,
  fnlwgt DOUBLE,
  education STRING,
  education_num DOUBLE,
  marital_status STRING,
  occupation STRING,
  relationship STRING,
  race STRING,
  sex STRING,
  capital_gain DOUBLE,
  capital_loss DOUBLE,
  hours_per_week DOUBLE,
  native_country STRING,
  income STRING)
USING CSV
OPTIONS (path "/databricks-datasets/adult/adult.data", header "true")
dataset = spark.table("adult")

# Use One-Hot Encoding to convert all categorical variables into binary vectors.

from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]

stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary SparseVectors
    # encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index", outputCol=categoricalCol + "classVec")
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol="income", outputCol="label")
stages += [label_stringIdx]

# Transform all features into a vector using VectorAssembler
numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

# Run the stages as a Pipeline. This puts the data through all of the feature transformations in a single call.

partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(dataset)
preppedDataDF = pipelineModel.transform(dataset)

# Fit logistic regression model

from pyspark.ml.classification import LogisticRegression
lrModel = LogisticRegression().fit(preppedDataDF)

# ROC for data
display(lrModel, preppedDataDF, "ROC")

显示 ROCDisplay ROC

若要显示残差,请省略 "ROC" 参数:To display the residuals, omit the "ROC" parameter:

display(lrModel, preppedDataDF)

显示残差Display residuals

决策树Decision trees

display 函数支持呈现决策树。The display function supports rendering a decision tree.

若要获取此可视化效果,请提供决策树模型。To obtain this visualization, you supply the decision tree model.

以下示例对某个树进行训练,以从手写数字图像的 MNIST 数据集中识别数字 (0 - 9),然后显示该树。The following examples train a tree to recognize digits (0 - 9) from the MNIST dataset of images of handwritten digits and then displays the tree.

PythonPython
trainingDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt").cache()
testDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt").cache()

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline

indexer = StringIndexer().setInputCol("label").setOutputCol("indexedLabel")

dtc = DecisionTreeClassifier().setLabelCol("indexedLabel")

# Chain indexer + dtc together into a single ML Pipeline.
pipeline = Pipeline().setStages([indexer, dtc])

model = pipeline.fit(trainingDF)
display(model.stages[-1])
ScalaScala
val trainingDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt").cache
val testDF = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt").cache

import org.apache.spark.ml.classification.{DecisionTreeClassifier, DecisionTreeClassificationModel}
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.Pipeline

val indexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel")
val dtc = new DecisionTreeClassifier().setLabelCol("indexedLabel")
val pipeline = new Pipeline().setStages(Array(indexer, dtc))

val model = pipeline.fit(trainingDF)
val tree = model.stages.last.asInstanceOf[DecisionTreeClassificationModel]

display(tree)

显示决策树Display decision tree

displayHTML 函数 displayHTML function

Azure Databricks 编程语言笔记本(Python、R 和 Scala)使用 displayHTML 函数支持 HTML 图形;你可以传递任何 HTML、CSS 或 JavaScript 代码。Azure Databricks programming language notebooks (Python, R, and Scala) support HTML graphics using the displayHTML function; you can pass the function any HTML, CSS, or JavaScript code. 此函数使用 JavaScript 库(例如 D3)支持交互式图形。This function supports interactive graphics using JavaScript libraries such as D3.

有关使用 displayHTML 的示例,请参阅:For examples of using displayHTML, see:

备注

displayHTML iframe 是从域 databricksusercontent.com 提供的,iframe 沙盒包含 allow-same-origin 属性。The displayHTML iframe is served from the domain databricksusercontent.com, and the iframe sandbox includes the allow-same-origin attribute. 必须可在浏览器中访问 databricksusercontent.comdatabricksusercontent.com must be accessible from your browser. 如果它当前被企业网络阻止,IT 人员需要将它加入允许列表。If it is currently blocked by your corporate network, it will need to be whitelisted by IT.

可视化(按语言)Visualizations by language

本部分内容:In this section:

Python 中的可视化效果Visualizations in Python

若要通过 Python 为数据绘图,请使用 display 函数,如下所示:To plot data in Python, use the display function as follows:

diamonds_df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")

display(diamonds_df.groupBy("color").avg("price").orderBy("color"))

Python 条形图Python bar chart

本部分内容:In this section:

深入了解 Python 笔记本Deep dive Python notebook

若要深入了解如何使用 display 实现 Python 可视化效果,请参阅笔记本:For a deep dive into Python visualizations using display, see the notebook:

SeabornSeaborn

还可以使用其他 Python 库来生成绘图。You can also use other Python libraries to generate plots. Databricks Runtime 包含 seaborn 可视化库。The Databricks Runtime includes the seaborn visualization library. 若要创建 seaborn 绘图,请导入库,创建绘图,然后将该绘图传递给 display 函数。To create a seaborn plot, import the library, create a plot, and pass the plot to the display function.

import seaborn as sns
sns.set(style="white")

df = sns.load_dataset("iris")
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=3)

g.map_upper(sns.regplot)

display(g.fig)

Seaborn 绘图Seaborn plot

其他 Python 库Other Python libraries

R 中的可视化效果Visualizations in R

若要通过 R 为数据绘图,请使用 display 函数,如下所示:To plot data in R, use the display function as follows:

library(SparkR)
diamonds_df <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv", header="true", inferSchema = "true")

display(arrange(agg(groupBy(diamonds_df, "color"), "price" = "avg"), "color"))

可以使用默认的 R plot 函数。You can use the default R plot function.

fit <- lm(Petal.Length ~., data = iris)
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit)

默认的 R plot 绘图R default plot

还可以使用任何 R 可视化包。You can also use any R visualization package. R 笔记本以 .png 形式捕获生成的绘图,并以内联方式显示它。The R notebook captures the resulting plot as a .png and displays it inline.

本部分内容:In this section:

LatticeLattice

Lattice 包支持网格图,这些图用于显示变量或变量之间的关系(该关系取决于一个或多个其他变量)。The Lattice package supports trellis graphs—graphs that display a variable or the relationship between variables, conditioned on one or more other variables.

library(lattice)
xyplot(price ~ carat | cut, diamonds, scales = list(log = TRUE), type = c("p", "g", "smooth"), ylab = "Log price")

R Lattice 绘图R Lattice plot

DandEFADandEFA

DandEFA 包支持 dandelion 绘图。The DandEFA package supports dandelion plots.

install.packages("DandEFA", repos = "https://cran.us.r-project.org")
library(DandEFA)
data(timss2011)
timss2011 <- na.omit(timss2011)
dandpal <- rev(rainbow(100, start = 0, end = 0.2))
facl <- factload(timss2011,nfac=5,method="prax",cormeth="spearman")
dandelion(facl,bound=0,mcex=c(1,1.2),palet=dandpal)
facl <- factload(timss2011,nfac=8,method="mle",cormeth="pearson")
dandelion(facl,bound=0,mcex=c(1,1.2),palet=dandpal)

R DandEFA 绘图R DandEFA plot

PlotlyPlotly

Plotly R 包依赖于 htmlwidgets for R。有关安装说明和笔记本,请参阅 htmlwidgetsThe Plotly R package relies on htmlwidgets for R. For installation instructions and a notebook, see htmlwidgets.

其他 R 库Other R libraries

Scala 中的可视化效果Visualizations in Scala

若要通过 Scala 为数据绘图,请使用 display 函数,如下所示:To plot data in Scala, use the display function as follows:

val diamonds_df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")

display(diamonds_df.groupBy("color").avg("price").orderBy("color"))

Scala 条形图Scala bar chart

深入了解 Scala 笔记本Deep dive Scala notebook

若要深入了解如何使用 display 实现 Python 可视化效果,请参阅笔记本:For a deep dive into Scala visualizations using display, see the notebook:

SQL 中的可视化效果 Visualizations in SQL

运行 SQL 查询时,Azure Databricks 将自动提取某些数据并以表格形式显示这些数据。When you run a SQL query, Azure Databricks automatically extracts some of the data and displays it as a table.

SELECT color, avg(price) AS price FROM diamonds GROUP BY color ORDER BY COLOR

SQL 表SQL table

从中可以选择不同的图表类型。From there you can select different chart types.

SQL 条形图SQL bar chart