clusterBy （DataStreamWriter）

按给定列对输出进行聚类分析。对聚类分析列具有类似值的记录在同一文件中组合在一起。聚类分析允许对聚类分析列使用谓词的查询来跳过不必要的数据，从而提高查询效率。与分区不同，聚类分析可用于高基数列。

Syntax

clusterBy(*cols)

参数

参数	类型	Description
`*cols`	str 或 list	要分类的列的名称。

退货

DataStreamWriter

示例

df = spark.readStream.format("rate").load()
df.writeStream.clusterBy("value")
# <...streaming.readwriter.DataStreamWriter object ...>

按时间戳对 Rate 源流进行群集，并写入 Parquet：

import tempfile
import time
with tempfile.TemporaryDirectory(prefix="clusterBy1") as d:
    with tempfile.TemporaryDirectory(prefix="clusterBy2") as cp:
        df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
        q = df.writeStream.clusterBy(
            "timestamp").format("parquet").option("checkpointLocation", cp).start(d)
        time.sleep(5)
        q.stop()
        spark.read.schema(df.schema).parquet(d).show()

Last updated on 2026-06-08

clusterBy （DataStreamWriter）

Syntax

参数

退货

示例

其他资源