partitionBy (DataStreamWriter)

按文件系统上的给定列对输出进行分区。 输出布局类似于 Hive 的分区方案。

Syntax

partitionBy(*cols)

参数

参数 类型 Description
*cols str 或 list 要分区依据的列的名称。

退货

DataStreamWriter

示例

df = spark.readStream.format("rate").load()
df.writeStream.partitionBy("value")
# <...streaming.readwriter.DataStreamWriter object ...>

按时间戳对 Rate 源流进行分区并写入 Parquet:

import tempfile
import time
with tempfile.TemporaryDirectory(prefix="partitionBy1") as d:
    with tempfile.TemporaryDirectory(prefix="partitionBy2") as cp:
        df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
        q = df.writeStream.partitionBy(
            "timestamp").format("parquet").option("checkpointLocation", cp).start(d)
        time.sleep(5)
        q.stop()
        spark.read.schema(df.schema).parquet(d).show()