注释
Databricks 建议使用 transformWithState
来构建自定义有状态应用程序。 请参阅 生成自定义有状态应用程序。
本文包含与支持 mapGroupsWithState
和 flatMapGroupsWithState
的功能相关的信息。 有关这些运算符的更多详细信息,请参阅 链接。
为 mapGroupsWithState
指定初始状态
可以使用 flatMapGroupsWithState
或 mapGroupsWithState
指定用户定义的初始状态,以进行结构化流式处理的有状态处理。 当在不使用有效检查点的情况下启动有状态流时,这样做就可避免重新处理数据。
def mapGroupsWithState[S: Encoder, U: Encoder](
timeoutConf: GroupStateTimeout,
initialState: KeyValueGroupedDataset[K, S])(
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]
def flatMapGroupsWithState[S: Encoder, U: Encoder](
outputMode: OutputMode,
timeoutConf: GroupStateTimeout,
initialState: KeyValueGroupedDataset[K, S])(
func: (K, Iterator[V], GroupState[S]) => Iterator[U])
以下示例用例指定 flatMapGroupsWithState
运算符的初始状态:
val fruitCountFunc =(key: String, values: Iterator[String], state: GroupState[RunningCount]) => {
val count = state.getOption.map(_.count).getOrElse(0L) + valList.size
state.update(new RunningCount(count))
Iterator((key, count.toString))
}
val fruitCountInitialDS: Dataset[(String, RunningCount)] = Seq(
("apple", new RunningCount(1)),
("orange", new RunningCount(2)),
("mango", new RunningCount(5)),
).toDS()
val fruitCountInitial = initialState.groupByKey(x => x._1).mapValues(_._2)
fruitStream
.groupByKey(x => x)
.flatMapGroupsWithState(Update, GroupStateTimeout.NoTimeout, fruitCountInitial)(fruitCountFunc)
以下示例用例指定 mapGroupsWithState
运算符的初始状态:
val fruitCountFunc =(key: String, values: Iterator[String], state: GroupState[RunningCount]) => {
val count = state.getOption.map(_.count).getOrElse(0L) + valList.size
state.update(new RunningCount(count))
(key, count.toString)
}
val fruitCountInitialDS: Dataset[(String, RunningCount)] = Seq(
("apple", new RunningCount(1)),
("orange", new RunningCount(2)),
("mango", new RunningCount(5)),
).toDS()
val fruitCountInitial = initialState.groupByKey(x => x._1).mapValues(_._2)
fruitStream
.groupByKey(x => x)
.mapGroupsWithState(GroupStateTimeout.NoTimeout, fruitCountInitial)(fruitCountFunc)
测试 mapGroupsWithState
更新函数
利用 TestGroupState
API,你可以测试用于 Dataset.groupByKey(...).mapGroupsWithState(...)
和 Dataset.groupByKey(...).flatMapGroupsWithState(...)
的状态更新函数。
状态更新函数使用 GroupState
类型的对象获取先前的状态作为输入。 请参阅 Apache Spark GroupState 参考文档。 例如:
import org.apache.spark.sql.streaming._
import org.apache.spark.api.java.Optional
test("flatMapGroupsWithState's state update function") {
var prevState = TestGroupState.create[UserStatus](
optionalState = Optional.empty[UserStatus],
timeoutConf = GroupStateTimeout.EventTimeTimeout,
batchProcessingTimeMs = 1L,
eventTimeWatermarkMs = Optional.of(1L),
hasTimedOut = false)
val userId: String = ...
val actions: Iterator[UserAction] = ...
assert(!prevState.hasUpdated)
updateState(userId, actions, prevState)
assert(prevState.hasUpdated)
}