diffpatterns_text 插件diffpatterns_text plugin

比较两个字符串值数据集,找出可以将这两个数据集之间的差异特征化的文本模式。Compares two data sets of string values and finds text patterns that characterize differences between the two data sets.

T | evaluate diffpatterns_text(TextColumn, BooleanCondition)

diffpatterns_text 返回一组文本模式,这些模式捕获两个数据集中数据的差异部分(例如,当条件为 true 时,模式捕获占较大百分比的行;当条件为 false 时,模式捕获占较小百分比的行)。The diffpatterns_text returns a set of text patterns that capture different portions of the data in the two sets (i.e. a pattern capturing a large percentage of the rows when the condition is true and low percentage of the rows when the condition is false). 模式是使用连续的标记(以空格分隔,来自文本列)或是使用表示通配符的 * 构建的。The patterns are built from consecutive tokens (separated by white space), with a token from the text column or a * representing a wildcard. 每种模式均由结果中的一行表示。Each pattern is represented by a row in the results.

语法Syntax

T | evaluate diffpatterns_text(TextColumn, BooleanCondition [, MinTokens, Threshold , MaxTokens])T | evaluate diffpatterns_text(TextColumn, BooleanCondition [, MinTokens, Threshold , MaxTokens])

参数Arguments

必需参数Required arguments

  • TextColumn - column_nameTextColumn - column_name

    要分析的文本列,必须是字符串类型的。The text column to analyze, must be of type string.

  • BooleanCondition - 布尔表达式BooleanCondition - Boolean expression

    定义如何生成要与输入表进行比较的两个记录子集。Defines how to generate the two record subsets to compare to the input table. 算法根据条件将查询拆分为两个数据集:“True”和“False”,然后分析它们之间的(文本)差异。The algorithm splits the query into two data sets, “True” and “False” according to the condition, then analyzes the (text) differences between them.

可选自变量Optional arguments

其他所有参数都为可选参数,但必须按以下方式进行排序。All other arguments are optional, but they must be ordered as below.

  • MinTokens - 0 < int < 200 [默认值:1]MinTokens - 0 < int < 200 [default: 1]

    为每个结果模式设置非通配符标记的最小数目。Sets the minimal number of non-wildcard tokens per result pattern.

  • Threshold - 0.015 < double < 1 [默认值:0.05]Threshold - 0.015 < double < 1 [default: 0.05]

    设置两个集之间的最小模式(比例)差异(请参阅 diffpatterns)。Sets the minimal pattern (ratio) difference between the two sets (see diffpatterns).

  • MaxTokens - 0 < int [默认值:20]MaxTokens - 0 < int [default: 20]

    设置每个结果模式的最大标记数量(从开始算起),指定一个较低的限制会减少查询运行时间。Sets the maximal number of tokens (from the beginning) per result pattern, specifying a lower limit decreases the query runtime.

返回Returns

diffpatterns_text 的结果返回以下列:The result of diffpatterns_text returns the following columns:

  • Count_of_True:当条件为 true 时与模式匹配的行数。Count_of_True: The number of rows matching the pattern when the condition is true.
  • Count_of_False:当条件为 false 时与模式匹配的行数。Count_of_False: The number of rows matching the pattern when the condition is false.
  • Percent_of_True:当条件为 true 时,行中与模式匹配的行所占百分比。Percent_of_True: The percentage of rows matching the pattern from the rows when the condition is true.
  • Percent_of_False:当条件为 false 时,行中与模式匹配的行所占百分比。Percent_of_False: The percentage of rows matching the pattern from the rows when the condition is false.
  • 模式:文本模式,其中包含来自文本字符串的标记和表示通配符的“*”。Pattern: The text pattern containing tokens from the text string and '*' for wildcards.

备注

这些模式不一定截然不同,可能未涵盖完整的数据集范围。The patterns aren't necessarily distinct and may not provide full coverage of the data set. 这些模式可能重叠,某些行可能与任何模式都不匹配。The patterns may be overlapping and some rows may not match any pattern.

示例Example

StormEvents     
| where EventNarrative != "" and monthofyear(StartTime) > 1 and monthofyear(StartTime) < 9
| where EventType == "Drought" or EventType == "Extreme Cold/Wind Chill"
| evaluate diffpatterns_text(EpisodeNarrative, EventType == "Extreme Cold/Wind Chill", 2)
Count_of_TrueCount_of_True Count_of_FalseCount_of_False Percent_of_TruePercent_of_True Percent_of_FalsePercent_of_False 模式Pattern
1111 00 6.296.29 00 Winds shifting northwest in * wake * a surface trough brought heavy lake effect snowfall downwind * Lake Superior fromWinds shifting northwest in * wake * a surface trough brought heavy lake effect snowfall downwind * Lake Superior from
99 00 5.145.14 00 Canadian high pressure settled * * region * produced the coldest temperatures since February * 2006.Canadian high pressure settled * * region * produced the coldest temperatures since February * 2006. Durations * freezing temperaturesDurations * freezing temperatures
00 3434 00 6.246.24 * * * * * * * * * * * * * * * * * * West Tennessee,* * * * * * * * * * * * * * * * * * West Tennessee,
00 4242 00 7.717.71 * * * * * * caused * * * * * * * * across western Colorado.* * * * * * caused * * * * * * * * across western Colorado. *
00 4545 00 8.268.26 * * below normal ** * below normal *
00 110110 00 20.1820.18 Below normal *Below normal *