Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Switch services using the Version drop-down list. Learn more about navigation.
Applies to: ✅ Azure Data Explorer ✅ Azure Monitor ✅ Microsoft Sentinel
It's often useful to join between two large datasets on some high-cardinality key, such as an operation ID or a session ID, and further limit the right-hand-side ($right) records that need to match up with each left-hand-side ($left) record by adding a restriction on the "time-distance" between datetime columns on the left and on the right.
The above operation differs from the usual join operation, since for the equi-join part of matching the high-cardinality key between the left and right datasets, the system can also apply a distance function and use it to considerably speed up the join.
Note
A distance function doesn't behave like equality (that is, when both dist(x,y) and dist(y,z) are true it doesn't follow that dist(x,z) is also true.) This is sometimes referred to as a "diagonal join".
Example to identify event sequences without time window
To identify event sequences within a relatively small time window, this example uses a table T with the following schema:
SessionId: A column of typestringwith correlation IDs.EventType: A column of typestringthat identifies the event type of the record.Timestamp: A column of typedatetimeindicates when the event described by the record happened.
| SessionId | EventType | Timestamp |
|---|---|---|
| 0 | A | 2017-10-01T00:00:00Z |
| 0 | B | 2017-10-01T00:01:00Z |
| 1 | B | 2017-10-01T00:02:00Z |
| 1 | A | 2017-10-01T00:03:00Z |
| 3 | A | 2017-10-01T00:04:00Z |
| 3 | B | 2017-10-01T00:10:00Z |
The following query creates the dataset and then identifies all the session IDs in which event type A was followed by an event type B within a 1min time window.
let T = datatable(SessionId:string, EventType:string, Timestamp:datetime)
[
'0', 'A', datetime(2017-10-01 00:00:00),
'0', 'B', datetime(2017-10-01 00:01:00),
'1', 'B', datetime(2017-10-01 00:02:00),
'1', 'A', datetime(2017-10-01 00:03:00),
'3', 'A', datetime(2017-10-01 00:04:00),
'3', 'B', datetime(2017-10-01 00:10:00),
];
T
| where EventType == 'A'
| project SessionId, Start=Timestamp
| join kind=inner
(
T
| where EventType == 'B'
| project SessionId, End=Timestamp
) on SessionId
| where (End - Start) between (0min .. 1min)
| project SessionId, Start, End
Output
| SessionId | Start | End |
|---|---|---|
| 0 | 2017-10-01 00:00:00.0000000 | 2017-10-01 00:01:00.0000000 |
Example optimized with time window
To optimize this query, we can rewrite it to account for the time window. THe time window is expressed as a join key. Rewrite the query so that the datetime values are "discretized" into buckets whose size is half the size of the time window. Use equi-join to compare the bucket IDs.
The query finds pairs of events within the same session (SessionId) where an 'A' event is followed by a 'B' event within 1 minute. It projects the session ID, the start time of the 'A' event, and the end time of the 'B' event.
let T = datatable(SessionId:string, EventType:string, Timestamp:datetime)
[
'0', 'A', datetime(2017-10-01 00:00:00),
'0', 'B', datetime(2017-10-01 00:01:00),
'1', 'B', datetime(2017-10-01 00:02:00),
'1', 'A', datetime(2017-10-01 00:03:00),
'3', 'A', datetime(2017-10-01 00:04:00),
'3', 'B', datetime(2017-10-01 00:10:00),
];
let lookupWindow = 1min;
let lookupBin = lookupWindow / 2.0;
T
| where EventType == 'A'
| project SessionId, Start=Timestamp, TimeKey = bin(Timestamp, lookupBin)
| join kind=inner
(
T
| where EventType == 'B'
| project SessionId, End=Timestamp,
TimeKey = range(bin(Timestamp-lookupWindow, lookupBin),
bin(Timestamp, lookupBin),
lookupBin)
| mv-expand TimeKey to typeof(datetime)
) on SessionId, TimeKey
| where (End - Start) between (0min .. lookupWindow)
| project SessionId, Start, End
Output
| SessionId | Start | End |
|---|---|---|
| 0 | 2017-10-01 00:00:00.0000000 | 2017-10-01 00:01:00.0000000 |
5 million data query
The next query emulates an extensive dataset of 5M records and approximately 1M Session IDs and runs the query with the time window technique.
let T = range x from 1 to 5000000 step 1
| extend SessionId = rand(1000000), EventType = rand(3), Time=datetime(2017-01-01)+(x * 10ms)
| extend EventType = case(EventType < 1, "A",
EventType < 2, "B",
"C");
let lookupWindow = 1min;
let lookupBin = lookupWindow / 2.0;
T
| where EventType == 'A'
| project SessionId, Start=Time, TimeKey = bin(Time, lookupBin)
| join kind=inner
(
T
| where EventType == 'B'
| project SessionId, End=Time,
TimeKey = range(bin(Time-lookupWindow, lookupBin),
bin(Time, lookupBin),
lookupBin)
| mv-expand TimeKey to typeof(datetime)
) on SessionId, TimeKey
| where (End - Start) between (0min .. lookupWindow)
| project SessionId, Start, End
| count
Output
| Count |
|---|
| 3344 |