填充时间间隙和插补缺失值Filling time gaps and imputing missing values

处理时序数据时,时序数据可能经常会缺少属性值。When dealing with time series data, it's often possible that the time series data has missing values for the attributes. 由于数据的本性,或者由于数据收集过程中出现中断情况,数据集中可能会存在时间间隙。It's also possible that, because of the nature of the data, or because of interruptions in data collection, there are time gaps in the dataset.

例如,收集智能设备的能源使用情况统计信息时,如果设备不运行,则使用情况统计信息中会出现间隙。For example, when collecting energy usage statistics for a smart device, whenever the device isn't operational there will be gaps in the usage statistics. 同样,在计算机遥测数据收集方案中,可能会将不同的传感器配置为以不同的频率发出数据,从而导致传感器缺少值。Similarly, in a machine telemetry data collection scenario, it's possible that the different sensors are configured to emit data at different frequencies, resulting in missing values for the sensors. 例如,如果有两个传感器(电压和压力),分别配置为 100 Hz 和 10 Hz 频率,则电压传感器将每隔 1/100 秒发出一次数据,而压力传感器将每隔 1/10 秒发出一次数据。For example, if there are two sensors, voltage and pressure, configured at 100 Hz and 10-Hz frequency respectively, the voltage sensor will emit data every one-hundredth of a second, while the pressure sensor will only emit data every one-tenth of a second.

下表描述了按一秒的时间间隔收集的计算机遥测数据集。The following table describes a machine telemetry dataset, which was collected at a one-second interval.

timestamp               VoltageReading  PressureReading
----------------------- --------------- ----------------
2020-09-07 06:14:41.000 164.990400      97.223600
2020-09-07 06:14:42.000 162.241300      93.992800
2020-09-07 06:14:43.000 163.271200      NULL
2020-09-07 06:14:44.000 161.368100      93.403700
2020-09-07 06:14:45.000 NULL            NULL
2020-09-07 06:14:46.000 NULL            98.364800
2020-09-07 06:14:49.000 NULL            94.098300
2020-09-07 06:14:51.000 157.695700      103.359100
2020-09-07 06:14:52.000 157.019200      NULL
2020-09-07 06:14:54.000 NULL            95.352000
2020-09-07 06:14:56.000 159.183500      100.748200

前面的数据集有两个重要特征。There are two important characteristics of the preceding dataset.

  • 数据集未包含与多个时间戳(2020-09-07 06:14:47.0002020-09-07 06:14:48.0002020-09-07 06:14:50.0002020-09-07 06:14:53.0002020-09-07 06:14:55.000)相关的任何数据点。The dataset doesn't contain any data points related to several timestamps 2020-09-07 06:14:47.000, 2020-09-07 06:14:48.000, 2020-09-07 06:14:50.000, 2020-09-07 06:14:53.000, and 2020-09-07 06:14:55.000. 这些时间戳是数据集中的间隙。These timestamps are gaps in the dataset.
  • 对于电压和压力读数,存在缺失值(表示为 null)。There are missing values, represented as null, for the Voltage and pressure readings.

间隙填充Gap filling

间隙填充是一种技术,有助于创建连续的、有序的时间戳集来简化时序数据分析。Gap filling is a technique that helps create contiguous, ordered set of timestamps to ease the analysis of time series data. 在 Azure SQL Edge 中,若要填充时序数据集中的间隙,最简单的方法是定义一个包含所需时间分布的临时表,然后对数据集表执行 LEFT OUTER JOINRIGHT OUTER JOIN 操作。In Azure SQL Edge, the easiest way to fill gaps in the time series dataset is to define a temporary table with the desired time distribution and then do a LEFT OUTER JOIN or a RIGHT OUTER JOIN operation on the dataset table.

以上面提供的 MachineTelemetry 数据为例,可以使用以下查询来生成连续的、有序的时间戳集,用于进行分析。Taking the MachineTelemetry data represented above as an example, the following query can be used to generate contiguous, ordered set of timestamps for analysis.

备注

下面的查询将生成缺失的行,其中包含属性的时间戳值和 null 值。The query below generates the missing rows, with the timestamp values and null values for the attributes.

Create Table #SeriesGenerate(dt datetime Primary key Clustered)
GO

Declare @startdate datetime = '2020-09-07 06:14:41.000', @endtime datetime = '2020-09-07 06:14:56.000'
While (@startdate <= @endtime)
BEGIN
Insert into #SeriesGenerate values (@startdate)
set @startdate = DATEADD(SECOND, 1, @startdate)
END

Select a.dt as timestamp, b.VoltageReading, b.PressureReading 
From 
#SeriesGenerate a LEFT OUTER JOIN MachineTelemetry b 
    on a.dt = b.[timestamp]

以上查询生成以下输出,其中包含指定范围中的所有“一秒”时间戳。The above query produces the following output containing all one-second timestamps in the specified range.

下面是结果集Here is the Result Set


timestamp               VoltageReading    PressureReading
----------------------- ----------------- ----------------
2020-09-07 06:14:41.000 164.990400        97.223600
2020-09-07 06:14:42.000 162.241300        93.992800
2020-09-07 06:14:43.000 163.271200        NULL
2020-09-07 06:14:44.000 161.368100        93.403700
2020-09-07 06:14:45.000 NULL              NULL
2020-09-07 06:14:46.000 NULL              98.364800
2020-09-07 06:14:47.000 NULL              NULL
2020-09-07 06:14:48.000 NULL              NULL
2020-09-07 06:14:49.000 NULL              94.098300
2020-09-07 06:14:50.000 NULL              NULL
2020-09-07 06:14:51.000 157.695700        103.359100
2020-09-07 06:14:52.000 157.019200        NULL
2020-09-07 06:14:53.000 NULL              NULL
2020-09-07 06:14:54.000 NULL              95.352000
2020-09-07 06:14:55.000 NULL              NULL
2020-09-07 06:14:56.000 159.183500        100.748200

输入缺失值Imputing missing values

上面的查询为数据分析生成了缺失的时间戳,但它未替换 voltagepressure 读数的任何缺失值(表示为 null)。The preceding query generated the missing timestamps for data analysis, however it did not replace any of the missing values (represented as null) for voltage and pressure readings. 在 Azure SQL Edge 中,T-SQL LAST_VALUE()FIRST_VALUE() 函数中添加了一个新的语法,这些函数提供了相关机制,可以根据数据集中前面或后面的值来插补缺失值。In Azure SQL Edge, a new syntax was added to the T-SQL LAST_VALUE() and FIRST_VALUE() functions, which provide mechanisms to impute missing values, based on the preceding or following values in the dataset.

新语法在 LAST_VALUE()FIRST_VALUE() 函数中添加了 IGNORE NULLSRESPECT NULLS 子句。The new syntax adds IGNORE NULLS and RESPECT NULLS clause to the LAST_VALUE() and FIRST_VALUE() functions. 以下针对 MachineTelemetry 数据集的查询使用 last_value 函数来计算缺失值,将缺失值替换为数据集中最后观察到的值。A following query on the MachineTelemetry dataset computes the missing values using the last_value function, where missing values are replaced with the last observed value in the dataset.

Select 
    timestamp,
    VoltageReading As OriginalVoltageValues,
    LAST_VALUE(VoltageReading) IGNORE NULLS OVER (ORDER BY timestamp) As ImputedUsingLastValue, 
    PressureReading As OriginalPressureValues,
    LAST_VALUE(PressureReading) IGNORE NULLS OVER (ORDER BY timestamp) As ImputedUsingLastValue
From 
MachineTelemetry 
order by timestamp 

下面是结果集Here is the Result Set


timestamp               OrigVoltageVals  ImputedVoltage OrigPressureVals  ImputedPressure
----------------------- ---------------- -------------- ----------------- ----------------
2020-09-07 06:14:41.000 164.990400       164.990400     97.223600         97.223600
2020-09-07 06:14:42.000 162.241300       162.241300     93.992800         93.992800
2020-09-07 06:14:43.000 163.271200       163.271200     NULL              93.992800
2020-09-07 06:14:44.000 161.368100       161.368100     93.403700         93.403700
2020-09-07 06:14:45.000 NULL             161.368100     NULL              93.403700
2020-09-07 06:14:46.000 NULL             161.368100     98.364800         98.364800
2020-09-07 06:14:49.000 NULL             161.368100     94.098300         94.098300
2020-09-07 06:14:51.000 157.695700       157.695700     103.359100        103.359100
2020-09-07 06:14:52.000 157.019200       157.019200     NULL              103.359100
2020-09-07 06:14:54.000 NULL             157.019200     95.352000         95.352000
2020-09-07 06:14:56.000 159.183500       159.183500     100.748200        100.748200

以下查询使用 LAST_VALUE()FIRST_VALUE 函数来插补缺失值。The following query imputes the missing values using both the LAST_VALUE() and the FIRST_VALUE function. 对于输出列 ImputedVoltage,缺失值将替换为最后观察到的值,而对于输出列 ImputedPressure,缺失值将替换为数据集中下一个观察到的值。For, the output column ImputedVoltage the missing values are replaced by the last observed value, while for the output column ImputedPressure the missing values are replaced by the next observed value in the dataset.

Select 
    dt as timestamp, 
    VoltageReading As OrigVoltageVals,
    LAST_VALUE(VoltageReading) IGNORE NULLS OVER (ORDER BY dt) As ImputedVoltage, 
    PressureReading As OrigPressureVals,
    First_VALUE(PressureReading) IGNORE NULLS OVER (ORDER BY dt ROWS 
                    BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) As ImputedPressure
From 
(Select a.dt, b.VoltageReading,b.PressureReading  from 
    #SeriesGenerate a 
        LEFT OUTER JOIN 
    MachineTelemetry b 
        on a.dt = b.[timestamp]) A
order by timestamp

下面是结果集Here is the Result Set


timestamp               OrigVoltageVals  ImputedVoltage  OrigPressureVals  ImputedPressure
----------------------- ---------------- --------------- ----------------- ---------------
2020-09-07 06:14:41.000 164.990400       164.990400      97.223600         97.223600
2020-09-07 06:14:42.000 162.241300       162.241300      93.992800         93.992800
2020-09-07 06:14:43.000 163.271200       163.271200      NULL              93.403700
2020-09-07 06:14:44.000 161.368100       161.368100      93.403700         93.403700
2020-09-07 06:14:45.000 NULL             161.368100      NULL              98.364800
2020-09-07 06:14:46.000 NULL             161.368100      98.364800         98.364800
2020-09-07 06:14:47.000 NULL             161.368100      NULL              94.098300
2020-09-07 06:14:48.000 NULL             161.368100      NULL              94.098300
2020-09-07 06:14:49.000 NULL             161.368100      94.098300         94.098300
2020-09-07 06:14:50.000 NULL             161.368100      NULL              103.359100
2020-09-07 06:14:51.000 157.695700       157.695700      103.359100        103.359100
2020-09-07 06:14:52.000 157.019200       157.019200      NULL              95.352000
2020-09-07 06:14:53.000 NULL             157.019200      NULL              95.352000
2020-09-07 06:14:54.000 NULL             157.019200      95.352000         95.352000
2020-09-07 06:14:55.000 NULL             157.019200      NULL              100.748200
2020-09-07 06:14:56.000 159.183500       159.183500      100.748200        100.748200

备注

上面的查询使用 FIRST_VALUE() 函数将缺失值替换为下一个观察到的值。The above query uses the FIRST_VALUE() function to replace missing values with the next observed value. 可以通过将 LAST_VALUE() 函数与 ORDER BY <ordering_column> DESC 子句结合使用来获得相同的结果。The same result can be achieved by using the LAST_VALUE() function with a ORDER BY <ordering_column> DESC clause.

后续步骤Next steps