自动训练时序预测模型Auto-train a time-series forecast model

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

在本文中,你将了解如何在 Azure 机器学习 Python SDK 中使用自动化机器学习来配置和训练时序预测回归模型。In this article, you learn how to configure and train a time-series forecasting regression model using automated machine learning in the Azure Machine Learning Python SDK.

有关低代码体验,请参阅教程:使用自动化机器学习预测需求,里面有关于在 Azure 机器学习工作室中使用自动化机器学习的时序预测示例。For a low code experience, see the Tutorial: Forecast demand with automated machine learning for a time-series forecasting example using automated machine learning in the Azure Machine Learning studio.

配置预测模型与使用自动化机器学习设置标准回归模型很相似,但要处理时序数据,需执行一些配置选项和预处理步骤。Configuring a forecasting model is similar to setting up a standard regression model using automated machine learning, but certain configuration options and pre-processing steps exist for working with time-series data.

例如,你可配置预测应延伸至未来的多久时间(预测时间范围),以及预测滞后等等。For example, you can configure how far into the future the forecast should extend (the forecast horizon), as well as lags and more. 自动化 ML 会针对数据集和预测时间范围内的所有项目,习得通常有内部分支的单个模型。Automated ML learns a single, but often internally branched model for all items in the dataset and prediction horizons. 这样就可以使用更多的数据来估计模型参数,使得未知系列的泛化成为可能。More data is thus available to estimate model parameters and generalization to unseen series becomes possible.

下面示例演示了如何:The following examples show you how to:

  • 准备用于时序建模的数据Prepare data for time series modeling
  • AutoMLConfig 对象中配置特定的时序参数Configure specific time-series parameters in an AutoMLConfig object
  • 使用时序数据运行预测Run predictions with time-series data

与经典的时序方法不同,在自动化 ML 中,将“透视”过去的时序值,使其成为回归器与其他预测器的附加维度。Unlike classical time series methods, in automated ML past time-series values are "pivoted" to become additional dimensions for the regressor together with other predictors. 此方法会在训练过程中,将多个上下文变量及其关系彼此整合。This approach incorporates multiple contextual variables and their relationship to one another during training. 影响预测的因素有很多,因此该方法将自身与真实的预测场景很好地协调起来。Since multiple factors can influence a forecast, this method aligns itself well with real world forecasting scenarios. 例如,在预测销售额时,历史趋势、汇率和价格相互作用,共同推动着销售结果。For example, when forecasting sales, interactions of historical trends, exchange rate and price all jointly drive the sales outcome.

从训练数据中提取的特征扮演着重要的角色。Features extracted from the training data play a critical role. 而且,自动化 ML 会执行标准预处理步骤并生成额外的时序特征,以捕捉季节性影响和最大程度提高预测准确性And, automated ML performs standard pre-processing steps and generates additional time-series features to capture seasonal effects and maximize predictive accuracy

时序和深度学习模型Time-series and deep learning models

通过自动化 ML 的深度学习,可预测单变量和多变量时序数据。Automated ML's deep learning allows for forecasting univariate and multivariate time series data.

深度学习模型具有三个固有功能:Deep learning models have three intrinsic capabilities:

  1. 可以从任意输入到输出映射进行学习They can learn from arbitrary mappings from inputs to outputs
  2. 支持多个输入和输出They support multiple inputs and outputs
  3. 它们可自动提取横跨较长序列的输入数据中的模式They can automatically extract patterns in input data that spans over long sequences

对于较大数据,深度学习模型(例如 Microsoft ForecastTCN)可提高生成的模型的分数。Given larger data, deep learning models, such as Microsoft's ForecastTCN, can improve the scores of the resulting model. 了解如何配置试验进行深度学习Learn how to configure your experiment for deep learning.

自动化 ML 作为推荐系统的一部分向用户提供原生时序和深度学习模型。Automated ML provides users with both native time-series and deep learning models as part of the recommendation system.

模型Models 说明Description 优点Benefits
Prophet(预览版)Prophet (Preview) Prophet 最适合用于受季节影响大且包含多个季节历史数据的时序。Prophet works best with time series that have strong seasonal effects and several seasons of historical data. 若要利用此模型,请使用 pip install fbprophet 在本地安装它。To leverage this model, install it locally using pip install fbprophet. 准确、快速、可靠地反应时序中的离群值、缺失数据和巨大变化。Accurate & fast, robust to outliers, missing data, and dramatic changes in your time series.
Auto-ARIMA(预览版)Auto-ARIMA (Preview) 自动回归集成移动平均 (ARIMA) 在数据静态时性能最佳。AutoRegressive Integrated Moving Average (ARIMA) performs best, when the data is stationary. 这意味着其统计属性(例如平均值和方差)在整个集中保持不变。This means that its statistical properties like the mean and variance are constant over the entire set. 例如,如果你掷一枚硬币,那么无论是今天掷、明天掷还是明年掷,正面朝上的可能性都是 50%。For example, if you flip a coin, then the probability of you getting heads is 50%, regardless if you flip today, tomorrow or next year. 适用于单变量系列,这是因为使用过去的值来预测未来的值。Great for univariate series, since the past values are used to predict the future values.
ForecastTCN(预览版)ForecastTCN (Preview) ForecastTCN 是一种神经网络模型,旨在处理最苛刻的预测任务,从而捕获数据中的非线性本地和全局趋势以及时序之间的关系。ForecastTCN is a neural network model designed to tackle the most demanding forecasting tasks, capturing nonlinear local and global trends in your data as well as relationships between time series. 可利用数据中的复杂趋势并轻松扩展到最大型的数据集。Capable of leveraging complex trends in your data and readily scales to the largest of datasets.

先决条件Prerequisites

  • Azure 机器学习工作区。An Azure Machine Learning workspace. 若要创建工作区,请参阅创建 Azure 机器学习工作区To create the workspace, see Create an Azure Machine Learning workspace.
  • 本文假设读者基本熟悉如何设置自动化机器学习试验。This article assumes basic familiarity with setting up an automated machine learning experiment. 遵循教程操作方法,了解基本的自动化机器学习试验设计模式。Follow the tutorial or how-to to see the basic automated machine learning experiment design patterns.

准备数据Preparing data

自动化机器学习中预测回归任务类型和回归任务类型之间最重要的区别在于,前者包含数据中表示有效时序的一项特征。The most important difference between a forecasting regression task type and regression task type within automated machine learning is including a feature in your data that represents a valid time series. 常规时序具有明确定义的一致频率,并且在连续时间范围内的每个采样点上都有一个值。A regular time series has a well-defined and consistent frequency and has a value at every sample point in a continuous time span. 将以下快照看作 sample.csv 文件。Consider the following snapshot of a file sample.csv.

day_datetime,store,sales_quantity,week_of_year
9/3/2018,A,2000,36
9/3/2018,B,600,36
9/4/2018,A,2300,36
9/4/2018,B,550,36
9/5/2018,A,2100,36
9/5/2018,B,650,36
9/6/2018,A,2400,36
9/6/2018,B,700,36
9/7/2018,A,2450,36
9/7/2018,B,650,36

此数据集是某公司每日销售数据的简单示例,该公司拥有两个不同的商店(A 和 B)。此外,它还包含一个 week_of_year 特征,使模型可以检测以周为单位的季节性。This data set is a simple example of daily sales data for a company that has two different stores, A and B. Additionally, there is a feature for week_of_year that will allow the model to detect weekly seasonality. 字段 day_datetime 表示以天为频率的洁净时序,字段 sales_quantity 是运行预测的目标列。The field day_datetime represents a clean time series with daily frequency, and the field sales_quantity is the target column for running predictions. 将数据读取到 Pandas 数据帧中,然后使用 to_datetime 函数确保该时序的类型为 datetimeRead the data into a Pandas dataframe, then use the to_datetime function to ensure the time series is a datetime type.

import pandas as pd
data = pd.read_csv("sample.csv")
data["day_datetime"] = pd.to_datetime(data["day_datetime"])

在这种情况下,数据已按时间字段 day_datetime 降序排序。In this case, the data is already sorted ascending by the time field day_datetime. 但在设置试验时,请确保所需的时间列按升序进行排序,从而生成有效的时序。However, when setting up an experiment, ensure the desired time column is sorted in ascending order to build a valid time series. 假设数据包含 1,000 条记录,并在数据中进行确定性拆分以创建训练和测试数据集。Assume the data contains 1,000 records, and make a deterministic split in the data to create training and test data sets. 确定标签列名称并将其设置为“标签”。Identify the label column name and set it to label. 在本例中,标签将为 sales_quantityIn this example, the label will be sales_quantity. 然后,从 test_data 分离出标签字段,形成 test_target 集。Then separate the label field from test_data to form the test_target set.

train_data = data.iloc[:950]
test_data = data.iloc[-50:]

label =  "sales_quantity"
 
test_labels = test_data.pop(label).values

备注

在训练用于预测未来值的模型时,请确保在针对预期范围运行预测时可使用训练中用到的所有特征。When training a model for forecasting future values, ensure all the features used in training can be used when running predictions for your intended horizon. 例如,在创建需求预测时,包含当前股票价格的特征可能大幅提升训练准确度。For example, when creating a demand forecast, including a feature for current stock price could massively increase training accuracy. 但是,如果你打算使用较长的时间范围进行预测,则可能没法准确预测与未来的时序点相对应的未来股价值,模型准确性也会受到影响。However, if you intend to forecast with a long horizon, you may not be able to accurately predict future stock values corresponding to future time-series points, and model accuracy could suffer.

训练和验证数据Train and validation data

可直接在 AutoMLConfig 构造函数中指定单独的训练集和验证集。You can specify separate train and validation sets directly in the AutoMLConfig constructor.

滚动原点交叉验证Rolling Origin Cross Validation

对于时序预测,滚动原点交叉验证 (ROCV) 用于按暂时一致的方式拆分时序。For time series forecasting Rolling Origin Cross Validation (ROCV) is used to split time series in a temporally consistent way. ROCV 使用原始时间点将时序分成训练数据和验证数据。ROCV divides the series into training and validation data using an origin time point. 在时间内滑动原点会生成交叉验证折叠。Sliding the origin in time generates the cross-validation folds.

替换文字

此策略将保留时序数据完整性并消除数据泄露的风险。This strategy will preserve the time series data integrity and eliminate the risk of data leakage. ROCV 自动用于预测任务,方式是同时传递训练数据和验证数据,并使用 n_cross_validations 设置交叉验证折叠数。ROCV is automatically used for forecasting tasks by passing the training and validation data together and setting the number of cross validation folds using n_cross_validations. 详细了解 AutoML 如何应用交叉验证来防止过度拟合模型Learn more about how auto ML applies cross validation to prevent over-fitting models.

automl_config = AutoMLConfig(task='forecasting',
                             n_cross_validations=3,
                             ...
                             **time_series_settings)

详细了解 AutoMLConfigLearn more about the AutoMLConfig.

配置并运行试验Configure and run experiment

对于预测任务,自动化机器学习使用特定于时序数据的预处理和估计步骤。For forecasting tasks, automated machine learning uses pre-processing and estimation steps that are specific to time-series data. 将执行下列预处理步骤:The following pre-processing steps will be executed:

  • 检测时序采样频率(例如每小时、每天、每周),并为缺失的时间点创建新记录以使序列连续。Detect time-series sample frequency (for example, hourly, daily, weekly) and create new records for absent time points to make the series continuous.
  • 通过向前填充估算目标列中缺少的值,通过列值中位数估算特征列中缺少的值Impute missing values in the target (via forward-fill) and feature columns (using median column values)
  • 创建基于时序标识符的特征,以在不同序列中启用固定效果Create features based on time series identifiers to enable fixed effects across different series
  • 创建基于时间的特征,以帮助学习季节性模式Create time-based features to assist in learning seasonal patterns
  • 将分类变量编码为数值数量Encode categorical variables to numeric quantities

AutoMLConfig 对象定义自动化机器学习任务所需的设置和数据。The AutoMLConfig object defines the settings and data necessary for an automated machine learning task. 与回归问题类似,你要定义标准训练参数,例如任务类型、迭代次数、训练数据和交叉验证次数。Similar to a regression problem, you define standard training parameters like task type, number of iterations, training data, and number of cross-validations. 对于预测任务,还必须设置对试验有影响的其他参数。For forecasting tasks, there are additional parameters that must be set that affect the experiment. 下表解释了每个参数及其用法。The following table explains each parameter and its usage.

参数 名称Parameter name 说明Description 必须Required
time_column_name 用于指定输入数据中用于生成时序的日期时间列并推断其频率。Used to specify the datetime column in the input data used for building the time series and inferring its frequency.
time_series_id_column_names 列名,用于唯一标识多行数据中具有相同时间戳的时序。The column name(s) used to uniquely identify the time series in data that has multiple rows with the same timestamp. 如果未定义时序标识符,则假定该数据集为一个时序。If time series identifiers are not defined, the data set is assumed to be one time-series.
forecast_horizon 定义要预测的未来的时段数。Defines how many periods forward you would like to forecast. 范围以时序频率为单位。The horizon is in units of the time series frequency. 单位基于预测器应预测出的训练数据的时间间隔,例如每月、每周。Units are based on the time interval of your training data, for example, monthly, weekly that the forecaster should predict out.
target_lags 要根据数据频率滞后目标值的行数。Number of rows to lag the target values based on the frequency of the data. 此滞后表示为一个列表或整数。The lag is represented as a list or single integer. 默认情况下,在独立变量和依赖变量之间的关系不匹配或关联时,应使用滞后。Lag should be used when the relationship between the independent variables and dependent variable doesn't match up or correlate by default. 例如,在尝试预测某产品的需求时,任何月份的需求可能取决于之前 3 个月特定商品的价格。For example, when trying to forecast demand for a product, the demand in any month may depend on the price of specific commodities 3 months prior. 在此示例中,可将目标(需求)的滞后负 3 个月,以便针对正确的关系训练模型。In this example, you may want to lag the target (demand) negatively by 3 months so that the model is training on the correct relationship.
target_rolling_window_size 要用于生成预测值的 n 个历史时间段,该值小于或等于训练集大小。n historical periods to use to generate forecasted values, <= training set size. 如果省略,则 n 为完整训练集大小。If omitted, n is the full training set size. 如果训练模型时只想考虑一定量的历史记录,请指定此参数。Specify this parameter when you only want to consider a certain amount of history when training the model.
enable_dnn 启用预测 DNN。Enable Forecasting DNNs.

请参阅参考文档以了解详细信息。See the reference documentation for more information.

将时序设置创建为字典对象。Create the time-series settings as a dictionary object. time_column_name 设置为数据集中的 day_datetime 字段。Set the time_column_name to the day_datetime field in the data set. 定义 time_series_id_column_names 参数,确保为数据创建两个单独的时序组(门店 A 和 B 各一个)。最后,将 forecast_horizon 设置为 50,以预测整个测试集。Define the time_series_id_column_names parameter to ensure that two separate time-series groups are created for the data; one for store A and B. Lastly, set the forecast_horizon to 50 in order to predict for the entire test set. 使用 target_rolling_window_size 将预测时段设置为 10 个时段,并使用 target_lags 参数指定目标值滞后两个时段。Set a forecast window to 10 periods with target_rolling_window_size, and specify a single lag on the target values for two periods ahead with the target_lags parameter. 建议将 forecast_horizontarget_rolling_window_sizetarget_lags 设置为“auto”,然后就会自动检测这些值。It is recommended to set forecast_horizon, target_rolling_window_size and target_lags to "auto" which will automatically detect these values for you. 在下面的示例中,“auto”设置已用于这些参数。In the example below, "auto" settings have been used for these parameters.

time_series_settings = {
    "time_column_name": "day_datetime",
    "time_series_id_column_names": ["store"],
    "forecast_horizon": "auto",
    "target_lags": "auto",
    "target_rolling_window_size": "auto",
    "preprocess": True,
}

备注

自动机器学习预处理步骤(特征规范化、处理缺失数据,将文本转换为数字等)成为基础模型的一部分。Automated machine learning pre-processing steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. 使用模型进行预测时,训练期间应用的相同预处理步骤将自动应用于输入数据。When using the model for predictions, the same pre-processing steps applied during training are applied to your input data automatically.

通过在上述代码片段中定义 time_series_id_column_names,AutoML 将创建两个单独的时序组,也称为多时序。By defining the time_series_id_column_names in the code snippet above, AutoML will create two separate time-series groups, also known as multiple time-series. 如果未定义时序标识符,AutoML 会假定该数据集为单时序。If no time series identifiers are defined, AutoML will assume that the dataset is a single time-series. 要详细了解单个时序,请查看 energy_demand_notebookTo learn more about single time-series, see the energy_demand_notebook.

现在创建一个标准 AutoMLConfig 对象,指定 forecasting 任务类型,然后提交试验。Now create a standard AutoMLConfig object, specifying the forecasting task type, and submit the experiment. 模型完成后,检索最佳的运行迭代。After the model finishes, retrieve the best run iteration.

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import logging

automl_config = AutoMLConfig(task='forecasting',
                             primary_metric='normalized_root_mean_squared_error',
                             experiment_timeout_minutes=15,
                             enable_early_stopping=True,
                             training_data=train_data,
                             label_column_name=label,
                             n_cross_validations=5,
                             enable_ensembling=False,
                             verbosity=logging.INFO,
                             **time_series_settings)

ws = Workspace.from_config()
experiment = Experiment(ws, "forecasting_example")
local_run = experiment.submit(automl_config, show_output=True)
best_run, fitted_model = local_run.get_output()

请参阅预测示例笔记本,了解高级预测配置的详细代码示例,其中包括:See the forecasting sample notebooks for detailed code examples of advanced forecasting configuration including:

配置“DNN 启用预测”试验Configure a DNN enable Forecasting experiment

备注

DNN 对自动机器学习的预测支持处于预览状态,不支持本地运行。DNN support for forecasting in Automated Machine Learning is in Preview and not supported for local runs.

要使用 DNN 进行预测,需要将 AutoMLConfig 中的 enable_dnn 参数设置为 true。In order to leverage DNNs for forecasting, you will need to set the enable_dnn parameter in the AutoMLConfig to true.

automl_config = AutoMLConfig(task='forecasting',
                             enable_dnn=True,
                             ...
                             **time_series_settings)

详细了解 AutoMLConfigLearn more about the AutoMLConfig.

或者,可在工作室中选择 Enable deep learning 选项。Alternatively, you can select the Enable deep learning option in the studio. 替换文字alt text

建议将 AML 计算群集与 GPU SKU 一起使用,并将至少两个节点用作计算目标。We recommend using an AML Compute cluster with GPU SKUs and at least two nodes as the compute target. 为了留出足够的时间让 DNN 训练完成,我们建议至少将试验超时值设为几小时。To allow sufficient time for the DNN training to complete, we recommend setting the experiment timeout to a minimum of a couple of hours. 有关包含 GPU 的 AML 计算和 VM 大小的详细信息,请参阅 AML 计算文档GPU 优化的虚拟机大小文档For more information on AML compute and VM sizes that include GPU's, see the AML Compute documentation and GPU optimized virtual machine sizes documentation.

查看饮料制造预测笔记本,获取使用 DNN 的详细代码示例。View the Beverage Production Forecasting notebook for a detailed code example leveraging DNNs.

自定义特征化Customize featurization

你可以自定义特征化设置,以确保用于训练机器学习模型的数据和特征能够产生相关的预测。You can customize your featurization settings to ensure that the data and features that are used to train your ML model result in relevant predictions.

若要自定义特征化,请在 AutoMLConfig 对象中指定 "featurization": FeaturizationConfigTo customize featurizations, specify "featurization": FeaturizationConfig in your AutoMLConfig object. 如果使用 Azure 机器学习工作室进行试验,请参阅操作方法文章If you're using the Azure Machine Learning studio for your experiment, see the how-to article.

支持的自定义项包括:Supported customizations include:

自定义Customization 定义Definition
列用途更新Column purpose update 重写指定列的自动检测到的特征类型。Override the auto-detected feature type for the specified column.
转换器参数更新Transformer parameter update 更新指定转换器的参数。Update the parameters for the specified transformer. 目前支持 Imputer(fill_value 和中值)。Currently supports Imputer (fill_value and median).
删除列Drop columns 指定要从特征化中删除的列。Specifies columns to drop from being featurized.

通过定义特征化配置创建 FeaturizationConfig 对象:Create the FeaturizationConfig object by defining your featurization configurations:

featurization_config = FeaturizationConfig()
# `logQuantity` is a leaky feature, so we remove it.
featurization_config.drop_columns = ['logQuantitity']
# Force the CPWVOL5 feature to be of numeric type.
featurization_config.add_column_purpose('CPWVOL5', 'Numeric')
# Fill missing values in the target column, Quantity, with zeroes.
featurization_config.add_transformer_params('Imputer', ['Quantity'], {"strategy": "constant", "fill_value": 0})
# Fill mising values in the `INCOME` column with median value.
featurization_config.add_transformer_params('Imputer', ['INCOME'], {"strategy": "median"})

目标滚动窗口聚合Target Rolling Window Aggregation

通常,目标的最新值是预测程序能具有的最佳信息。Often the best information a forecaster can have is the recent value of the target. 创建目标的累计统计信息可能会提高预测的准确性。Creating cumulative statistics of the target may increase the accuracy of your predictions. 通过目标滚动窗口聚合,可将数据值的滚动聚合添加为特征。Target rolling window aggregations allows you to add a rolling aggregation of data values as features. 要启动目标滚动窗口,请将 target_rolling_window_size 设置为所需的整数窗口大小。To enable target rolling windows set the target_rolling_window_size to your desired integer window size.

预测能源需求时可看到此项的示例。An example of this can be seen when predicting energy demand. 可添加一个滚动窗口特征(3 天)来解释供暖空间的热变化。You might add a rolling window feature of three days to account for thermal changes of heated spaces. 在下例中,我们通过在 AutoMLConfig 构造函数中设置 target_rolling_window_size=3,将此窗口设置为大小 3。In the example below, we've created this window of size three by setting target_rolling_window_size=3 in the AutoMLConfig constructor. 下表显示了在应用窗口聚合后发生的特征工程。The table shows feature engineering that occurs when window aggregation is applied. 根据定义的设置针对滑动窗口 3 生成表示最小值、最大值和总和的列。Columns for minimum, maximum, and sum are generated on a sliding window of three based on the defined settings. 每一行有计算得出的一个新特征;如果时间戳为 2017 年 9 月 8 日凌晨 4:00,则使用 2017 年 9 月 8 日凌晨 1:00 至 3:00 的需求值计算最小值、最大值和总和值。Each row has a new calculated feature, in the case of the time-stamp for September 8, 2017 4:00am the maximum, minimum, and sum values are calculated using the demand values for September 8, 2017 1:00AM - 3:00AM. 3 这个窗口将移位填充其余行的数据。This window of three shifts along to populate data for the remaining rows.

替换文字

通过生成和使用这些附加特征作为额外的上下文数据,可帮助提高训练模型的准确性。Generating and using these additional features as extra contextual data helps with the accuracy of the train model.

请查看使用目标滚动窗口聚合特征的 Python 代码示例。View a Python code example leveraging the target rolling window aggregate feature.

查看特性工程摘要View feature engineering summary

对于自动化机器学习中的时序任务类型,可以通过特征工程过程查看详细信息。For time-series task types in automated machine learning, you can view details from the feature engineering process. 以下代码显示每个原始特征以及以下属性:The following code shows each raw feature along with the following attributes:

  • 原始特征名称Raw feature name
  • 由此原始特征形成的工程特征的数量Number of engineered features formed out of this raw feature
  • 检测到的类型Type detected
  • 特征是否已被丢弃Whether feature was dropped
  • 原始特征的特征转换列表List of feature transformations for the raw feature
fitted_model.named_steps['timeseriestransformer'].get_featurization_summary()

用最佳模型进行预测Forecasting with best model

使用最佳模型迭代来预测测试数据集的值。Use the best model iteration to forecast values for the test data set.

应使用 forecast() 而不是 predict(),这样就可指定预测的开始时间。The forecast() function should be used instead of predict(), this will allow specifications of when predictions should start. 在下例中,先将 y_pred 中的所有值替换为 NaNIn the following example, you first replace all values in y_pred with NaN. 在本例中,预测原点将位于训练数据的末尾,这是使用 predict() 时的常态。The forecast origin will be at the end of training data in this case, as it would normally be when using predict(). 但是,如果只将 y_pred 的后半部分替换为 NaN,则函数不会修改前半部分的数值,而会在后半部分预测 NaN 值。However, if you replaced only the second half of y_pred with NaN, the function would leave the numerical values in the first half unmodified, but forecast the NaN values in the second half. 函数将返回预测值和对齐的特征。The function returns both the forecasted values and the aligned features.

还可以在 forecast() 函数中使用 forecast_destination 参数,预测到指定日期为止的值。You can also use the forecast_destination parameter in the forecast() function to forecast values up until a specified date.

label_query = test_labels.copy().astype(np.float)
label_query.fill(np.nan)
label_fcst, data_trans = fitted_pipeline.forecast(
    test_data, label_query, forecast_destination=pd.Timestamp(2019, 1, 8))

计算 actual_labels 实际值和 predict_labels 中预测值之间的 RMSE(均方根误差)。Calculate RMSE (root mean squared error) between the actual_labels actual values, and the forecasted values in predict_labels.

from sklearn.metrics import mean_squared_error
from math import sqrt

rmse = sqrt(mean_squared_error(actual_labels, predict_labels))
rmse

现在确定了模型的整体准确性,最现实的下一步是使用模型来预测未知的未来值。Now that the overall model accuracy has been determined, the most realistic next step is to use the model to forecast unknown future values. 提供与测试集 test_data 具有相同格式但具有未来日期时间的数据集,生成的预测集就是每个时序步骤的预测值。Supply a data set in the same format as the test set test_data but with future datetimes, and the resulting prediction set is the forecasted values for each time-series step. 假设数据集中最后的时序记录针对的是 2018/12/31。Assume the last time-series records in the data set were for 12/31/2018. 若要预测次日的需求(或者小于或等于 forecast_horizon 的待预测时间段),请为每个商店创建 2019/01/01 的一条时序记录。To forecast demand for the next day (or as many periods as you need to forecast, <= forecast_horizon), create a single time series record for each store for 01/01/2019.

day_datetime,store,week_of_year
01/01/2019,A,1
01/01/2019,A,1

重复执行必要的步骤,将此未来数据加载到数据帧,然后运行 best_run.predict(test_data) 以预测未来值。Repeat the necessary steps to load this future data to a dataframe and then run best_run.predict(test_data) to predict future values.

备注

不能预测大于 forecast_horizon 的时间段数的值。Values cannot be predicted for number of periods greater than the forecast_horizon. 必须使用更大的时间范围对模型进行重新训练,才能预测当前时间范围之外的未来值。The model must be re-trained with a larger horizon to predict future values beyond the current horizon.

后续步骤Next steps