自动训练时序预测模型Auto-train a time-series forecast model

本文介绍如何在 Azure 机器学习 Python SDK 中使用自动化机器学习 AutoML 来配置和训练时序预测回归模型。In this article, you learn how to configure and train a time-series forecasting regression model using automated machine learning, AutoML, in the Azure Machine Learning Python SDK.

为此,需要:To do so, you:

  • 准备用于时序建模的数据。Prepare data for time series modeling.
  • AutoMLConfig 对象中配置特定的时序参数。Configure specific time-series parameters in an AutoMLConfig object.
  • 使用时序数据运行预测。Run predictions with time-series data.

有关低代码体验,请参阅教程:使用自动化机器学习预测需求,里面有关于在 Azure 机器学习工作室中使用自动化机器学习的时序预测示例。For a low code experience, see the Tutorial: Forecast demand with automated machine learning for a time-series forecasting example using automated machine learning in the Azure Machine Learning studio.

与经典的时序方法不同,在自动化 ML 中,将“透视”过去的时序值,使其成为回归器与其他预测器的附加维度。Unlike classical time series methods, in automated ML, past time-series values are "pivoted" to become additional dimensions for the regressor together with other predictors. 此方法会在训练过程中,将多个上下文变量及其关系彼此整合。This approach incorporates multiple contextual variables and their relationship to one another during training. 影响预测的因素有很多,因此该方法将自身与真实的预测场景很好地协调起来。Since multiple factors can influence a forecast, this method aligns itself well with real world forecasting scenarios. 例如,在预测销售额时,历史趋势、汇率和价格相互作用,共同推动着销售结果。For example, when forecasting sales, interactions of historical trends, exchange rate and price all jointly drive the sales outcome.

先决条件Prerequisites

在本文中,你需要:For this article you need,

  • Azure 机器学习工作区。An Azure Machine Learning workspace. 若要创建工作区,请参阅创建 Azure 机器学习工作区To create the workspace, see Create an Azure Machine Learning workspace.

  • 本文假设你对设置自动化机器学习试验有一定的了解。This article assumes some familiarity with setting up an automated machine learning experiment. 遵循教程操作方法,了解主要的自动化机器学习试验设计模式。Follow the tutorial or how-to to see the main automated machine learning experiment design patterns.

准备数据Preparing data

AutoML 中预测回归任务类型和回归任务类型之间最重要的区别在于,前者包含数据中表示有效时序的一项特征。The most important difference between a forecasting regression task type and regression task type within AutoML is including a feature in your data that represents a valid time series. 常规时序具有明确定义的一致频率,并且在连续时间范围内的每个采样点上都有一个值。A regular time series has a well-defined and consistent frequency and has a value at every sample point in a continuous time span.

将以下快照看作 sample.csv 文件。Consider the following snapshot of a file sample.csv. 此数据集是有两个不同商店(A 和 B)的公司的每日销售数据。This data set is of daily sales data for a company that has two different stores, A and B.

此外,还有一些功能适用于Additionally, there are features for

  • week_of_year:允许模型检测每周周期性。week_of_year: allows the model to detect weekly seasonality.
  • day_datetime:表示具有每日频率的干净时序。day_datetime: represents a clean time series with daily frequency.
  • sales_quantity:用于运行预测的目标列。sales_quantity: the target column for running predictions.
day_datetime,store,sales_quantity,week_of_year
9/3/2018,A,2000,36
9/3/2018,B,600,36
9/4/2018,A,2300,36
9/4/2018,B,550,36
9/5/2018,A,2100,36
9/5/2018,B,650,36
9/6/2018,A,2400,36
9/6/2018,B,700,36
9/7/2018,A,2450,36
9/7/2018,B,650,36

将数据读取到 Pandas 数据帧中,然后使用 to_datetime 函数确保该时序的类型为 datetimeRead the data into a Pandas dataframe, then use the to_datetime function to ensure the time series is a datetime type.

import pandas as pd
data = pd.read_csv("sample.csv")
data["day_datetime"] = pd.to_datetime(data["day_datetime"])

在这种情况下,数据已按时间字段 day_datetime 降序排序。In this case, the data is already sorted ascending by the time field day_datetime. 但在设置试验时,请确保所需的时间列按升序进行排序,从而生成有效的时序。However, when setting up an experiment, ensure the desired time column is sorted in ascending order to build a valid time series.

以下代码The following code,

  • 假设数据包含 1,000 条记录,并在数据中进行确定性拆分以创建训练和测试数据集。Assumes the data contains 1,000 records, and makes a deterministic split in the data to create training and test data sets.
  • 将标签列标识为 sales_quantityIdentifies the label column as sales_quantity.
  • test_data 中分离出标签字段,以形成 test_target 集。Separates the label field from test_data to form the test_target set.
train_data = data.iloc[:950]
test_data = data.iloc[-50:]

label =  "sales_quantity"
 
test_labels = test_data.pop(label).values

重要

在训练用于预测未来值的模型时,请确保在针对预期范围运行预测时可使用训练中用到的所有特征。When training a model for forecasting future values, ensure all the features used in training can be used when running predictions for your intended horizon. 例如,在创建需求预测时,包含当前股票价格的特征可能大幅提升训练准确度。For example, when creating a demand forecast, including a feature for current stock price could massively increase training accuracy. 但是,如果你打算使用较长的时间范围进行预测,则可能没法准确预测与未来的时序点相对应的未来股价值,模型准确性也会受到影响。However, if you intend to forecast with a long horizon, you may not be able to accurately predict future stock values corresponding to future time-series points, and model accuracy could suffer.

训练和验证数据Training and validation data

可以直接在 AutoMLConfig 对象中指定不同的训练集和验证集。You can specify separate train and validation sets directly in the AutoMLConfig object. 详细了解 AutoMLConfigLearn more about the AutoMLConfig.

对于时序预测,默认情况下仅使用滚动原点交叉验证 (ROCV) 进行验证。For time series forecasting, only Rolling Origin Cross Validation (ROCV) is used for validation by default. 将训练数据和验证数据一起传递,并在 AutoMLConfig 中使用 n_cross_validations 参数设置交叉验证折叠数。Pass the training and validation data together, and set the number of cross validation folds with the n_cross_validations parameter in your AutoMLConfig. ROCV 使用原始时间点将时序分成训练数据和验证数据。ROCV divides the series into training and validation data using an origin time point. 在时间内滑动原点会生成交叉验证折叠。Sliding the origin in time generates the cross-validation folds. 此策略保留时序数据完整性并消除了数据泄露风险This strategy preserves the time series data integrity and eliminates the risk of data leakage

替换文字

你还可以自带验证数据,详情请参阅在 AutoML 中配置数据拆分和交叉验证You can also bring your own validation data, learn more in Configure data splits and cross-validation in AutoML.

automl_config = AutoMLConfig(task='forecasting',
                             n_cross_validations=3,
                             ...
                             **time_series_settings)

详细了解 AutoML 如何应用交叉验证来防止过度拟合模型Learn more about how AutoML applies cross validation to prevent over-fitting models.

配置试验Configure experiment

AutoMLConfig 对象定义自动化机器学习任务所需的设置和数据。The AutoMLConfig object defines the settings and data necessary for an automated machine learning task. 预测模型的配置与标准回归模型的设置相似,但存在专门针对时序数据的某些模型、配置选项和特征化步骤。Configuration for a forecasting model is similar to the setup of a standard regression model, but certain models, configuration options, and featurization steps exist specifically for time-series data.

支持的模型Supported models

在模型创建和优化过程中,自动化机器学习会自动尝试各种模型和算法。Automated machine learning automatically tries different models and algorithms as part of the model creation and tuning process. 用户不需要指定算法。As a user, there is no need for you to specify the algorithm. 对于预测试验,本机时序模型和深度学习模型都是推荐系统的一部分。For forecasting experiments, both native time-series and deep learning models are part of the recommendation system. 下表对此模型子集进行了汇总。The following table summarizes this subset of models.

提示

传统回归模型也作为预测试验的推荐系统的一部分进行了测试。Traditional regression models are also tested as part of the recommendation system for forecasting experiments. 有关模型的完整列表,请参阅支持的模型表See the supported model table for the full list of models.

模型Models 说明Description 优点Benefits
Prophet(预览版)Prophet (Preview) Prophet 最适合用于受季节影响大且包含多个季节历史数据的时序。Prophet works best with time series that have strong seasonal effects and several seasons of historical data. 若要利用此模型,请使用 pip install fbprophet 在本地安装它。To leverage this model, install it locally using pip install fbprophet. 准确、快速、可靠地反应时序中的离群值、缺失数据和巨大变化。Accurate & fast, robust to outliers, missing data, and dramatic changes in your time series.
Auto-ARIMA(预览版)Auto-ARIMA (Preview) 自动回归集成移动平均 (ARIMA) 在数据处于静态时性能最佳。Auto-Regressive Integrated Moving Average (ARIMA) performs best, when the data is stationary. 这意味着其统计属性(例如平均值和方差)在整个集中保持不变。This means that its statistical properties like the mean and variance are constant over the entire set. 例如,如果你掷一枚硬币,那么无论是今天掷、明天掷还是明年掷,正面朝上的可能性都是 50%。For example, if you flip a coin, then the probability of you getting heads is 50%, regardless if you flip today, tomorrow or next year. 适用于单变量系列,这是因为使用过去的值来预测未来的值。Great for univariate series, since the past values are used to predict the future values.
ForecastTCN(预览版)ForecastTCN (Preview) ForecastTCN 是一种神经网络模型,旨在处理最苛刻的预测任务,从而捕获数据中的非线性本地和全局趋势以及时序之间的关系。ForecastTCN is a neural network model designed to tackle the most demanding forecasting tasks, capturing nonlinear local and global trends in your data as well as relationships between time series. 可利用数据中的复杂趋势并轻松扩展到最大型的数据集。Capable of leveraging complex trends in your data and readily scales to the largest of datasets.

配置设置Configuration settings

与回归问题类似,你要定义标准训练参数,例如任务类型、迭代次数、训练数据和交叉验证次数。Similar to a regression problem, you define standard training parameters like task type, number of iterations, training data, and number of cross-validations. 对于预测任务,还必须设置对试验有影响的其他参数。For forecasting tasks, there are additional parameters that must be set that affect the experiment.

下表汇总了这些额外的参数。The following table summarizes these additional parameters. 有关语法设计模式,请参阅 ForecastingParameter 类参考文档See the ForecastingParameter class reference documentation for syntax design patterns.

参数 名称Parameter name 说明Description 必选Required
time_column_name 用于指定输入数据中用于生成时序的日期时间列并推断其频率。Used to specify the datetime column in the input data used for building the time series and inferring its frequency.
forecast_horizon 定义要预测的未来的时段数。Defines how many periods forward you would like to forecast. 范围以时序频率为单位。The horizon is in units of the time series frequency. 单位基于预测器应预测出的训练数据的时间间隔,例如每月、每周。Units are based on the time interval of your training data, for example, monthly, weekly that the forecaster should predict out.
enable_dnn 启用预测 DNNEnable Forecasting DNNs.
time_series_id_column_names 列名,用于唯一标识多行数据中具有相同时间戳的时序。The column name(s) used to uniquely identify the time series in data that has multiple rows with the same timestamp. 如果未定义时序标识符,则假定该数据集为一个时序。If time series identifiers are not defined, the data set is assumed to be one time-series. 要详细了解单个时序,请查看 energy_demand_notebookTo learn more about single time-series, see the energy_demand_notebook.
freq 时序数据集频率。The time series dataset frequency. 此参数表示事件预计发生的时间段,例如每日、每周、每年等。频率必须是 pandas 偏移别名This parameter represents the period with which events are expected to occur, such as daily, weekly, yearly, etc. The frequency must be a pandas offset alias.
target_lags 要根据数据频率滞后目标值的行数。Number of rows to lag the target values based on the frequency of the data. 此滞后表示为一个列表或整数。The lag is represented as a list or single integer. 默认情况下,在独立变量和依赖变量之间的关系不匹配或关联时,应使用滞后。Lag should be used when the relationship between the independent variables and dependent variable doesn't match up or correlate by default.
feature_lags 当设置了 target_lags 并且 feature_lags 设置为 auto 时,要滞后的功能将由自动化 ML 自动确定。The features to lag will be automatically decided by automated ML when target_lags are set and feature_lags is set to auto. 启用功能滞后有助于提高准确性。Enabling feature lags may help to improve accuracy. 默认情况下会禁用功能滞后。Feature lags are disabled by default.
target_rolling_window_size 要用于生成预测值的 n 个历史时间段,该值小于或等于训练集大小。n historical periods to use to generate forecasted values, <= training set size. 如果省略,则 n 为完整训练集大小。If omitted, n is the full training set size. 如果训练模型时只想考虑一定量的历史记录,请指定此参数。Specify this parameter when you only want to consider a certain amount of history when training the model. 详细了解目标滚动窗口聚合Learn more about target rolling window aggregation.
short_series_handling_config 启用“短时序处理”,以避免在训练期间由于数据不足而失败。Enables short time series handling to avoid failing during training due to insufficient data. 在默认情况下,“短时序处理”设置为 autoShort series handling is set to auto by default. 详细了解短时序处理Learn more about short series handling.

以下代码The following code,

  • 利用 ForecastingParameters 类来为试验训练定义预测参数Leverages the ForecastingParameters class to define the forecasting parameters for your experiment training
  • time_column_name 设置为数据集中的 day_datetime 字段。Sets the time_column_name to the day_datetime field in the data set.
  • time_series_id_column_names 参数定义为 "store"Defines the time_series_id_column_names parameter to "store". 这可确保为数据创建 两个单独的时序组,一个用于商店 A,一个用于商店 B。This ensures that two separate time-series groups are created for the data; one for store A and B.
  • forecast_horizon 设置为 50 以针对整个测试集进行预测。Sets the forecast_horizon to 50 in order to predict for the entire test set.
  • 使用 target_rolling_window_size 将预测窗口设置为 10 个时段Sets a forecast window to 10 periods with target_rolling_window_size
  • 使用 target_lags 参数指定目标值滞后两个时段。Specifies a single lag on the target values for two periods ahead with the target_lags parameter.
  • target_lags 设置为建议的“auto”设置,这将自动为你检测此值。Sets target_lags to the recommended "auto" setting, which will automatically detect this value for you.
from azureml.automl.core.forecasting_parameters import ForecastingParameters

forecasting_parameters = ForecastingParameters(time_column_name='day_datetime', 
                                               forecast_horizon=50,
                                               time_series_id_column_names=["store"],
                                               freq='W',
                                               target_lags='auto',
                                               target_rolling_window_size=10)
                                              

然后,将这些 forecasting_parameters 传入到标准 AutoMLConfig 对象中,同时还会传入 forecasting 任务类型、主要指标、退出标准和训练数据。These forecasting_parameters are then passed into your standard AutoMLConfig object along with the forecasting task type, primary metric, exit criteria and training data.

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import logging

automl_config = AutoMLConfig(task='forecasting',
                             primary_metric='normalized_root_mean_squared_error',
                             experiment_timeout_minutes=15,
                             enable_early_stopping=True,
                             training_data=train_data,
                             label_column_name=label,
                             n_cross_validations=5,
                             enable_ensembling=False,
                             verbosity=logging.INFO,
                             **forecasting_parameters)

使用自动化 ML 成功训练预测模型所需的数据量受在配置 AutoMLConfig 时指定的 forecast_horizonn_cross_validationstarget_lagstarget_rolling_window_size 值的影响。The amount of data required to successfully train a forecasting model with automated ML is influenced by the forecast_horizon, n_cross_validations, and target_lags or target_rolling_window_size values specified when you configure your AutoMLConfig.

下面的公式计算构建时序功能所需的历史数据量。The following formula calculates the amount of historic data that what would be needed to construct time series features.

所需的最小历史数据量:(2x forecast_horizon) + #n_cross_validations + max(max(target_lags), target_rolling_window_size)Minimum historic data required: (2x forecast_horizon) + #n_cross_validations + max(max(target_lags), target_rolling_window_size)

对于不满足指定的相关设置所需历史数据量的数据集中的任何序列,都将引发“错误”异常。An Error exception will be raised for any series in the dataset that does not meet the required amount of historic data for the relevant settings specified.

特征化步骤Featurization steps

在每一个自动化机器学习试验中,默认情况下都会将自动缩放和规范化技术应用于数据。In every automated machine learning experiment, automatic scaling and normalization techniques are applied to your data by default. 这些技术是特征化的类型,用于帮助对不同规模数据的特征敏感的某些算法。These techniques are types of featurization that help certain algorithms that are sensitive to features on different scales. AutoML 中的特征化中详细了解默认特征化步骤Learn more about default featurization steps in Featurization in AutoML

但是,仅对 forecasting 任务类型执行以下步骤:However, the following steps are performed only for forecasting task types:

  • 检测时序采样频率(例如每小时、每天、每周),并为缺失的时间点创建新记录以使序列连续。Detect time-series sample frequency (for example, hourly, daily, weekly) and create new records for absent time points to make the series continuous.
  • 通过向前填充估算目标列中缺少的值,通过列值中位数估算特征列中缺少的值Impute missing values in the target (via forward-fill) and feature columns (using median column values)
  • 创建基于时序标识符的特征,以在不同序列中启用固定效果Create features based on time series identifiers to enable fixed effects across different series
  • 创建基于时间的特征,以帮助学习季节性模式Create time-based features to assist in learning seasonal patterns
  • 将分类变量编码为数值数量Encode categorical variables to numeric quantities

若要获取这些步骤所创建的功能的摘要,请参阅特征化透明度To get a summary of what features are created as result of these steps, see Featurization transparency

备注

自动化机器学习特征化步骤(特征规范化、处理缺失数据,将文本转换为数字等)成为了基础模型的一部分。Automated machine learning featurization steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. 使用模型进行预测时,将自动向输入数据应用在训练期间应用的相同特征化步骤。When using the model for predictions, the same featurization steps applied during training are applied to your input data automatically.

自定义特征化Customize featurization

你还可以自定义特征化设置,以确保用于训练 ML 模型的数据和特征能够产生相关的预测。You also have the option to customize your featurization settings to ensure that the data and features that are used to train your ML model result in relevant predictions.

forecasting 任务支持的自定义项包括:Supported customizations for forecasting tasks include:

自定义Customization 定义Definition
列用途更新Column purpose update 重写指定列的自动检测到的特征类型。Override the auto-detected feature type for the specified column.
转换器参数更新Transformer parameter update 更新指定转换器的参数。Update the parameters for the specified transformer. 目前支持 Imputer(fill_value 和中值)。Currently supports Imputer (fill_value and median).
删除列Drop columns 指定要从特征化中删除的列。Specifies columns to drop from being featurized.

若要使用 SDK 来自定义特征化,请在 AutoMLConfig 对象中指定 "featurization": FeaturizationConfigTo customize featurizations with the SDK, specify "featurization": FeaturizationConfig in your AutoMLConfig object. 详细了解自定义特征化Learn more about custom featurizations.

备注

从 SDK 版本1.19 开始,“删除列”功能已弃用。The drop columns functionality is deprecated as of SDK version 1.19. 在自动化 ML 试验中使用数据集之前,作为数据清理过程的一部分,请将数据集中的列删除。Drop columns from your dataset as part of data cleansing, prior to consuming it in your automated ML experiment.

featurization_config = FeaturizationConfig()
# `logQuantity` is a leaky feature, so we remove it.
featurization_config.drop_columns = ['logQuantitity']
# Force the CPWVOL5 feature to be of numeric type.
featurization_config.add_column_purpose('CPWVOL5', 'Numeric')
# Fill missing values in the target column, Quantity, with zeroes.
featurization_config.add_transformer_params('Imputer', ['Quantity'], {"strategy": "constant", "fill_value": 0})
# Fill mising values in the `INCOME` column with median value.
featurization_config.add_transformer_params('Imputer', ['INCOME'], {"strategy": "median"})

如果使用 Azure 机器学习工作室进行试验,请参阅如何在工作室中自定义特征化If you're using the Azure Machine Learning studio for your experiment, see how to customize featurization in the studio.

可选配置Optional configurations

有其他可用于预测任务的可选配置,例如,启用深度学习和指定目标滚动窗口聚合。Additional optional configurations are available for forecasting tasks, such as enabling deep learning and specifying a target rolling window aggregation.

启用深度学习Enable deep learning

备注

DNN 对自动机器学习的预测支持目前为 预览版,不支持本地运行。DNN support for forecasting in Automated Machine Learning is in preview and not supported for local runs.

你还可以通过深层神经网络 (DNN) 利用深度学习来改进模型的分数。You can also leverage deep learning with deep neural networks, DNNs, to improve the scores of your model. 通过自动化 ML 的深度学习,可预测单变量和多变量时序数据。Automated ML's deep learning allows for forecasting univariate and multivariate time series data.

深度学习模型具有三个固有功能:Deep learning models have three intrinsic capabilities:

  1. 可以从任意输入到输出映射进行学习They can learn from arbitrary mappings from inputs to outputs
  2. 支持多个输入和输出They support multiple inputs and outputs
  3. 可以从跨越较长序列的输入数据中自动提取模式。They can automatically extract patterns in input data that spans over long sequences.

若要启用深度学习,请在 AutoMLConfig 对象中设置 enable_dnn=TrueTo enable deep learning, set the enable_dnn=True in the AutoMLConfig object.

automl_config = AutoMLConfig(task='forecasting',
                             enable_dnn=True,
                             ...
                             **forecasting_parameters)

警告

为使用 SDK 创建的试验启用 DNN 时,系统会禁用最佳模型说明When you enable DNN for experiments created with the SDK, best model explanations are disabled.

若要为在 Azure 机器学习工作室中创建的 AutoML 试验启用 DNN,请参阅工作室操作指南中的任务类型设置To enable DNN for an AutoML experiment created in the Azure Machine Learning studio, see the task type settings in the studio how-to.

查看饮料制造预测笔记本,获取使用 DNN 的详细代码示例。View the Beverage Production Forecasting notebook for a detailed code example leveraging DNNs.

目标滚动窗口聚合Target Rolling Window Aggregation

通常,目标的最新值是预测程序能具有的最佳信息。Often the best information a forecaster can have is the recent value of the target. 通过目标滚动窗口聚合,可将数据值的滚动聚合添加为特征。Target rolling window aggregations allow you to add a rolling aggregation of data values as features. 通过生成和使用这些附加特征作为额外的上下文数据,可帮助提高训练模型的准确性。Generating and using these additional features as extra contextual data helps with the accuracy of the train model.

例如,假设你想要预测能源需求。For example, say you want to predict energy demand. 你可能希望添加一项滚动窗口(3 天)特征来解释供暖空间的热变化。You might want to add a rolling window feature of three days to account for thermal changes of heated spaces. 在此示例中,通过在 AutoMLConfig 构造函数中设置 target_rolling_window_size= 3 来创建此窗口。In this example, create this window by setting target_rolling_window_size= 3 in the AutoMLConfig constructor.

下表显示了在应用窗口聚合后发生的特征工程。The table shows resulting feature engineering that occurs when window aggregation is applied. 根据定义的设置针对滑动窗口 3 生成表示 最小值、最大值总和 的列。Columns for minimum, maximum, and sum are generated on a sliding window of three based on the defined settings. 每一行有计算得出的一个新特征;如果时间戳为 2017 年 9 月 8 日凌晨 4:00,则使用 2017 年 9 月 8 日凌晨 1:00 至 3:00 的 需求值 计算最大值、最小值和总和值。Each row has a new calculated feature, in the case of the timestamp for September 8, 2017 4:00am the maximum, minimum, and sum values are calculated using the demand values for September 8, 2017 1:00AM - 3:00AM. 3 这个窗口将移位填充其余行的数据。This window of three shifts along to populate data for the remaining rows.

替换文字

请查看使用目标滚动窗口聚合特征的 Python 代码示例。View a Python code example leveraging the target rolling window aggregate feature.

短时序处理Short series handling

如果没有足够的数据点来执行模型开发的训练和验证阶段,自动化 ML 就会将一个时序视为短时序。Automated ML considers a time series a short series if there are not enough data points to conduct the train and validation phases of model development. 数据点的数量因各个试验而异,并且依赖于 max_horizon、交叉验证拆分数以及模型回看的长度,该长度是构建时序功能所需的最长历史记录。The number of data points varies for each experiment, and depends on the max_horizon, the number of cross validation splits, and the length of the model lookback, that is the maximum of history that's needed to construct the time-series features. 有关精确的计算,请参阅 short_series_handling_configuration 参考文档For the exact calculation see the short_series_handling_configuration reference documentation.

默认情况下,自动化 ML 通过在 ForecastingParameters 对象中使用 short_series_handling_configuration 参数来提供“短时序处理”。Automated ML offers short series handling by default with the short_series_handling_configuration parameter in the ForecastingParameters object.

若要启用“短序列处理”,还必须定义 freq 参数。To enable short series handling, the freq parameter must also be defined. 为了定义每小时频率,我们将设置 freq='H'To define an hourly frequency, we will set freq='H'. 查看此处的频率字符串选项。View the frequency string options here. 若要更改默认行为 short_series_handling_configuration = 'auto',请更新 ForecastingParameter 对象中的 short_series_handling_configuration 参数。To change the default behavior, short_series_handling_configuration = 'auto', update the short_series_handling_configuration parameter in your ForecastingParameter object.

from azureml.automl.core.forecasting_parameters import ForecastingParameters

forecast_parameters = ForecastingParameters(time_column_name='day_datetime', 
                                            forecast_horizon=50,
                                            short_series_handling_configuration='auto',
                                            freq = 'H',
                                            target_lags='auto')

下表总结了可用于 short_series_handling_config 的设置。The following table summarizes the available settings for short_series_handling_config.

设置Setting 说明Description
auto 下面是“短时序处理”的默认行为The following is the default behavior for short series handling
  • 如果所有时序都短,则填充数据。If all series are short, pad the data.
  • 如果并非所有时序都短,则删除短时序。If not all series are short, drop the short series.
  • pad 如果 short_series_handling_config = pad,则自动化 ML 会为找到的每个短时序添加随机值。If short_series_handling_config = pad, then automated ML adds random values to each short series found. 下面列出了列类型以及用于填充这些列的内容:The following lists the column types and what they are padded with:
  • 对象列,其中包含 NANObject columns with NaNs
  • 数值列,其中包含 0Numeric columns with 0
  • 布尔/逻辑列,其中包含 FalseBoolean/logic columns with False
  • 目标列填充平均值为零且标准偏差为 1 的随机值。The target column is padded with random values with mean of zero and standard deviation of 1.
  • drop 如果 short_series_handling_config = drop,则自动化 ML 会删除短时序,并且该短时序不会用于训练或预测。If short_series_handling_config = drop, then automated ML drops the short series, and it will not be used for training or prediction. 对这些时序的预测将会返回 NAN。Predictions for these series will return NaN's.
    None 不会填充或删除任何时序No series is padded or dropped

    警告

    填充可能会影响生成的模型的准确性,因为我们引入人工数据只是为了使训练成功而不会发生失败。Padding may impact the accuracy of the resulting model, since we are introducing artificial data just to get past training without failures.

    如果许多时序中都是短时序,你可能还会在可说明性结果中看到一些影响If many of the series are short, then you may also see some impact in explainability results

    运行试验Run the experiment

    准备好 AutoMLConfig 对象后,可以提交试验。When you have your AutoMLConfig object ready, you can submit the experiment. 模型完成后,检索最佳的运行迭代。After the model finishes, retrieve the best run iteration.

    ws = Workspace.from_config()
    experiment = Experiment(ws, "Tutorial-automl-forecasting")
    local_run = experiment.submit(automl_config, show_output=True)
    best_run, fitted_model = local_run.get_output()
    

    用最佳模型进行预测Forecasting with best model

    使用最佳模型迭代来预测测试数据集的值。Use the best model iteration to forecast values for the test data set.

    forecast() 函数允许指定预测的开始时间,这与通常用于分类和回归任务的 predict() 不同。The forecast() function allows specifications of when predictions should start, unlike the predict(), which is typically used for classification and regression tasks.

    在下例中,先将 y_pred 中的所有值替换为 NaNIn the following example, you first replace all values in y_pred with NaN. 在本例中,预测原点将位于训练数据的末尾。The forecast origin will be at the end of training data in this case. 但是,如果只将 y_pred 的后半部分替换为 NaN,则函数不会修改前半部分的数值,而会在后半部分预测 NaN 值。However, if you replaced only the second half of y_pred with NaN, the function would leave the numerical values in the first half unmodified, but forecast the NaN values in the second half. 函数将返回预测值和对齐的特征。The function returns both the forecasted values and the aligned features.

    还可以在 forecast() 函数中使用 forecast_destination 参数,预测到指定日期为止的值。You can also use the forecast_destination parameter in the forecast() function to forecast values up until a specified date.

    label_query = test_labels.copy().astype(np.float)
    label_query.fill(np.nan)
    label_fcst, data_trans = fitted_pipeline.forecast(
        test_data, label_query, forecast_destination=pd.Timestamp(2019, 1, 8))
    

    计算 actual_labels 实际值与 predict_labels 中的预测值之间的均方根误差 (RMSE)。Calculate root mean squared error (RMSE) between the actual_labels actual values, and the forecasted values in predict_labels.

    from sklearn.metrics import mean_squared_error
    from math import sqrt
    
    rmse = sqrt(mean_squared_error(actual_labels, predict_labels))
    rmse
    

    现在确定了模型的整体准确性,最现实的下一步是使用模型来预测未知的未来值。Now that the overall model accuracy has been determined, the most realistic next step is to use the model to forecast unknown future values.

    提供与测试集 test_data 具有相同格式但具有未来日期时间的数据集,生成的预测集就是每个时序步骤的预测值。Supply a data set in the same format as the test set test_data but with future datetimes, and the resulting prediction set is the forecasted values for each time-series step. 假设数据集中最后的时序记录针对的是 2018/12/31。Assume the last time-series records in the data set were for 12/31/2018. 若要预测次日的需求(或者小于或等于 forecast_horizon 的待预测时间段),请为每个商店创建 2019/01/01 的一条时序记录。To forecast demand for the next day (or as many periods as you need to forecast, <= forecast_horizon), create a single time series record for each store for 01/01/2019.

    day_datetime,store,week_of_year
    01/01/2019,A,1
    01/01/2019,A,1
    

    重复执行必要的步骤,将此未来数据加载到数据帧,然后运行 best_run.predict(test_data) 以预测未来值。Repeat the necessary steps to load this future data to a dataframe and then run best_run.predict(test_data) to predict future values.

    备注

    在启用了 target_lags 和/或 target_rolling_window_size 时,使用自动化 ML 的预测不支持示例内预测。In-sample predictions are not supported for forecasting with automated ML when target_lags and/or target_rolling_window_size are enabled.

    示例笔记本Example notebooks

    请参阅预测示例笔记本,了解高级预测配置的详细代码示例,其中包括:See the forecasting sample notebooks for detailed code examples of advanced forecasting configuration including:

    后续步骤Next steps