教程:使用自动化机器学习预测需求Tutorial: Forecast demand with automated machine learning

本教程将使用 Azure 机器学习工作室中的自动化机器学习(又称为自动化 ML)创建一个时序预测模型,用于预测单车共享服务的租赁需求。In this tutorial, you use automated machine learning, or automated ML, in the Azure Machine Learning studio to create a time-series forecasting model to predict rental demand for a bike sharing service.

有关分类模型示例,请参阅教程:使用 Azure 机器学习中的自动化 ML 创建分类模型For a classification model example, see Tutorial: Create a classification model with automated ML in Azure Machine Learning.

本教程介绍如何执行以下任务:In this tutorial, you learn how to do the following tasks:

  • 创建并加载数据集。Create and load a dataset.
  • 配置并运行自动化 ML 试验。Configure and run an automated ML experiment.
  • 指定预测设置。Specify forecasting settings.
  • 浏览试验结果。Explore the experiment results.
  • 部署最佳模型。Deploy the best model.

先决条件Prerequisites

在 Azure 机器学习工作室中开始操作Get started in Azure Machine Learning studio

本教程将在 Azure 机器学习工作室中创建自动化 ML 试验运行。机器学习工作室是一个整合的 Web 界面,其中包含的机器学习工具可让各种技能水平的数据科学实践者执行数据科学方案。For this tutorial, you create your automated ML experiment run in Azure Machine Learning studio, a consolidated web interface that includes machine learning tools to perform data science scenarios for data science practitioners of all skill levels. Internet Explorer 浏览器不支持此工作室。The studio is not supported on Internet Explorer browsers.

  1. 登录到 Azure 机器学习工作室Sign in to Azure Machine Learning studio.

  2. 选择创建的订阅和工作区。Select your subscription and the workspace you created.

  3. 选择“开始”。 Select Get started.

  4. 在左窗格的“创作”部分,选择“自动化 ML” 。In the left pane, select Automated ML under the Author section.

  5. 选择“+新建自动化 ML 运行”。Select +New automated ML run.

创建并加载数据集Create and load dataset

在配置试验之前,请以 Azure 机器学习数据集的形式将数据文件上传到工作区。Before you configure your experiment, upload your data file to your workspace in the form of an Azure Machine Learning dataset. 这可以确保数据格式适合在试验中使用。Doing so, allows you to ensure that your data is formatted appropriately for your experiment.

  1. 在“选择数据集”窗体中,从“+ 创建数据集”下拉列表中选择“从本地文件”。On the Select dataset form, select From local files from the +Create dataset drop-down.

    1. 在“基本信息”窗体中,为数据集指定名称,并提供可选的说明。On the Basic info form, give your dataset a name and provide an optional description. 数据集类型默认为“表格”,因为 Azure 机器学习工作室中的自动化 ML 目前仅支持表格数据集。The dataset type should default to Tabular, since automated ML in Azure Machine Learning studio currently only supports tabular datasets.

    2. 在左下角选择“下一步”Select Next on the bottom left

    3. 在“数据存储和文件选择”窗体中,选择在创建工作区期间自动设置的默认数据存储“workspaceblobstore (Azure Blob 存储)”。On the Datastore and file selection form, select the default datastore that was automatically set up during your workspace creation, workspaceblobstore (Azure Blob Storage). 这是要将数据文件上传到的存储位置。This is the storage location where you'll upload your data file.

    4. 选择“浏览”。Select Browse.

    5. 在本地计算机上选择“bike-no.csv”文件。Choose the bike-no.csv file on your local computer. 这是作为必备组件下载的文件。This is the file you downloaded as a prerequisite.

    6. 选择“下一步”Select Next

      上传完成后,系统会根据文件类型预先填充“设置和预览”窗体。When the upload is complete, the Settings and preview form is pre-populated based on the file type.

    7. 验证“设置和预览”窗体是否已填充如下,然后选择“下一步”。Verify that the Settings and preview form is populated as follows and select Next.

      字段Field 说明Description 教程的值Value for tutorial
      文件格式File format 定义文件中存储的数据的布局和类型。Defines the layout and type of data stored in a file. 带分隔符Delimited
      分隔符Delimiter 一个或多个字符,用于指定纯文本或其他数据流中不同的独立区域之间的边界。 One or more characters for specifying the boundary between  separate, independent regions in plain text or other data streams. 逗号Comma
      编码Encoding 指定字符架构表中用于读取数据集的位。Identifies what bit to character schema table to use to read your dataset. UTF-8UTF-8
      列标题Column headers 指示如何处理数据集的标头(如果有)。Indicates how the headers of the dataset, if any, will be treated. 使用第一个文件中的标头Use headers from the first file
      跳过行Skip rows 指示要跳过数据集中的多少行(如果有)。Indicates how many, if any, rows are skipped in the dataset. None
    8. 通过“架构”窗体,可以进一步为此试验配置数据。The Schema form allows for further configuration of your data for this experiment.

      1. 对于本示例,请选择忽略 casualregistered 列。For this example, choose to ignore the casual and registered columns. 这些列是 cnt 列的细目,因此我们不会包含这些列。These columns are a breakdown of the cnt column so, therefore we don't include them.

      2. 此外,对于本示例,请保留“属性”和“类型”的默认值。Also for this example, leave the defaults for the Properties and Type.

      3. 选择“下一页”。Select Next.

    9. 在“确认详细信息”窗体上,确认信息与先前在“基本信息”和“设置和预览”窗体上填充的内容匹配。On the Confirm details form, verify the information matches what was previously populated on the Basic info and Settings and preview forms.

    10. 选择“创建”以完成数据集的创建。Select Create to complete the creation of your dataset.

    11. 当数据集出现在列表中时,则选择它。Select your dataset once it appears in the list.

    12. 选择“下一步”。Select Next.

配置试验运行Configure experiment run

加载并配置数据后,请设置远程计算目标,并在数据中选择要预测的列。After you load and configure your data, set up your remote compute target and select which column in your data you want to predict.

  1. 按如下所述填充“配置运行”窗体:Populate the Configure run form as follows:
    1. 输入试验名称:automl-bikeshareEnter an experiment name: automl-bikeshare

    2. 选择“cnt”作为要预测的目标列。Select cnt as the target column, what you want to predict. 此列指示共享单车的租赁总次数。This column indicates the number of total bike share rentals.

    3. 选择“创建新计算”并配置计算目标。Select Create a new compute and configure your compute target. 自动 ML 仅支持 Azure 机器学习计算。Automated ML only supports Azure Machine Learning compute.

      字段Field 说明Description 教程的值Value for tutorial
      计算名称Compute name 用于标识计算上下文的唯一名称。A unique name that identifies your compute context. bike-computebike-compute
      虚拟机类型  Virtual machine type 选择计算的虚拟机大小。Select the virtual machine type for your compute. CPU(中央处理单元)CPU (Central Processing Unit)
      虚拟机大小  Virtual machine size 指定计算资源的虚拟机大小。Select the virtual machine size for your compute. Standard_DS12_V2Standard_DS12_V2
      最小/最大节点数Min / Max nodes 若要分析数据,必须指定一个或多个节点。To profile data, you must specify 1 or more nodes. 最小节点数:1Min nodes: 1
      最大节点数:6Max nodes: 6
      缩减前的空闲秒数Idle seconds before scale down 群集自动缩减到最小节点数之前的空闲时间。Idle time before the cluster is automatically scaled down to the minimum node count. 120(默认值)120 (default)
      高级设置Advanced settings 用于为试验配置虚拟网络并对其进行授权的设置。Settings to configure and authorize a virtual network for your experiment. None
      1. 选择“创建”,获取计算目标。Select Create to get the compute target.

        完成此操作需要数分钟的时间。This takes a couple minutes to complete.

      2. 创建后,从下拉列表中选择新的计算目标。After creation, select your new compute target from the drop-down list.

    4. 选择“下一页”。Select Next.

选择预测设置Select forecast settings

通过指定机器学习任务类型和配置设置来完成自动化 ML 试验的设置。Complete the setup for your automated ML experiment by specifying the machine learning task type and configuration settings.

  1. 在“任务类型和设置”窗体中,选择“时序预测”作为机器学习任务类型。On the Task type and settings form, select Time series forecasting as the machine learning task type.

  2. 选择“日期”作为时间列,将“时序标识符”留空。Select date as your Time column and leave Time series identifiers blank.

  3. “预测范围”是要预测的未来时间长短。The forecast horizon is the length of time into the future you want to predict. 取消选择“自动检测”,并在字段中键入 14。Deselect Autodetect and type 14 in the field.

  4. 选择“查看其他配置设置”并按如下所示填充字段。Select View additional configuration settings and populate the fields as follows. 这些设置旨在更好地控制训练作业以及指定预测设置。These settings are to better control the training job and specify settings for your forecast. 否则,将会根据试验选择和数据应用默认设置。Otherwise, defaults are applied based on experiment selection and data.

    其他配置 Additional configurations 说明Description 教程的值  Value for tutorial
    主要指标Primary metric 对机器学习算法进行度量时依据的评估指标。Evaluation metric that the machine learning algorithm will be measured by. 规范化均方根误差Normalized root mean squared error
    解释最佳模型Explain best model 自动显示有关自动化 ML 创建的最佳模型的可解释性。Automatically shows explainability on the best model created by automated ML. 启用Enable
    阻止的算法Blocked algorithms 要从训练作业中排除的算法Algorithms you want to exclude from the training job 极端随机树Extreme Random Trees
    其他预测设置Additional forecasting settings 这些设置有助于改善模型的准确度These settings help improve the accuracy of your model

    预测目标滞后:要将目标变量的滞后往后推多久Forecast target lags: how far back you want to construct the lags of the target variable
    目标滚动窗口:指定滚动窗口的大小(例如 max, minsum),将基于此大小生成特征。Target rolling window: specifies the size of the rolling window over which features, such as the max, min and sum, will be generated.


    预测目标延隔:  无Forecast target lags: None
    目标滚动窗口大小:   无Target rolling window size: None
    退出条件Exit criterion 如果符合某个条件,则会停止训练作业。If a criteria is met, the training job is stopped. 训练作业时间(小时):  3Training job time (hours): 3
    指标分数阈值:  无Metric score threshold: None
    验证Validation 选择交叉验证类型和测试数。Choose a cross-validation type and number of tests. 验证类型:Validation type:
    k-折交叉验证   k-fold cross-validation

    验证次数:5Number of validations: 5
    并发Concurrency 每次迭代执行的并行迭代的最大数目The maximum number of parallel iterations executed per iteration 最大并发迭代次数:  6Max concurrent iterations: 6

    选择“保存” 。Select Save.

运行试验Run experiment

若要运行试验,请选择“完成”。To run your experiment, select Finish. 此时会打开“运行详细信息”屏幕,其顶部的运行编号旁边显示了“运行状态”。The Run details screen opens with the Run status at the top next to the run number. 此状态随着试验的进行而更新。This status updates as the experiment progresses.

重要

准备试验运行时,准备需要 10-15 分钟Preparation takes 10-15 minutes to prepare the experiment run. 运行以后,每个迭代还需要 2-3 分钟Once running, it takes 2-3 minutes more for each iteration.

在生产环境中,此过程需要一段时间,因此不妨干点其他的事。In production, you'd likely walk away for a bit as this process takes time. 在等待过程中,我们建议在“模型”选项卡上开始浏览已完成测试的算法。While you wait, we suggest you start exploring the tested algorithms on the Models tab as they complete.

浏览模型Explore models

导航到“模型”选项卡,以查看测试的算法(模型)。Navigate to the Models tab to see the algorithms (models) tested. 默认情况下,这些模型在完成后按指标分数排序。By default, the models are ordered by metric score as they complete. 对于本教程,列表中首先显示评分最高的模型(评分根据所选的“规范化均方根误差”指标给出)。For this tutorial, the model that scores the highest based on the chosen Normalized root mean squared error metric is at the top of the list.

在等待所有试验模型完成的时候,可以选择已完成模型的“算法名称”,以便浏览其性能详细信息。While you wait for all of the experiment models to finish, select the Algorithm name of a completed model to explore its performance details.

以下示例将浏览“详细信息”和“指标”选项卡,以查看选定模型的属性、指标和性能图表。 The following example navigates through the Details and the Metrics tabs to view the selected model's properties, metrics and performance charts.

运行详细信息

部署模型Deploy the model

Azure 机器学习工作室中的自动化机器学习可以通过几个步骤将最佳模型部署为 Web 服务。Automated machine learning in Azure Machine Learning studio allows you to deploy the best model as a web service in a few steps. 部署是模型的集成,因此它可以对新数据进行预测并识别潜在的机会领域。Deployment is the integration of the model so it can predict on new data and identify potential areas of opportunity.

在此试验中部署到 Web 服务后,单车共享公司即会获得一个迭代且可缩放的 Web 解决方案,可以预测共享单车的租赁需求。For this experiment, deployment to a web service means that the bike share company now has an iterative and scalable web solution for forecasting bike share rental demand.

运行完成后,选择屏幕顶部的“运行 1”导航回父运行页。Once the run is complete, navigate back to parent run page by selecting Run 1 at the top of your screen.

在“最佳模型摘要”部分,根据“标准均方根误差”指标,在此试验背景下 StackEnsemble 被视为最佳模型 。In the Best model summary section, StackEnsemble is considered the best model in the context of this experiment, based on the Normalized root mean squared error metric.

我们将部署此模型,但请注意,部署需要大约 20 分钟才能完成。We deploy this model, but be advised, deployment takes about 20 minutes to complete. 部署过程需要几个步骤,包括注册模型、生成资源和为 Web 服务配置资源。The deployment process entails several steps including registering the model, generating resources, and configuring them for the web service.

  1. 选择“StackEnsemble”打开特定于模型的页面。Select StackEnsemble to open the model-specific page.

  2. 选择位于屏幕左上角的“部署”按钮。Select the Deploy button located in the top-left area of the screen.

  3. 按如下所示填充“部署模型”窗格:Populate the Deploy a model pane as follows:

    字段Field Value
    部署名称Deployment name bikeshare-deploybikeshare-deploy
    部署说明Deployment description 单车共享需求部署bike share demand deployment
    计算类型Compute type 选择“Azure 计算实例(ACI)”Select Azure Compute Instance (ACI)
    启用身份验证Enable authentication 禁用。Disable.
    使用自定义部署资产Use custom deployment assets 禁用。Disable. 禁用此选项可以自动生成默认驱动程序文件(评分脚本)和环境文件。Disabling allows for the default driver file (scoring script) and environment file to be autogenerated.

    本示例使用“高级”菜单中提供的默认值。For this example, we use the defaults provided in the Advanced menu.

  4. 选择“部署”。Select Deploy.

    “运行”屏幕的顶部会以绿色字体显示一条成功消息,指出部署已成功启动。A green success message appears at the top of the Run screen stating that the deployment was started successfully. 可以在“部署状态”下的“模型摘要”窗格中找到部署进度。The progress of the deployment can be found in the Model summary pane under Deploy status.

部署成功后,即会获得一个正常运行的、可以生成预测结果的 Web 服务。Once deployment succeeds, you have an operational web service to generate predictions.

转到后续步骤详细了解如何使用新的 Web 服务,以及如何使用 Power BI 的内置 Azure 机器学习支持来测试预测。Proceed to the Next steps to learn more about how to consume your new web service, and test your predictions using Power BI's built in Azure Machine Learning support.

清理资源Clean up resources

部署文件比数据文件和试验文件更大,因此它们的存储成本也更大。Deployment files are larger than data and experiment files, so they cost more to store. 仅当你想要最大程度地降低帐户成本,或者想要保留工作区和试验文件时,才删除部署文件。Delete only the deployment files to minimize costs to your account, or if you want to keep your workspace and experiment files. 否则,如果你不打算使用任何文件,请删除整个资源组。Otherwise, delete the entire resource group, if you don't plan to use any of the files.

删除部署实例Delete the deployment instance

若要保留资源组和工作区以便在其他教程和探索中使用,请仅从 Azure 机器学习工作室中删除部署实例。Delete just the deployment instance from the Azure Machine Learning studio, if you want to keep the resource group and workspace for other tutorials and exploration.

  1. 转到 Azure 机器学习工作室Go to the Azure Machine Learning studio. 导航到你的工作区,然后在“资产”窗格的左下角选择“终结点”。Navigate to your workspace and on the left under the Assets pane, select Endpoints.

  2. 选择要删除的部署,然后选择“删除”。Select the deployment you want to delete and select Delete.

  3. 选择“继续”。Select Proceed.

删除资源组Delete the resource group

重要

已创建的资源可以用作其他 Azure 机器学习教程和操作方法文章的先决条件。The resources you created can be used as prerequisites to other Azure Machine Learning tutorials and how-to articles.

如果不打算使用已创建的资源,请删除它们,以免产生任何费用:If you don't plan to use the resources you created, delete them, so you don't incur any charges:

  1. 在 Azure 门户中,选择最左侧的“资源组” 。In the Azure portal, select Resource groups on the far left.

    在 Azure 门户中删除Delete in the Azure portal

  2. 从列表中选择已创建的资源组。From the list, select the resource group you created.

  3. 选择“删除资源组” 。Select Delete resource group.

  4. 输入资源组名称。Enter the resource group name. 然后选择“删除” 。Then select Delete.

后续步骤Next steps

在本教程中,你已使用 Azure 机器学习工作室中的自动化 ML 创建并部署了一个可预测单车共享租赁需求的时序预测模型。In this tutorial, you used automated ML in the Azure Machine Learning studio to create and deploy a time series forecasting model that predicts bike share rental demand.

请参阅以下文章中的步骤来创建 Power BI 支持的架构,以方便使用新部署的 Web 服务:See this article for steps on how to create a Power BI supported schema to facilitate consumption of your newly deployed web service:

备注

此单车共享数据集已根据本教程修改。This bike share dataset has been modified for this tutorial. 此数据集是作为 Kaggle 竞赛的一部分提供的,最初通过 Capital Bikeshare 提供。This dataset was made available as part of a Kaggle competition and was originally available via Capital Bikeshare. 也可以在 UCI 机器学习数据库中找到它。It can also be found within the UCI Machine Learning Database.

源:Fanaee-T、Hadi、Gama 和 Joao:合并系综检测器的事件标签和背景知识;人工智能的进步 (2013):pp. 1-15,Springer Berlin Heidelberg。Source: Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.