使用 Azure 机器学习创建、查看和部署自动化机器学习模型Create, review, and deploy automated machine learning models with Azure Machine Learning

应用于:否基本版是企业版            (升级到企业版APPLIES TO: noBasic edition yesEnterprise edition                       (Upgrade to Enterprise)

本文介绍如何在不编写任何代码的情况下,在 Azure 机器学习工作室中创建、探索和部署自动化机器学习模型。In this article, you learn how to create, explore, and deploy automated machine learning models without a single line of code in Azure Machine Learning studio.

重要

Azure 机器学习工作室中的自动化 ML 体验处于预览状态。The automated ML experience in the Azure Machine learning studio is in preview. 某些功能可能不受支持或者受限。Certain features may not be supported or have limited capabilities.

自动化机器学习是一个为你选择要用于特定数据的最佳机器学习算法的过程。Automated machine learning is a process in which the best machine learning algorithm to use for your specific data is selected for you. 通过此过程可以快速生成机器学习模型。This process enables you to generate machine learning models quickly. 详细了解自动化机器学习Learn more about automated machine learning.

对于端到端示例,请试用使用 Azure 机器学习的自动化 ML 界面创建分类模型教程For an end to end example, try the tutorial for creating a classification model with Azure Machine Learning's automated ML interface.

要获得 Python 基于代码的体验,请使用 Azure 机器学习 SDK 配置自动化机器学习试验For a Python code-based experience, configure your automated machine learning experiments with the Azure Machine Learning SDK.

先决条件Prerequisites

入门Get started

  1. 登录到 https://studio.ml.azure.cn/处的 Azure 机器学习。Sign in to Azure Machine Learning at https://studio.ml.azure.cn.

  2. 选择订阅和工作区。Select your subscription and workspace.

  3. 导航到左侧窗格。Navigate to the left pane. 在“创作”部分下,选择“自动化 ML” 。Select Automated ML under the Author section.

Azure 机器学习工作室导航窗格Azure Machine Learning studio navigation pane

如果这是第一个进行试验,将会看到空列表和文档链接。If this is your first time doing any experiments, you'll see an empty list and links to documentation.

否则,将看到最近进行的自动化机器学习试验列表,包括使用 SDK 创建的试验。Otherwise, you'll see a list of your recent automated machine learning experiments, including those created with the SDK.

创建并运行试验Create and run experiment

  1. 选择“+ 新建自动化 ML 运行”并填充窗体。Select + New automated ML run and populate the form.

  2. 选择存储容器中的数据集,或创建新数据集。Select a dataset from your storage container, or create a new dataset. 可以基于本地文件、Web URL、数据存储或 Azure 开放数据集创建数据集。Datasets can be created from local files, web urls, datastores, or Azure open datasets. 详细了解数据集创建Learn more about dataset creation.

    重要

    训练数据的要求:Requirements for training data:

    • 数据必须为表格格式。Data must be in tabular form.
    • 要预测的值(目标列)必须在数据中存在。The value you want to predict (target column) must be present in the data.
    1. 若要从本地计算机上的文件创建新的数据集,请选择“+创建数据集”,然后选择“从本地文件”。To create a new dataset from a file on your local computer, select +Create dataset and then select From local file.

    2. 在“基本信息”窗体中,为数据集指定唯一名称,并提供可选说明。In the Basic info form, give your dataset a unique name and provide an optional description.

    3. 选择“下一步”,以打开“数据存储和文件选择”窗体。Select Next to open the Datastore and file selection form. 在此窗体上,你需选择要将数据集上传到何处:将数据集上传到在工作区中自动创建的默认存储容器,或选择要用于试验的存储容器。On this form you select where to upload your dataset; the default storage container that's automatically created with your workspace, or choose a storage container that you want to use for the experiment.

      1. 如果数据位于虚拟网络后面,则需要启用“跳过验证”功能以确保工作区可以访问数据。If your data is behind a virtual network, you need to enable the skip the validation function to ensure that the workspace can access your data. 详细了解网络隔离和隐私Learn more about network isolation and privacy.
    4. 选择“浏览”以上传数据集的数据文件。Select Browse to upload the data file for your dataset.

    5. 查看“设置和预览”窗体中内容的准确性。Review the Settings and preview form for accuracy. 该窗体是基于文件类型智能填充的。The form is intelligently populated based on the file type.

      字段Field 说明Description
      文件格式File format 定义文件中存储的数据的布局和类型。Defines the layout and type of data stored in a file.
      分隔符Delimiter 一个或多个字符,用于指定纯文本或其他数据流中不同的独立区域之间的边界。One or more characters for specifying the boundary between separate, independent regions in plain text or other data streams.
      编码Encoding 指定字符架构表中用于读取数据集的位。Identifies what bit to character schema table to use to read your dataset.
      列标题Column headers 指示如何处理数据集的标头(如果有)。Indicates how the headers of the dataset, if any, will be treated.
      跳过行Skip rows 指示要跳过数据集中的多少行(如果有)。Indicates how many, if any, rows are skipped in the dataset.

      选择“下一页”。Select Next.

    6. “架构”窗体是基于“设置和预览”窗体中所做的选择智能填充的。 The Schema form is intelligently populated based on the selections in the Settings and preview form. 在此处,请配置每个列的数据类型,检查列名称,并选择“不包含”哪些列进行试验。Here configure the data type for each column, review the column names, and select which columns to Not include for your experiment.

      选择“下一步”。Select Next.

    7. “确认详细信息”窗体上总结了先前在“基本信息”和“设置和预览”窗体中填充的信息 。The Confirm details form is a summary of the information previously populated in the Basic info and Settings and preview forms. 你还可以使用已启用分析的计算来为数据集创建数据配置文件。You also have the option to create a data profile for your dataset using a profiling enabled compute. 详细了解数据分析Learn more about data profiling.

      选择“下一页”。Select Next.

  3. 新建的数据集出现后,请将其选中。Select your newly created dataset once it appears. 还可以查看数据集和样本统计信息的预览。You are also able to view a preview of the dataset and sample statistics.

  4. 在“配置运行”窗体中,输入唯一的试验名称。On the Configure run form, enter a unique experiment name.

  5. 选择目标列;这是要对其进行预测的列。Select a target column; this is the column that you would like to do predictions on.

  6. 为数据分析和训练作业选择计算。Select a compute for the data profiling and training job. 下拉列表中提供了现有计算的列表。A list of your existing computes is available in the dropdown. 若要创建新的计算,请按照步骤 7 中的说明操作。To create a new compute, follow the instructions in step 7.

  7. 选择“创建新计算”,以便为此试验配置计算上下文。Select Create a new compute to configure your compute context for this experiment.

    字段Field 说明Description
    计算名称Compute name 输入用于标识计算上下文的唯一名称。Enter a unique name that identifies your compute context.
    虚拟机优先级Virtual machine priority 低优先级虚拟机的费用更低,但不能保证计算节点。Low priority virtual machines are cheaper but don't guarantee the compute nodes.
    虚拟机类型Virtual machine type 选择“CPU”或“GPU”作为虚拟机类型。Select CPU or GPU for virtual machine type.
    虚拟机大小Virtual machine size 指定计算资源的虚拟机大小。Select the virtual machine size for your compute.
    最小/最大节点数Min / Max nodes 若要分析数据,必须指定一个或多个节点。To profile data, you must specify 1 or more nodes. 输入计算的最大节点数。Enter the maximum number of nodes for your compute. 对于 AML 计算,默认值为 6 个节点。The default is 6 nodes for an AML Compute.
    高级设置Advanced settings 使用这些设置可以配置用户帐户和现有虚拟网络以进行试验。These settings allow you to configure a user account and existing virtual network for your experiment.

    选择“创建”。Select Create. 创建新计算可能需要花费几分钟时间。Creation of a new compute can take a few minutes.

    备注

    计算名称将会指示选择/创建的计算是否已启用分析。Your compute name will indicate if the compute you select/create is profiling enabled. (有关更多详细信息,请参阅数据分析部分)。(See the section data profiling for more details).

    选择“下一页”。Select Next.

  8. 在“任务类型和设置”窗体中选择任务类型:分类、回归或预测。On the Task type and settings form, select the task type: classification, regression, or forecasting. 有关详细信息,请参阅支持的任务类型See supported task types for more information.

    1. 对于分类,还可以启用用于文本特征化的深度学习。For classification, you can also enable deep learning which is used for text featurizations.

    2. 对于预测,可以For forecasting you can,

      1. 启用深度学习Enable deep learning

      2. 选择“时间列”:此列包含要使用的时间数据。Select time column: This column contains the time data to be used.

      3. 选择“预测范围”:指示模型可以预测未来的多少个时间单位(分钟/小时/天/周/月/年)。Select forecast horizon: Indicate how many time units (minutes/hours/days/weeks/months/years) will the model be able to predict to the future. 模型需要预测的未来时间越久远,其准确度越低。The further the model is required to predict into the future, the less accurate it will become. 详细了解预测和预测范围Learn more about forecasting and forecast horizon.

  9. (可选)查看附加配置设置:可用来更好地控制训练作业的其他设置。(Optional) View addition configuration settings: additional settings you can use to better control the training job. 否则,将会根据试验选择和数据应用默认设置。Otherwise, defaults are applied based on experiment selection and data.

    其他配置Additional configurations 说明Description
    主要指标Primary metric 用于对模型进行评分的主要指标。Main metric used for scoring your model. 详细了解模型指标Learn more about model metrics.
    解释最佳模型Explain best model 选择启用或禁用,以确定是否显示建议的最佳模型的可解释性。Select to enable or disable, in order to show explainability of the recommended best model.
    阻止的算法Blocked algorithm 选择要从训练作业中排除的算法。Select algorithms you want to exclude from the training job.
    退出条件Exit criterion 如果满足其中的任一条件,则会停止训练作业。When any of these criteria are met, the training job is stopped.
    训练作业时间(小时) :允许训练作业运行多长时间。Training job time (hours): How long to allow the training job to run.
    指标评分阈值:所有管道的最低指标评分。Metric score threshold: Minimum metric score for all pipelines. 这可以确保在你具有一个要实现的已定义目标指标时,无需花费不必要的时间来完成训练作业。This ensures that if you have a defined target metric you want to reach, you do not spend more time on the training job than necessary.
    验证Validation 选择要在训练作业中使用的交叉验证选项之一。Select one of the cross validation options to use in the training job. 详细了解交叉验证Learn more about cross validation.
    并发Concurrency 最大并发迭代数:要在训练作业中测试的最大管道(迭代)数。Max concurrent iterations: Maximum number of pipelines (iterations) to test in the training job. 作业运行的迭代数不会超过指定的数目。The job will not run more than the specified number of iterations.
  10. (可选)查看特征化设置:如果选择在“其他配置设置”窗体中启用“自动特征化”,则会应用默认的特征化技术 。(Optional) View featurization settings: if you choose to enable Automatic featurization in the Additional configuration settings form, default featurization techniques are applied. 在“查看特征化设置”中,可以更改这些默认设置并相应地进行自定义。In the View featurization settings you can change these defaults and customize accordingly. 了解如何自定义特征化Learn how to customize featurizations.

    Azure 机器学习工作室任务类型窗体

数据分析和摘要统计信息Data profiling & summary stats

可以获取整个数据集的各种摘要统计信息,以验证该数据集是否随时可在机器学习中使用。You can get a vast variety of summary statistics across your data set to verify whether your data set is ML-ready. 对于非数字列,仅包括最小值、最大值和误差计数等基本统计信息。For non-numeric columns, they include only basic statistics like min, max, and error count. 对于数字列,还可以查看其统计时刻和估算的分位数。For numeric columns, you can also review their statistical moments and estimated quantiles. 具体而言,我们的数据配置文件包括:Specifically, our data profile includes:

备注

对于具有不相关类型的特征,将显示空白条目。Blank entries appear for features with irrelevant types.

统计信息Statistic 说明Description
FeatureFeature 正在汇总的列的名称。Name of the column that is being summarized.
配置文件Profile 基于推理的类型显示的内联可视化效果。In-line visualization based on the type inferred. 例如,字符串、布尔值和日期包含值计数,而小数(数字)则包含近似的直方图。For example, strings, booleans, and dates will have value counts, while decimals (numerics) have approximated histograms. 这样,就可以快速了解数据的分布。This allows you to gain a quick understanding of the distribution of the data.
类型分布Type distribution 列中类型的内联值计数。In-line value count of types within a column. Null 是其自身的类型,因此,此可视化效果可用于检测反常值或缺失值。Nulls are their own type, so this visualization is useful for detecting odd or missing values.
类型Type 列的推理类型。Inferred type of the column. 可能的值包括:字符串、布尔值、日期和小数。Possible values include: strings, booleans, dates, and decimals.
MinMin 列的最小值。Minimum value of the column. 对于其类型不采用固有顺序(例如布尔值)的特征,将显示空白条目。Blank entries appear for features whose type does not have an inherent ordering (e.g. booleans).
MaxMax 列的最大值。Maximum value of the column.
计数Count 列中缺失和未缺失条目的总数。Total number of missing and non-missing entries in the column.
非缺失计数Not missing count 列中未缺失的条目数。Number of entries in the column that are not missing. 空字符串和误差被视为值,因此它们不会计入“未缺少计数”。Empty strings and errors are treated as values, so they will not contribute to the "not missing count."
分位数Quantiles 每个分位数中的近似值,用于提供数据分布的概观。Approximated values at each quantile to provide a sense of the distribution of the data.
平均值Mean 列的算术中间值或平均值。Arithmetic mean or average of the column.
标准偏差Standard deviation 此列数据的离散量或差异量的度量。Measure of the amount of dispersion or variation of this column's data.
VarianceVariance 此列数据与其平均值之间的分散程度度量。Measure of how far spread out this column's data is from its average value.
倾斜Skewness 此列数据与正态分布之间的差异程度度量。Measure of how different this column's data is from a normal distribution.
峰度Kurtosis 此列数据与正态分布相比的落后程度度量。Measure of how heavily tailed this column's data is compared to a normal distribution.

自定义特征化Customize featurization

在“特征化”窗体中,可以启用/禁用自动特征化,并为试验自定义自动特征化设置。In the Featurization form, you can enable/disable automatic featurization and customize the automatic featurization settings for your experiment. 若要打开此窗体,请参阅创建并运行试验部分中的步骤 10。To open this form, see step 10 in the Create and run experiment section.

下表汇总了工作室中目前可用的自定义。The following table summarizes the customizations currently available via the studio.

Column 自定义Customization
已含Included 指定训练时要包含的列。Specifies which columns to include for training.
特征类型Feature type 更改选定列的值类型。Change the value type for the selected column.
插补值Impute with 选择数据中用于插补缺失值的值。Select what value to impute missing values with in your data.

Azure 机器学习工作室任务类型窗体

运行试验并查看结果Run experiment and view results

选择“完成”来运行试验。Select Finish to run your experiment. 试验准备过程最长可能需要 10 分钟。The experiment preparing process can take up to 10 minutes. 训练作业可能需要额外的 2-3 分钟,使每个管道完成运行。Training jobs can take an additional 2-3 minutes more for each pipeline to finish running.

查看试验详细信息View experiment details

“运行详细信息”屏幕中会打开“详细信息”选项卡。 此屏幕显示试验运行的摘要,在顶部的运行编号旁会显示状态栏。The Run Detail screen opens to the Details tab. This screen shows you a summary of the experiment run including a status bar at the top next to the run number.

“模型”选项卡包含按指标评分排序的已创建模型列表。The Models tab contains a list of the models created ordered by the metric score. 默认情况下,列表中首先显示评分最高的模型(评分根据所选指标给出)。By default, the model that scores the highest based on the chosen metric is at the top of the list. 在训练作业尝试了更多的模型后,这些模型将添加到列表中。As the training job tries out more models, they are added to the list. 使用此功能可以快速比较到目前为止生成的模型的指标。Use this to get a quick comparison of the metrics for the models produced so far.

运行详细信息仪表板Run details dashboard

查看训练运行详细信息View training run details

向下钻取任何已完成的模型,以查看训练运行详细信息,例如“模型”选项卡上的模型摘要或“指标”选项卡上的性能指标图表。详细了解图表Drill down on any of the completed models to see training run details, like a model summary on the Model tab or performance metric charts on the Metrics tab. Learn more about charts.

迭代详细信息Iteration details

部署模型Deploy your model

获得最佳模型后,可将其部署为 Web 服务以根据新数据进行预测。Once you have the best model at hand, it is time to deploy it as a web service to predict on new data.

自动化 ML 可帮助你在不编写任何代码的情况下部署模型:Automated ML helps you with deploying the model without writing code:

  1. 可以使用多个部署选项。You have a couple options for deployment.

    • 选项 1:根据定义的指标条件部署最佳模型。Option 1: Deploy the best model, according to the metric criteria you defined.

      1. 试验完成后,选择屏幕顶部的“运行 1”,导航到父运行页面。After the experiment is complete, navigate to the parent run page by selecting Run 1 at the top of the screen.
      2. 选择“最佳模型摘要”部分中列出的模型。Select the model listed in the Best model summary section.
      3. 选择窗口左上角的“部署”。Select Deploy on the top left of the window.
    • 选项 2:从此试验部署特定模型迭代。Option 2: To deploy a specific model iteration from this experiment.

      1. 从“模型”选项卡中选择所需模型Select the desired model from the Models tab
      2. 选择窗口左上角的“部署”。Select Deploy on the top left of the window.
  2. 填充“部署模型”窗格。Populate the Deploy model pane.

    字段Field Value
    名称Name 输入部署的唯一名称。Enter a unique name for your deployment.
    说明Description 输入说明,以更清楚地指出此部署的用途。Enter a description to better identify what this deployment is for.
    计算类型Compute type 选择要部署的终结点类型:Azure Kubernetes 服务 (AKS) 或 Azure 容器实例 (ACI) 。Select the type of endpoint you want to deploy: Azure Kubernetes Service (AKS) or Azure Container Instance (ACI).
    计算名称Compute name 仅适用于 AKS:选择要部署到的 AKS 群集的名称。Applies to AKS only: Select the name of the AKS cluster you wish to deploy to.
    启用身份验证Enable authentication 选择此项将允许基于令牌或基于密钥的身份验证。Select to allow for token-based or key-based authentication.
    使用自定义部署资产Use custom deployment assets 若要上传自己的评分脚本和环境文件,请启用此功能。Enable this feature if you want to upload your own scoring script and environment file. 详细了解评分脚本Learn more about scoring scripts.

    重要

    文件名不能超过 32 个字符,并且必须以字母数字开头和结尾。File names must be under 32 characters and must begin and end with alphanumerics. 开头和结尾之间可以包含短划线、下划线、句点和字母数字。May include dashes, underscores, dots, and alphanumerics between. 不允许空格。Spaces are not allowed.

    “高级”菜单提供默认部署功能,例如数据收集和资源利用率设置。The Advanced menu offers default deployment features such as data collection and resource utilization settings. 若要替代这些默认设置,请在此菜单中替代。If you wish to override these defaults do so in this menu.

  3. 选择“部署”。Select Deploy. 完成部署可能需要大约 20 分钟。Deployment can take about 20 minutes to complete. 部署开始后,将显示“模型摘要”选项卡。Once deployment begins, the Model summary tab appears. 在“部署状态”部分下查看部署进度。See the deployment progress under the Deploy status section.

现在,你已获得一个正常运行的、可以生成预测结果的 Web 服务!Now you have an operational web service to generate predictions! 可以通过 Power BI 内置的 Azure 机器学习支持查询该服务,以测试预测。You can test the predictions by querying the service from Power BI's built in Azure Machine Learning support.

后续步骤Next steps