Azure 机器学习中的已知问题和故障排除Known issues and troubleshooting in Azure Machine Learning

本文帮助你对使用 Azure 机器学习时可能遇到的已知问题进行故障排除。This article helps you troubleshoot known issues you may encounter when using Azure Machine Learning.

有关故障排除的更多信息,请参阅本文末尾的后续步骤For more information on troubleshooting, see Next steps at the end of this article.

提示

使用 Azure 机器学习时遇到的资源配额可能会导致错误或其他问题。Errors or other issues might be the result of resource quotas you encounter when working with Azure Machine Learning.

访问诊断日志Access diagnostic logs

如果在请求帮助时可以提供诊断信息,有时会很有帮助。Sometimes it can be helpful if you can provide diagnostic information when asking for help. 若要日志,请执行以下操作:To see some logs:

  1. 访问 Azure 机器学习工作室Visit Azure Machine Learning studio.
  2. 在左侧选择“试验”On the left-hand side, select Experiment
  3. 选择一个试验。Select an experiment.
  4. 选择一个运行。Select a run.
  5. 在顶部,选择“输出 + 日志”。On the top, select Outputs + logs.

备注

Azure 机器学习在训练期间记录会从各种源(例如,运行训练作业的 AutoML 或 Docker 容器)记录信息。Azure Machine Learning logs information from a variety of sources during training, such as AutoML or the Docker container that runs the training job. 其中的许多日志没有详细的阐述。Many of these logs are not documented. 如果遇到问题且联系了 Microsoft 支持部门,他们可以在排除故障时使用这些日志。If you encounter problems and contact Microsoft support, they may be able to use these logs during troubleshooting.

安装和导入Installation and import

  • Pip 安装:依赖项不保证与单行安装一致:Pip Installation: Dependencies are not guaranteed to be consistent with single-line installation:

    这是 pip 的已知限制,因为作为单行安装时,pip 没有有效的依赖项解析程序。This is a known limitation of pip, as it does not have a functioning dependency resolver when you install as a single line. 它仅查看第一个独特依赖项。The first unique dependency is the only one it looks at.

    在以下代码中,azureml-datadriftazureml-train-automl 都使用单行 pip 进行安装。In the following code azureml-datadrift and azureml-train-automl are both installed using a single-line pip install.

      pip install azureml-datadrift, azureml-train-automl
    

    在本例中,假设 azureml-datadrift 要求版本高于 1.0,azureml-train-automl 要求版本低于 1.2。For this example, let's say azureml-datadrift requires version > 1.0 and azureml-train-automl requires version < 1.2. 如果 azureml-datadrift 的最新版本是 1.3,那么两个包都会升级到 1.3,即使 azureml-train-automl 包要求使用较旧版本。If the latest version of azureml-datadrift is 1.3, then both packages get upgraded to 1.3, regardless of the azureml-train-automl package requirement for an older version.

    若要确保为包安装适当的版本,请使用多行安装,如以下代码中所示。To ensure the appropriate versions are installed for your packages, install using multiple lines like in the following code. 在这里,顺序不是问题,因为 pip 显式降级为下一行调用的一部分。Order isn't an issue here, since pip explicitly downgrades as part of the next line call. 因此,会应用适当的版本依赖项。And so, the appropriate version dependencies are applied.

       pip install azureml-datadrift
       pip install azureml-train-automl 
    
  • 安装 azureml-train-automl-client 时不保证安装解释包:Explanation package not guaranteed to be installed when installing the azureml-train-automl-client:

    在启用模型解释的情况下运行远程 AutoML 时,将看到错误消息“请安装 azureml-explain-model 包以获取模型解释”。When running a remote AutoML run with model explanation enabled, you will see an error message "Please install azureml-explain-model package for model explanations." 这是已知问题。This is a known issue. 作为解决方法,请执行以下步骤之一:As a workaround follow one of the steps below:

    1. 在本地安装 azureml-explain-model。Install azureml-explain-model locally.
        pip install azureml-explain-model
    
    1. 通过在 AutoML 配置中传递 model_explainability=False,完全禁用可解释性功能。Disable the explainability feature entirely by passing model_explainability=False in the AutoML configuration.
        automl_config = AutoMLConfig(task = 'classification',
                               path = '.',
                               debug_log = 'automated_ml_errors.log',
                               compute_target = compute_target,
                               run_configuration = aml_run_config,
                               featurization = 'auto',
                               model_explainability=False,
                               training_data = prepped_data,
                               label_column_name = 'Survived',
                               **automl_settings)
    
  • Panda 错误:通常在 AutoML 试验期间出现:Panda errors: Typically seen during AutoML Experiment:

    当使用 pip 手动设置环境时,你可能会注意到由于安装了不支持的包版本而导致的属性错误(特别是来自 pandas 的错误)。When manually setting up your environment using pip, you may notice attribute errors (especially from pandas) due to unsupported package versions being installed. 若要防止此类错误,请使用 automl_setup.cmd 安装 AutoML SDKIn order to prevent such errors, please install the AutoML SDK using the automl_setup.cmd:

    1. 打开 Anaconda 提示符并克隆一组示例笔记本的 GitHub 存储库。Open an Anaconda prompt and clone the GitHub repository for a set of sample notebooks.
    git clone https://github.com/Azure/MachineLearningNotebooks.git
    
    1. cd 到 how-to-use-azureml/automated-machine-learning 文件夹,其中提取了示例笔记本,然后运行:cd to the how-to-use-azureml/automated-machine-learning folder where the sample notebooks were extracted and then run:
    automl_setup
    
  • 在本地计算机上运行时出现 KeyError: 'brand'KeyError: 'brand' when running AutoML on local compute

    如果在 2020 年 6 月 10 日之后使用 SDK 1.7.0 或更早版本创建了新环境,由于 py-cpuinfo 包中的某个更新,训练可能会失败并收到此错误。If a new environment was created after June 10, 2020, by using SDK 1.7.0 or earlier, training might fail with this error due to an update in the py-cpuinfo package. (在 2020 年 6 月 10 日或之前创建的环境不受影响,因为使用的是缓存的训练图像,所以是远程计算上运行的试验。)若要解决此问题,请执行以下两个步骤之一:(Environments created on or before June 10, 2020, are unaffected, as are experiments run on remote compute because cached training images are used.) To work around this issue, take either of the following two steps:

    • 将 SDK 版本更新为 1.8.0 或更高版本(这也会将 py-cpuinfo 降级到 5.0.0):Update the SDK version to 1.8.0 or later (this also downgrades py-cpuinfo to 5.0.0):

      pip install --upgrade azureml-sdk[automl]
      
    • 将已安装的 py-cpuinfo 版本降级为 5.0.0:Downgrade the installed version of py-cpuinfo to 5.0.0:

      pip install py-cpuinfo==5.0.0
      
  • 错误消息:无法卸载 'PyYAML'Error message: Cannot uninstall 'PyYAML'

    适用于 Python 的 Azure 机器学习 SDK:PyYAML 是 distutils 安装的项目。Azure Machine Learning SDK for Python: PyYAML is a distutils installed project. 因此,在部分卸载的情况下,我们无法准确确定哪些文件属于它。Therefore, we cannot accurately determine which files belong to it if there is a partial uninstall. 若要在忽略此错误的同时继续安装 SDK,请使用:To continue installing the SDK while ignoring this error, use:

    pip install --upgrade azureml-sdk[notebooks,automl] --ignore-installed PyYAML
    
  • Azure 机器学习 SDK 安装失败并收到异常:ModuleNotFoundError:没有名为 "ruamel" 的模块或 "ImportError:没有名为 ruamel. yaml 的模块"Azure Machine Learning SDK installation failing with an exception: ModuleNotFoundError: No module named 'ruamel' or 'ImportError: No module named ruamel.yaml'

    在 conda 基本环境中,在最新 pip (>20.1.1) 上安装适用于 Python 的 Azure 机器学习 SDK 时,所有已发布的适用于 Python 的 Azure 机器学习 SDK 版本都会遇到此问题。This issue is getting encountered with the installation of Azure Machine Learning SDK for Python on the latest pip (>20.1.1) in the conda base environment for all released versions of Azure Machine Learning SDK for Python. 请尝试以下解决方法:Refer to the following workarounds:

    • 应避免在 conda 基本环境中安装 Python SDK,而是应创建 conda 环境并在新创建的用户环境中安装 SDK。Avoid installing Python SDK on the conda base environment, rather create your conda environment and install SDK on that newly created user environment. 最新的 pip 应在这个新的 conda 环境中运行。The latest pip should work on that new conda environment.

    • 在 docker 中创建映像时,如果不能脱离 conda 基本环境,请在 docker 文件中固定 pip<=20.1.1。For creating images in docker, where you cannot switch away from conda base environment, please pin pip<=20.1.1 in the docker file.

    conda install -c r -y conda python=3.6.2 pip=20.1.1
    
  • 安装包时 Databricks 失败Databricks failure when installing packages

    安装更多包时,Azure Databricks 上的 Azure 机器学习 SDK 安装失败。Azure Machine Learning SDK installation fails on Azure Databricks when more packages are installed. 某些包(如 psutil)可能会导致冲突。Some packages, such as psutil, can cause conflicts. 为了避免安装错误,请通过冻结库版本来安装包。To avoid installation errors, install packages by freezing the library version. 此问题与 Databricks 相关,而与 Azure 机器学习 SDK 无关。This issue is related to Databricks and not to the Azure Machine Learning SDK. 使用其他库时也可能会遇到此问题。You might experience this issue with other libraries, too. 示例:Example:

    psutil cryptography==1.5 pyopenssl==16.0.0 ipython==2.2.0
    

    或者,如果一直面临 Python 库的安装问题,可以使用初始化脚本。Alternatively, you can use init scripts if you keep facing install issues with Python libraries. 此方法并不正式受到支持。This approach isn't officially supported. 有关详细信息,请参阅群集范围的初始化脚本For more information, see Cluster-scoped init scripts.

  • Databricks 导入错误:无法从 pandas._libs.tslibs 中导入名称 Timedelta :如果在使用自动机器学习时看到此错误,请在笔记本中运行以下两行:Databricks import error: cannot import name Timedelta from pandas._libs.tslibs: If you see this error when you use automated machine learning, run the two following lines in your notebook:

    %sh rm -rf /databricks/python/lib/python3.7/site-packages/pandas-0.23.4.dist-info /databricks/python/lib/python3.7/site-packages/pandas
    %sh /databricks/python/bin/pip install pandas==0.23.4
    
  • Databricks 导入错误:没有名为“pandas.core.indexes”的模块:如果在使用自动化机器学习时看到此错误:Databricks import error: No module named 'pandas.core.indexes': If you see this error when you use automated machine learning:

    1. 请运行以下命令,在 Azure Databricks 群集中安装两个包:Run this command to install two packages in your Azure Databricks cluster:

      scikit-learn==0.19.1
      pandas==0.22.0
      
    2. 分离群集,然后将其重新附加到笔记本。Detach and then reattach the cluster to your notebook.

    如果这些步骤无法解决问题,请尝试重启群集。If these steps don't solve the issue, try restarting the cluster.

  • Databricks FailToSendFeather:如果在 Azure Databricks 群集上读取数据时出现 FailToSendFeather 错误,请参考以下解决方法:Databricks FailToSendFeather: If you see a FailToSendFeather error when reading data on Azure Databricks cluster, refer to the following solutions:

    • azureml-sdk[automl] 包升级到最新版本。Upgrade azureml-sdk[automl] package to the latest version.
    • 添加 azureml-dataprep 版本 1.1.8 或更高版本。Add azureml-dataprep version 1.1.8 or above.
    • 添加 pyarrow 版本 0.11 或更高版本。Add pyarrow version 0.11 or above.

创建和管理工作区Create and manage workspaces

警告

不支持将 Azure 机器学习工作区移动到另一个订阅,或将拥有的订阅移到新租户。Moving your Azure Machine Learning workspace to a different subscription, or moving the owning subscription to a new tenant, is not supported. 这样做可能会导致错误。Doing so may cause errors.

  • Azure 门户Azure portal:

    • 如果通过 SDK 的共享链接或 Azure 门户直接访问工作区,则无法查看扩展中包含订阅信息的标准“概述”页面。If you go directly to your workspace from a share link from the SDK or the Azure portal, you can't view the standard Overview page that has subscription information in the extension. 此情况下,也无法切换到其他工作区。In this scenario, you also can't switch to another workspace. 若要查看其他工作区,请直接转到 Azure 机器学习工作室并搜索工作区名称。To view another workspace, go directly to Azure Machine Learning studio and search for the workspace name.
    • 所有资产(数据集、试验、计算等)仅适用于 Azure 机器学习工作室All assets (Datasets, Experiments, Computes, and so on) are available only in Azure Machine Learning studio. 它们不可在 Azure 门户中使用。They're not available from the Azure portal.
  • Azure 机器学习工作室 Web 门户支持的浏览器:建议使用与操作系统兼容的最新浏览器。Supported browsers in Azure Machine Learning studio web portal: We recommend that you use the most up-to-date browser that's compatible with your operating system. 支持以下浏览器:The following browsers are supported:

    • Microsoft Edge(新的 Microsoft Edge(最新版),Microsoft Edge (The new Microsoft Edge, latest version. 不是旧版 Microsoft Edge)Not Microsoft Edge legacy)
    • Safari(最新版本,仅限 Mac)Safari (latest version, Mac only)
    • Chrome(最新版本)Chrome (latest version)
    • Firefox(最新版本)Firefox (latest version)

设置你的环境Set up your environment

  • 创建 AmlCompute 时出错:如果用户在 GA 发布之前已通过 Azure 门户创建了自己的 Azure 机器学习工作区,则他们很可能无法在该工作区中创建 AmlCompute。Trouble creating AmlCompute: There is a rare chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create AmlCompute in that workspace. 可对服务提出支持请求,也可通过门户或 SDK 创建新的工作区以立即解除锁定。You can either raise a support request against the service or create a new workspace through the portal or the SDK to unblock yourself immediately.

  • Azure 容器注册表当前不支持在资源组名称中使用 unicode 字符:由于 ACR 请求的资源组名称包含 unicode 字符,因此可能会失败。Azure Container Registry doesn't currently support unicode characters in Resource Group names: It is possible that ACR requests fail because its resource group name contains unicode characters. 若要缓解此问题,建议在具有其他名称的资源组中创建一个 ACR。To mitigate this issue, we recommend creating an ACR in a differently-named resource group.

处理数据Work with data

AzureFile 存储过载Overloaded AzureFile storage

如果收到 Unable to upload project files to working directory in AzureFile because the storage is overloaded 错误,请应用以下解决方法。If you receive an error Unable to upload project files to working directory in AzureFile because the storage is overloaded, apply following workarounds.

如果对其他工作负荷(例如数据传输)使用文件共享,则我们建议使用 Blob,以便可以自由使用文件共享来提交运行。If you are using file share for other workloads, such as data transfer, the recommendation is to use blobs so that file share is free to be used for submitting runs. 还可以在两个不同的工作区之间拆分工作负荷。You may also split the workload between two different workspaces.

作为输入传递数据Passing data as input

  • TypeError:FileNotFound:无此类文件或目录: 如果文件不在提供的文件路径中,则会出现此错误。TypeError: FileNotFound: No such file or directory: This error occurs if the file path you provide isn't where the file is located. 需确保引用文件的方式与在计算目标上将数据集装载到的位置相一致。You need to make sure the way you refer to the file is consistent with where you mounted your dataset on your compute target. 为确保确定性状态,我们建议在将数据集装载到计算目标时使用抽象路径。To ensure a deterministic state, we recommend using the abstract path when mounting a dataset to a compute target. 例如,在以下代码中,我们将数据集装载到计算目标文件系统的根目录 /tmp 下。For example, in the following code we mount the dataset under the root of the filesystem of the compute target, /tmp.

    # Note the leading / in '/tmp/dataset'
    script_params = {
        '--data-folder': dset.as_named_input('dogscats_train').as_mount('/tmp/dataset'),
    } 
    

    如果不包含前导正斜杠“/”,则需要为计算目标上的工作目录添加前缀(例如 /mnt/batch/.../tmp/dataset),以指示要将数据集装载到的位置。If you don't include the leading forward slash, '/', you'll need to prefix the working directory e.g. /mnt/batch/.../tmp/dataset on the compute target to indicate where you want the dataset to be mounted.

装载数据集Mount dataset

  • 数据集初始化失败:“等待装入点准备完毕”已超时azureml-sdk >=1.12.0 中添加了重试逻辑以缓解问题。Dataset initialization failed: Waiting for mount point to be ready has timed out: Re-try logic has been added in azureml-sdk >=1.12.0 to mitigate the issue. 如果使用的是以前的 azureml-sdk 版本,请升级到最新版本。If you are on previous azureml-sdk versions, please upgrade to the latest version. 如果使用的是 azureml-sdk>=1.12.0,请重新创建环境,以便获得具有修补程序的最新补丁。If you are already on azureml-sdk>=1.12.0, please recreate your environment so that you have the latest patch with the fix.

数据标签项目Data labeling projects

问题Issue 解决方法Resolution
只能使用在 Blob 数据存储中创建的数据集。Only datasets created on blob datastores can be used. 这是当前版本的已知限制。This is a known limitation of the current release.
创建后,项目长时间显示“正在初始化”。After creation, the project shows "Initializing" for a long time. 手动刷新页面。Manually refresh the page. 初始化应该按每秒大约 20 个数据点的速率继续。Initialization should proceed at roughly 20 datapoints per second. 缺少 autorefresh 是一个已知问题。The lack of autorefresh is a known issue.
查看映像时,最近添加标签的映像不显示。When reviewing images, newly labeled images are not shown. 若要加载所有带标签的映像,请选择“第一个”按钮。To load all labeled images, choose the First button. 按下“第一个”按钮会返回到列表的最前面,但会加载所有带标签的数据。The First button will take you back to the front of the list, but loads all labeled data.
在为对象检测提供标记时按 Esc 键会在左上角创建大小为零的标签。Pressing Esc key while labeling for object detection creates a zero size label on the top-left corner. 在此状态下提交标签会失败。Submitting labels in this state fails. 单击标签旁边的打叉标记来删除该标签。Delete the label by clicking on the cross mark next to it.

数据偏移监视器Data drift monitors

数据偏移监视器的限制和已知问题:Limitations and known issues for data drift monitors:

  • 分析历史数据时的时间范围限制为监视器频率设置的 31 个间隔。The time range when analyzing historical data is limited to 31 intervals of the monitor's frequency setting.

  • 除非未指定特征列表(使用所有特征),否则特征限制为 200 个。Limitation of 200 features, unless a feature list is not specified (all features used).

  • 计算大小必须足够大才能处理数据。Compute size must be large enough to handle the data.

  • 确保数据集包含处于给定监视器运行的开始和结束日期范围内的数据。Ensure your dataset has data within the start and end date for a given monitor run.

  • 数据集监视器仅适用于包含 50 行或更多行的数据集。Dataset monitors will only work on datasets that contain 50 rows or more.

  • 数据集中的列或特征根据下表中的条件划分为分类值或数字值。Columns, or features, in the dataset are classified as categorical or numeric based on the conditions in the following table. 如果特征不满足这些条件 - 例如,某个字符串类型的列包含 100 个以上的唯一值 - 则会从数据偏移算法中删除该特征,但仍会对其进行分析。If the feature does not meet these conditions - for instance, a column of type string with >100 unique values - the feature is dropped from our data drift algorithm, but is still profiled.

    特征类型Feature type 数据类型Data type 条件Condition 限制Limitations
    分类Categorical string、bool、int、floatstring, bool, int, float 特征中的唯一值数小于 100,并小于行数的 5%。The number of unique values in the feature is less than 100 and less than 5% of the number of rows. Null 被视为其自身的类别。Null is treated as its own category.
    数值Numerical int、floatint, float 特征中的值为数字数据类型,且不符合分类特征的条件。The values in the feature are of a numerical data type and do not meet the condition for a categorical feature. 如果 15% 以上的值为 null,则会删除特征。Feature dropped if >15% of values are null.
  • 创建了数据偏移监视器,但无法在 Azure 机器学习工作室的“数据集监视器”页上看到数据时,请尝试以下操作。When you have created a datadrift monitor but cannot see data on the Dataset monitors page in Azure Machine Learning studio, try the following.

    1. 检查是否已在页面顶部选择了正确的日期范围。Check if you have selected the right date range at the top of the page.
    2. 在“数据集监视器”选项卡上,选择试验链接以检查运行状态。On the Dataset Monitors tab, select the experiment link to check run status. 此链接位于表的最右侧。This link is on the far right of the table.
    3. 如果运行已成功完成,请检查驱动程序日志,以便查看已生成的指标数,或者查看是否有任何警告消息。If run completed successfully, check driver logs to see how many metrics has been generated or if there's any warning messages. 单击试验后,在“输出 + 日志”选项卡中查找驱动程序日志。Find driver logs in the Output + logs tab after you click on an experiment.
  • 如果 SDK backfill() 函数未生成预期的输出,则可能是由于身份验证问题。If the SDK backfill() function does not generate the expected output, it may be due to an authentication issue. 创建要传入到此函数中的计算时,请勿使用 Run.get_context().experiment.workspace.compute_targetsWhen you create the compute to pass into this function, do not use Run.get_context().experiment.workspace.compute_targets. 而应使用 ServicePrincipalAuthentication(例如以下代码)来创建要传入到该 backfill() 函数中的计算:Instead, use ServicePrincipalAuthentication such as the following to create the compute that you pass into that backfill() function:

    auth = ServicePrincipalAuthentication(
            tenant_id=tenant_id,
            service_principal_id=app_id,
            service_principal_password=client_secret
            )
    ws = Workspace.get("xxx", auth=auth, subscription_id="xxx", resource_group"xxx")
    compute = ws.compute_targets.get("xxx")
    

Azure 机器学习设计器Azure Machine Learning designer

设计器中的数据集可视化效果Dataset visualization in the designer

在“数据集”资产页或使用 SDK 注册数据集后,可以在设计器画布左侧列表中的“数据集”类别下找到它 。After you register a dataset in Datasets asset page or using SDK, you can find it under Datasets category in the list left to the designer canvas.

但将数据集拖到画布上要进行直观显示时,由于以下某些原因,可能无法将其可视化:However, when you drag the dataset to the canvas and visualize, it may be unable to visualize due to some of the following reasons:

  • 目前只能可视化设计器中的表格数据集。Currently you can only visualize tabular dataset in the designer. 如果在设计器外注册文件数据集,则无法在设计器画布中对其进行可视化。If you register a file dataset outside designer, you cannot visualize it in the designer canvas.
  • 数据集存储在虚拟网络 (VNet) 中。Your dataset is stored in virtual network (VNet). 如果要进行可视化,则需要启用数据存储的工作区托管标识。If you want to visualize, you need to enable workspace managed identity of the datastore.
    1. 转到相关的数据存储,然后单击“更新凭据” 更新凭据
    2. 选择“确定”,启用工作区托管标识。Select Yes to enable workspace managed identity. 启用工作区托管标识

计算准备时间很长Long compute preparation time

第一次连接或创建计算目标可能需要几分钟甚至更长的时间。It may be a few minutes or even longer when you first connect to or create a compute target.

在模型数据收集器中,数据到达 blob 存储帐户最多需要(但通常不到)10 分钟。From the Model Data Collector, it can take up to (but usually less than) 10 minutes for data to arrive in your blob storage account. 等待 10 分钟以确保运行下面的单元。Wait 10 minutes to ensure cells below will run.

import time
time.sleep(600)

实时终结点的日志Log for real-time endpoints

实时终结点的日志是客户数据。Logs of real-time endpoints are customer data. 对于实时终结点故障排除,可以使用以下代码来启用日志。For real-time endpoint troubleshooting, you can use following code to enable logs.

有关监视 Web 服务终结点的更多详细信息,请参阅本文See more details about monitoring web service endpoints in this article.

from azureml.core import Workspace
from azureml.core.webservice import Webservice

ws = Workspace.from_config()
service = Webservice(name="service-name", workspace=ws)
logs = service.get_logs()

如果有多个租户,则可能需要在 ws = Workspace.from_config() 之前添加以下身份验证代码If you have multiple Tenant, you may need to add the following authenticate code before ws = Workspace.from_config()

from azureml.core.authentication import InteractiveLoginAuthentication
interactive_auth = InteractiveLoginAuthentication(tenant_id="the tenant_id in which your workspace resides")

训练模型Train models

  • ModuleErrors(没有名为“xxx”的模块) :如果在 Azure ML 中提交试验时遇到 ModuleErrors,则表示训练脚本需要安装某个包,但并未添加该包。ModuleErrors (No module named): If you are running into ModuleErrors while submitting experiments in Azure ML, it means that the training script is expecting a package to be installed but it isn't added. 你提供包名称后,Azure ML 在用于训练运行的环境中安装该包。Once you provide the package name, Azure ML installs the package in the environment used for your training run.

    如果使用估算器提交试验,则可以根据要从哪个源安装包,通过估算器中的 pip_packagesconda_packages 参数指定包名称。If you are using Estimators to submit experiments, you can specify a package name via pip_packages or conda_packages parameter in the estimator based on from which source you want to install the package. 还可以使用 conda_dependencies_file 指定包含所有依赖项的 yml 文件,或使用 pip_requirements_file 参数列出 txt 文件中的所有 pip 要求。You can also specify a yml file with all your dependencies using conda_dependencies_fileor list all your pip requirements in a txt file using pip_requirements_file parameter. 如果你有自己的 Azure ML 环境对象,并且希望替代估算器使用的默认映像,则可以通过估算器构造函数的 environment 参数来指定该环境。If you have your own Azure ML Environment object that you want to override the default image used by the estimator, you can specify that environment via the environment parameter of the estimator constructor.

    Azure ML 还提供适用于 TensorFlow、PyTorch、Chainer 和 SKLearn 的框架特定的估算器。Azure ML also provides framework-specific estimators for TensorFlow, PyTorch, Chainer and SKLearn. 使用这些估算器可确保在用于训练的环境中自动安装核心框架依赖项。Using these estimators will make sure that the core framework dependencies are installed on your behalf in the environment used for training. 可以使用相应的选项根据前面所述指定额外的依赖项。You have the option to specify extra dependencies as described above.

    可以在 AzureML 容器中看到 Azure ML 维护的 Docker 映像及其内容。Azure ML maintained docker images and their contents can be seen in AzureML Containers. 框架特定的依赖项列在相应的框架文档中 - ChainerPyTorchTensorFlowSKLearnFramework-specific dependencies are listed in the respective framework documentation - Chainer, PyTorch, TensorFlow, SKLearn.

    备注

    如果你认为某个特定的包比较常用,需要添加到 Azure ML 维护的映像和环境中,请在 AzureML 容器中提出 GitHub 问题。If you think a particular package is common enough to be added in Azure ML maintained images and environments please raise a GitHub issue in AzureML Containers.

  • NameError(未定义名称)、AttributeError(对象没有属性) :此异常应该是训练脚本引发的。NameError (Name not defined), AttributeError (Object has no attribute): This exception should come from your training scripts. 可以在 Azure 门户中查看日志文件,以获取有关未定义特定名称或属性错误的详细信息。You can look at the log files from Azure portal to get more information about the specific name not defined or attribute error. 在 SDK 中,可以使用 run.get_details() 来查看错误消息。From the SDK, you can use run.get_details() to look at the error message. 这还会列出针对运行生成的所有日志文件。This will also list all the log files generated for your run. 在重新提交运行之前,请务必检查训练脚本并修复错误。Please make sure to take a look at your training script and fix the error before resubmitting your run.

  • Horovod 已关闭:在大多数情况下,如果遇到“AbortedError:Horovod 已关闭”,此异常表示某个进程中的根本性异常导致 Horovod 关闭。Horovod has been shut down: In most cases if you encounter "AbortedError: Horovod has been shut down" this exception means there was an underlying exception in one of the processes that caused Horovod to shut down. MPI 作业中的每个排名都会在 Azure ML 中生成专属的日志文件。Each rank in the MPI job gets it own dedicated log file in Azure ML. 这些日志名为 70_driver_logsThese logs are named 70_driver_logs. 对于分布式训练,日志名称带有 _rank 后缀,以方便区分日志。In case of distributed training, the log names are suffixed with _rank to make it easier to differentiate the logs. 若要查找导致 Horovod 关闭的确切错误,请浏览所有日志文件,并查看 driver_log 文件末尾的 TracebackTo find the exact error that caused Horovod to shut down, go through all the log files and look for Traceback at the end of the driver_log files. 其中的某个文件会指出实际的根本性异常。One of these files will give you the actual underlying exception.

  • 运行或试验删除:可以通过以下方式将试验存档:使用 Experiment.archive 方法,或者从 Azure 机器学习工作室客户端中的“试验”选项卡视图中使用“存档试验”按钮。Run or experiment deletion: Experiments can be archived by using the Experiment.archive method, or from the Experiment tab view in Azure Machine Learning studio client via the "Archive experiment" button. 执行此操作后,在列出查询和视图时将隐藏该试验,但不会将其删除。This action hides the experiment from list queries and views, but does not delete it.

    目前不支持永久删除个体试验或运行。Permanent deletion of individual experiments or runs is not currently supported. 有关删除工作区资产的详细信息,请参阅导出或删除机器学习服务工作区数据For more information on deleting Workspace assets, see Export or delete your Machine Learning service workspace data.

  • 指标文档太大:对于一次性可从训练运行记录的指标对象大小,Azure 机器学习施加了内部限制。Metric Document is too large: Azure Machine Learning has internal limits on the size of metric objects that can be logged at once from a training run. 如果在记录列表值指标时遇到“指标文档太大”错误,请尝试将列表拆分为较小的区块,例如:If you encounter a "Metric Document is too large" error when logging a list-valued metric, try splitting the list into smaller chunks, for example:

    run.log_list("my metric name", my_metric[:N])
    run.log_list("my metric name", my_metric[N:])
    

    在内部,Azure ML 会将具有相同指标名称的块串联到一个连续列表中。Internally, Azure ML concatenates the blocks with the same metric name into a contiguous list.

自动化机器学习Automated machine learning

  • AutoML 依赖项到新版本的最新升级将破坏兼容性:从 SDK 1.13.0 版开始,模型将不加载到较旧的 SDK 中,这是因为在之前的包中固定的旧版本与现在固定的更新的版本不兼容。Recent upgrade of AutoML dependencies to newer versions will be breaking compatibility: As of version 1.13.0 of the SDK, models won't be loaded in older SDKs due to incompatibility between the older versions we pinned in our previous packages, and the newer versions we pin now. 你将看到错误,例如:You will see error such as:

    • 找不到模块:例如 No module named 'sklearn.decomposition._truncated_svdModule not found: Ex.No module named 'sklearn.decomposition._truncated_svd,
    • 导入错误:例如 ImportError: cannot import name 'RollingOriginValidator'Import errors: Ex.ImportError: cannot import name 'RollingOriginValidator',
    • 属性错误:例如:Attribute errors: Ex. AttributeError: 'SimpleImputer' object has no attribute 'add_indicator

    若要解决此问题,请执行下面两个步骤之一,具体取决于你的 AutoML SDK 训练版本:To work around this issue, take either of the following two steps depending on your AutoML SDK training version:

    1. 如果 AutoML SDK 训练版本高于 1.13.0,则需要 pandas == 0.25.1sckit-learn==0.22.1If your AutoML SDK training version is greater than 1.13.0, you need pandas == 0.25.1 and sckit-learn==0.22.1. 如果版本不匹配,请将 scikit-learn 和/或 pandas 升级为正确的版本,如下所示:If there is a version mismatch, upgrade scikit-learn and/or pandas to correct version as shown below:
       pip install --upgrade pandas==0.25.1
       pip install --upgrade scikit-learn==0.22.1
    
    1. 如果 AutoML SDK 训练版本低于或等于 1.12.0,则需要 pandas == 0.23.4sckit-learn==0.20.3If your AutoML SDK training version is less than or equal to 1.12.0, you need pandas == 0.23.4 and sckit-learn==0.20.3. 如果版本不匹配,请将 scikit-learn 和/或 pandas 降级为正确的版本,如下所示:If there is a version mismatch, downgrade scikit-learn and/or pandas to correct version as shown below:
      pip install --upgrade pandas==0.23.4
      pip install --upgrade scikit-learn==0.20.3
    
  • 预测 R2 评分始终为零:如果提供的训练数据的时间序列包含的值与上一个 n_cv_splits + forecasting_horizon 数据点相同,则会出现此问题。Forecasting R2 score is always zero: This issue arises if the training data provided has time series that contains the same value for the last n_cv_splits + forecasting_horizon data points. 如果该模式在你的时间序列中是预期的,可将主要指标切换为标准均方根误差。If this pattern is expected in your time series, you can switch your primary metric to normalized root mean squared error.

  • TensorFlow:从 SDK 1.5.0 版开始,自动化机器学习默认不安装 TensorFlow 模型。TensorFlow: As of version 1.5.0 of the SDK, automated machine learning does not install TensorFlow models by default. 若要安装 TensorFlow 并将其用于自动化 ML 试验,请通过 CondaDependecies 安装 tensorflow==1.12.0。To install TensorFlow and use it with your automated ML experiments, install tensorflow==1.12.0 via CondaDependecies.

    from azureml.core.runconfig import RunConfiguration
    from azureml.core.conda_dependencies import CondaDependencies
    run_config = RunConfiguration()
    run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['tensorflow==1.12.0'])
    
  • 试验图表:自 4 月 12 日以来,自动化 ML 试验迭代中显示的二元分类图表(精准率-召回率、ROC、增益曲线等)在用户界面中无法正常呈现。Experiment Charts: Binary classification charts (precision-recall, ROC, gain curve etc.) shown in automated ML experiment iterations are not rendering correctly in user interface since 4/12. 绘制的图表目前显示相反的结果:表现更好的模型反而显示更低的结果。Chart plots are currently showing inverse results, where better performing models are shown with lower results. 我们研究解决方法。A resolution is under investigation.

  • Databricks 取消自动化机器学习运行:在 Azure Databricks 上使用自动化机器学习功能时,若要取消某个运行并启动新的试验运行,请重启 Azure Databricks 群集。Databricks cancel an automated machine learning run: When you use automated machine learning capabilities on Azure Databricks, to cancel a run and start a new experiment run, restart your Azure Databricks cluster.

  • Databricks 自动化机器学习的迭代数超过 10 个:在自动化机器学习设置中,如果迭代数超过 10 个,请在提交运行时将 show_output 设置为 FalseDatabricks >10 iterations for automated machine learning: In automated machine learning settings, if you have more than 10 iterations, set show_output to False when you submit the run.

  • Databricks Azure 机器学习 SDK 和自动化机器学习的小组件:Databricks 笔记本不支持 Azure 机器学习 SDK 小组件,因为笔记本无法分析 HTML 小组件。Databricks widget for the Azure Machine Learning SDK and automated machine learning: The Azure Machine Learning SDK widget isn't supported in a Databricks notebook because the notebooks can't parse HTML widgets. 可以通过在 Azure Databricks 笔记本单元中使用以下 Python 代码,在门户中查看该小组件:You can view the widget in the portal by using this Python code in your Azure Databricks notebook cell:

    displayHTML("<a href={} target='_blank'>Azure Portal: {}</a>".format(local_run.get_portal_url(), local_run.id))
    
  • automl_setup 失败automl_setup fails:

    • 在 Windows 上,从 Anaconda 提示符运行 automl_setup。On Windows, run automl_setup from an Anaconda Prompt. 点击此链接安装 MinicondaUse this link to install Miniconda.
    • 通过运行 conda info 命令,确保已安装 conda 64 位而不是 32 位。Ensure that conda 64-bit is installed, rather than 32-bit by running the conda info command. 对于 Windows,platform 应为 win-64,对于 Mac,应为 osx-64The platform should be win-64 for Windows or osx-64 for Mac.
    • 确保已安装 conda 4.4.10 或更高版本。Ensure that conda 4.4.10 or later is installed. 可以使用命令 conda -V 检查该版本。You can check the version with the command conda -V. 如果安装了以前的版本,可以使用以下命令对其进行更新:conda update condaIf you have a previous version installed, you can update it by using the command: conda update conda.
    • Linux - gcc: error trying to exec 'cc1plus'Linux - gcc: error trying to exec 'cc1plus'
      • 如果遇到 gcc: error trying to exec 'cc1plus': execvp: No such file or directory 错误,请使用命令 sudo apt-get install build-essential 安装版本要素。If the gcc: error trying to exec 'cc1plus': execvp: No such file or directory error is encountered, install build essentials using the command sudo apt-get install build-essential.
      • 将新名称作为第一个参数传递给 automl_setup 以创建新的 conda 环境。Pass a new name as the first parameter to automl_setup to create a new conda environment. 使用 conda env list 查看现有的 conda 环境,并使用 conda env remove -n <environmentname> 删除它们。View existing conda environments using conda env list and remove them with conda env remove -n <environmentname>.
  • automl_setup_linux.sh 失败:如果 automl_setup_linus.sh 在 Ubuntu Linux 上失败,并出现错误:unable to execute 'gcc': No such file or directory-automl_setup_linux.sh fails: If automl_setup_linus.sh fails on Ubuntu Linux with the error: unable to execute 'gcc': No such file or directory-

    1. 确保已启用出站端口 53 和 80。Make sure that outbound ports 53 and 80 are enabled. 在 Azure VM 上,可以通过选择 VM 并单击“网络”,从 Azure 门户执行此操作。On an Azure VM, you can do this from the Azure Portal by selecting the VM and clicking on Networking.
    2. 运行命令 sudo apt-get updateRun the command: sudo apt-get update
    3. 运行命令 sudo apt-get install build-essential --fix-missingRun the command: sudo apt-get install build-essential --fix-missing
    4. 再次运行 automl_setup_linux.shRun automl_setup_linux.sh again
  • configuration.ipynb 失败configuration.ipynb fails:

    • 对于本地 conda,请首先确保 automl_setup 已成功运行。For local conda, first ensure that automl_setup has successfully run.
    • 确保 subscription_id 是正确的。Ensure that the subscription_id is correct. 通过选择“所有服务”,然后选择“订阅”,在 Azure 门户中查找 subscription_id。Find the subscription_id in the Azure Portal by selecting All Service and then Subscriptions. 字符“<”和“>”不应包含在 subscription_id 值中。The characters "<" and ">" should not be included in the subscription_id value. 例如,subscription_id = "12345678-90ab-1234-5678-1234567890abcd" 的格式有效。For example, subscription_id = "12345678-90ab-1234-5678-1234567890abcd" has the valid format.
    • 确保参与者或所有者有权访问“订阅”。Ensure Contributor or Owner access to the Subscription.
    • 确保使用 Azure 门户访问该区域。Ensure access to the region using the Azure Portal.
  • 导入 AutoMLConfig 失败:自动化机器学习版本 1.0.76 中存在包更改,这要求先卸载以前的版本,再更新到新版本。import AutoMLConfig fails: There were package changes in the automated machine learning version 1.0.76, which require the previous version to be uninstalled before updating to the new version. 如果从 v1.0.76 之前的 SDK 版本升级到 v1.0.76 或更高版本后遇到 ImportError: cannot import name AutoMLConfig,请先运行 pip uninstall azureml-train automl 再运行 pip install azureml-train-auotml 来解决该错误。If the ImportError: cannot import name AutoMLConfig is encountered after upgrading from an SDK version before v1.0.76 to v1.0.76 or later, resolve the error by running: pip uninstall azureml-train automl and then pip install azureml-train-auotml. automl_setup.cmd 脚本会自动执行此操作。The automl_setup.cmd script does this automatically.

  • workspace.from_config 失败:如果调用 ws = Workspace.from_config()' 失败 -workspace.from_config fails: If the calls ws = Workspace.from_config()' fails -

    1. 确保 configuration.ipynb 笔记本已成功运行。Ensure that the configuration.ipynb notebook has run successfully.
    2. 如果正在从不在运行 configuration.ipynb 的文件夹下的文件夹中运行笔记本,则将文件夹 aml_config 及其包含的文件 config.json 复制到新文件夹中。If the notebook is being run from a folder that is not under the folder where the configuration.ipynb was run, copy the folder aml_config and the file config.json that it contains to the new folder. Workspace.from_config 读取笔记本文件夹或其父文件夹的 config.json。Workspace.from_config reads the config.json for the notebook folder or its parent folder.
    3. 如果正在使用新的订阅、资源组、工作区或区域,请确保再次运行 configuration.ipynb 笔记本。If a new subscription, resource group, workspace or region, is being used, make sure that you run the configuration.ipynb notebook again. 仅当指定订阅下的指定资源组中已存在工作区时,直接更改 config.json 才会生效。Changing config.json directly will only work if the workspace already exists in the specified resource group under the specified subscription.
    4. 如果要更改区域,请更改工作区、资源组或订阅。If you want to change the region, please change the workspace, resource group or subscription. 即使指定的区域不同,Workspace.create 也不会创建或更新工作区(如果已存在)。Workspace.create will not create or update a workspace if it already exists, even if the region specified is different.
  • 示例笔记本失败:如果示例笔记本失败,并出现属性、方法或库不存在的错误:Sample notebook fails: If a sample notebook fails with an error that property, method, or library does not exist:

    • 确保在 Jupyter 笔记本中选择了正确的内核。Ensure that the correct kernel has been selected in the jupyter notebook. 内核显示在笔记本页面的右上方。The kernel is displayed in the top right of the notebook page. 默认值为 azure_automl。The default is azure_automl. 请注意,内核作为笔记本的一部分进行保存。Note that the kernel is saved as part of the notebook. 因此,如果切换到新的 conda 环境,则必须在笔记本中选择新内核。So, if you switch to a new conda environment, you will have to select the new kernel in the notebook.
      • 对于 Azure Notebooks,它应为 Python 3.6。For Azure Notebooks, it should be Python 3.6.
      • 对于本地 conda 环境,它应为在 automl_setup 中指定的 conda 环境名称。For local conda environments, it should be the conda environment name that you specified in automl_setup.
    • 确保笔记本适用于正在使用的 SDK 版本。Ensure the notebook is for the SDK version that you are using. 可以通过在 Jupyter 笔记本单元格中执行 azureml.core.VERSION 来检查 SDK 版本。You can check the SDK version by executing azureml.core.VERSION in a jupyter notebook cell. 通过单击 Branch 按钮,选择 Tags 选项卡,然后选择版本,可以从 GitHub 下载以前版本的示例笔记本。You can download previous version of the sample notebooks from GitHub by clicking the Branch button, selecting the Tags tab and then selecting the version.
  • Windows 中的 Numpy 导入失败:在某些 Windows 环境中,最新的 Python 3.6.8 版本加载 numpy 时会出现错误。Numpy import fails in Windows: Some Windows environments see an error loading numpy with the latest Python version 3.6.8. 如果出现此问题,请尝试使用 Python 3.6.7 版本。If you see this issue, try with Python version 3.6.7.

  • Numpy 导入失败:在自动化 ML conda 环境中检查 TensorFlow 版本。Numpy import fails: Check the tensorflow version in the automated ml conda environment. 支持的版本为 <1.13 的版本。Supported versions are < 1.13. 如果版本 >= 1.13,请从环境中卸载 TensorFlow。可以按如下所示检查 TensorFlow 的版本并进行卸载 -Uninstall tensorflow from the environment if version is >= 1.13 You may check the version of tensorflow and uninstall as follows -

    1. 启动命令 shell,激活安装了自动化 ML 包的 conda 环境。Start a command shell, activate conda environment where automated ml packages are installed.
    2. 输入 pip freeze 并查找 tensorflow,如果找到,则列出的版本应 <1.13Enter pip freeze and look for tensorflow, if found, the version listed should be < 1.13
    3. 如果列出的版本不是受支持的版本,请在命令 shell 中使用 pip uninstall tensorflow 并输入 y 进行确认。If the listed version is a not a supported version, pip uninstall tensorflow in the command shell and enter y for confirmation.

部署和提供模型Deploy & serve models

对以下错误采取以下操作:Take these actions for the following errors:

错误Error 解决方法Resolution
部署 Web 服务时映像生成失败Image building failure when deploying web service 将“pynacl==1.2.1”作为 pip 依赖项添加到 Conda 文件以进行映像配置Add "pynacl==1.2.1" as a pip dependency to Conda file for image configuration
['DaskOnBatch:context_managers.DaskOnBatch', 'setup.py']' died with <Signals.SIGKILL: 9> 请将部署中使用的 VM 的 SKU 更改为具有更多内存的 SKU。Change the SKU for VMs used in your deployment to one that has more memory.
FPGA 失败FPGA failure 你将无法在 FPGA 上部署模型,直到已请求并获得 FPGA 配额批准为止。You will not be able to deploy models on FPGAs until you have requested and been approved for FPGA quota. 若要请求访问权限,请填写配额请求表单: https://aka.ms/aml-real-time-aiTo request access, fill out the quota request form: https://aka.ms/aml-real-time-ai

更新 AKS 群集中的 Azure 机器学习组件Updating Azure Machine Learning components in AKS cluster

必须手动应用对 Azure Kubernetes 服务群集中安装的 Azure 机器学习组件的更新。Updates to Azure Machine Learning components installed in an Azure Kubernetes Service cluster must be manually applied.

可以通过从 Azure 机器学习工作区分离群集,然后将群集重新附加到工作区,来应用这些更新。You can apply these updates by detaching the cluster from the Azure Machine Learning workspace, and then reattaching the cluster to the workspace. 如果在群集中启用了 TLS,则重新附加群集时需要提供 TLS/SSL 证书和私钥。If TLS is enabled in the cluster, you will need to supply the TLS/SSL certificate and private key when reattaching the cluster.

compute_target = ComputeTarget(workspace=ws, name=clusterWorkspaceName)
compute_target.detach()
compute_target.wait_for_completion(show_output=True)

attach_config = AksCompute.attach_configuration(resource_group=resourceGroup, cluster_name=kubernetesClusterName)

## If SSL is enabled.
attach_config.enable_ssl(
    ssl_cert_pem_file="cert.pem",
    ssl_key_pem_file="key.pem",
    ssl_cname=sslCname)

attach_config.validate_configuration()

compute_target = ComputeTarget.attach(workspace=ws, name=args.clusterWorkspaceName, attach_configuration=attach_config)
compute_target.wait_for_completion(show_output=True)

如果不再具有 TLS/SSL 证书和私钥,或者使用 Azure 机器学习生成的证书,则可以在分离群集之前,使用 kubectl 连接到群集并检索机密 azuremlfessl 来检索文件。If you no longer have the TLS/SSL certificate and private key, or you are using a certificate generated by Azure Machine Learning, you can retrieve the files prior to detaching the cluster by connecting to the cluster using kubectl and retrieving the secret azuremlfessl.

kubectl get secret/azuremlfessl -o yaml

备注

Kubernetes 存储的机密采用 base-64 编码格式。Kubernetes stores the secrets in base-64 encoded format. 在将机密提供给 attach_config.enable_ssl 之前,需要对机密的 cert.pemkey.pem 组成部分进行 base-64 解码。You will need to base-64 decode the cert.pem and key.pem components of the secrets prior to providing them to attach_config.enable_ssl.

分离 Azure Kubernetes 服务Detaching Azure Kubernetes Service

使用 Azure 机器学习工作室、SDK 或适用于机器学习的 Azure CLI 扩展来分离 AKS 群集不会删除 AKS 群集。Using the Azure Machine Learning studio, SDK, or the Azure CLI extension for machine learning to detach an AKS cluster does not delete the AKS cluster. 若要删除群集,请参阅结合使用 Azure CLI 和 AKSTo delete the cluster, see Use the Azure CLI with AKS.

Azure Kubernetes 服务中的 Web 服务失败Webservices in Azure Kubernetes Service failures

对于 Azure Kubernetes 服务中的许多 Web 服务失败,可以使用 kubectl 连接到群集进行调试。Many webservice failures in Azure Kubernetes Service can be debugged by connecting to the cluster using kubectl. 可以运行以下命令获取 Azure Kubernetes 服务群集的 kubeconfig.jsonYou can get the kubeconfig.json for an Azure Kubernetes Service Cluster by running

az aks get-credentials -g <rg> -n <aks cluster name>

身份验证错误Authentication errors

如果通过远程作业对某个计算目标执行管理操作,会收到以下错误之一:If you perform a management operation on a compute target from a remote job, you will receive one of the following errors:

{"code":"Unauthorized","statusCode":401,"message":"Unauthorized","details":[{"code":"InvalidOrExpiredToken","message":"The request token was either invalid or expired. Please try again with a valid token."}]}
{"error":{"code":"AuthenticationFailed","message":"Authentication failed."}}

例如,如果尝试通过一个为实施远程执行操作而提交的机器学习管道创建或附加计算目标,会收到错误。For example, you will receive an error if you try to create or attach a compute target from an ML Pipeline that is submitted for remote execution.

工作室中缺少用户界面项Missing user interface items in studio

可以使用 Azure 基于角色的访问控制来限制可使用 Azure 机器学习执行的操作。Azure role-based access control can be used to restrict actions that you can perform with Azure Machine Learning. 这些限制可以防止用户界面项显示在 Azure 机器学习工作室中。These restrictions can prevent user interface items from showing in the Azure Machine Learning studio. 例如,如果分配了无法创建计算实例的角色,则创建计算实例的选项不会出现在工作室中。For example, if you are assigned a role that cannot create a compute instance, the option to create a compute instance will not appear in the studio.

有关详细信息,请参阅管理用户和角色For more information, see Manage users and roles.

计算群集不会重设大小Compute cluster won't resize

如果 Azure 机器学习计算群集在根据节点状态重设大小时卡住 (0 -> 0),可能是由于 Azure 资源锁定而导致的。If your Azure Machine Learning compute cluster appears stuck at resizing (0 -> 0) for the node state, this may be caused by Azure resource locks.

Azure 允许你在资源上放置锁,这样这些资源就无法被删除,或者会处于只读状态。Azure allows you to place locks on resources, so that they cannot be deleted or are read only. 锁定资源可能会导致意外结果。Locking a resource can lead to unexpected results. 某些操作看似不会修改资源,但实际上需要执行被锁阻止的操作。Some operations that don't seem to modify the resource actually require actions that are blocked by the lock.

例如,将删除锁应用于工作区的资源组会阻止对 Azure ML 计算群集进行缩放操作。For example, applying a delete lock to the resource group for your workspace will prevent scaling operations for Azure ML compute clusters.

若要详细了解如何锁定资源,请参阅锁定资源以防止意外更改For more information on locking resources, see Lock resources to prevent unexpected changes.

后续步骤Next steps

请参阅更多有关 Azure 机器学习的故障排除文章:See more troubleshooting articles for Azure Machine Learning: