Azure 机器学习中的已知问题和故障排除Known issues and troubleshooting in Azure Machine Learning

本文帮助你对使用 Azure 机器学习时可能遇到的已知问题进行故障排除。This article helps you troubleshoot known issues you may encounter when using Azure Machine Learning.

有关故障排除的更多信息,请参阅本文末尾的后续步骤For more information on troubleshooting, see Next steps at the end of this article.

提示

使用 Azure 机器学习时遇到的资源配额可能会导致错误或其他问题。Errors or other issues might be the result of resource quotas you encounter when working with Azure Machine Learning.

访问诊断日志Access diagnostic logs

如果在请求帮助时可以提供诊断信息,有时会很有帮助。Sometimes it can be helpful if you can provide diagnostic information when asking for help. 若要日志,请执行以下操作:To see some logs:

  1. 访问 Azure 机器学习工作室Visit Azure Machine Learning studio.
  2. 在左侧选择“试验”On the left-hand side, select Experiment
  3. 选择一个试验。Select an experiment.
  4. 选择一个运行。Select a run.
  5. 在顶部,选择“输出 + 日志”。On the top, select Outputs + logs.

备注

Azure 机器学习在训练期间记录会从各种源(例如,运行训练作业的 AutoML 或 Docker 容器)记录信息。Azure Machine Learning logs information from a variety of sources during training, such as AutoML or the Docker container that runs the training job. 其中的许多日志没有详细的阐述。Many of these logs are not documented. 如果遇到问题且联系了 Microsoft 支持部门,他们可以在排除故障时使用这些日志。If you encounter problems and contact Microsoft support, they may be able to use these logs during troubleshooting.

安装和导入Installation and import

  • Pip 安装:依赖项不保证与单行安装一致:Pip Installation: Dependencies are not guaranteed to be consistent with single-line installation:

    这是 pip 的已知限制,因为作为单行安装时,pip 没有有效的依赖项解析程序。This is a known limitation of pip, as it does not have a functioning dependency resolver when you install as a single line. 它仅查看第一个独特依赖项。The first unique dependency is the only one it looks at.

    在以下代码中,azure-ml-datadriftazureml-train-automl 都使用单行 pip 进行安装。In the following code azure-ml-datadrift and azureml-train-automl are both installed using a single-line pip install.

      pip install azure-ml-datadrift, azureml-train-automl
    

    在本例中,假设 azure-ml-datadrift 要求版本高于 1.0,azureml-train-automl 要求版本低于 1.2。For this example, let's say azure-ml-datadrift requires version > 1.0 and azureml-train-automl requires version < 1.2. 如果 azure-ml-datadrift 的最新版本是 1.3,那么两个包都会升级到 1.3,即使 azureml-train-automl 包要求使用较旧版本。If the latest version of azure-ml-datadrift is 1.3, then both packages get upgraded to 1.3, regardless of the azureml-train-automl package requirement for an older version.

    若要确保为包安装适当的版本,请使用多行安装,如以下代码中所示。To ensure the appropriate versions are installed for your packages, install using multiple lines like in the following code. 在这里,顺序不是问题,因为 pip 显式降级为下一行调用的一部分。Order isn't an issue here, since pip explicitly downgrades as part of the next line call. 因此,会应用适当的版本依赖项。And so, the appropriate version dependencies are applied.

       pip install azureml-datadrift
       pip install azureml-train-automl 
    
  • 安装 azureml-train-automl-client 时不保证安装解释包:Explanation package not guaranteed to be installed when installing the azureml-train-automl-client:

    在启用模型解释的情况下运行远程 AutoML 时,将看到错误消息“请安装 azureml-explain-model 包以获取模型解释”。When running a remote AutoML run with model explanation enabled, you will see an error message "Please install azureml-explain-model package for model explanations." 这是已知问题。This is a known issue. 作为解决方法,请执行以下步骤之一:As a workaround follow one of the steps below:

    1. 在本地安装 azureml-explain-model。Install azureml-explain-model locally.
        pip install azureml-explain-model
    
    1. 通过在 AutoML 配置中传递 model_explainability=False,完全禁用可解释性功能。Disable the explainability feature entirely by passing model_explainability=False in the AutoML configuration.
        automl_config = AutoMLConfig(task = 'classification',
                               path = '.',
                               debug_log = 'automated_ml_errors.log',
                               compute_target = compute_target,
                               run_configuration = aml_run_config,
                               featurization = 'auto',
                               model_explainability=False,
                               training_data = prepped_data,
                               label_column_name = 'Survived',
                               **automl_settings)
    
  • Panda 错误:通常在 AutoML 试验期间出现:Panda errors: Typically seen during AutoML Experiment:

    当使用 pip 手动设置环境时,你可能会注意到由于安装了不支持的包版本而导致的属性错误(特别是来自 pandas 的错误)。When manually setting up your environment using pip, you may notice attribute errors (especially from pandas) due to unsupported package versions being installed. 若要防止此类错误,请使用 automl_setup.cmd 安装 AutoML SDKIn order to prevent such errors, please install the AutoML SDK using the automl_setup.cmd:

    1. 打开 Anaconda 提示符并克隆一组示例笔记本的 GitHub 存储库。Open an Anaconda prompt and clone the GitHub repository for a set of sample notebooks.
    git clone https://github.com/Azure/MachineLearningNotebooks.git
    
    1. cd 到 how-to-use-azureml/automated-machine-learning 文件夹,其中提取了示例笔记本,然后运行:cd to the how-to-use-azureml/automated-machine-learning folder where the sample notebooks were extracted and then run:
    automl_setup
    
  • 错误消息:无法卸载 'PyYAML'Error message: Cannot uninstall 'PyYAML'

    适用于 Python 的 Azure 机器学习 SDK:PyYAML 是 distutils 安装的项目。Azure Machine Learning SDK for Python: PyYAML is a distutils installed project. 因此,在部分卸载的情况下,我们无法准确确定哪些文件属于它。Therefore, we cannot accurately determine which files belong to it if there is a partial uninstall. 若要在忽略此错误的同时继续安装 SDK,请使用:To continue installing the SDK while ignoring this error, use:

    pip install --upgrade azureml-sdk[notebooks,automl] --ignore-installed PyYAML
    
    
    

创建和管理工作区Create and manage workspaces

警告

不支持将 Azure 机器学习工作区移动到另一个订阅,或将拥有的订阅移到新租户。Moving your Azure Machine Learning workspace to a different subscription, or moving the owning subscription to a new tenant, is not supported. 这样做可能会导致错误。Doing so may cause errors.

  • Azure 门户:如果直接通过 SDK 或门户的共享链接查看工作区,则将无法在扩展程序中查看包含订阅信息的常规“概述”页。Azure portal: If you go directly to view your workspace from a share link from the SDK or the portal, you will not be able to view the normal Overview page with subscription information in the extension. 也将无法切换到另一个工作区。You will also not be able to switch into another workspace. 如果需要查看其他工作区,请直接转到 Azure 机器学习工作室并搜索工作区名称。If you need to view another workspace, go directly to Azure Machine Learning studio and search for the workspace name.

  • Azure 机器学习工作室 Web 门户支持的浏览器:建议使用与操作系统兼容的最新浏览器。Supported browsers in Azure Machine Learning studio web portal: We recommend that you use the most up-to-date browser that's compatible with your operating system. 支持以下浏览器:The following browsers are supported:

    • Microsoft Edge(新的 Microsoft Edge(最新版),Microsoft Edge (The new Microsoft Edge, latest version. 不是旧版 Microsoft Edge)Not Microsoft Edge legacy)
    • Safari(最新版本,仅限 Mac)Safari (latest version, Mac only)
    • Chrome(最新版本)Chrome (latest version)
    • Firefox(最新版本)Firefox (latest version)

设置你的环境Set up your environment

  • 创建 AmlCompute 时出错:如果用户在 GA 发布之前已通过 Azure 门户创建了自己的 Azure 机器学习工作区,则他们很可能无法在该工作区中创建 AmlCompute。Trouble creating AmlCompute: There is a rare chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create AmlCompute in that workspace. 可对服务提出支持请求,也可通过门户或 SDK 创建新的工作区以立即解除锁定。You can either raise a support request against the service or create a new workspace through the portal or the SDK to unblock yourself immediately.

处理数据Work with data

AzureFile 存储过载Overloaded AzureFile storage

如果收到 Unable to upload project files to working directory in AzureFile because the storage is overloaded 错误,请应用以下解决方法。If you receive an error Unable to upload project files to working directory in AzureFile because the storage is overloaded, apply following workarounds.

如果对其他工作负荷(例如数据传输)使用文件共享,则我们建议使用 Blob,以便可以自由使用文件共享来提交运行。If you are using file share for other workloads, such as data transfer, the recommendation is to use blobs so that file share is free to be used for submitting runs. 还可以在两个不同的工作区之间拆分工作负荷。You may also split the workload between two different workspaces.

作为输入传递数据Passing data as input

  • TypeError:FileNotFound:无此类文件或目录: 如果文件不在提供的文件路径中,则会出现此错误。TypeError: FileNotFound: No such file or directory: This error occurs if the file path you provide isn't where the file is located. 需确保引用文件的方式与在计算目标上将数据集装载到的位置相一致。You need to make sure the way you refer to the file is consistent with where you mounted your dataset on your compute target. 为确保确定性状态,我们建议在将数据集装载到计算目标时使用抽象路径。To ensure a deterministic state, we recommend using the abstract path when mounting a dataset to a compute target. 例如,在以下代码中,我们将数据集装载到计算目标文件系统的根目录 /tmp 下。For example, in the following code we mount the dataset under the root of the filesystem of the compute target, /tmp.

    # Note the leading / in '/tmp/dataset'
    script_params = {
        '--data-folder': dset.as_named_input('dogscats_train').as_mount('/tmp/dataset'),
    } 
    

    如果不包含前导正斜杠“/”,则需要为计算目标上的工作目录添加前缀(例如 /mnt/batch/.../tmp/dataset),以指示要将数据集装载到的位置。If you don't include the leading forward slash, '/', you'll need to prefix the working directory e.g. /mnt/batch/.../tmp/dataset on the compute target to indicate where you want the dataset to be mounted.

数据标签项目Data labeling projects

问题Issue 解决方法Resolution
只能使用在 Blob 数据存储中创建的数据集。Only datasets created on blob datastores can be used. 这是当前版本的已知限制。This is a known limitation of the current release.
创建后,项目长时间显示“正在初始化”。After creation, the project shows "Initializing" for a long time. 手动刷新页面。Manually refresh the page. 初始化应该按每秒大约 20 个数据点的速率继续。Initialization should proceed at roughly 20 datapoints per second. 缺少 autorefresh 是一个已知问题。The lack of autorefresh is a known issue.
查看映像时,最近添加标签的映像不显示。When reviewing images, newly labeled images are not shown. 若要加载所有带标签的映像,请选择“第一个”按钮。To load all labeled images, choose the First button. 按下“第一个”按钮会返回到列表的最前面,但会加载所有带标签的数据。The First button will take you back to the front of the list, but loads all labeled data.
在为对象检测提供标记时按 Esc 键会在左上角创建大小为零的标签。Pressing Esc key while labeling for object detection creates a zero size label on the top-left corner. 在此状态下提交标签会失败。Submitting labels in this state fails. 单击标签旁边的打叉标记来删除该标签。Delete the label by clicking on the cross mark next to it.

数据偏移监视器Data drift monitors

数据偏移监视器的限制和已知问题:Limitations and known issues for data drift monitors:

  • 分析历史数据时的时间范围限制为监视器频率设置的 31 个间隔。The time range when analyzing historical data is limited to 31 intervals of the monitor's frequency setting.

  • 除非未指定特征列表(使用所有特征),否则特征限制为 200 个。Limitation of 200 features, unless a feature list is not specified (all features used).

  • 计算大小必须足够大才能处理数据。Compute size must be large enough to handle the data.

  • 确保数据集包含处于给定监视器运行的开始和结束日期范围内的数据。Ensure your dataset has data within the start and end date for a given monitor run.

  • 数据集监视器仅适用于包含 50 行或更多行的数据集。Dataset monitors will only work on datasets that contain 50 rows or more.

  • 数据集中的列或特征根据下表中的条件划分为分类值或数字值。Columns, or features, in the dataset are classified as categorical or numeric based on the conditions in the following table. 如果特征不满足这些条件 - 例如,某个字符串类型的列包含 100 个以上的唯一值 - 则会从数据偏移算法中删除该特征,但仍会对其进行分析。If the feature does not meet these conditions - for instance, a column of type string with >100 unique values - the feature is dropped from our data drift algorithm, but is still profiled.

    特征类型Feature type 数据类型Data type 条件Condition 限制Limitations
    分类Categorical string、bool、int、floatstring, bool, int, float 特征中的唯一值数小于 100,并小于行数的 5%。The number of unique values in the feature is less than 100 and less than 5% of the number of rows. Null 被视为其自身的类别。Null is treated as its own category.
    数值Numerical int、floatint, float 特征中的值为数字数据类型,且不符合分类特征的条件。The values in the feature are of a numerical data type and do not meet the condition for a categorical feature. 如果 15% 以上的值为 null,则会删除特征。Feature dropped if >15% of values are null.
  • 创建了数据偏移监视器,但无法在 Azure 机器学习工作室的“数据集监视器”页上看到数据时,请尝试以下操作。When you have created a datadrift monitor but cannot see data on the Dataset monitors page in Azure Machine Learning studio, try the following.

    1. 检查是否已在页面顶部选择了正确的日期范围。Check if you have selected the right date range at the top of the page.
    2. 在“数据集监视器”选项卡上,选择试验链接以检查运行状态。On the Dataset Monitors tab, select the experiment link to check run status. 此链接位于表的最右侧。This link is on the far right of the table.
    3. 如果运行已成功完成,请检查驱动程序日志,以便查看已生成的指标数,或者查看是否有任何警告消息。If run completed successfully, check driver logs to see how many metrics has been generated or if there's any warning messages. 单击试验后,在“输出 + 日志”选项卡中查找驱动程序日志。Find driver logs in the Output + logs tab after you click on an experiment.
  • 如果 SDK backfill() 函数未生成预期的输出,则可能是由于身份验证问题。If the SDK backfill() function does not generate the expected output, it may be due to an authentication issue. 创建要传入到此函数中的计算时,请勿使用 Run.get_context().experiment.workspace.compute_targetsWhen you create the compute to pass into this function, do not use Run.get_context().experiment.workspace.compute_targets. 而应使用 ServicePrincipalAuthentication(例如以下代码)来创建要传入到该 backfill() 函数中的计算:Instead, use ServicePrincipalAuthentication such as the following to create the compute that you pass into that backfill() function:

    auth = ServicePrincipalAuthentication(
            tenant_id=tenant_id,
            service_principal_id=app_id,
            service_principal_password=client_secret
            )
    ws = Workspace.get("xxx", auth=auth, subscription_id="xxx", resource_group"xxx")
    compute = ws.compute_targets.get("xxx")
    

Azure 机器学习设计器Azure Machine Learning designer

  • 计算准备时间很长:Long compute preparation time:

第一次连接或创建计算目标可能需要几分钟甚至更长的时间。It may be a few minutes or even longer when you first connect to or create a compute target.

在模型数据收集器中,数据到达 blob 存储帐户最多需要(但通常不到)10 分钟。From the Model Data Collector, it can take up to (but usually less than) 10 minutes for data to arrive in your blob storage account. 等待 10 分钟以确保运行下面的单元。Wait 10 minutes to ensure cells below will run.

import time
time.sleep(600)
  • 实时终结点的日志:Log for real-time endpoints:

实时终结点的日志是客户数据。Logs of real-time endpoints are customer data. 对于实时终结点故障排除,可以使用以下代码来启用日志。For real-time endpoint troubleshooting, you can use following code to enable logs.

有关监视 Web 服务终结点的更多详细信息,请参阅本文See more details about monitoring web service endpoints in this article.

from azureml.core import Workspace
from azureml.core.webservice import Webservice

ws = Workspace.from_config()
service = Webservice(name="service-name", workspace=ws)
logs = service.get_logs()

如果有多个租户,则可能需要在 ws = Workspace.from_config() 之前添加以下身份验证代码If you have multiple Tenant, you may need to add the following authenticate code before ws = Workspace.from_config()

from azureml.core.authentication import InteractiveLoginAuthentication
interactive_auth = InteractiveLoginAuthentication(tenant_id="the tenant_id in which your workspace resides")

训练模型Train models

  • ModuleErrors(没有名为“xxx”的模块) :如果在 Azure ML 中提交试验时遇到 ModuleErrors,则表示训练脚本需要安装某个包,但并未添加该包。ModuleErrors (No module named): If you are running into ModuleErrors while submitting experiments in Azure ML, it means that the training script is expecting a package to be installed but it isn't added. 你提供包名称后,Azure ML 在用于训练运行的环境中安装该包。Once you provide the package name, Azure ML installs the package in the environment used for your training run.

    如果使用估算器提交试验,则可以根据要从哪个源安装包,通过估算器中的 pip_packagesconda_packages 参数指定包名称。If you are using Estimators to submit experiments, you can specify a package name via pip_packages or conda_packages parameter in the estimator based on from which source you want to install the package. 还可以使用 conda_dependencies_file 指定包含所有依赖项的 yml 文件,或使用 pip_requirements_file 参数列出 txt 文件中的所有 pip 要求。You can also specify a yml file with all your dependencies using conda_dependencies_fileor list all your pip requirements in a txt file using pip_requirements_file parameter. 如果你有自己的 Azure ML 环境对象,并且希望替代估算器使用的默认映像,则可以通过估算器构造函数的 environment 参数来指定该环境。If you have your own Azure ML Environment object that you want to override the default image used by the estimator, you can specify that environment via the environment parameter of the estimator constructor.

    Azure ML 还提供适用于 TensorFlow、PyTorch、Chainer 和 SKLearn 的框架特定的估算器。Azure ML also provides framework-specific estimators for TensorFlow, PyTorch, Chainer and SKLearn. 使用这些估算器可确保在用于训练的环境中自动安装核心框架依赖项。Using these estimators will make sure that the core framework dependencies are installed on your behalf in the environment used for training. 可以使用相应的选项根据前面所述指定额外的依赖项。You have the option to specify extra dependencies as described above.

    可以在 AzureML 容器中看到 Azure ML 维护的 Docker 映像及其内容。Azure ML maintained docker images and their contents can be seen in AzureML Containers. 框架特定的依赖项列在相应的框架文档中 - ChainerPyTorchTensorFlowSKLearnFramework-specific dependencies are listed in the respective framework documentation - Chainer, PyTorch, TensorFlow, SKLearn.

    备注

    如果你认为某个特定的包比较常用,需要添加到 Azure ML 维护的映像和环境中,请在 AzureML 容器中提出 GitHub 问题。If you think a particular package is common enough to be added in Azure ML maintained images and environments please raise a GitHub issue in AzureML Containers.

  • NameError(未定义名称)、AttributeError(对象没有属性) :此异常应该是训练脚本引发的。NameError (Name not defined), AttributeError (Object has no attribute): This exception should come from your training scripts. 可以在 Azure 门户中查看日志文件,以获取有关未定义特定名称或属性错误的详细信息。You can look at the log files from Azure portal to get more information about the specific name not defined or attribute error. 在 SDK 中,可以使用 run.get_details() 来查看错误消息。From the SDK, you can use run.get_details() to look at the error message. 这还会列出针对运行生成的所有日志文件。This will also list all the log files generated for your run. 在重新提交运行之前,请务必检查训练脚本并修复错误。Please make sure to take a look at your training script and fix the error before resubmitting your run.

  • Horovod 已关闭:在大多数情况下,如果遇到“AbortedError:Horovod 已关闭”,此异常表示某个进程中的根本性异常导致 Horovod 关闭。Horovod has been shut down: In most cases if you encounter "AbortedError: Horovod has been shut down" this exception means there was an underlying exception in one of the processes that caused Horovod to shut down. MPI 作业中的每个排名都会在 Azure ML 中生成专属的日志文件。Each rank in the MPI job gets it own dedicated log file in Azure ML. 这些日志名为 70_driver_logsThese logs are named 70_driver_logs. 对于分布式训练,日志名称带有 _rank 后缀,以方便区分日志。In case of distributed training, the log names are suffixed with _rank to make it easier to differentiate the logs. 若要查找导致 Horovod 关闭的确切错误,请浏览所有日志文件,并查看 driver_log 文件末尾的 TracebackTo find the exact error that caused Horovod to shut down, go through all the log files and look for Traceback at the end of the driver_log files. 其中的某个文件会指出实际的根本性异常。One of these files will give you the actual underlying exception.

  • 运行或试验删除:可以通过以下方式将试验存档:使用 Experiment.archive 方法,或者从 Azure 机器学习工作室客户端中的“试验”选项卡视图中使用“存档试验”按钮。Run or experiment deletion: Experiments can be archived by using the Experiment.archive method, or from the Experiment tab view in Azure Machine Learning studio client via the "Archive experiment" button. 执行此操作后,在列出查询和视图时将隐藏该试验,但不会将其删除。This action hides the experiment from list queries and views, but does not delete it.

    目前不支持永久删除个体试验或运行。Permanent deletion of individual experiments or runs is not currently supported. 有关删除工作区资产的详细信息,请参阅导出或删除机器学习服务工作区数据For more information on deleting Workspace assets, see Export or delete your Machine Learning service workspace data.

  • 指标文档太大:对于一次性可从训练运行记录的指标对象大小,Azure 机器学习施加了内部限制。Metric Document is too large: Azure Machine Learning has internal limits on the size of metric objects that can be logged at once from a training run. 如果在记录列表值指标时遇到“指标文档太大”错误,请尝试将列表拆分为较小的区块,例如:If you encounter a "Metric Document is too large" error when logging a list-valued metric, try splitting the list into smaller chunks, for example:

    run.log_list("my metric name", my_metric[:N])
    run.log_list("my metric name", my_metric[N:])
    

    在内部,Azure ML 会将具有相同指标名称的块串联到一个连续列表中。Internally, Azure ML concatenates the blocks with the same metric name into a contiguous list.

自动化机器学习Automated machine learning

  • TensorFlow:从 SDK 1.5.0 版开始,自动化机器学习默认情况下不会安装 tensorflow 模型。TensorFlow: As of version 1.5.0 of the SDK, automated machine learning does not install tensorflow models by default. 若要安装 tensorflow 并将其用于自动化 ML 试验,请通过 CondaDependecies 安装 tensorflow==1.12.0。To install tensorflow and use it with your automated ML experiments, install tensorflow==1.12.0 via CondaDependecies.

    from azureml.core.runconfig import RunConfiguration
    from azureml.core.conda_dependencies import CondaDependencies
    run_config = RunConfiguration()
    run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['tensorflow==1.12.0'])
    
  • 试验图表:自 4 月 12 日以来,自动化 ML 试验迭代中显示的二元分类图表(精准率-召回率、ROC、增益曲线等)在用户界面中无法正常呈现。Experiment Charts: Binary classification charts (precision-recall, ROC, gain curve etc.) shown in automated ML experiment iterations are not rendering correctly in user interface since 4/12. 绘制的图表目前显示相反的结果:表现更好的模型反而显示更低的结果。Chart plots are currently showing inverse results, where better performing models are shown with lower results. 我们研究解决方法。A resolution is under investigation.

部署和提供模型Deploy & serve models

对以下错误采取以下操作:Take these actions for the following errors:

错误Error 解决方法Resolution
部署 Web 服务时映像生成失败Image building failure when deploying web service 将“pynacl==1.2.1”作为 pip 依赖项添加到 Conda 文件以进行映像配置Add "pynacl==1.2.1" as a pip dependency to Conda file for image configuration
['DaskOnBatch:context_managers.DaskOnBatch', 'setup.py']' died with <Signals.SIGKILL: 9> 请将部署中使用的 VM 的 SKU 更改为具有更多内存的 SKU。Change the SKU for VMs used in your deployment to one that has more memory.
FPGA 失败FPGA failure 你将无法在 FPGA 上部署模型,直到已请求并获得 FPGA 配额批准为止。You will not be able to deploy models on FPGAs until you have requested and been approved for FPGA quota. 若要请求访问权限,请填写配额请求表单: https://aka.ms/aml-real-time-aiTo request access, fill out the quota request form: https://aka.ms/aml-real-time-ai

更新 AKS 群集中的 Azure 机器学习组件Updating Azure Machine Learning components in AKS cluster

必须手动应用对 Azure Kubernetes 服务群集中安装的 Azure 机器学习组件的更新。Updates to Azure Machine Learning components installed in an Azure Kubernetes Service cluster must be manually applied.

可以通过从 Azure 机器学习工作区分离群集,然后将群集重新附加到工作区,来应用这些更新。You can apply these updates by detaching the cluster from the Azure Machine Learning workspace, and then reattaching the cluster to the workspace. 如果在群集中启用了 TLS,则重新附加群集时需要提供 TLS/SSL 证书和私钥。If TLS is enabled in the cluster, you will need to supply the TLS/SSL certificate and private key when reattaching the cluster.

compute_target = ComputeTarget(workspace=ws, name=clusterWorkspaceName)
compute_target.detach()
compute_target.wait_for_completion(show_output=True)

attach_config = AksCompute.attach_configuration(resource_group=resourceGroup, cluster_name=kubernetesClusterName)

## If SSL is enabled.
attach_config.enable_ssl(
    ssl_cert_pem_file="cert.pem",
    ssl_key_pem_file="key.pem",
    ssl_cname=sslCname)

attach_config.validate_configuration()

compute_target = ComputeTarget.attach(workspace=ws, name=args.clusterWorkspaceName, attach_configuration=attach_config)
compute_target.wait_for_completion(show_output=True)

如果不再具有 TLS/SSL 证书和私钥,或者使用 Azure 机器学习生成的证书,则可以在分离群集之前,使用 kubectl 连接到群集并检索机密 azuremlfessl 来检索文件。If you no longer have the TLS/SSL certificate and private key, or you are using a certificate generated by Azure Machine Learning, you can retrieve the files prior to detaching the cluster by connecting to the cluster using kubectl and retrieving the secret azuremlfessl.

kubectl get secret/azuremlfessl -o yaml

备注

Kubernetes 存储的机密采用 base-64 编码格式。Kubernetes stores the secrets in base-64 encoded format. 在将机密提供给 attach_config.enable_ssl 之前,需要对机密的 cert.pemkey.pem 组成部分进行 base-64 解码。You will need to base-64 decode the cert.pem and key.pem components of the secrets prior to providing them to attach_config.enable_ssl.

Azure Kubernetes 服务中的 Web 服务失败Webservices in Azure Kubernetes Service failures

对于 Azure Kubernetes 服务中的许多 Web 服务失败,可以使用 kubectl 连接到群集进行调试。Many webservice failures in Azure Kubernetes Service can be debugged by connecting to the cluster using kubectl. 可以运行以下命令获取 Azure Kubernetes 服务群集的 kubeconfig.jsonYou can get the kubeconfig.json for an Azure Kubernetes Service Cluster by running

az aks get-credentials -g <rg> -n <aks cluster name>

身份验证错误Authentication errors

如果通过远程作业对某个计算目标执行管理操作,会收到以下错误之一:If you perform a management operation on a compute target from a remote job, you will receive one of the following errors:

{"code":"Unauthorized","statusCode":401,"message":"Unauthorized","details":[{"code":"InvalidOrExpiredToken","message":"The request token was either invalid or expired. Please try again with a valid token."}]}
{"error":{"code":"AuthenticationFailed","message":"Authentication failed."}}

例如,如果尝试通过一个为实施远程执行操作而提交的机器学习管道创建或附加计算目标,会收到错误。For example, you will receive an error if you try to create or attach a compute target from an ML Pipeline that is submitted for remote execution.

工作室中缺少用户界面项Missing user interface items in studio

可以使用 Azure 基于角色的访问控制来限制可使用 Azure 机器学习执行的操作。Azure role-based access control can be used to restrict actions that you can perform with Azure Machine Learning. 这些限制可以防止用户界面项显示在 Azure 机器学习工作室中。These restrictions can prevent user interface items from showing in the Azure Machine Learning studio. 例如,如果分配了无法创建计算实例的角色,则创建计算实例的选项不会出现在工作室中。For example, if you are assigned a role that cannot create a compute instance, the option to create a compute instance will not appear in the studio.

有关详细信息,请参阅管理用户和角色For more information, see Manage users and roles.

后续步骤Next steps

请参阅更多有关 Azure 机器学习的故障排除文章:See more troubleshooting articles for Azure Machine Learning: