群集意外终止 Unexpected cluster termination

有时,群集会意外终止,而不是手动终止或所配置的自动终止Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. 群集可能由于许多原因而终止。A cluster can be terminated for many reasons. 有些终止是由 Azure Databricks 发起的,有些终止是由云提供商发起的。Some terminations are initiated by Azure Databricks and others are initiated by the cloud provider. 本文介绍了终止原因和修正步骤。This article describes termination reasons and steps for remediation.

超出了 Azure Databricks 发起的请求限制 Azure Databricks initiated request limit exceeded

若要防止 API 滥用,请确保服务质量,并防止意外创建太多的大型群集。Azure Databricks 会限制所有扩大群集规模的请求,包括群集创建、启动和重设大小。To defend against API abuses, ensure quality of service, and prevent you from accidentally creating too many large clusters, Azure Databricks throttles all cluster up-sizing requests, including cluster creation, starting, and resizing. 此限制使用令牌 Bucket 算法来限制任何人能够在 Databricks 部署中按定义的时间间隔启动的节点总数,但允许特定大小的突发请求。The throttling uses the token bucket algorithm to limit the total number of nodes that anyone can launch over a defined interval across your Databricks deployment, while allowing burst requests of certain sizes. 来自 Web UI 和 API 的请求都存在速率限制。Requests coming from both the web UI and the APIs are subject to rate limiting. 当群集请求超出速率限制时,超出限制的请求会失败,并出现 REQUEST_LIMIT_EXCEEDED 错误。When cluster requests exceed rate limits, the limit-exceeding request fails with a REQUEST_LIMIT_EXCEEDED error.

解决方案Solution

如果你达到了合法工作流的限制,Databricks 建议你执行以下操作:If you hit the limit for your legitimate workflow, Databricks recommends that you do the following:

  • 在几分钟后重试你的请求。Retry your request a few minutes later.
  • 在计划的时间范围内均匀分散你的重复执行工作流。Spread out your recurring workflow evenly in the planned time frame. 例如,尝试将所有作业分布在一小时内的不同间隔内,而不是笼统地将其安排在一小时边界内运行。For example, instead of scheduling all of your jobs to run at an hourly boundary, try distributing them at different intervals within the hour.
  • 请考虑使用节点类型较大且节点数较少的群集。Consider using clusters with a larger node type and smaller number of nodes.
  • 使用自动缩放群集。Use autoscaling clusters.

如果这些选项不适用于你,请联系 Azure Databricks 支持人员,请求其提高你的核心实例数限制。If these options don’t work for you, contact Azure Databricks Support to request a limit increase for the core instance.

对于 Azure Databricks 发起的其他终止原因,请参阅终止代码For other Azure Databricks initiated termination reasons, see Termination Code.

云提供商发起的终止Cloud provider initiated terminations

本文列出了常见的与云提供商相关的终止原因和修正步骤。This article lists common cloud provider related termination reasons and remediation steps.

Launch failure(启动失败)Launch failure

这种终止原因出现在 Azure Databricks 无法获取虚拟机的时候。This termination reason occurs when Azure Databricks fails to acquire virtual machines. 系统会传播来自 API 的有助于排查问题的错误代码和消息。The error code and message from the API are propagated to help you troubleshoot the issue.

OperationNotAllowedOperationNotAllowed

你已经达到了配额限制,该限制通常是你的订阅可以启动的核心数。You have reached a quota limit, usually number of cores, that your subscription can launch. 请在 Azure 门户中请求提高限制。Request a limit increase in Azure portal. 请参阅 Azure 订阅和服务限制、配额和约束See Azure subscription and service limits, quotas, and constraints.

PublicIPCountLimitReachedPublicIPCountLimitReached

你已达到了你可以运行的公共 IP 的限制数。You have reached the limit of the public IPs that you can have running. 请在 Azure 门户中请求提高限制。Request a limit increase in Azure Portal.

SkuNotAvailableSkuNotAvailable

你选择的资源 SKU(例如 VM 大小)不可用于你选择的位置。The resource SKU you have selected (such as VM size) is not available for the location you have selected. 若要解决此问题,请参阅解决 SKU 不可用错误To resolve, see Resolve errors for SKU not available.

ReadOnlyDisabledSubscriptionReadOnlyDisabledSubscription

你的订阅已被禁用。Your subscription was disabled. 请按照为何我的 Azure 订阅被禁用?如何重新激活它?中的步骤重新激活你的订阅。Follow the steps in Why is my Azure subscription disabled and how do I reactivate it? to reactivate your subscription.

ResourceGroupBeingDeletedResourceGroupBeingDeleted

如果有人在 Azure 门户中取消了你的 Azure Databricks 工作区,而你此时在尝试创建群集,则可能会发生这种情况。Can occur if someone cancels your Azure Databricks workspace in the Azure portal and you try to create a cluster at the same time. 群集失败,因为正在删除资源组。The cluster fails because the resource group is being deleted.

SubscriptionRequestsThrottledSubscriptionRequestsThrottled

你的订阅即将达到 Azure 资源管理器请求限制(请参阅限制资源管理器请求)。Your subscription is hitting the Azure Resource Manager request limit (see Throttling Resource Manager requests). 典型原因是 Azure Databricks 外部的另一系统正在对 Azure 进行大量的 API 调用。Typical cause is that another system outside Azure Databricks) making a lot of API calls to Azure. 请联系 Azure 支持部门来查明此系统,然后减少 API 调用数。Contact Azure support to identify this system and then reduce the number of API calls.

通信丢失Communication lost

Azure Databricks 能够启动群集,但失去了与承载 Spark 驱动程序的实例的连接。Azure Databricks was able to launch the cluster, but lost the connection to the instance hosting the Spark driver.

由驱动程序虚拟机关闭或网络问题导致。Caused by the driver virtual machine going down or a networking issue.