群集无法启动Cluster failed to launch

本文介绍了群集无法启动的几种情况,并根据在日志中找到的错误消息为每种情况提供了故障排除步骤。This article describes several scenarios in which a cluster fails to launch, and provides troubleshooting steps for each scenario based on error messages found in logs.

群集超时Cluster timeout

错误消息:Error messages:

Driver failed to start in time

INTERNAL_ERROR: The Spark driver failed to start within 300 seconds

Cluster failed to be healthy within 200 seconds

原因Cause

如果群集连接到外部 Hive 元存储并尝试从 maven 存储库下载所有 Hive 元存储库,则群集可能无法启动。The cluster can fail to launch if it has a connection to an external Hive metastore and it tries to download all the Hive metastore libraries from a maven repo. 群集下载将近 200 个 JAR 文件,包括依赖项。A cluster downloads almost 200 JAR files, including dependencies. 如果 Azure Databricks 群集管理器无法在 5 分钟内确认驱动程序已就绪,则群集无法启动。If the Azure Databricks cluster manager cannot confirm that the driver is ready within 5 minutes, then cluster launch fails. 出现这种情况的原因可能是 JAR 下载时间太长。This can occur because JAR downloading is taking too much time.

解决方案Solution

将 Hive 库存储在 DBFS 中,并从 DBFS 位置本地访问它们。Store the Hive libraries in DBFS and access them locally from the DBFS location. 请参阅 Spark 选项See Spark Options.

全局或特定于群集的初始化脚本Global or cluster-specific init scripts

错误消息:Error message:

The cluster could not be started in 50 minutes. Cause: Timed out with exception after <xxx> attempts

原因Cause

在群集启动阶段期间运行的初始化脚本向每个辅助角色计算机发送 RPC(远程过程调用),以便在本地运行脚本。Init scripts that run during the cluster spin-up stage send an RPC (remote procedure call) to each worker machine to run the scripts locally. 在该过程继续之前,所有 RPC 都必须返回其状态。All RPCs must return their status before the process continues. 如果有任何 RPC 遇到问题,而未进行响应(例如由于暂时性网络问题),则可能会遇到 1 小时超时,从而导致群集设置作业失败。If any RPC hits an issue and doesn’t respond back (due to a transient networking issue, for example), then the 1-hour timeout can be hit, causing the cluster setup job to fail.

解决方案Solution

请使用群集范围初始化脚本,而不是全局或群集命名初始化脚本。Use a cluster-scoped init script instead of global or cluster-named init scripts. 使用群集范围初始化脚本时,Azure Databricks 不会使用 RPC 的同步阻止来提取初始化脚本执行状态。With cluster-scoped init scripts, Azure Databricks does not use synchronous blocking of RPCs to fetch init script execution status.

群集 UI 中安装了太多的库Too many libraries installed in cluster UI

错误消息:Error message:

Library installation timed out after 1800 seconds. Libraries that are not yet installed:

原因Cause

这通常是网络问题造成的间歇性问题。This is usually an intermittent problem due to network problems.

解决方案Solution

通常情况下,可以通过重新运行作业或重启群集来解决此问题。Usually you can fix this problem by re-running the job or restarting the cluster.

库安装程序配置为 3 分钟后超时。The library installer is configured to time out after 3 minutes. 提取和安装 jar 时,可能会由于网络问题发生超时。While fetching and installing jars, a timeout can occur due to network problems. 若要缓解此问题,可将库从 maven 下载到 DBFS 位置,并从该位置进行安装。To mitigate this issue, you can download the libraries from maven to a DBFS location and install it from there.

云提供商限制Cloud provider limit

错误消息:Error message:

Cluster terminated. Reason: Cloud Provider Limit

原因Cause

此错误通常由云提供商返回。This error is usually returned by the cloud provider.

解决方案Solution

请参阅群集意外终止中的云提供商错误信息。See the cloud provider error information in cluster unexpected termination.

云提供商关闭Cloud provider shutdown

错误消息:Error message:

Cluster terminated. Reason: Cloud Provider Shutdown

原因Cause

此错误通常由云提供商返回。This error is usually returned by the cloud provider.

解决方案Solution

请参阅群集意外终止中的云提供商错误信息。See the cloud provider error information in cluster unexpected termination.

Instances unreachable(实例不可访问)Instances unreachable

错误消息:Error message:

Cluster terminated. Reason: Instances Unreachable

An unexpected error was encountered while setting up the cluster. Please retry and contact Azure Databricks if the problem persists. Internal error message: Timeout while placing node

原因Cause

此错误通常由云提供商返回。This error is usually returned by the cloud provider. 通常,将 Azure Databricks 工作区部署到你自己的虚拟网络 (VNet)(而不是启动新 Azure Databricks 工作区时创建的默认 VNet)时,会发生这种情况。Typically, it occurs when you have an Azure Databricks workspace deployed to your own virtual network (VNet) (as opposed to the default VNet created when you launch a new Azure Databricks workspace). 如果已部署工作区的虚拟网络已对等互连或与本地资源建立了 ExpressRoute 连接,则当 Azure Databricks 尝试创建群集时,虚拟网络无法与群集节点建立 ssh 连接。If the virtual network where the workspace is deployed is already peered or has an ExpressRoute connection to on-premises resources, the virtual network cannot make an ssh connection to the cluster node when Azure Databricks is attempting to create a cluster.

解决方案Solution

添加用户定义的路由 (UDR),以使 Azure Databricks 控制平面能够通过 ssh 方式访问群集实例、Blob 存储实例和项目资源。Add a user-defined route (UDR) to give the Azure Databricks control plane ssh access to the cluster instances, Blob Storage instances, and artifact resources. 此自定义 UDR 允许出站连接且不会干扰群集创建。This custom UDR allows outbound connections and does not interfere with cluster creation. 有关详细 UDR 说明,请参阅步骤 3:创建用户定义的路由,并将其与 Azure Databricks 虚拟网络子网关联For detailed UDR instructions, see Step 3: Create user-defined routes and associate them with your Azure Databricks virtual network subnets. 有关与 VNet 相关的详细故障排除信息,请参阅故障排除For more VNet-related troubleshooting information, see Troubleshooting.