使用 Azure 数据工厂时的常见错误Common errors using Azure Data Factory

Azure 数据工厂是一项托管服务,借助该服务可以使用 Azure Databricks 笔记本、JAR 和 Python 脚本创作数据管道。Azure Data Factory is a managed service that lets you author data pipelines using Azure Databricks notebooks, JARs, and Python scripts. 本文介绍常见的问题和解决方案。This article describes common issues and solutions.

无法创建群集Cluster could not be created

在 Azure 数据工厂中创建使用与 Azure Databricks 相关的活动(如 Notebook 活动)的数据管道时,可以要求创建新的群集。When you create a data pipeline in Azure Data Factory that uses an Azure Databricks-related activity such as Notebook Activity, you can ask for a new cluster to be created. 在 Azure 中,群集创建可能会因多种原因而失败:In Azure, cluster creation can fail for a variety of reasons:

  • Azure 订阅的可预配虚拟机数量受到限制。Your Azure subscription is limited in the number of virtual machines that can be provisioned.
  • Failed to create cluster because of Azure quota 表示你正在使用的订阅没有足够的配额来创建所需的资源。Failed to create cluster because of Azure quota indicates that the subscription you are using does not have enough quota to create the needed resources. 例如,如果请求 500 个核心,但配额为 50 个核心,则请求将失败。For example, if you request 500 cores but your quota is 50 cores, the request will fail. 请联系 Azure 支持部门,请求增加配额。Contact Azure Support to request a quota increase.
  • Azure resource provider is currently under high load and requests are being throttled. 此错误表示你的 Azure 订阅乃至区域可能正受到限制。Azure resource provider is currently under high load and requests are being throttled. This error indicates that your Azure subscription or perhaps even the region is being throttled. 仅重试数据管道可能无济于事。Simply retrying the data pipeline may not help. 有关此问题的详细信息,请参阅排查 API 限制错误Learn more about this issue at Troubleshooting API throttling errors.
  • Could not launch cluster due to cloud provider failures 表示为群集预配一个或多个虚拟机时出现常规故障。Could not launch cluster due to cloud provider failures indicates a generic failure to provision one or more virtual machines for the cluster. 请等待并稍后重试。Wait and try again later.

群集在数据管道执行期间遇到问题Cluster ran into issues during data pipeline execution

Azure Databricks 包括多种机制,可提高 Apache Spark 群集的复原能力。Azure Databricks includes a variety of mechanisms that increase the resilience of your Apache Spark cluster. 也就是说,它无法从每次故障中恢复,导致出现如下错误:That said, it cannot recover from every failure, leading to errors like this:

  • Connection refused
  • RPC timed out
  • Exchange times out after X seconds
  • Cluster became unreachable during run
  • Too many execution contexts are open right now
  • Driver was restarted during run
  • Context ExecutionContextId is disconnected
  • Could not reach driver of cluster for X seconds

大多数情况下,这些错误并不表示 Azure 的底层基础结构出现问题。Most of the time, these errors do not indicate an issue with the underlying infrastructure of Azure. 相反,很可能是群集上运行的作业太多,使群集过载并导致超时。Instead, it is quite likely that the cluster has too many jobs running on it, which can overload the cluster and cause timeouts.

一般来说,应该将更大的数据管道移动到它们自己的 Azure Databricks 群集上运行。As a general rule, you should move heavier data pipelines to run on their own Azure Databricks clusters. 与 Azure Monitor 集成并使用 Grafana 观察执行指标可以深入了解即将过载的群集。Integrating with Azure Monitor and observing execution metrics with Grafana can provide insight into clusters that are getting overloaded.

Azure Databricks 服务的负载过高Azure Databricks service is experiencing high load

你可能会注意到某些数据管道失败,并出现如下错误:You may notice that certain data pipelines fail with errors like these:

  • The service at {API} is temporarily unavailable
  • Jobs is not fully initialized yet. Please retry later
  • Failed or timeout processing HTTP request
  • No webapps are available to handle your request

这些错误表明 Azure Databricks 服务负载过重。These errors indicate that the Azure Databricks service is under heavy load. 如果发生这种情况,请尝试限制包含 Azure Databricks 活动的并发数据管道的数目。If this happens, try limiting the number of concurrent data pipelines that include a Azure Databricks activity. 例如,如果要对从源到目标的 1,000 个表执行 ETL,而不是为每个表启动数据管道,可以将多个表合并到一个数据管道中,或者使其错开执行,以免一次触发所有执行。For example, if you are performing ETL with 1,000 tables from source to destination, instead of launching a data pipeline per table, either combine multiple tables in one data pipeline or stagger their execution so they don’t all trigger at once.

重要

Azure Databricks 不允许在 3,600 秒窗口中创建超过 1,000 个作业。Azure Databricks will not allow you to create more than 1,000 Jobs in a 3,600 second window. 如果在 Azure 数据工厂中尝试执行此操作,则数据管道将会失败。If you try to do so with Azure Data Factory, your data pipeline will fail.

如果过于频繁地(例如每 5 秒一次)轮询 Databricks 作业 API 的作业运行状态,也会显示这些错误。These errors can also show if you poll the Databricks Jobs API for job run status too frequently (e.g. every 5 seconds). 补救措施是降低轮询频率。The remedy is to reduce the frequency of polling.

库安装超时Library installation timeout

Azure Databricks 包括对安装第三方库的可靠支持。Azure Databricks includes robust support for installing third-party libraries. 遗憾的是,你可能会看到如下所示的问题:Unfortunately, you may see issues like this:

  • Failed or timed out installing libraries

出现这种情况的原因是每次启动带有附加库的群集时,Azure Databricks 都会从相应的存储库(如 PyPI)下载库。This happens because every time you start a cluster with a library attached, Azure Databricks downloads the library from the appropriate repository (such as PyPI). 此操作可能会超时,导致群集无法启动。This operation can time out, causing your cluster to fail to start.

除了限制附加到群集的库的数目之外,这个问题没有简单的解决方案。There is no simple solution for this problem, other than limiting the number of libraries you attach to clusters.