将生产工作负荷迁移到 Azure DatabricksMigrate production workloads to Azure Databricks

本指南说明如何将生产作业从其他平台上的 Apache Spark 移到 Azure Databricks 上的 Apache Spark。This guide explains how to move your production jobs from Apache Spark on other platforms to Apache Spark on Azure Databricks.

概念Concepts

Databricks 作业Databricks job

可以捆绑并提交到 Azure Databricks 的单个代码单元。A single unit of code that you can bundle and submit to Azure Databricks. Azure Databricks 作业等效于具有单个 SparkContextSpark 应用程序An Azure Databricks job is equivalent to a Spark application with a single SparkContext. 入口点可以位于库(例如 JAR、egg、wheel)或笔记本中。The entry point can be in a library (for example, JAR, egg, wheel) or a notebook. 你可以使用复杂的重试和警报机制按计划运行 Azure Databricks 作业。You can run Azure Databricks jobs on a schedule with sophisticated retries and alerting mechanisms. 运行作业的主要界面为作业 APIUIThe primary interfaces for running jobs are the Jobs API and UI.

Pool

帐户中的一组实例,由 Azure Databricks 管理,但在空闲时不会产生任何 Azure Databricks 费用。A set of instances in your account that are managed by Azure Databricks but incur no Azure Databricks charges when they are idle. 在池上提交多个作业可以确保作业快速启动。Submitting multiple jobs on a pool ensures your jobs start quickly. 可以为实例池设置护栏(实例类型、实例限制等)和自动缩放策略。You can set guardrails (instance types, instance limits, and so on) and autoscaling policies for the pool of instances. 池等效于其他 Spark 平台上的自动缩放群集。A pool is equivalent to an autoscaling cluster on other Spark platforms.

迁移步骤Migration steps

本部分提供将生产作业移到 Azure Databricks 的步骤。This section provides the steps for moving your production jobs to Azure Databricks.

步骤 1:创建池Step 1: Create a pool

创建自动缩放池Create an autoscaling pool. 这等效于在其他 Spark 平台中创建自动缩放群集。This is equivalent to creating an autoscaling cluster in other Spark platforms. 在其他平台上,如果自动缩放群集中的实例在几分钟或几小时内处于空闲状态,则需要为其付费。On other platforms, if instances in the autoscaling cluster are idle for a few minutes or hours, you pay for them. Azure Databricks 为你免费管理实例池。Azure Databricks manages the instance pool for you for free. 也就是说,如果这些计算机未处于使用状态,则无需支付 Azure Databricks 费用;你只需要向云提供商付费。That is, you don’t pay Azure Databricks if these machines are not in use; you pay only the cloud provider. 仅当在实例上运行作业时,Azure Databricks 才收费。Azure Databricks charges only when jobs are run on the instances.

创建池Create pool

关键配置:Key configurations:

  • 最小空闲:池维护的未被作业使用的备用实例数。Min Idle: Number of standby instances, not in use by jobs, that the pool maintains. 可将其设置为 0。You can set this to 0.
  • 最大容量:这是一个可选字段。Max Capacity: This is an optional field. 如果已设置云提供商实例限制,则可以将此字段留空。If you already have cloud provider instance limits set, you can leave this field empty. 如果要设置其他最大限制,请设置一个较高的值,以便大量作业可以共享该池。If you want to set additional max limits, set a high value so that a large number of jobs can share the pool.
  • 空闲实例自动终止:如果设置了“最小空闲”的实例在指定时间段内处于空闲状态,则将其释放回云提供商。Idle Instance Auto Termination: The instances over Min Idle are released back to the cloud provider if they are idle for the specified period. 值越高,实例保持准备就绪状态的时间越长,因而作业的启动速度就越快。The higher the value, the more the instances are kept ready and thereby your jobs will start faster.

步骤 2:在池上运行作业Step 2: Run a job on a pool

可使用作业 API 或 UI 在池上运行作业。You can run a job on a pool using the Jobs API or the UI. 必须通过提供群集规范来运行每个作业。当作业将要启动时,Azure Databricks 会自动从池中创建新群集。You must run each job by providing a cluster spec. When a job is about to start, Azure Databricks automatically creates a new cluster from the pool. 作业完成后,群集将自动终止。The cluster is automatically terminated when the job finishes. 完全按照作业的运行时间向你收费。You are charged exactly for the amount of time your job was run. 这是在 Azure Databricks 上运行作业最具成本效益的方法。This is the most cost-effective way to run jobs on Azure Databricks. 每个新群集均具有:Each new cluster has:

  • 一个关联的 SparkContext,其等效于其他 Spark 平台上的 Spark 应用程序。One associated SparkContext, which is equivalent to a Spark application on other Spark platforms.
  • 一个驱动程序节点和指定数量的辅助角色。A driver node and a specified number of workers. 对于单个作业,可以指定辅助角色范围。For a single job, you can specify a worker range. Azure Databricks 基于单个 Spark 作业所需的资源自动缩放该作业。Azure Databricks autoscales a single Spark job based on the resources needed for that job. Azure Databricks 基准测试表明,根据作业的性质,这可以为你节省最多 30% 的云成本。Azure Databricks benchmarks show that this can save you up to 30% on cloud costs, depending on the nature of your job.

在池上运行作业的方法有以下三种:API/CLI、Airflow、UI。There are three ways to run jobs on a pool: API/CLI, Airflow, UI.

API/CLIAPI / CLI

  1. 下载并配置 Databricks CLIDownload and configure the Databricks CLI.

  2. 运行以下命令来提交一次代码。Run the following command to submit your code one time. API 将返回一个 URL,你可以使用该 URL 跟踪作业运行的进度。The API returns a URL that you can use to track the progress of the job run.

    databricks runs submit --json
    
    {
      "run_name": "my spark job",
      "new_cluster": {
        "spark_version": "5.0.x-scala2.11",
    
        "instance_pool_id": "0313-121005-test123-pool-ABCD1234"**,**
        "num_workers": 10
        },
        "libraries": [
        {
        "jar": "dbfs:/my-jar.jar"
        }
    
        ],
        "timeout_seconds": 3600,
        "spark_jar_task": {
        "main_class_name": "com.databricks.ComputeModels"
      }
    }
    
  3. 若要计划作业,请使用下面的示例。To schedule a job, use the following example. 通过此机制创建的作业将显示在作业列表页中。Jobs created through this mechanism are displayed in the jobs list page. 返回值为 job_id,可用于查看所有运行的状态。The return value is a job_id that you can use to look at the status of all the runs.

    databricks jobs create --json
    
    {
      "name": "Nightly model training",
      "new_cluster": {
         "spark_version": "5.0.x-scala2.11",
         ...
         **"instance_pool_id": "0313-121005-test123-pool-ABCD1234",**
         "num_workers": 10
       },
       "libraries": [
         {
         "jar": "dbfs:/my-jar.jar"
         }
       ],
       "email_notifications": {
         "on_start": ["john@foo.com"],
         "on_success": ["sally@foo.com"],
         "on_failure": ["bob@foo.com"]
       },
       "timeout_seconds": 3600,
       "max_retries": 2,
       "schedule": {
       "quartz_cron_expression": "0 15 22 ? \* \*",
       "timezone_id": "America/Los_Angeles"
       },
       "spark_jar_task": {
         "main_class_name": "com.databricks.ComputeModels"
      }
    }
    

如果使用 spark-submit 提交 Spark 作业,则下表显示了 spark-submit 参数如何映射到作业创建 API 中的不同参数。If you use spark-submit to submit Spark jobs, the following table shows how spark-submit parameters map to different arguments in the Jobs Create API.

spark-submit 参数spark-submit parameter 如何将其应用于 Azure DatabricksHow it applies on Azure Databricks
–class–class 使用 Spark JAR 任务提供主类名和参数。Use the Spark JAR task to provide the main class name and the parameters.
–jars–jars 使用 libraries 参数提供依赖项列表。Use the libraries argument to provide the list of dependencies.
–py-files–py-files 对于 Python 作业,请使用 Spark Python 任务For Python jobs, use the Spark Python task. 可使用 libraries 参数来提供 egg 或 wheel 依赖项。You can use the libraries argument to provide egg or wheel dependencies.
–master–master 在云中,无需管理长时间运行的主节点。In the cloud, you don’t need to manage a long running master node. 所有实例和作业均由 Azure Databricks 服务管理。All the instances and jobs are managed by Azure Databricks services. 忽略此参数。Ignore this parameter.
–deploy-mode–deploy-mode 忽略此 Azure Databricks 上的参数。Ignore this parameter on Azure Databricks.
–conf–conf NewCluster 规范中,使用 spark_conf 参数。In the NewCluster spec, use the spark_conf argument.
–num-executors–num-executors NewCluster 规范中,使用 num_workers 参数。In the NewCluster spec, use the num_workers argument. 你还可以使用 autoscale 选项来提供一个范围(推荐)。You can also use the autoscale option to provide a range (recommended).
–driver-memory、–driver-cores–driver-memory, –driver-cores 根据所需的驱动程序内存和核心,选择适当的实例类型。Based on the driver memory and cores you need, choose an appropriate instance type.
你将在池创建期间为驱动程序提供实例类型。You will provide the instance type for the driver during the pool creation. 在作业提交过程中忽略此参数。Ignore this parameter during job submission.
–executor-memory、–executor-cores–executor-memory, –executor-cores 根据所需的执行程序内存,选择适当的实例类型。Based on the executor memory you need, choose an appropriate instance type.
你将在池创建期间为辅助角色提供实例类型。You will provide the instance type for the workers during the pool creation. 在作业提交过程中忽略此参数。Ignore this parameter during job submission.
–driver-class-path–driver-class-path 在 spark_conf 参数中,将 spark.driver.extraClassPath 设置为合适的值。Set spark.driver.extraClassPath to the appropriate value in spark_conf argument.
–driver-java-options–driver-java-options 在 spark_conf 参数中,将 spark.driver.extraJavaOptions 设置为合适的值。Set spark.driver.extraJavaOptions to the appropriate value in the spark_conf argument.
–files–files spark_conf 参数中,将 spark.files 设置为合适的值。Set spark.files to the appropriate value in the spark_conf argument.
–name–name 运行提交请求中,使用 run_name 参数。In the Runs Submit request, use the run_name argument. 创建作业请求中,使用 name 参数。In the Create Job request, use the name argument.

气流Airflow

如果想要使用 Airflow 在 Azure Databricks 中提交作业,Azure Databricks 提供 Airflow 运算符Azure Databricks offers an Airflow operator if you want to use Airflow to submit jobs in Azure Databricks. Databricks Airflow 运算符调用作业运行 API 将作业提交到 Azure Databricks。The Databricks Airflow operator calls the Jobs Run API to submit jobs to Azure Databricks. 请参阅 Apache AirflowSee Apache Airflow.

UIUI

Azure Databricks 提供了一种简单直观易用的 UI 来提交和计划作业。Azure Databricks provides a simple and intuitive easy-to-use UI to submit and schedule jobs. 若要通过 UI 创建和提交作业,请遵循分步指南To create and submit jobs from the UI, follow the step-by-step guide.

步骤 3:作业故障排除Step 3: Troubleshoot jobs

Azure Databricks 提供了许多工具来帮助你对作业进行故障排除。Azure Databricks provides lots of tools to help you troubleshoot your jobs.

访问日志和 Spark UIAccess logs and Spark UI

Azure Databricks 维护完全托管的 Spark 历史记录服务器,使你可以访问每个作业运行的所有 Spark 日志和 Spark UI。Azure Databricks maintains a fully managed Spark history server to allow you to access all the Spark logs and Spark UI for each job run. 可从作业详细信息页以及作业运行页访问它们:They can be accessed from the job details page as well as the job run page:

作业运行Job run

转发日志Forward logs

你还可以将群集日志转发到你的云存储位置。You can also forward cluster logs to your cloud storage location. 若要将日志发送到所选位置,请使用 NewCluster 规范中的 cluster_log_conf 参数。To send logs to your location of choice, use the cluster_log_conf parameter in the NewCluster spec.

查看指标View metrics

作业运行时,你可以转到群集页,然后在“指标”选项卡中查看实时 Ganglia 指标。Azure Databricks 还会每 15 分钟对这些指标进行一次快照并将其存储,因此即使在作业完成后,你也可以查看这些指标。While the job is running, you can go to the cluster page and look at the live Ganglia metrics in the Metrics tab. Azure Databricks also snapshots these metrics every 15 minutes and stores them, so you can look at these metrics even after your job is completed. 若要将指标发送到指标服务器,可以在群集中安装自定义代理。To send metrics to your metrics server, you can install custom agents in the cluster. 请参阅监视性能See Monitor performance.

Ganglia 指标Ganglia metrics

设置警报Set alerts

使用作业创建 API 中的 email_notifications 获取有关作业失败的警报。Use email_notifications in the Job Create API to get alerts on job failures. 你还可以将这些电子邮件警报转发给 PagerDuty、Slack 和其他监视系统。You can also forward these email alerts to PagerDuty, Slack, and other monitoring systems.

常见问题 (FAQ)Frequently asked questions (FAQs)

能否不使用池运行作业?Can I run jobs without a pool?

能。Yes. 池是可选的。Pools are optional. 可以直接在新群集上运行作业。You can directly run jobs on a new cluster. 在这种情况下,Azure Databricks 会通过向云提供商要求所需实例来创建群集。In such cases, Azure Databricks creates the cluster by asking the cloud provider for the required instances. 对于池,如果池中实例可用,则群集启动时间将为约 30 秒。With pools, cluster startup time will be around 30s if instances are available in the pool.

什么是笔记本作业?What is a notebook job?

Azure Databricks 具有不同的作业类型 - JAR、Python、笔记本。Azure Databricks has different job types - JAR, Python, notebook. 笔记本作业类型在指定的笔记本中运行代码。A notebook job type runs code in the specified notebook. 请参阅笔记本作业提示See Notebook job tips.

与 JAR 作业相比,应何时使用笔记本作业?When should I use a notebook job when compared to JAR job?

JAR 作业等效于 spark-submit 作业。A JAR job is equivalent to a spark-submit job. 它将执行 JAR,然后你可以查看日志和 Spark UI,以便进行故障排除。It executes the JAR and then you can look at the logs and Spark UI for troubleshooting. 笔记本作业执行指定的笔记本。A notebook job executes the specified notebook. 你可以在笔记本中导入库,也可以从笔记本调用库。You can import libraries in a notebook and call your libraries from the notebook too. 使用笔记本作业作为 main 入口点的优势在于,你可以轻松地在笔记本输出区域中调试生产作业的中间结果。The advantage of using a notebook job as the main entry point is you can easily debug your production jobs’ intermediate results in the notebook output area. 请参阅 JAR 作业提示See JAR job tips.

能否连接到我自己的 Hive 元存储?Can I connect to my own Hive metastore?

可以,Azure Databricks 支持外部 Hive 元存储。Yes, Azure Databricks supports external Hive metastores. 请参阅外部 Apache Hive 元存储See External Apache Hive metastore.