Azure Databricks 群集的区域性灾难恢复Regional disaster recovery for Azure Databricks clusters

本文介绍可用于 Azure Databricks 群集的灾难恢复体系结构,以及实现该设计的步骤。This article describes a disaster recovery architecture useful for Azure Databricks clusters, and the steps to accomplish that design.

Azure Databricks 体系结构Azure Databricks architecture

在较高层面上,当你通过 Azure 门户创建 Azure Databricks 工作区时,会在所选的 Azure 区域,将某个托管设备部署为订阅中的 Azure 资源。At a high level, when you create an Azure Databricks workspace from the Azure portal, a managed appliance is deployed as an Azure resource in your subscription, in the chosen Azure region. 此设备部署在 Azure 虚拟网络中,该网络具有网络安全组和订阅中的 Azure 存储帐户。This appliance is deployed in an Azure Virtual Network with a Network Security Group and an Azure Storage account, available in your subscription. 该虚拟网络为 Databricks 工作区提供外围级安全性,并受网络安全组的保护。The virtual network provides perimeter level security to the Databricks workspace and is protected via network security group. 在工作区中,可以通过提供辅助角色和驱动程序 VM 类型与 Databricks 运行时版本来创建 Databricks 群集。Within the workspace, you can create Databricks clusters by providing the worker and driver VM type and Databricks runtime version. 保存的数据可在存储帐户(可以是 Azure Blob 存储或 Azure Data Lake Storage)中使用。The persisted data is available in your storage account, which can be Azure Blob Storage or Azure Data Lake Storage. 创建群集后,可以通过笔记本、REST API、ODBC/JDBC 终结点运行作业:只需将作业附加到特定的群集即可。Once the cluster is created, you can run jobs via notebooks, REST APIs, ODBC/JDBC endpoints by attaching them to a specific cluster.

Databricks 控制平面管理并监视 Databricks 工作区环境。The Databricks control plane manages and monitors the Databricks workspace environment. 任何管理操作(例如创建群集)将从控制平面发起。Any management operation such as create cluster will be initiated from the control plane. 所有元数据(例如计划的作业)存储在可通过异地复制实现容错的 Azure 数据库中。All metadata, such as scheduled jobs, is stored in an Azure Database with geo-replication for fault tolerance.

Databricks 体系结构

此体系结构的优势之一是,用户可将 Azure Databricks 连接到其帐户中的任何存储资源。One of the advantages of this architecture is that users can connect Azure Databricks to any storage resource in their account. 另一个重要优点在于,计算 (Azure Databricks) 和存储可以互相独立地缩放。A key benefit is that both compute (Azure Databricks) and storage can be scaled independently of each other.

如何创建区域性灾难恢复拓扑How to create a regional disaster recovery topology

在上面的体系结构说明中可以看到,包含 Azure Databricks 的大数据管道使用了许多组件:Azure 存储、Azure 数据库和其他数据源。As you notice in the preceding architecture description, there are a number of components used for a Big Data pipeline with Azure Databricks: Azure Storage, Azure Database, and other data sources. Azure Databricks 是大数据管道的计算层。Azure Databricks is the compute for the Big Data pipeline. 它是临时性的层,这意味着,尽管数据仍在 Azure 存储中提供,但计算层(Azure Databricks 群集)可以终止,因此,在无需计算资源时,就无需支付其费用。It is ephemeral in nature, meaning that while your data is still available in Azure Storage, the compute (Azure Databricks cluster) can be terminated so that you don’t have to pay for compute when you don’t need it. 计算 (Azure Databricks) 和存储源必须位于同一区域,避免作业遇到较高的延迟。The compute (Azure Databricks) and storage sources must be in the same region so that jobs don’t experience high latency.

若要创建自己的区域性灾难恢复拓扑,需满足以下要求:To create your own regional disaster recovery topology, follow these requirements:

  1. 在不同的 Azure 区域中预配多个 Azure Databricks 工作区。Provision multiple Azure Databricks workspaces in separate Azure regions. 例如,在“中国东部 2”区域中创建主要 Azure Databricks 工作区。For example, create the primary Azure Databricks workspace in China East 2. 在另一个区域中创建辅助灾难恢复 Azure Databricks 工作区。Create the secondary disaster-recovery Azure Databricks workspace in a separate region.

  2. 使用异地冗余存储Use geo-redundant storage. 与 Azure Databricks 关联的数据默认存储在 Azure 存储中。The data associated Azure Databricks is stored by default in Azure Storage. Databricks 作业的结果也存储在 Azure Blob 存储中,因此,在终止群集后,处理的数据具有持久性,并保持高可用性。The results from Databricks jobs are also stored in Azure Blob Storage, so that the processed data is durable and remains highly available after cluster is terminated. 由于存储和 Databricks 群集共置在一起,因此必须使用异地冗余存储,以便在主要区域不再可访问时,可在次要区域中访问数据。As the Storage and Databricks cluster are co-located, you must use Geo-redundant storage so that data can be accessed in secondary region if primary region is no longer accessible.

  3. 创建次要区域后,必须迁移用户、用户文件夹、笔记本、群集配置、作业配置、库、存储、初始化脚本,并重新配置访问控制。Once the secondary region is created, you must migrate the users, user folders, notebooks, cluster configuration, jobs configuration, libraries, storage, init scripts, and reconfigure access control. 以下部分介绍了其他详细信息。Additional details are outlined in the following section.

详细迁移步骤Detailed migration steps

  1. 在计算机上设置 Databricks 命令行接口Set up the Databricks command-line interface on your computer

    本文演示的许多代码示例使用命令行接口来执行大多数自动化步骤,因为命令行接口是基于 Azure Databricks REST API 的易用包装器。This article shows a number of code examples that use the command-line interface for most of the automated steps, since it is an easy-to-user wrapper over Azure Databricks REST API.

    在执行任何迁移步骤之前,请在台式机或打算使用的虚拟机上安装 databricks-cli。Before performing any migration steps, install the databricks-cli on your desktop computer or a virtual machine where you plan to do the work. 有关详细信息,请参阅安装 Databricks CLIFor more information, see Install Databricks CLI

    pip install databricks-cli
    

    备注

    本文中提供的任何 python 脚本预期使用版本高于 2.7、低于 3.x 的 Python。Any python scripts provided in this article are expected to work with Python 2.7+ < 3.x.

  2. 配置两个配置文件。Configure two profiles.

    为主要工作区配置一个配置文件,为辅助工作区配置另一个配置文件:Configure one for the primary workspace, and another one for the secondary workspace:

    databricks configure --profile primary --token
    databricks configure --profile secondary --token
    

    在每个后续步骤中,本文中的代码块使用相应的工作区命令在配置文件之间切换。The code blocks in this article switch between profiles in each subsequent step using the corresponding workspace command. 请确保创建的配置文件的名称可以代入每个代码块。Be sure that the names of the profiles you create are substituted into each code block.

    EXPORT_PROFILE = "primary"
    IMPORT_PROFILE = "secondary"
    

    如果需要,可以手动在命令行中切换:You can manually switch at the command line if needed:

    databricks workspace ls --profile primary
    databricks workspace ls --profile secondary
    
  3. 迁移 Azure Active Directory 用户Migrate Azure Active Directory users

    手动将相同的 Azure Active Directory 用户添加到主要工作区中的辅助工作区。Manually add the same Azure Active Directory users to the secondary workspace that exist in primary workspace.

  4. 迁移用户文件夹和笔记本Migrate the user folders and notebooks

    使用以下 python 代码迁移包含嵌套文件夹结构和每个用户的笔记本的沙盒用户环境。Use the following python code to migrate the sandboxed user environments, which include the nested folder structure and notebooks per user.

    备注

    此步骤不会复制库,因为基础 API 不支持库。Libraries are not copied over in this step, as the underlying API doesn't support those.

    复制以下 python 脚本并将其保存到某个文件,然后在 Databricks 命令行中运行它。Copy and save the following python script to a file, and run it in your Databricks command line. 例如,python scriptname.pyFor example, python scriptname.py.

    import sys
    import os
    import subprocess
    from subprocess import call, check_output
    
    EXPORT_PROFILE = "primary"
    IMPORT_PROFILE = "secondary"
    
    # Get a list of all users
    user_list_out = check_output(["databricks", "workspace", "ls", "/Users", "--profile", EXPORT_PROFILE])
    user_list = (user_list_out.decode(encoding="utf-8")).splitlines()
    
    print (user_list)
    
    # Export sandboxed environment(folders, notebooks) for each user and import into new workspace.
    #Libraries are not included with these APIs / commands.
    
    for user in user_list:
      #print("Trying to migrate workspace for user ".decode() + user)
      print (("Trying to migrate workspace for user ") + user)
    
      subprocess.call(str("mkdir -p ") + str(user), shell = True)
      export_exit_status = call("databricks workspace export_dir /Users/" + str(user) + " ./" + str(user) + " --profile " + EXPORT_PROFILE, shell = True)
    
      if export_exit_status==0:
        print ("Export Success")
        import_exit_status = call("databricks workspace import_dir ./" + str(user) + " /Users/" + str(user) + " --profile " + IMPORT_PROFILE, shell=True)
        if import_exit_status==0:
          print ("Import Success")
        else:
          print ("Import Failure")
      else:
        print ("Export Failure")
    print ("All done")
    
  5. 迁移群集配置Migrate the cluster configurations

    迁移笔记本后,可以选择性地将群集配置迁移到新工作区。Once notebooks have been migrated, you can optionally migrate the cluster configurations to the new workspace. 这几乎是一个完全自动化的步骤,除非你要执行选择性的群集配置迁移而不是完整迁移,否则可以使用 databricks-cli 来完成。It's almost a fully automated step using databricks-cli, unless you would like to do selective cluster config migration rather than for all.

    备注

    遗憾的是,由于没有创建群集配置的终结点,此脚本会尝试立即创建每个群集。Unfortunately there is no create cluster config endpoint, and this script tries to create each cluster right away. 如果订阅中没有足够的核心,群集创建操作可能会失败。If there aren't enough cores available in your subscription, the cluster creation may fail. 只要成功转移配置,即可忽略这种失败。The failure can be ignored, as long as the configuration is transferred successfully.

    下面提供的脚本列显从旧群集 ID 到新群集 ID 的映射,稍后可对作业迁移(配置为使用现有群集的作业)使用该映射。The following script provided prints a mapping from old to new cluster IDs, which could be used for job migration later (for jobs that are configured to use existing clusters).

    复制以下 python 脚本并将其保存到某个文件,然后在 Databricks 命令行中运行它。Copy and save the following python script to a file, and run it in your Databricks command line. 例如,python scriptname.pyFor example, python scriptname.py.

    import sys
    import os
    import subprocess
    import json
    from subprocess import call, check_output
    
    EXPORT_PROFILE = "primary"
    IMPORT_PROFILE = "secondary"
    
    # Get all clusters info from old workspace
    clusters_out = check_output(["databricks", "clusters", "list",    "--profile", EXPORT_PROFILE])
    clusters_info_list = str(clusters_out.decode(encoding="utf-8")).   splitlines()
    print("Printting Cluster info List")
    print(clusters_info_list)
    
    # Create a list of all cluster ids
    clusters_list = []
    ##for cluster_info in clusters_info_list: clusters_list.append   (cluster_info.split(None, 1)[0])
    
    for cluster_info in clusters_info_list:
       if cluster_info != '':
          clusters_list.append(cluster_info.split(None, 1)[0])
    
    # Optionally filter cluster ids out manually, so as to create only required ones in new workspace
    
    # Create a list of mandatory / optional create request elements
    cluster_req_elems = ["num_workers","autoscale","cluster_name","spark_version","spark_conf","node_type_id","driver_node_type_id","custom_tags","cluster_log_conf","spark_env_vars","autotermination_minutes","enable_elastic_disk"]
    print("Printing Cluster element List")
    print (cluster_req_elems)
    print(str(len(clusters_list)) + " clusters found in the primary site" )
    
    print ("---------------------------------------------------------")
    # Try creating all / selected clusters in new workspace with same config as in old one.
    cluster_old_new_mappings = {}
    i = 0
    for cluster in clusters_list:
       i += 1
       print("Checking cluster " + str(i) + "/" + str(len(clusters_list)) + " : " +str(cluster))
       cluster_get_out_f = check_output(["databricks", "clusters", "get", "--cluster-id", str(cluster), "--profile", EXPORT_PROFILE])
       cluster_get_out=str(cluster_get_out_f.decode(encoding="utf-8"))
       print ("Got cluster config from old workspace")
       print (cluster_get_out)
        # Remove extra content from the config, as we need to build create request with allowed elements only
       cluster_req_json = json.loads(cluster_get_out)
       cluster_json_keys = cluster_req_json.keys()
    
       #Don't migrate Job clusters
       if cluster_req_json['cluster_source'] == u'JOB' :
          print ("Skipping this cluster as it is a Job cluster : " + cluster_req_json['cluster_id'] )
          print ("---------------------------------------------------------")
          continue
    
          #cluster_req_json.pop(key, None)
          for key in cluster_json_keys:
            if key not in cluster_req_elems:
             print (cluster_req_json)
             #cluster_del_item=cluster_json_keys .keys()
             cluster_req_json.popitem(key, None)
    
       # Create the cluster, and store the mapping from old to new cluster ids
    
       #Create a temp file to store the current cluster info as JSON
       strCurrentClusterFile = "tmp_cluster_info.json"
    
       #delete the temp file if exists
       if os.path.exists(strCurrentClusterFile) :
          os.remove(strCurrentClusterFile)
    
       fClusterJSONtmp = open(strCurrentClusterFile,"w+")
       fClusterJSONtmp.write(json.dumps(cluster_req_json))
       fClusterJSONtmp.close()
    
       #cluster_create_out = check_output(["databricks", "clusters", "create", "--json", json.dumps(cluster_req_json), "--profile", IMPORT_PROFILE])
       cluster_create_out = check_output(["databricks", "clusters", "create", "--json-file", strCurrentClusterFile , "--profile", IMPORT_PROFILE])
       cluster_create_out_json = json.loads(cluster_create_out)
       cluster_old_new_mappings[cluster] = cluster_create_out_json['cluster_id']
    
       print ("Cluster create request sent to secondary site workspace successfully")
       print ("---------------------------------------------------------")
    
       #delete the temp file if exists
       if os.path.exists(strCurrentClusterFile) :
          os.remove(strCurrentClusterFile)
    
    print ("Cluster mappings: " + json.dumps(cluster_old_new_mappings))
    print ("All done")
    print ("P.S. : Please note that all the new clusters in your secondary site are being started now!")
    print ("       If you won't use those new clusters at the moment, please don't forget terminating your new clusters to avoid charges")
    
  6. 迁移作业配置Migrate the jobs configuration

    如果在上一步骤中迁移了群集配置,则可以选择将作业配置迁移到新工作区。If you migrated cluster configurations in the previous step, you can opt to migrate job configurations to the new workspace. 这几乎是一个完全自动化的步骤,除非你要执行选择性的作业配置迁移而不是执行所有作业的迁移,否则可以使用 databricks-cli 来完成。It is a fully automated step using databricks-cli, unless you would like to do selective job config migration rather than doing it for all jobs.

    备注

    计划作业的配置还包含“计划”信息,因此,在迁移后,该作业默认会根据配置的时间开始工作。The configuration for a scheduled job contains the "schedule" information as well, so by default that will start working as per configured timing as soon as it's migrated. 因此,以下代码块在迁移过程中会删除所有计划信息(目的是避免针对新旧工作区重复运行)。Hence, the following code block removes any schedule information during the migration (to avoid duplicate runs across old and new workspaces). 在准备好交接后,请配置此类作业的计划。Configure the schedules for such jobs once you're ready for cutover.

    作业配置需要新群集或现有群集的设置。The job configuration requires settings for a new or an existing cluster. 如果使用现有群集,则以下脚本/代码会尝试将旧群集 ID 替换为新群集 ID。If using existing cluster, the script /code below will attempt to replace the old cluster ID with new cluster ID.

    复制以下 python 脚本并将其保存到某个文件。Copy and save the following python script to a file. old_cluster_idnew_cluster_id 的值替换为在上一步骤中执行的群集迁移的输出。Replace the value for old_cluster_id and new_cluster_id, with the output from cluster migration done in previous step. 在 databricks-cli 命令行中运行该命令,例如 python scriptname.pyRun it in your databricks-cli command line, for example, python scriptname.py.

    import sys
    import os
    import subprocess
    import json
    from subprocess import call, check_output
    
    
    EXPORT_PROFILE = "primary"
    IMPORT_PROFILE = "secondary"
    
    # Please replace the old to new cluster id mappings from cluster migration output
    cluster_old_new_mappings = {"0227-120427-tryst214": "0229-032632-paper88"}
    
    # Get all jobs info from old workspace
    try:
      jobs_out = check_output(["databricks", "jobs", "list", "--profile", EXPORT_PROFILE])
      jobs_info_list = jobs_out.splitlines()
    except:
      print("No jobs to migrate")
      sys.exit(0)
    
    # Create a list of all job ids
    jobs_list = []
    for jobs_info in jobs_info_list:
      jobs_list.append(jobs_info.split(None, 1)[0])
    
    # Optionally filter job ids out manually, so as to create only required ones in new workspace
    
    # Create each job in the new workspace based on corresponding settings in the old workspace
    
    for job in jobs_list:
      print("Trying to migrate ") + job
    
      job_get_out = check_output(["databricks", "jobs", "get", "--job-id", job, "--profile", EXPORT_PROFILE])
      print("Got job config from old workspace")
    
      job_req_json = json.loads(job_get_out)  
      job_req_settings_json = job_req_json['settings']
    
      # Remove schedule information so job doesn't start before proper cutover
      job_req_settings_json.pop('schedule', None)
    
      # Replace old cluster id with new cluster id, if job configured to run against an existing cluster
      if 'existing_cluster_id' in job_req_settings_json:
        if job_req_settings_json['existing_cluster_id'] in cluster_old_new_mappings:
          job_req_settings_json['existing_cluster_id'] = cluster_old_new_mappings[job_req_settings_json['existing_cluster_id']]
        else:
          print("Mapping not available for old cluster id ") + job_req_settings_json['existing_cluster_id']
          continue
    
      call(["databricks", "jobs", "create", "--json", json.dumps(job_req_settings_json), "--profile", IMPORT_PROFILE])
      print("Sent job create request to new workspace successfully")
    
    print("All done")
    
  7. 迁移库Migrate libraries

    目前无法直接将库从一个工作区迁移到另一个工作区。There's currently no straightforward way to migrate libraries from one workspace to another. 需要手动将这些库重新安装到新工作区中。Instead, reinstall those libraries into the new workspace manually. 可以使用 DBFS CLI(用于将自定义库上传到工作区)和库 CLI 的组合来自动完成此过程。It is possible to automate using combination of DBFS CLI to upload custom libraries to the workspace and Libraries CLI.

  8. 迁移 Azure Blob 存储和 Azure Data Lake Storage 装入点Migrate Azure blob storage and Azure Data Lake Storage mounts

    使用基于笔记本的解决方案手动重新装载所有 Azure Blob 存储Azure Data Lake Storage(第 2 代)装入点。Manually remount all Azure Blob storage and Azure Data Lake Storage (Gen 2) mount points using a notebook-based solution. 存储资源应已装载到主要工作区,必须在辅助工作区中重复该操作。The storage resources would have been mounted in the primary workspace, and that has to be repeated in the secondary workspace. 无法使用外部 API 进行装载。There is no external API for mounts.

  9. 迁移群集初始化脚本Migrate cluster init scripts

    可以使用 DBFS CLI 将任何群集初始化脚本从旧工作区迁移到新工作区。Any cluster initialization scripts can be migrated from old to new workspace using the DBFS CLI. 首先,将所需的脚本从 dbfs:/dat abricks/init/.. 复制到本地台式机或虚拟机。First, copy the needed scripts from dbfs:/dat abricks/init/.. to your local desktop or virtual machine. 接下来,将这些脚本复制到新工作区中的相同路径。Next, copy those scripts into the new workspace at the same path.

    // Primary to local
    dbfs cp -r dbfs:/databricks/init ./old-ws-init-scripts --profile primary
    
    // Local to Secondary workspace
    dbfs cp -r old-ws-init-scripts dbfs:/databricks/init --profile secondary
    
  10. 手动重新配置和重新应用访问控制。Manually reconfigure and reapply access control.

    如果现有的主要工作区配置为使用高级层 (SKU),则有可能你同时在使用访问控制功能If your existing primary workspace is configured to use the Premium tier (SKU), it's likely you also are using the Access Control feature.

    如果确实使用了访问控制功能,请手动将访问控制重新应用到资源(笔记本、群集、作业、表)。If you do use the Access Control feature, manually reapply the access control to the resources (Notebooks, Clusters, Jobs, Tables).

Azure 生态系统的灾难恢复Disaster recovery for your Azure ecosystem

如果你在使用其他 Azure 服务,请确保也为这些服务实施灾难恢复最佳做法。If you are using other Azure services, be sure to implement disaster recovery best practices for those services, too. 例如,如果你选择使用外部 Hive 元存储实例,应考虑 Azure SQL 数据库Azure HDInsight 和/或 Azure Database for MySQL 的灾难恢复。For example, if you choose to use an external Hive metastore instance, you should consider disaster recovery for Azure SQL Database, Azure HDInsight, and/or Azure Database for MySQL. 有关灾难恢复的常规信息,请参阅 Azure 应用程序的灾难恢复For general information about disaster recovery, see Disaster recovery for Azure applications.

后续步骤Next steps

有关详细信息,请参阅 Azure Databricks 文档For more information, see Azure Databricks documentation.