用于 Git 集成的存储库Repos for Git integration

重要

此功能目前以公共预览版提供。This feature is in Public Preview.

可以将 Azure Databricks 中的工作与远程 Git 存储库同步。You can sync your work in Azure Databricks with a remote Git repository. 这样可以更轻松地实现最佳开发做法。This makes it easier to implement development best practices. Azure Databricks 支持与 GitHub、Bitbucket、GitLab 和 Azure DevOps 集成。Azure Databricks supports integrations with GitHub, Bitbucket, GitLab, and Azure DevOps.

Repos 是一系列通过将内容同步到远程 Git 存储库来实现对其内容的共同版本控制的文件夹。Repos are folders whose contents are co-versioned together by syncing them to a remote Git repository. Repos 只能包含 Azure Databricks 笔记本和子文件夹。Repos can contain only Azure Databricks notebooks and sub-folders. 链接的 Git 存储库可以包含其他文件,但它们不会显示在 Azure Databricks 工作区中。The linked Git repository can contain other files, but they won’t appear in the Azure Databricks workspace.

要求Requirements

  • 标准 Azure 部署支持 Repos。Repos is supported on standard Azure deployments. 不支持符合 HIPAA 的部署。HIPAA-compliant deployments are not supported.
  • 作为 Git 提供程序,支持 GitHub、BitBucket、GitLab 和 Azure DevOps,前提是可以从 Databricks 控制平面访问你的 Git 服务器。GitHub, BitBucket, GitLab, and Azure DevOps are supported as Git providers, provided your Git server is accessible from the Databricks control plane. 不支持专用 Git 服务器(如 VPN 后面的 Git 服务器)。Private Git servers, such as Git servers behind a VPN, are not supported.

启用存储库Enable Repos

  1. 转到管理控制台Go to the Admin Console.
  2. 选择“高级”选项卡。Select the Advanced tab.
  3. 单击“Repos”旁边的“启用”按钮 。Click the Enable button next to Repos.
  4. 单击“确认” 。Click Confirm. 可能需要刷新浏览器才能看到新的图标。You may need to refresh your browser to see the new icon.

如果为 Repos 启用了工作区,则会在工作区的边栏中看到“存储库”图标。When your workspace is enabled for Repos, you’ll see the Repos icon in your workspace’s sidebar.

Repos 图标Repos icon

配置 Git 与 Azure Databricks 的集成Configure your Git integration with Azure Databricks

  1. 单击 Azure Databricks 工作区中的帐户图标配置文件图标,然后从菜单中选择“用户设置”。Click the Account Icon profile icon in your Azure Databricks workspace and select User Settings from the menu.
  2. 在“用户设置”页上,转到“Git 集成”选项卡 。On the User Settings page, go to the Git Integration tab.
  3. 按照说明与 GitHubBitbucket CloudGitLab 进行集成。Follow the instructions for integration with GitHub, Bitbucket Cloud, or GitLab.
  4. 如果你的组织在 GitHub 中启用了 SAML SSO,请确保已为你的 SSO 个人访问令牌授权If your organization has SAML SSO enabled in GitHub, ensure that you have authorized your personal access token for SSO.

可以克隆现有的远程 Git 存储库,也可以在 Azure Databricks 中创建新的存储库并在以后添加远程 Git 存储库 URL。You can clone an existing remote Git repository, or create a new repo in Azure Databricks and add the remote Git repository URL later.

  1. 单击左侧导航栏中的“Repos”图标。Click the Repos icon in the left navigation bar.

  2. 单击“添加存储库”。Click Add Repo.

    添加存储库Add repo

  3. 在“添加存储库”对话框中,执行以下操作之一:In the Add Repo dialog, do one of the following:

    • 若要克隆远程 Git 存储库,请单击“克隆远程 Git 存储库”,然后输入存储库 URL。To clone a remote Git repository, click Clone remote Git repo and enter the repository URL. 从下拉菜单中选择你的 Git 提供程序,然后单击“创建”。Select your Git provider from the drop-down menu, and click Create.

      如果远程存储库包含 Azure Databricks 笔记本源文件,它们将被同步到存储库。If the remote repository contains Azure Databricks notebook source files, they will be synced to the repo. 所有其他文件将被忽略。All other files will be ignored.

      从存储库克隆Clone from repo

    • 若要创建未链接到远程 Git 存储库的新存储库,请单击“稍后添加 Git 远程”。To create a new repo not linked to a remote Git repository, click Add Git remote later. 为该存储库输入一个名称,然后单击“创建”。Enter a name for the repo and click Create.

      准备好添加 Git 存储库 URL 时,单击工作区中存储库名称旁边的向下箭头以打开“存储库”菜单,然后选择“Git…”When you are ready to add the Git repository URL, click the down arrow next to the repo name in the workspace to open the Repo menu, and select Git… 以打开 Git 对话框。to open the Git dialog.

      Repos 菜单Repos menu

      在“Git 存储库 URL”字段中,输入远程存储库的 URL,并从下拉菜单中选择你的 Git 提供程序。In the Git repo URL field, enter the URL for the remote repository and select your Git provider from the drop-down menu. 单击“保存” 。Click Save.

      Git 对话框“设置”选项卡Git dialog settings tab

使用存储库Work with repos

创建存储库后,可以在存储库中开发笔记本,并与远程 Git 存储库同步。After you have created a repo, you can develop notebooks in the repo and sync with your remote Git repository.

在 Azure Databricks 存储库中使用笔记本和文件夹Work with notebooks and folders in an Azure Databricks repo

若要在存储库中创建新的笔记本或文件夹,请单击存储库名称旁边的向下箭头,然后从菜单中选择“创建”>“笔记本”或“创建”>“文件夹” 。To create a new notebook or folder in a repo, click the down arrow next to the repo name, and select Create > Notebook or Create > Folder from the menu.

存储库创建菜单Repo create menu

若要将工作区中的笔记本或文件夹移动到存储库中,请导航到该笔记本或文件夹,然后从下拉菜单中选择“移动”:To move an notebook or folder in your workspace into a repo, navigate to the notebook or folder and select Move from the drop-down menu:

移动对象Move object

在对话框中,选择要将对象移动到的存储库:In the dialog, select the repo to which you want to move the object:

移动存储库Move repo

将 SQL 和 Python 文件作为 Databricks 笔记本导入 Import SQL and Python files as a Databricks notebook

可以将 SQL 和 Python 文件作为单一单元 Databricks 笔记本导入。You can import SQL and Python files as single-cell Databricks notebooks.

  • 在 SQL 文件顶部添加注释行 -- Databricks notebook sourceAdd the comment line -- Databricks notebook source at the top of a SQL file.
  • 在 Python 文件顶部添加注释行 ## Databricks notebook sourceAdd the comment line ## Databricks notebook source at the top of a Python file.

将存储库与 Git 同步Sync a repo with Git

若要与 Git 同步,请使用 Git 对话框。To sync with Git, use the Git dialog. 通过 Git 对话框,可以从远程 Git 存储库中拉取更改,并推送和提交更改。The Git dialog lets you pull changes from your remote Git repository and push and commit changes. 还可以更改正在处理的分支,或创建新分支You can also change the branch you are working on or create a new branch.

重要

使用对话框执行 Git 操作会清除笔记本注释和修订历史记录。Performing Git operations using the dialog clears notebook comments and revision history. 有关详细信息,请参阅限制和常见问题解答For more information, see Limitations and FAQ.

打开 Git 对话框Open the Git dialog

可以从笔记本或存储库浏览器访问 Git 对话框。You can access the Git dialog from a notebook or from the repos browser.

  • 在笔记本中,单击笔记本左上方的按钮,该按钮标识当前的 Git 分支。From a notebook, click the button at the top left of the notebook that identifies the current Git branch.

    笔记本上的 Git 对话框按钮Git dialog button on notebook

  • 在存储库浏览器中,可以单击存储库名称旁边的按钮:From the repos browser, you can click the button next to the repo name:

    存储库浏览器中的 Git 对话框按钮Git dialog button in repo browser

    还可以单击存储库名称旁边的向下箭头,然后从菜单中选择“Git...”You can also click the down arrow next to the repo name, and select Git… from the menu.

    Repos 菜单 2Repos menu 2

从远程 Git 存储库中拉取更改Pull changes from the remote Git repository

若要从远程 Git 存储库中拉取更改,请在 Git 对话框中单击To pull changes from the remote Git repository, click 请求in the Git dialog. 笔记本会自动更新为远程存储库中的最新版本。Notebooks are updated automatically to the latest version in your remote repository.

如果存在合并冲突,将显示一条消息。A message appears if there are merge conflicts. Databricks 建议使用 Git 提供程序接口解决合并冲突。Databricks recommends that you resolve the merge conflict using your Git provider interface.

提交更改并将其推送到远程 Git 存储库Commit and push changes to the remote Git repository

添加新笔记本或对现有笔记本进行更改后,Git 对话框将指示已更改的文件。When you have added new notebooks or made changes to existing notebooks, the Git dialog indicates the files that have changed.

Git 对话框git dialog

添加所需的更改摘要,然后单击“提交并推送”将这些更改推送到远程 Git 存储库。Add a required Summary of the changes, and click Commit & Push to push these changes to the remote Git repository.

如果无权提交到主分支,则请创建一个新分支,并使用你的 Git 提供程序接口创建一个拉取请求 (PR) 以将其合并到主分支。If you don’t have permission to commit to the master branch, create a new branch and use your Git provider interface to create a pull request (PR) to merge it into the master branch.

备注

如果存在合并冲突,Databricks 建议创建一个新分支,将更改提交并推送到该分支,在你自己的分支中工作,并使用你的 Git 提供程序接口解决合并冲突。If there are merge conflicts, Databricks recommends that you create a new branch, commit and push your changes to that branch, work in your own branch, and resolve the merge conflict using your Git provider interface.

创建新分支Create a new branch

可以从 Git 对话框中基于现有分支创建新分支:You can create a new branch based on an existing branch from the Git dialog:

Git 对话框“新建分支”Git dialog new branch

管理权限Manage permissions

创建存储库时,你具有“可管理”权限。When you create a repo, you have Can Manage permission. 这使你可以执行 Git 操作或修改远程存储库。This lets you perform Git operations or modify the remote repository. 你可以克隆没有 Git 凭据(个人访问令牌和用户名)的公共远程存储库。You can clone public remote repositories without Git credentials (personal access token and username). 若要修改公用远程存储库,或克隆或修改专用远程存储库,必须具有 Git 提供程序用户名以及对远程存储库具有读写权限的个人访问令牌。To modify a public remote repository, or to clone or modify a private remote repository, you must have a Git provider username and personal access token with read and write permissions for the remote repository.

存储库 APIRepos API

重要

此功能以个人预览版提供。This feature is in Private Preview. 若要试用,请与 Azure Databricks 联系人取得联系。To try it, reach out to your Azure Databricks contact.

Repos API 更新终结点使你可以将存储库更新到特定 Git 分支的最新版本。The Repos API update endpoint allows you to update a repo to the latest version of a specific Git branch. 这使你能够在针对存储库中的笔记本运行作业之前更新存储库。This enables you to update the repo before you run a job against a notebook in the repo. 有关此个人预览版功能的详细信息,请与 Azure Databricks 代表联系。For more information about this private preview feature, contact your Azure Databricks representative.

将存储库与 CI/CD 工作流集成的最佳做法Best practices for integrating repos with CI/CD workflows

本部分包括将 Azure Databricks 存储库与 CI/CD 工作流集成的最佳做法。This section includes best practices for integrating Azure Databricks repos with your CI/CD workflow. 下图概述了这些步骤。The following figure shows an overview of the steps.

最佳做法概述Best practices overview

管理员工作流Admin workflow

Repos 具有用户级文件夹和非用户顶级文件夹。Repos have user-level folders and non-user top level folders. 用户级文件夹是在用户第一次克隆远程存储库时自动创建的。User-level folders are automatically created when users first clone a remote repository. 可以将用户文件夹中的存储库视为“本地签出”,它与每个用户是一对一的关系,用户可以在其中更改其代码。You can think of repos in user folders as “local checkouts” that are individual for each user and where users make changes to their code.

设置顶级存储库文件夹Set up top-level repo folders

管理员可以创建非用户顶级文件夹。Admins can create non-user top level folders. 对于这些顶级文件夹,最常见的用例是创建开发、过渡和生产文件夹,其中包含用于开发、过渡和生产的适当版本或分支的存储库。The most common use case for these top level folders is to create Dev, Staging, and Production folders that contain repos for the appropriate versions or branches for development, staging, and production. 例如,如果你的公司使用主分支进行生产,则生产文件夹将包含配置为在主分支上运行的存储库。For example, if your company uses the Main branch for production, the Production folder would contain repos configured to be at the Main branch.

通常,工作区中的所有非管理员用户对这些顶级文件夹的访问权限都是只读的。Typically permissions on these top-level folders are read-only for all non-admin users within the workspace.

顶级存储库文件夹Top-level repo folders

设置 Git 自动化以在合并时更新存储库Set up Git automation to update repos on merge

为了确保存储库始终处于最新版本,可以设置 Git 自动化以调用 Repos APITo ensure that repos are always at the latest version, you can set up Git automation to call the Repos API. 在 Git 提供程序中设置自动化,以在每次成功将 PR 合并到主分支后,在生产文件夹中的相应存储库上调用 Repos API 终结点,以将该存储库升级到最新版本。In your Git provider, set up automation that, after every successful merge of a PR into the Main branch, calls the Repos API endpoint on the appropriate repo in the Production folder to bring that repo to the latest version.

例如,在 GitHub 上这可以通过 GitHub Actions 来实现。For example, on GitHub this can be achieved with GitHub Actions. 有关详细信息,请参阅 Repos APIFor more information, see Repos API.

用户工作流User workflow

若要启动工作流,请将远程存储库克隆到用户文件夹中。To start a workflow, clone your remote repository into a user folder. 最佳做法是为工作创建一个新的功能分支,或选择一个先前创建的分支,而不是直接将更改提交并推送到主分支。A best practice is to create a new feature branch, or select a previously created branch, for your work, instead of directly committing and pushing changes to the main branch. 可以在该分支中进行更改,并提交和推送更改。You can make changes, commit, and push changes in that branch. 准备好合并代码时,创建一个拉取请求,并执行 Git 中的评审和合并过程。When you are ready to merge your code, create a pull request and follow the review and merge processes in Git.

生产作业工作流Production job workflow

可以将作业直接指向存储库中的笔记本。You can point jobs directly to notebooks in repos. 当作业启动时,它将使用存储库中的当前代码版本。When a job kicks off a run, it uses the current version of the code in the repo.

如果按照管理员工作流中所述设置了自动化,则每次成功的合并都会调用 Repos API 来更新存储库。If the automation is setup as described in Admin workflow, every successful merge calls the Repos API to update the repo. 因此,配置为从存储库运行代码的作业始终使用创建作业运行时可用的最新版本。As a result, jobs that are configured to run code from a repo always use the latest version available when the job run was created.

限制和常见问题解答Limitations and FAQ

传入更改会清除存储库中笔记本的注释和结果。Incoming changes clear out comments and results for notebooks in the repo.

发生这种情况是因为重新导入了笔记本,因此工作区中的所有笔记本单元都将被删除并重新导入。This happens because the notebook is re-imported, so all notebooks cells in the workspace are deleted and re-imported.

Repos 只能包含 Azure Databricks 笔记本和文件夹。Repos can contain only Azure Databricks notebooks and folders.

  • 不支持库和 MLflow 试验。Libraries and MLflow experiments are not supported. 可以在存储库中使用笔记本试验。You can use notebook experiments in repos.
  • 不支持非笔记本文件(例如 .txt、.csv、.md 或 .yaml 文件)。Non-notebook files such as .txt, .csv, .md, or .yaml files are not supported.
  • 远程 Git 存储库可能包含其他文件,但它们不会显示在 Azure Databricks 中。The remote Git repository may contain other files, but they will not appear in Azure Databricks.

为什么我能在存储库中看到 .py 文件,但无法同步我自己的 .py 文件?Why do I see .py files in my repo but can’t sync my own .py files?

  • Azure Databricks 将笔记本的笔记本源导出为 py,以提高可读性并在 Git 提供程序中可区别。Azure Databricks exports the notebook source for notebooks as py for easier readability and diffing in your Git provider. 但是,这些文件具有其他元数据来将其自身标识为 Azure Databricks 笔记本源文件。However, those files have additional metadata to identify them as Azure Databricks notebook source files. 任意(无可用于区分的标识)的 py 文件均不可用或不可引用。Arbitrary py files are not available or referencable.
  • 在 Databricks Runtime 7.1 和更高版本以及 Databricks Runtime 7.1 ML 和更高版本中,借助 %pip install 支持,可以访问专用存储库,并将 Python 库加载到笔记本中。In Databricks Runtime 7.1 and above and Databricks Runtime 7.1 ML and above, %pip install support allows you to access private repositories to load Python libraries into notebooks.

如何在存储库中运行非 Databricks 笔记本文件?How can I run non-Databricks notebook files in a repo? 例如,.py 文件?For example, a .py file?

您可以使用以下任意一项:You can use any of the following:

是否可以创建不是用户文件夹的顶级文件夹?Can I create top-level folders that are not user folders?

是的,管理员可以创建单一深度的顶级文件夹。Yes, admins can create top-level folders to a single depth. Repos 不支持其他文件夹级别。Repos does not support additional folder levels.

Repos 是否支持对提交执行 GPG 签名?Does Repos support GPG signing of commits?

不是。No.

Github 令牌在 Azure Databricks 中的存储方式和位置是?How and where are the Github tokens stored in Azure Databricks? 谁有权从 Azure Databricks 访问?Who would have access from Azure Databricks?

  • 身份验证令牌存储在 Azure Databricks 控制平面中,Azure Databricks 员工只能通过经审核的临时凭据获取访问权限。The authentication tokens are stored in the Azure Databricks control plane, and an Azure Databricks employee can only gain access through a temporary credential that is audited.
  • Azure Databricks 记录这些令牌的创建和删除操作,但不记录其使用情况。Azure Databricks logs the creation and deletion of these tokens, but not their usage. Azure Databricks 会通过日志记录跟踪 Git 操作,这些日志可用于审核 Azure Databricks 应用程序的令牌使用情况。Azure Databricks has logging that tracks Git operations that could be used to audit the usage of the tokens by the Azure Databricks application.
  • Github 企业审核令牌使用情况。Github enterprise audits token usage. 其他 Git 服务也可能进行 Git 服务器审核。Other Git services may also have Git server auditing.

Repos 是否支持本地或自承载 Git 服务器?Does Repos support on-premise or self-hosted Git servers?

不是。No.

Repos 是否支持 Git 子模块?Does Repos support Git submodules?

不是。No.

是否可以在不依赖外部业务流程工具的情况下,在运行作业之前从 Git 中拉取最新版本的存储库?Can I pull the latest version of a repository from Git before running a job without relying on an external orchestration tool?

不是。No. 通常,可以将此操作集成为 Git 服务器上的预提交,可在每次推送到分支(主/生产)时都更新生产存储库。Typically you can integrate this as a pre-commit on the Git server so that every push to a branch (main/prod) updates the Production repo.

能否拉取 ipynb 文件?Can I pull in ipynb files?

不是。No.

存储库的大小或文件数量是否有限制?Are there limits on the size of a repo or the number of files?

存储库大小限制为 100MB。Repo size is limited to 100MB. 工作分支限制为 30MB。Working branches are limited to 30MB.

Databricks 建议一个存储库中的笔记本数量不超过 200 个。Databricks recommends no more than 200 notebooks in a repo. 如果有大量笔记本文件,可能会收到超时错误。You may receive timeout errors with a large number of notebook files. 在存储库的初始克隆时也可能收到超时错误,但该操作可在后台完成。You may also receive a timeout error on the initial clone of the repo, but the operation might complete in the background.

Repos 是否支持分支合并?Does Repos support branch merging?

不是。No. Databricks 建议创建拉取请求并通过 Git 提供程序进行合并。Databricks recommends that you create a pull request and merge through your Git provider.

Azure Databricks 存储库的内容是否经过加密?Are the contents of Azure Databricks repos encrypted?

存储库的内容由 Azure Databricks 使用默认密钥进行了加密。The contents of repos are encrypted by Azure Databricks using a default key. 不支持使用为笔记本启用客户管理的密钥进行加密。Encryption using Enable customer-managed keys for notebooks is not supported.

Azure Databricks 存储库内容存储在何处?Where is Azure Databricks repo content stored?

存储库的内容暂时克隆到控制平面中的磁盘上。The contents of a repo are temporarily cloned onto disk in the control plane. Azure Databricks 笔记本文件存储在控制平面数据库中,与主工作区中的笔记本一样。Azure Databricks notebook files are stored in the control plane database just like notebooks in the main workspace. 非笔记本文件最多可以在磁盘上存储 30 天。Non-notebook files may be stored on disk for up to 30 days.

Repos 是否支持 Azure 数据工厂 (ADF)?Does Repos support Azure Data Factory (ADF)?

Repos 不支持 ADF。Repos does not support ADF. 请尝试以下解决方法:在 Repos 外部创建“触发器笔记本”。Try the following workaround: create a “trigger notebook” outside Repos. 可以通过从 ADF 调用“触发器笔记本”并使用小组件提供参数(例如要运行的作业)在存储库内部启动笔记本作业。You can start the notebook job inside the repo by calling the “trigger notebook” from ADF and providing parameters, such as which job to run, using Widgets.

疑难解答Troubleshooting

错误消息:Invalid credentialsError message: Invalid credentials

请尝试以下做法:Try the following:

  • 确认“Git 集成”选项卡(“用户设置”>“Git 集成”)中的设置是否正确。Confirm that the settings in the Git integration tab (User Settings > Git Integration) are correct.

    • 必须输入 Git 提供程序的用户名和令牌。You must enter both your Git provider username and token. 旧的 Git 集成不需要用户名,因此可能需要添加用户名才能使用存储库。Legacy Git integrations did not require a username, so you may need to add a username to work with repos.
  • 确认在“添加存储库”对话框中选择了正确的 Git 提供程序。Confirm that you have selected the correct Git provider in the Add Repo dialog.

  • 确保你的个人访问令牌或应用密码具有正确的存储库访问权限。Ensure your personal access token or app password has the correct repo access.

  • 如果在 Git 提供程序上启用了 SSO,请为 SSO 授权你的令牌。If SSO is enabled on your Git provider, authorize your tokens for SSO.

  • 使用命令行 Git 测试令牌。Test your token with command line Git. 以下两个选项均适用:Both of these options should work:

    git clone https://<username>:<personal-access-token>@github.com/<org>/<repo-name>.git
    
    git clone -c http.sslVerify=false -c http.extraHeader='Authorization: Bearer <personal-access-token>' https://agile.act.org/
    

错误消息:由于 SSL 问题,无法建立安全连接Error message: Secure connection could not be established because of SSL problems

<link>: Secure connection to <link> could not be established because of SSL problems

如果无法从 Azure Databricks 控制平面访问 Git 服务器,会发生此错误。This error occurs if your Git server is not accessible from the Azure Databricks control plane. 不支持专用 Git 服务器。Private Git servers are not supported.