最佳做法:Azure Databricks 上的数据管理Best practices: Data governance on Azure Databricks

本文档介绍数据治理的需求,并分享可用于在组织中实现这些技术的最佳做法和策略。This document describes the need for data governance and shares best practices and strategies you can use to implement these techniques across your organization. 它演示了一个典型的部署工作流,你可以使用 Azure Databricks 和云原生解决方案来保护和监视从应用程序到存储的每个层。It demonstrates a typical deployment workflow you can employ using Azure Databricks and cloud-native solutions to secure and monitor each layer from the application down to storage.

为什么数据治理非常重要?Why is data governance important?

数据治理是一个涵盖性术语,它封装了为安全管理组织内的数据资产而实现的策略和做法。Data governance is an umbrella term that encapsulates the policies and practices implemented to securely manage the data assets within an organization. 作为任何成功的数据治理实践的关键原则之一,数据安全性可能是任何大型组织的首要考虑因素。As one of the key tenets of any successful data governance practice, data security is likely to be top of mind at any large organization. 数据安全性的关键在于,数据团队能够在整个组织中对用户数据访问模式具有出色的可见性和可审核性。Key to data security is the ability for data teams to have superior visibility and auditability of user data access patterns across their organization. 实现有效的数据治理解决方案有助于公司保护其数据免受未经授权的访问,并确保他们制定符合法规要求的规则。Implementing an effective data governance solution helps companies protect their data from unauthorized access and ensures that they have rules in place to comply with regulatory requirements.

治理挑战Governance challenges

无论你管理的是初创企业还是大型企业的数据,安全团队和平台所有者都会面临一个重大挑战,即确保这些数据是安全的,并根据组织的内部控制进行管理。Whether you’re managing the data of a startup or a large corporation, security teams and platform owners have the singular challenge of ensuring that this data is secure and is being managed according to the internal controls of the organization. 世界各地的监管机构正在改变我们对数据的捕获和存储方式的看法。Regulatory bodies the world over are changing the way we think about how data is both captured and stored. 这些合规性风险只会使本已十分棘手的问题更加复杂。These compliance risks only add further complexity to an already tough problem. 那么,你如何向那些能够驱动未来用例的用户开放数据呢?How then, do you open your data to those who can drive the use cases of the future? 最终,你应采用数据策略和做法,通过高效应用大量数据存储(即一直在不断增长的存储)来帮助企业实现价值。Ultimately, you should be adopting data policies and practices that help the business to realize value through the meaningful application of what can often be vast stores of data, stores that are growing all the time. 当数据团队可以访问许多不同的数据源时,我们就可以获得解决方案来解决世界上最棘手的难题。We get solutions to the world’s toughest problems when data teams have access to many and disparate sources of data.

考虑云中数据的安全性和可用性时的典型难题:Typical challenges when considering the security and availability of your data in the cloud:

  • 你当前的数据和分析工具是否支持对云中数据的访问控制?Do your current data and analytics tools support access controls on your data in the cloud? 它们是否提供了对数据在给定工具中移动时所采取措施的可靠日志记录?Do they provide robust logging of actions taken on the data as it moves through the given tool?
  • 你现在实施的安全和监视解决方案是否会随着数据湖中数据需求的增长而扩展?Will the security and monitoring solution you put in place now scale as demand on the data in your data lake grows? 为少数用户预配和监视数据访问非常容易。It can be easy enough to provision and monitor data access for a small number of users. 如果要向数百个用户开启数据湖,会发生什么情况?What happens when you want to open up your data lake to hundreds of users? 数千个呢?To thousands?
  • 是否可以采取任何措施来主动确保数据访问策略得到遵守?Is there anything you can do to be proactive in ensuring that your data access policies are being observed? 仅仅监视是不够的,这只是更多的数据。It is not enough to simply monitor; that is just more data. 如果数据可用性仅仅是数据安全的一个挑战,则你应有一个解决方案,以便在整个组织中主动监视和跟踪对这些信息的访问。If data availability is merely a challenge of data security, you should have a solution in place to actively monitor and track access to this information across the organization.
  • 可以采取哪些步骤来识别现有数据治理解决方案中的不足?What steps can you take to identify gaps in your existing data governance solution?

Azure Databricks 如何解决这些难题How Azure Databricks addresses these challenges

  • 访问控制:丰富的访问控制套件,一直到存储层。Access control: Rich suite of access control all the way down to the storage layer. Azure Databricks 可以通过在平台中使用最先进的 Azure 安全服务来利用其云主干。Azure Databricks can take advantage of its cloud backbone by utilizing state-of-the-art Azure security services right in the platform. 在 Spark 群集上启用 Azure Active Directory 凭据传递以控制对数据湖的访问。Enable Azure Active Directory credential passthrough on your spark clusters to control access to your data lake.
  • 群集策略:使管理员能够控制对计算资源的访问权限。Cluster policies: Enable administrators to control access to compute resources.
  • API 优先:使用 Databricks REST API 自动执行预配和权限管理。API first: Automate provisioning and permission management with the Databricks REST API.
  • 审核日志:有关跨工作区执行的操作和操作的可靠审核日志已提交到数据湖。Audit logging: Robust audit logs on actions and operations taken across the workspace delivered to your data lake. Azure Databricks 可以利用 Azure 的强大功能,跨部署帐户和所配置的任何其他帐户提供数据访问信息。Azure Databricks can leverage the power of Azure to provide data access information across your deployment account and any others you configure. 然后,你可以使用这些信息来启动警报,提示我们潜在的错误行为。You can then use this information to power alerts that tip us off to potential wrongdoing.

以下部分说明如何使用这些 Azure Databricks 功能来实现治理解决方案。The following sections illustrate how to use these Azure Databricks features to implement a governance solution.

设置访问控制Set up access control

若要设置访问控制,需要保护对存储的访问并实现对单个表的细粒度控制。To set up access control, you secure access to storage and implement fine-grained control of individual tables.

实现表访问控制Implement table access control

你可以在 Azure Databricks 上启用表访问控制,以编程方式授予、拒绝和撤消对来自 Spark SQL API 数据的访问权限。You can enable table access control on Azure Databricks to programmatically grant, deny, and revoke access to your data from the Spark SQL API. 你可以控制对安全对象(如数据库、表、视图和函数)的访问权限。You can control access to securable objects like databases, tables, views and functions. 假设你的公司有一个用于存储财务数据的数据库。Consider a scenario where your company has a database to store financial data. 你可能希望分析师使用这些数据创建财务报表。You might want your analysts to create financial reports using that data. 但是,数据库中的另一个表中可能存在分析师不应访问的敏感信息。However, there might be sensitive information in another table in the database that analysts should not access. 你可以为用户或组提供从一个表读取数据所需的权限,但拒绝访问第二个表的所有权限。You can provide the user or group the privileges required to read data from one table, but deny all privileges to access the second table.

在下图中,Alice 是一个管理员,拥有财务数据库中的 shared_dataprivate_data 表。In the following illustration, Alice is an admin who owns the shared_data and private_data tables in the Finance database. 然后,Alice 为分析师 Oscar 提供了从 shared_data 读取的所需权限,但拒绝 private_data 的所有权限。Alice then provides Oscar, an analyst, with the privileges required to read from shared_data but denies all privileges to private_data.

授予 select 权限Grant select

Alice 向 Oscar 授予 SELECT 权限,以便从 shared_data 读取:Alice grants SELECT privileges to Oscar to read from shared_data:

授予 select 权限表Grant select table

Alice 拒绝 Oscar 访问 private_data 的所有权限:Alice denies all privileges to Oscar to access private_data:

Deny 语句Deny statement

你可以通过定义对表的子集的细粒度访问控制或通过设置表的派生视图的权限来进一步实现这一点。You can take this one step further by defining fine-grained access controls to a subset of a table or by setting privileges on derived views of a table.

Deny 表Deny table

安全访问 Azure Data Lake StorageSecure access to Azure Data Lake Storage

可以通过几种方式从 Azure Databricks 群集访问 Azure Data Lake Storage 中的数据。You can access data in Azure Data Lake Storage from Azure Databricks clusters in a couple of ways. 此处所述的方法主要与将在相应工作流中使用的数据访问方式对应。The methods discussed here mainly correspond to how the data being accessed will be used in the corresponding workflow. 也就是说,你是否会以一种更具交互性的即席方式访问数据,比如开发 ML 模型或构建可操作仪表板?That is, will you be accessing your data in a more interactive, ad-hoc way, perhaps developing an ML model or building an operational dashboard? 在这种情况下,我们建议使用 Azure Active Directory (Azure AD) 凭据传递。In that case, we recommend that you use Azure Active Directory (Azure AD) credential passthrough. 是否要运行需要一次性访问数据湖中的容器的自动化、计划的工作负载?Will you be running automated, scheduled workloads that require one-off access to the containers in your data lake? 那么使用服务主体访问 Azure Data Lake Storage 将是首选。Then using service principals to access Azure Data Lake Storage is preferred.

凭据直通身份验证Credential passthrough

凭据传递根据用户的基于角色的访问控制为任何预配的文件存储提供用户范围的数据访问控制。Credential passthrough provides user-scoped data access controls to any provisioned file stores based on the user’s role based access controls. 配置群集时,选择并展开“高级选项”以启用凭据传递。When you configure a cluster, select and expand Advanced Options to enable credential passthrough. 任何用户如果试图访问群集上的数据,都将受到在其相应的文件系统资源上根据其 Active Directory 帐户设置的访问控制约束。Any users who attempt to access data on the cluster will be governed by the access controls put in place on their corresponding file system resources, according to their Active Directory account.

群集权限Cluster permission

此解决方案适合于许多交互式用例,并提供了一种简化的方法,要求你只在一个地方管理权限。This solution is suitable for many interactive use cases and offers a streamlined approach, requiring that you manage permissions in just one place. 通过这种方式,你可以将一个群集分配给多个用户,而无需担心为每个用户预配特定的访问控制。In this way, you can allocate one cluster to multiple users without having to worry about provisioning specific access controls for each of your users. Azure Databricks 群集上的进程隔离可确保用户凭据不会泄露或以其他方式共享。Process isolation on Azure Databricks clusters ensures that user credentials will not be leaked or otherwise shared. 此方法还有一个额外好处,可以在 Azure 存储审计日志中记录用户级条目,这有助于平台管理员将存储层操作与特定用户相关联。This approach also has the added benefit of logging user-level entries in your Azure storage audit logs, which can help platform admins to associate storage layer actions with specific users.

此方法存在以下限制:Some limitations to this method are:

  • 仅支持 Azure Data Lake Storage 文件系统。Supports only Azure Data Lake Storage file systems.
  • Databricks REST API 访问。Databricks REST API access.
  • 表访问控制:Azure Databricks 不建议将凭证直通与表访问控制一起使用。Table access control: Azure Databricks does not suggest using credential passthrough with table access control. 有关结合使用这两种功能的限制的详细信息,请参阅限制For more details on the limitations of combining these two features, see Limitations. 有关使用表访问控制的更多信息,请参阅实现表访问控制For more information about using table access control, see Implement table access control.
  • 不适合长时间运行的作业或查询,因为在用户访问令牌上的生存时间有限。Not suitable for long-running jobs or queries, because of the limited time-to-live on a user’s access token. 对于这些类型的工作负载,建议使用服务主体来访问数据。For these types of workloads, we recommend that you use service principals to access your data.

使用凭据传递安全装载 Azure Data Lake StorageSecurely mount Azure Data Lake Storage using credential passthrough

可以将 Azure Data Lake Storage 帐户或其中的文件夹装载到 Databricks 文件系统 (DBFS),从而提供一种简单而安全的方法来访问数据湖中的数据。You can mount an Azure Data Lake Storage account or folder inside it to the Databricks File System (DBFS), providing an easy and secure way to access data in your data lake. 装载是指向数据湖的指针,因此数据永远不会在本地同步。The mount is a pointer to a data lake store, so the data is never synced locally. 使用启用了 Azure data Lake Storage 凭据传递的群集装载数据时,对装入点的任何读取或写入操作都将使用 Azure AD 凭据。When you mount data using a cluster enabled with Azure Data Lake Storage credential passthrough, any read or write to the mount point uses your Azure AD credentials. 此装入点对其他用户可见,但只有具有读写访问权限的用户可以执行以下操作:This mount point will be visible to other users, but the only users that will have read and write access are those who:

  • 有权访问基础 Azure Data Lake Storage 存储帐户Have access to the underlying Azure Data Lake Storage storage account
  • 使用为 Azure Data Lake Storage 凭据传递启用的群集Are using a cluster enabled for Azure Data Lake Storage credential passthrough

若要使用凭证传递装载 Azure Data Lake Storage,请按照使用凭证传递将 Azure Data Lake Storage 装载到 DBFS 中的说明操作。To mount Azure Data Lake Storage using credential passthrough, follow the instructions in Mount Azure Data Lake Storage to DBFS using credential passthrough.

服务主体Service principals

如何授予用户或服务帐户对更长时间运行或更频繁的工作负载的访问权限?How do you grant access to users or service accounts for more long-running or frequent workloads? 如果你想利用需要通过 ODBC/JDBC 访问 Azure Databricks 中的表的商业智能工具(如 Power BI 或 Tableau),该怎么办?What if you want to utilize a business intelligence tool, such as Power BI or Tableau, that needs access to the tables in Azure Databricks via ODBC/JDBC? 在这些情况下,应使用服务主体和 OAuth。In these cases, you should use service principals and OAuth. 服务主体是特定 Azure 资源范围内的标识帐户。Service principals are identity accounts scoped to very specific Azure resources. 在笔记本中生成作业时,可以将以下行添加到作业群集的 Spark 配置或直接在笔记本中运行。When building a job in a notebook, you can add the following lines to the job cluster’s Spark configuration or run directly in the notebook. 此方法使你可以访问作业范围内的相应文件存储。This allows you to access the corresponding file store within the scope of the job.

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.chinacloudapi.cn", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.chinacloudapi.cn", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.chinacloudapi.cn", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.chinacloudapi.cn", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.chinacloudapi.cn", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

类似地,通过使用服务主体和 OAuth 令牌装载文件存储,可以直接从 Azure Data Lake Storage Gen2 URI 读取所述数据。Similarly, you can access said data by reading directly from an Azure Data Lake Storage Gen2 URI by mounting your file store(s) with a service principal and an OAuth token. 设置上述配置后,现在可以使用 URI 直接访问 Azure Data Lake Storage 中的文件:Once you’ve set the configuration above, you can now access files directly in your Azure Data Lake Storage using the URI:

"abfss://<file-system-name>@<storage-account-name>.dfs.core.chinacloudapi.cn/<directory-name>"

群集上以这种方式注册文件系统的所有用户都可以访问文件系统中的数据。All users on a cluster with a file system registered in this way will have access to the data in the file system.

管理群集配置Manage cluster configurations

群集策略允许 Azure Databricks 管理员定义群集上允许的群集属性,例如实例类型、节点数量、自定义标记等。Cluster policies allow Azure Databricks administrators to define cluster attributes that are allowed on a cluster, such as instance types, number of nodes, custom tags, and many more. 当管理员创建策略并将其分配给用户或组时,这些用户只能基于他们有权访问的策略创建群集。When an admin creates a policy and assigns it to a user or a group, those users can only create clusters based on the policy they have access to. 这为管理员提供了对可以创建的群集类型的更高控制力度。This gives administrators a much higher degree of control on what types of clusters can be created.

在 JSON 策略定义中定义策略,然后使用群集策略 UI群集策略 API 创建群集策略。You define policies in a JSON policy definition and then create cluster policies using the cluster policies UI or Cluster Policies API. 仅当用户对至少一个群集策略具有 create_cluster 权限或访问权限时,才能创建群集。A user can create a cluster only if they have the create_cluster permission or access to at least one cluster policy. 扩展你对新分析项目团队的需求,如上所述,管理员现在可以创建群集策略,并将其分配给项目组中的一个或多个用户,这些用户现在可以为团队创建群集,但仅限于群集策略中指定的规则。Extending your requirements for the new analytics project team, as described above, administrators can now create a cluster policy and assign it to one or more users within the project team who can now create clusters for the team limited to the rules specified in the cluster policy. 下图提供了一个用户的示例,该用户可以访问 Project Team Cluster Policy 并根据策略定义创建群集。The image below provides an example of a user that has access to the Project Team Cluster Policy creating a cluster based on the policy definition.

群集策略Cluster policy

自动预配群集并授予权限Automatically provision clusters and grant permissions

通过为群集和权限添加终结点,Databricks REST API 2.0 可以轻松地为任何规模的用户和组提供和授予群集资源的权限。With the addition of endpoints for both clusters and permissions, the Databricks REST API 2.0 makes it easy to both provision and grant permission to cluster resources for users and groups at any scale. 你可以使用群集 API 为特定用例创建和配置群集。You can use the Clusters API to create and configure clusters for your specific use case.

然后,可以使用 [/dev-tools/api/latest/permissions.md] 对群集应用访问控制。You can then use the [/dev-tools/api/latest/permissions.md] to apply access controls to the cluster.

下面的配置示例可能比较适合新的分析项目团队。The following is an example of a configuration that might suit a new analytics project team.

要求如下:The requirements are:

  • 支持此团队(主要是 SQL 和 Python 用户)的交互式工作负载。Support the interactive workloads of this team, who are mostly SQL and Python users.
  • 使用凭据在对象存储中预配数据源,团队可以通过这些凭据访问与角色绑定的数据。Provision a data source in object storage with credentials that give the team access to the data tied to the role.
  • 确保用户获取群集资源的相同份额。Ensure that users get an equal share of the cluster’s resources.
  • 预配更大的内存优化实例类型。Provision larger, memory optimized instance types.
  • 向群集授予权限,以便只有此新项目团队可以访问它。Grant permissions to the cluster such that only this new project team has access to it.
  • 标记此群集,以确保你可以正确地对发生的任何计算成本进行计费。Tag this cluster to make sure you can properly do chargebacks on any compute costs incurred.

部署脚本Deployment script

可以使用群集和权限 API 中的 API 终结点来部署此配置。You deploy this configuration by using the API endpoints in the Clusters and Permissions APIs.

设置群集Provision cluster

终结点 - https:///<databricks-instance>/api/2.0/clusters/createEndpoint - https:///<databricks-instance>/api/2.0/clusters/create


{
  "autoscale": {
      "min_workers": 2,
      "max_workers": 20
  },
  "cluster_name": "project team interactive cluster",
  "spark_version": "latest-stable-scala2.11",
  "spark_conf": {
      "spark.Azure Databricks.cluster.profile": "serverless",
      "spark.Azure Databricks.repl.allowedLanguages": "python,sql",
      "spark.Azure Databricks.passthrough.enabled": "true",
      "spark.Azure Databricks.pyspark.enableProcessIsolation": "true"
  },
  "node_type_id": "Standard_D14_v2",
  "ssh_public_keys": [],
  "custom_tags": {
      "ResourceClass": "Serverless",
      "team": "new-project-team"
  },
  "spark_env_vars": {
      "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
  },
  "autotermination_minutes": 60,
  "enable_elastic_disk": true,
  "init_scripts": []
}

授予群集权限Grant cluster permission

终结点 - https://<databricks-instance>/api/2.0/permissions/clusters/<cluster_id>Endpoint - https://<databricks-instance>/api/2.0/permissions/clusters/<cluster_id>

{
  "access_control_list": [
    {
      "group_name": "project team",
      "permission_level": "CAN_MANAGE"
    }
  ]
}

你马上就会拥有一个群集,该群集已经预配对湖中关键数据的安全访问,已锁定除相应团队之外的所有人,已针对退款进行标记,且已配置,满足项目的所有要求。Instantly you have a cluster that has been provisioned with secure access to critical data in the lake, locked down to all but the corresponding team, tagged for chargebacks, and configured to meet the requirements of the project. 实现此解决方案还需要在主机云提供商帐户中执行额外的配置步骤,但是也可以自动执行以满足缩放需求。There are additional configuration steps within your host cloud provider account required to implement this solution, though too, can be automated to meet the requirements of scale.

审核访问Audit access

在 Azure Databricks 中配置访问控制以及控制存储中的数据访问是迈向高效数据治理解决方案的第一步。Configuring access control in Azure Databricks and controlling data access in storage is the first step towards an efficient data governance solution. 但是,完整的解决方案需要审核对数据的访问并提供警报和监视功能。However, a complete solution requires auditing access to data and providing alerting and monitoring capabilities. Databricks 提供一组全面的审核事件来记录 Azure Databricks 用户提供的活动,允许企业监视平台上的详细使用模式。Databricks provides a comprehensive set of audit events to log activities provided by Azure Databricks users, allowing enterprises to monitor detailed usage patterns on the platform. 若要全面了解用户在平台上执行的操作以及访问的数据,应同时使用本机 Azure Databricks 和云提供商审核日志记录功能。To get a complete understanding of what users are doing on the platform and what data is being accessed, you should use both native Azure Databricks and cloud provider audit logging capabilities.

在 Azure Databricks 中配置访问控制以及控制存储帐户中的数据访问是迈向高效数据治理解决方案的第一步。Configuring access controls in Azure Databricks and controlling data access in the storage account is a great first step towards an efficient data governance solution. 但是,完整的解决方案还需要能够审核对数据的访问并提供警报和监视功能。However, it is incomplete until you can audit access to data and provide alerting and monitoring capabilities. Azure Databricks 提供一组全面的审核事件来记录用户执行的活动,允许企业监视平台上的详细使用模式。Azure Databricks provides a comprehensive set of audit events to log activities performed by users allowing enterprises to monitor detailed usage patterns on the platform.

请确保已在 Azure Databricks 中启用了诊断日志记录Make sure you have diagnostic logging enabled in Azure Databricks. 为帐户启用日志记录后,Azure Databricks 将自动开始向指定的交付位置发送诊断日志。Once logging is enabled for your account, Azure Databricks automatically starts sending diagnostic logs to the delivery location you specified. 你还可以选择 Send to Log Analytics,这会将诊断数据转发到 Azure Monitor。You also have the option to Send to Log Analytics, which will forward diagnostic data to Azure Monitor. 下面是一个示例查询,你可以在“日志搜索”框中输入查询已登录到 Azure Databricks 工作区的所有用户及其位置:Here is an example query you can enter into the Log search box to query all users who have logged into the Azure Databricks workspace and their location:

Azure MonitorAzure monitor

几个步骤中,你可以使用 Azure 监视服务或创建实时警报。In a few steps, you can use Azure monitoring services or create real-time alerts. Azure 活动日志提供对存储帐户上的操作以及其中的容器所采取的操作的可见性。The Azure Activity Log provides visibility into the actions taken on your storage accounts and the containers therein. 也可以在此处配置警报规则。Alert rules can be configured here as well.

Azure 活动日志Azure activity log

了解详细信息Learn more

以下是一些资源,可帮助你构建一个全面的数据治理解决方案,以满足组织需求:Here are some resources to help you build a comprehensive data governance solution that meets your organization’s needs: