使用 Azure Active Directory 凭据直通身份验证保护对 Azure Data Lake Storage 的访问Secure access to Azure Data Lake Storage using Azure Active Directory credential passthrough

可以使用登录 Azure Databricks 时所用的相同 Azure Active Directory (Azure AD) 标识自动从 Azure Databricks 群集向 Azure Data Lake Storage Gen2 进行身份验证。You can authenticate automatically to Azure Data Lake Storage Gen2 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. 为群集启用 Azure Data Lake Storage 凭据直通身份验证时,在该群集上运行的命令可以在 Azure Data Lake Storage 中读取和写入数据,无需配置用于访问存储的服务主体凭据。When you enable your cluster for Azure Data Lake Storage credential passthrough, commands that you run on that cluster can read and write data in Azure Data Lake Storage without requiring you to configure service principal credentials for access to storage.

要求Requirements

重要

如果你位于防火墙后面,但该防火墙尚未配置为允许将流量发往 Azure Active Directory,那么你不能使用 Azure Active Directory 凭据向 Azure Data Lake Storage 进行身份验证。You cannot authenticate to Azure Data Lake Storage with your Azure Active Directory credentials if you are behind a firewall that has not been configured to allow traffic to Azure Active Directory. Azure 防火墙默认阻止 Active Directory 访问。Azure Firewall blocks Active Directory access by default. 若要允许访问,请配置 AzureActiveDirectory 服务标记。To allow access, configure the AzureActiveDirectory service tag. 可以在 Azure IP 范围和服务标记 JSON 文件中的 AzureActiveDirectory 标记下找到网络虚拟设备的等效信息。You can find equivalent information for network virtual appliances under the AzureActiveDirectory tag in the Azure IP Ranges and Service Tags JSON file. 有关详细信息,请参阅 Azure 防火墙服务标记公有云的 Azure IP 地址For more information, see Azure Firewall service tags and Azure IP Addresses for Public Cloud.

群集要求Cluster requirements

  • Databricks Runtime 6.0 或更高版本(用于在标准群集上提供 R 支持)。Databricks Runtime 6.0 or above for R support on standard clusters.
  • 不能已为群集设置了 Azure Data Lake Storage 凭据(例如,通过提供服务主体凭据进行设置)。Azure Data Lake Storage credentials cannot have been set for the cluster (by providing your service principal credentials, for example).
  • 启用了凭据直通身份验证的群集不支持作业,也不支持数据对象特权Clusters enabled for credential passthrough do not support jobs, and they do not support Data object privileges.

日志记录建议Logging recommendations

由于你的标识会传递到 Azure Data Lake Storage,因此你的标识可以出现在 Azure 存储诊断日志中。Since your identity is passed through to Azure Data Lake Storage, your identity can appear in the Azure Storage diagnostic logs. 因此,可以将 ADLS 请求绑定到 Azure Databricks 群集中的单个用户。This allows ADLS requests to be tied to individual users from Azure Databricks clusters. 若要开始接收这些日志,请在存储帐户上启用诊断日志记录功能。To start receiving these logs, turn on diagnostic logging on your storage account.

  • Azure Data Lake Storage Gen2:使用 PowerShell 执行 Set-AzStorageServiceLoggingProperty 命令进行配置。Azure Data Lake Storage Gen2: Configure using PowerShell with the Set-AzStorageServiceLoggingProperty command. 指定 2.0 作为版本,因为日志条目格式 2.0 包含请求中的用户主体名称。Specify 2.0 as the version, because log entry format 2.0 includes the user principal name in the request.

为高并发群集启用 Azure Data Lake Storage 凭据直通身份验证Enable Azure Data Lake Storage credential passthrough for a high-concurrency cluster

高并发性群集可由多个用户共享。High concurrency clusters can be shared by multiple users. 它们仅支持 Python、SQL 和 R。They support only Python, SQL, and R.

  1. 创建群集时,请将“群集模式”设置为高并发性When you create a cluster, set the Cluster Mode to High Concurrency.
  2. 在“高级选项”下,选择“启用凭据直通身份验证,但只允许 Python 和 SQL 命令”。Under Advanced Options, select Enable credential passthrough and only allow Python and SQL commands.

启用凭据直通身份验证Enable credential passthrough

为标准群集启用 Azure Data Lake Storage 凭据直通身份验证 Enable Azure Data Lake Storage credential passthrough for a standard cluster

对于启用了凭据直通身份验证的标准群集,只能进行单用户访问。Standard clusters with credential passthrough are limited to a single user. 标准群集支持 Python、SQL 和 Scala。Standard clusters support Python, SQL, and Scala. 在 Databricks Runtime 6.0 及更高版本上,它们也支持 SparkR。On Databricks Runtime 6.0 and above, they also support SparkR.

你必须在群集创建时分配一个用户,但该群集随时可由具有“可管理”权限的用户编辑,以替换原始用户。You must assign a user at cluster creation, but the cluster can be edited by a user with Can Manage permissions at any time to replace the original user.

重要

分配给群集的用户必须至少有群集的“可附加到”权限,才能在群集上运行命令。The user assigned to the cluster must have at least Can Attach To permissions for the cluster in order to run commands on the cluster. 管理员和群集创建者有“可管理”权限,但不能在群集上运行命令,除非他们是指定的群集用户。Admins and the cluster creator have Can Manage permissions, but cannot run commands on the cluster unless they are the designated cluster user.

  1. 创建群集时,请将“群集模式”设置为标准When you create a cluster, set the Cluster Mode to Standard.
  2. 在“高级选项”下,选择“为用户级访问启用凭据直通身份验证”,然后从“单用户访问”下拉列表中选择用户名。Under Advanced Options, select Enable credential passthrough for user-level access and select the user name from the Single User Access drop-down.

启用凭据直通身份验证Enable credential passthrough

使用凭据直通身份验证读取和写入 Azure Data Lake StorageRead and write Azure Data Lake Storage using credential passthrough

Azure Data Lake Storage 凭据直通身份验证仅支持 Azure Data Lake Storage Gen2。Azure Data Lake Storage credential passthrough supports only Azure Data Lake Storage Gen2. 在 Azure Data Lake Storage Gen2 中使用 abfss:// 路径直接访问数据。Access data directly in Azure Data Lake Storage Gen2 using an abfss:// path. 例如:For example:

Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2

spark.read.csv("abfss://<my-file-system-name>@<my-storage-account-name>.dfs.core.chinacloudapi.cn/MyData.csv").collect()

使用凭据直通身份验证将 Azure Data Lake Storage 装载到 DBFS Mount Azure Data Lake Storage to DBFS using credential passthrough

可以将 Azure Data Lake Storage 帐户或其中的文件夹装载到 Databricks 文件系统 (DBFS)You can mount an Azure Data Lake Storage account or a folder inside it to Databricks File System (DBFS). 此装载是指向一个数据湖存储的指针,因此数据永远不会在本地同步。The mount is a pointer to a data lake store, so the data is never synced locally.

使用启用了 Azure data Lake Storage 凭据直通身份验证的群集装载数据时,对装入点的任何读取或写入操作都将使用 Azure AD 凭据。When you mount data using a cluster enabled with Azure Data Lake Storage credential passthrough, any read or write to the mount point uses your Azure AD credentials. 此装入点对其他用户可见,但只有以下用户具有读取和写入访问权限:This mount point will be visible to other users, but the only users that will have read and write access are those who:

  • 有权访问基础 Azure Data Lake Storage 存储帐户的用户Have access to the underlying Azure Data Lake Storage storage account
  • 正在使用已启用 Azure Data Lake Storage 凭据直通身份验证的群集的用户Are using a cluster enabled for Azure Data Lake Storage credential passthrough

若要装载 Azure Data Lake Storage 帐户或其中的文件夹,请使用装载 Azure Data Lake Storage Gen2 文件系统中所述的 Python 命令,并且将 configs 替换为以下内容:To mount an Azure Data Lake Storage account or a folder inside it, use the Python commands described in Mount Azure Data Lake Storage Gen2 filesystem, replacing the configs with:

Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2

configs = {
  "fs.azure.account.auth.type": "CustomAccessToken",
  "fs.azure.account.custom.token.provider.class":   spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}

警告

不要提供存储帐户访问密钥或服务主体凭据来向装入点进行身份验证。Do not provide your storage account access keys or service principal credentials to authenticate to the mount point. 那样会使得其他用户可以使用这些凭据访问文件系统。That would give other users access to the filesystem using those credentials. Azure Data Lake Storage 凭据直通身份验证的目的是让你不必使用这些凭据,以及确保只有有权访问基础 Azure Data Lake Storage 帐户的用户才能访问文件系统。The purpose of Azure Data Lake Storage credential passthrough is to prevent you from having to use those credentials and to ensure that access to the filesystem is restricted to users who have access to the underlying Azure Data Lake Storage account.

安全Security

与其他用户共享 Azure Data Lake Storage 凭据直通身份验证群集是安全的。It is safe to share Azure Data Lake Storage credential passthrough clusters with other users. 你和其他用户彼此独立,无法读取或使用彼此的凭据。You will be isolated from each other and will not be able to read or use each other’s credentials.

支持的功能Supported features

功能Feature 最低 Databricks Runtime 版本Minimum Databricks Runtime Version 说明Notes
Python 和 SQLPython and SQL 5.15.1
%run 5.15.1
DBFSDBFS 5.35.3 仅当 DBFS 路径解析为 Azure Data Lake Storage Gen2 中的位置时,才会传递凭据。Credentials are passed through only if the DBFS path resolves to a location in Azure Data Lake Storage Gen2. 对于会解析为其他存储系统的 DBFS 路径,请使用另一方法来指定凭据。For DBFS paths that resolve to other storage systems, use a different method to specify your credentials.
Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2 5.35.3
增量缓存Delta caching 5.45.4
PySpark ML APIPySpark ML API 5.45.4 不支持以下 ML 类:The following ML classes are not supported:

* org/apache/spark/ml/classification/RandomForestClassifier
* org/apache/spark/ml/clustering/BisectingKMeans
* org/apache/spark/ml/clustering/GaussianMixture
* org/spark/ml/clustering/KMeans
* org/spark/ml/clustering/LDA
* org/spark/ml/evaluation/ClusteringEvaluator
* org/spark/ml/feature/HashingTF
* org/spark/ml/feature/OneHotEncoder
* org/spark/ml/feature/StopWordsRemover
* org/spark/ml/feature/VectorIndexer
* org/spark/ml/feature/VectorSizeHint
* org/spark/ml/regression/IsotonicRegression
* org/spark/ml/regression/RandomForestRegressor
* org/spark/ml/util/DatasetUtils
广播变量Broadcast variables 5.55.5 在 PySpark 中,可以构造的 Python UDF 的大小存在限制,因为大型 UDF 是作为广播变量发送的。Within PySpark, there is a limit on the size of the Python UDFs you can construct, since large UDFs are sent as broadcast variables.
作用域为笔记本的库Notebook-scoped libraries 5.55.5
ScalaScala 5.55.5
Spark RSpark R 6.06.0
笔记本工作流Notebook workflows 6.16.1
PySpark ML APIPySpark ML API 6.16.1 所有 PySpark ML 类都受支持。All PySpark ML classes supported.
Ganglia UIGanglia UI 6.16.1

限制Limitations

Azure Data Lake Storage 凭据直通身份验证不支持以下功能:The following features are not supported with Azure Data Lake Storage credential passthrough:

  • %fs(请改用等效的 dbutils.fs 命令)。%fs (use the equivalent dbutils.fs command instead).
  • 作业Jobs.
  • REST APIThe REST API.
  • 表访问控制Table access control. Azure Data Lake Storage 凭据直通身份验证所授予的功能可用于绕过表 ACL 的细化权限,而表 ACL 的额外限制会限制你通过 Azure Data Lake Storage 凭据直通身份验证获得的某些能力。The powers granted by Azure Data Lake Storage credential passthrough could be used to bypass the fine-grained permissions of table ACLs, while the extra restrictions of table ACLs will constrain some of the power you get from Azure Data Lake Storage credential passthrough. 具体而言:In particular:
    • 如果你的 Azure AD 权限允许你访问特定表所基于的数据文件,那么你可以通过 RDD API 获取对该表的完全权限(无论通过表 ACL 对其施加的限制如何)。If you have Azure AD permission to access the data files that underlie a particular table you will have full permissions on that table via the RDD API, regardless of the restrictions placed on them via table ACLs.
    • 只有在使用数据帧 API 时,才会受到表 ACL 权限的约束。You will be constrained by table ACLs permissions only when using the DataFrame API. 如果你尝试直接通过数据帧 API 读取文件,则即使你可以直接通过 RDD API 读取那些文件,也会看到警告,指出你在任何文件上都没有 SELECT 权限。You will see warnings about not having permission SELECT on any file if you try to read files directly with the DataFrame API, even though you could read those files directly via the RDD API.
    • 你将无法从 Azure Data Lake Storage 以外的文件系统所支持的表进行读取,即使你具有读取表的表 ACL 权限。You will be unable to read from tables backed by filesystems other than Azure Data Lake Storage, even if you have table ACL permission to read the tables.
  • SparkContext (sc) 和 SparkSession (spark) 对象上的以下方法:The following methods on SparkContext (sc) and SparkSession (spark) objects:
    • 已弃用的方法。Deprecated methods.
    • 允许非管理员用户调用 Scala 代码的方法,例如 addFile()addJar()Methods such as addFile() and addJar() that would allow non-admin users to call Scala code.
    • 访问 Azure Data Lake Storage Gen2 以外的文件系统的任何方法(若要访问启用了 Azure Data Lake Storage 凭据直通身份验证的群集上的其他文件系统,请使用另一方法来指定凭据,并参阅故障排除下关于受信任文件系统的部分)。Any method that accesses a filesystem other than Azure Data Lake Storage Gen2 (to access other filesystems on a cluster with Azure Data Lake Storage credential passthrough enabled, use a different method to specify your credentials and see the section on trusted filesystems under Troubleshooting).
    • 旧的 Hadoop API(hadoopFile()hadoopRDD())。The old Hadoop APIs (hadoopFile() and hadoopRDD()).
    • 流式处理 API,因为直通凭据会在流仍在运行时过期。Streaming APIs, since the passed-through credentials would expire while the stream was still running.
  • FUSE 装载 (/dbfs)。The FUSE mount (/dbfs).
  • Azure 数据工厂。Azure Data Factory.
  • 高并发性群集上的 Databricks ConnectDatabricks Connect on high concurrency clusters.
  • 高并发性群集上的 MLflowMLflow on high concurrency clusters.
  • 高并发性群集上的 azureml-sdk[databricks] Python 包。azureml-sdk[databricks] Python package on high concurrency clusters.
  • 不能使用 Azure Active Directory 令牌生存期策略来延长 Azure Active Directory 直通令牌的生存期。You cannot extend the lifetime of Azure Active Directory passthrough tokens using Azure Active Directory token lifetime policies. 因此,如果向群集发送耗时超过一小时的命令,并在 1 小时标记期过后访问 Azure Data Lake Storage 资源,该命令会失败。As a consequence, if you send a command to the cluster that takes longer than an hour, it will fail if an Azure Data Lake Storage resource is accessed after the 1 hour mark.

示例笔记本Example notebooks

以下笔记本演示 Azure Data Lake Storage Gen2 的 Azure Data Lake Storage 凭据直通身份验证。The following notebooks demonstrate Azure Data Lake Storage credential passthrough for Azure Data Lake Storage Gen2.

Azure Data Lake Storage Gen2 直通笔记本Azure Data Lake Storage Gen2 passthrough notebook

获取笔记本Get notebook

故障排除 Troubleshooting

py4j.security.Py4JSecurityException: … 未加入允许列表py4j.security.Py4JSecurityException: … is not whitelisted

如果访问的方法未被 Azure Databricks 明确标记为对 Azure Data Lake Storage 凭据直通身份验证群集安全,则将引发此异常。This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as safe for Azure Data Lake Storage credential passthrough clusters. 在大多数情况下,这意味着该方法可能会允许 Azure Data Lake Storage 凭据直通身份验证群集上的用户访问其他用户的凭据。In most cases, this means that the method could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials.

org.apache.spark.api.python.PythonSecurityException:路径 … 使用不受信任的文件系统org.apache.spark.api.python.PythonSecurityException: Path … uses an untrusted filesystem

如果你尝试访问了一个文件系统,而 Azure Data Lake Storage 凭据直通身份验证群集不知道该文件系统是否安全,则会引发此异常。This exception is thrown when you have tried to access a filesystem that is not known by the Azure Data Lake Storage credential passthrough cluster to be safe. 使用不受信任的文件系统可能会使 Azure Data Lake Storage 凭据直通身份验证群集上的用户能够访问其他用户的凭据,因此我们禁用了我们无法确信其会被用户安全使用的所有文件系统。Using an untrusted filesystem might allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials, so we disallow all filesystems that we are not confident are being used safely.

若要在 Azure Data Lake Storage 凭据直通身份验证群集上配置一组受信任的文件系统,请将该群集上的 Spark conf 键 spark.databricks.pyspark.trustedFilesystems 设置为以逗号分隔的类名称列表,这些名称是 org.apache.hadoop.fs.FileSystem 的受信任实现。To configure the set of trusted filesystems on a Azure Data Lake Storage credential passthrough cluster, set the Spark conf key spark.databricks.pyspark.trustedFilesystems on that cluster to be a comma-separated list of the class names that are trusted implementations of org.apache.hadoop.fs.FileSystem.