无法读取 WASB 文件系统中的文件以及列出该文件系统中的目录Unable to read files and list directories in a WASB filesystem

问题Problem

尝试使用 Spark 读取 WASB 上的文件时,会出现以下异常:When you try reading a file on WASB with Spark, you get the following exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 19, 10.139.64.5, executor 0): shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

尝试使用 dbutils.fs.ls 或 Hadoop API 列出 WASB 中的文件时,会出现以下异常:When you try listing files in WASB using dbutils.fs.ls or the Hadoop API, you get the following exception:

java.io.FileNotFoundException: File/<some-directory> does not exist.

原因Cause

WASB 文件系统支持三种类型的 blob:块、页面和追加。The WASB filesystem supports three types of blobs: block, page, and append.

  • 块 blob 针对大数据块(Hadoop 中的默认值)上传进行了优化。Block blobs are optimized for upload of large blocks of data (the default in Hadoop).
  • 页 blob 针对随机读取和写入操作进行了优化。Page blobs are optimized for random read and write operations.
  • 追加 Blob 针对追加操作进行了优化。Append blobs are optimized for append operations.

请参阅了解块 blob、追加 blob 和页 blob 获取详细信息。See Understanding block blobs, append blobs, and page blobs for details.

如果尝试读取追加 blob 或列出仅包含追加 blob 的目录,将发生上述错误。The errors described above occur if you try to read an append blob or list a directory that contains only append blobs. Azure Databricks 和 Hadoop Azure WASB 实现不支持读取追加 blob。The Azure Databricks and Hadoop Azure WASB implementations do not support reading append blobs. 同样,在列出目录时,将忽略追加 blob。Similarly when listing a directory, append blobs are ignored.

没有解决方法可以实现读取追加 blob 或列出仅包含追加 blob 的目录。There is no workaround to enable reading append blobs or listing a directory that contains only append blobs. 但可以使用适用于 Python 的 Azure CLI 或 Azure 存储 SDK 来标识目录是否包含追加 blob 或文件是否是追加 blob。However, you can use either Azure CLI or Azure Storage SDK for Python to identify if a directory contains append blobs or a file is an append blob.

可以通过运行以下 Azure CLI 命令来验证目录是否包含追加 blob:You can verify whether a directory contains append blobs by running the following Azure CLI command:

az storage blob list \
  --auth-mode key \
  --account-name <account-name> \
  --container-name <container-name> \
  --prefix <path>

结果以 JSON 文档形式返回,可以在其中轻松查找每个文件的 blob 类型。The result is returned as a JSON document, in which you can easily find the blob type for each file.

如果目录很大,可以使用标志 --num-results <num> 限制结果数。If directory is large, you can limit number of results with the flag --num-results <num>.

还可以使用适用于 Python 的 Azure 存储 SDK 列出和浏览 WASB 文件系统中的文件:You can also use Azure Storage SDK for Python to list and explore files in a WASB filesystem:

iter = service.list_blobs("container")
for blob in iter:
  if blob.properties.blob_type == "AppendBlob":
    print("\t Blob name: %s, %s" % (blob.name, blob.properties.blob_type))

Azure Databricks 确实支持使用 Hadoop API 访问追加 blob,但仅在附加到文件的情况下。Azure Databricks does support accessing append blobs using the Hadoop API, but only when appending to a file.

解决方案Solution

没有针对此问题的解决方法。There is no workaround for this issue.

使用适用于 Python 的 Azure CLI 或 Azure 存储 SDK 来标识目录是否包含追加 blob 或项目是追加 blob。Use Azure CLI or Azure Storage SDK for Python to identify if the directory contains append blobs or the object is an append blob.

可以使用 RDD API 来实现 Spark SQL UDF 或自定义函数,以便使用适用于 Python 的 Azure 存储 SDK 加载、读取或转换 blob。You can implement either a Spark SQL UDF or custom function using RDD API to load, read, or convert blobs using Azure Storage SDK for Python.