Databricks 文件系统 (DBFS)Databricks File System (DBFS)

Databricks 文件系统 (DBFS) 是一个装载到 Azure Databricks 工作区的分布式文件系统,可以在 Azure Databricks 群集上使用。Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. DBFS 是基于可缩放对象存储的抽象,具有以下优势:DBFS is an abstraction on top of scalable object storage and offers the following benefits:

  • 允许你装载存储对象,因此无需凭据即可无缝访问数据。Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
  • 允许你使用目录和文件语义(而不是存储 URL)与对象存储进行交互。Allows you to interact with object storage using directory and file semantics instead of storage URLs.
  • 将文件保存到对象存储,因此在终止群集后不会丢失数据。Persists files to object storage, so you won’t lose data after you terminate a cluster.

DBFS 根DBFS root

DBFS 中的默认存储位置称为 DBFS 根The default storage location in DBFS is known as the DBFS root. 以下 DBFS 根位置中存储了几种类型的数据:Several types of data are stored in the following DBFS root locations:

  • /FileStore:导入的数据文件、生成的绘图以及上传的库。/FileStore: Imported data files, generated plots, and uploaded libraries. 请参阅 FileStoreSee FileStore.
  • /databricks-datasets:示例公共数据集/databricks-datasets: Sample public datasets.
  • /databricks-results:通过下载查询的完整结果生成的文件。/databricks-results: Files generated by downloading the full results of a query.
  • /databricks/init:全局和群集命名的(已弃用)init 脚本/databricks/init: Global and cluster-named (deprecated) init scripts.
  • /user/hive/warehouse:非外部 Hive 表的数据和元数据。/user/hive/warehouse: Data and metadata for non-external Hive tables.

在新的工作区中,DBFS 根具有以下默认文件夹:In a new workspace, the DBFS root has the following default folders:

DBFS 根默认文件DBFS root default folders

DBFS 根还包含不可见且无法直接访问的数据,包括装入点元数据和凭据以及某些类型的日志。The DBFS root also contains data—including mount point metadata and credentials and certain types of logs—that is not visible and cannot be directly accessed.

重要

已写入装入点路径 (/mnt) 的数据存储在 DBFS 根的外部。Data written to mount point paths (/mnt) is stored outside of the DBFS root. 尽管 DBFS 根是可写的,仍建议你将数据存储在装载的对象存储中,而不是存储在 DBFS 根中。Even though the DBFS root is writeable, we recommend that you store data in mounted object storage rather than in the DBFS root.

将对象存储装载到 DBFS Mount object storage to DBFS

通过将对象存储装载到 DBFS,可访问对象存储中的对象,就像它们在本地文件系统中一样。Mounting object storage to DBFS allows you to access objects in object storage as if they were on the local file system.

重要

  • 所有用户都对装载到 DBFS 的对象存储中的对象具有读写访问权限。All users have read and write access to the objects in object storage mounted to DBFS.

  • 不支持嵌套装载。Nested mounts are not supported. 例如,不支持以下结构:For example, the following structure is not supported:

    • storage1 装载为 /mnt/storage1storage1 mounted as /mnt/storage1
    • storage2 装载为 /mnt/storage1/storage2storage2 mounted as /mnt/storage1/storage2

    建议为每个存储对象创建单独的装载条目:We recommend creating separate mount entries for each storage object:

    • storage1 装载为 /mnt/storage1storage1 mounted as /mnt/storage1
    • storage2 装载为 /mnt/storage2storage2 mounted as /mnt/storage2

要了解如何装载和卸载 Azure Blob 存储容器和 Azure Data Lake Storage 帐户,请参阅将 Azure Blob 存储容器装载到 DBFS使用凭据传递将 Azure Data Lake Storage 装载到 DBFS 以及使用服务主体和 OAuth 2.0 装载 Azure Data Lake Storage Gen2 帐户For information on how to mount and unmount Azure Blob storage containers and Azure Data Lake Storage accounts, see Mount Azure Blob storage containers to DBFS, Mount Azure Data Lake Storage to DBFS using credential passthrough, and Mount an Azure Data Lake Storage Gen2 account using a service principal and OAuth 2.0.

访问 DBFSAccess DBFS

可使用文件上传接口将数据上传到 DBFS,并使用 DBFS CLIDBFS APIDatabricks 文件系统实用工具 (dbutils.fs)Spark API本地文件 API 来上传和访问 DBFS 对象。You can upload data to DBFS using the file upload interface, and can upload and access DBFS objects using the DBFS CLI, DBFS API, Databricks file system utilities (dbutils.fs), Spark APIs, and local file APIs. 在 Spark 群集中,使用 Databricks 文件系统实用工具、Spark API 或本地文件 API 访问 DBFS 对象。In a Spark cluster you access DBFS objects using Databricks file system utilities, Spark APIs, or local file APIs. 在本地计算机上,使用 Databricks CLI 或 DBFS API 访问 DBFS 对象。On a local computer you access DBFS objects using the Databricks CLI or DBFS API.

本节内容:In this section:

文件上传接口 File upload interface

如果本地计算机上有要使用 Azure Databricks 进行分析的小型数据文件,可使用文件上传接口将其轻松导入 Databricks 文件系统 (DBFS)If you have small data files on your local machine that you want to analyze with Azure Databricks, you can easily import them to Databricks File System (DBFS) using the file upload interface.

备注

管理员用户可禁用文件上传接口。Admin users can disable the file upload interface. 请查看管理数据上传See Manage data upload.

如果要使用 UI 创建,请参阅使用 UI 创建表If you’d like to create a table using the UI, see Create a table using the UI.

如果要上传在笔记本中使用的数据,请执行以下步骤。If you’d like to upload data for use in a notebook, follow these steps.

  1. 创建新笔记本或打开现有笔记本,然后单击“文件”>“上传数据”Create a new notebook or open an existing one, then click File > Upload Data

    上传数据Upload data

  2. 选择 DBFS 中的目标目录来存储已上传的文件。Select a target directory in DBFS to store the uploaded file. 目标目录默认为 /shared_uploads/<your-email-address>/The target directory defaults to /shared_uploads/<your-email-address>/.

    具有工作区访问权限的用户都可访问已上传的文件。Uploaded files are accessible by everyone who has access to the workspace.

  3. 将文件拖放到放置目标上,或者单击“浏览”以查找本地文件系统中的文件。Either drag files onto the drop target or click Browse to locate files in your local filesystem.

    选择文件和目标Select Files and Destination

  4. 完成文件上传后,单击“下一步”。When you have finished uploading the files, click Next.

    如果已上传 CSV、TSV 或 JSON 文件,Azure Databricks 会生成代码,显示如何将数据加载到数据帧。If you’ve uploaded CSV, TSV, or JSON files, Azure Databricks generates code showing how to load the data into a DataFrame.

    查看文件和示例代码View Files and Sample Code

    若要将文本保存到剪贴板,请单击“复制”。To save the text to your clipboard, click Copy.

  5. 单击“完成”,返回到笔记本。Click Done to return to the notebook.

Databricks CLIDatabricks CLI

DBFS 命令行接口 (CLI) 使用 DBFS API 向 DBFS 公开一个易用型命令行接口。The DBFS command-line interface (CLI) uses the DBFS API to expose an easy to use command-line interface to DBFS. 通过此客户端,你可使用类似于 Unix 命令行中所用的命令与 DBFS 进行交互。Using this client, you can interact with DBFS using commands similar to those you use on a Unix command line. 例如: 。For example:

# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana

若要详细了解 DBFS 命令行接口,请参阅 Databricks CLIFor more information about the DBFS command-line interface, see Databricks CLI.

dbutils dbutils

dbutils.fs 提供与文件系统类似的命令来访问 DBFS 中的文件。dbutils.fs provides file-system-like commands to access files in DBFS. 本部分提供几个示例,说明如何使用 dbutils.fs 命令在 DBFS 中写入和读取文件。This section has several examples of how to write files to and read files from DBFS using dbutils.fs commands.

提示

若要访问 DBFS 的帮助菜单,请使用 dbutils.fs.help() 命令。To access the help menu for DBFS, use the dbutils.fs.help() command.

  • 在 DBFS 根中写入和读取文件,就像它是本地文件系统一样。Write files to and read files from the DBFS root as if it were a local filesystem.

    dbutils.fs.mkdirs("/foobar/")
    
    dbutils.fs.put("/foobar/baz.txt", "Hello, World!")
    
    dbutils.fs.head("/foobar/baz.txt")
    
    dbutils.fs.rm("/foobar/baz.txt")
    
  • 使用 dbfs:/ 访问 DBFS 路径。Use dbfs:/ to access a DBFS path.

    display(dbutils.fs.ls("dbfs:/foobar"))
    
  • 笔记本支持使用速记命令(%fs 魔术命令)来访问 dbutils 文件系统模块。Notebooks support a shorthand—%fs magic commands—for accessing the dbutils filesystem module. 可通过 %fs 魔术命令使用大多数 dbutils.fs 命令。Most dbutils.fs commands are available using %fs magic commands.

    # List the DBFS root
    
    %fs ls
    
    # Recursively remove the files under foobar
    
    %fs rm -r foobar
    
    # Overwrite the file "/mnt/my-file" with the string "Hello world!"
    
    %fs put -f "/mnt/my-file" "Hello world!"
    

DBFS APIDBFS API

请参阅 DBFS API将大文件上传到 DBFSSee DBFS API and Upload a big file into DBFS.

Spark API Spark APIs

使用 Spark API 时,将使用 "/mnt/training/file.csv""dbfs:/mnt/training/file.csv" 引用文件。When you’re using Spark APIs, you reference files with "/mnt/training/file.csv" or "dbfs:/mnt/training/file.csv". 以下示例将文件 foo.text 写入 DBFS /tmp 目录。The following example writes the file foo.text to the DBFS /tmp directory.

df.write.text("/tmp/foo.txt")

本地文件 API Local file APIs

可使用本地文件 API 来读取和写入 DBFS 路径。You can use local file APIs to read and write to DBFS paths. Azure Databricks 使用 FUSE 装载 /dbfs 配置每个群集节点,使群集节点上运行的进程能够使用本地文件 API 来读取和写入基础分布式存储层。Azure Databricks configures each cluster node with a FUSE mount /dbfs that allows processes running on cluster nodes to read and write to the underlying distributed storage layer with local file APIs. 使用本地文件 API 时,必须在 /dbfs 下提供路径。When using local file APIs, you must provide the path under /dbfs. 例如: 。For example:

PythonPython

#write a file to DBFS using Python I/O APIs
with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
  f.write("Apache Spark is awesome!\n")
  f.write("End of example!")

# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
  for line in f_read:
    print line

ScalaScala

import scala.io.Source

val filename = "/dbfs/tmp/test_dbfs.txt"
for (line <- Source.fromFile(filename).getLines()) {
  println(line)
}

本地文件 API 限制 Local file API Limitations

下面列举了适用于 Databricks Runtime 各版本的本地文件 API 使用限制。The following enumerates the limitations in local file API usage that apply to each Databricks Runtime version.

  • 全部:不支持凭据传递。All: Does not support credential passthrough.

  • 6.06.0

    • 不支持随机写入。Does not support random writes. 对于需要随机写入的工作负载,请先在本地磁盘上执行 I/O,然后将结果复制到 /dbfsFor workloads that require random writes, perform the I/O on local disk first and then copy the result to /dbfs. 例如: 。For example:

      # python
      import xlsxwriter
      from shutil import copyfile
      
      workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
      worksheet = workbook.add_worksheet()
      worksheet.write(0, 0, "Key")
      worksheet.write(0, 1, "Value")
      workbook.close()
      
      copyfile('/local_disk0/tmp/excel.xlsx', '/dbfs/tmp/excel.xlsx')
      
    • 不支持稀疏文件。Does not support sparse files. 若要复制稀疏文件,请使用 cp --sparse=neverTo copy sparse files, use cp --sparse=never:

      $ cp sparse.file /dbfs/sparse.file
      error writing '/dbfs/sparse.file': Operation not supported
      $ cp --sparse=never sparse.file /dbfs/sparse.file
      
  • 5.55.5

    • 仅支持小于 2 GB 的文件。Supports only files less than 2GB in size. 如果使用本地文件 I/O API 读取或写入大于 2 GB 的文件,则可能会导致文件损坏。If you use local file I/O APIs to read or write files larger than 2GB you might see corrupted files. 相反,请使用 DBFS CLIdbutils.fs 或 Spark API 来访问大于 2 GB 的文件,或使用用于深度学习的本地文件 API 中所述的 /dbfs/ml 文件夹。Instead, access files larger than 2GB using the DBFS CLI, dbutils.fs, or Spark APIs or use the /dbfs/ml folder described in Local file APIs for deep learning.

    • 如果使用本地文件 I/O API 来写入文件,然后立即尝试使用 DBFS CLIdbutils.fs 或 Spark API 来访问它,则可能会遇到 FileNotFoundException、文件大小为 0 或文件内容陈旧的情况。If you write a file using the local file I/O APIs and then immediately try to access it using the DBFS CLI, dbutils.fs, or Spark APIs, you might encounter a FileNotFoundException, a file of size 0, or stale file contents. 这是预料之中的,因为 OS 默认情况下会缓存写入。That is expected because the OS caches writes by default. 若要强制将这些写入刷新到持久存储(在我们的示例中为 DBFS)中,请使用标准的 Unix 系统调用同步。例如:To force those writes to be flushed to persistent storage (in our case DBFS), use the standard Unix system call sync. For example:

      // scala
      import scala.sys.process._
      
      // Write a file using the local file API (over the FUSE mount).
      dbutils.fs.put("file:/dbfs/tmp/test", "test-contents")
      
      // Flush to persistent storage.
      "sync /dbfs/tmp/test" !
      
      // Read the file using "dbfs:/" instead of the FUSE mount.
      dbutils.fs.head("dbfs:/tmp/test")
      

用于深度学习的本地文件 API Local file APIs for deep learning

分布式深度学习应用程序需要访问 DBFS 来对数据进行加载、检查点和日志记录处理。对于这些应用,Databricks Runtime 6.0 及更高版本提供了一个已针对深度学习工作负载优化的高性能 /dbfs 装载。For distributed deep learning applications, which require DBFS access for loading, checkpointing, and logging data, Databricks Runtime 6.0 and above provide a high-performance /dbfs mount that’s optimized for deep learning workloads.

在 Databricks Runtime 5.5 LTS 中,仅优化 /dbfs/mlIn Databricks Runtime 5.5 LTS, only /dbfs/ml is optimized. 在此版本中,Databricks 建议将数据保存在 /dbfs/ml 下,该数据映射到 dbfs:/mlIn this version Databricks recommends saving data under /dbfs/ml, which maps to dbfs:/ml.