使用 Python API 读取已装载 DBFS 的大型文件Reading large DBFS-mounted files using Python APIs

本文介绍如何解决使用本地 Python API 读取已装载 DBFS 的大型文件时出现的错误。This article explains how to resolve an error that occurs when you read large DBFS-mounted files using local Python APIs.

问题Problem

如果将一个文件装载到 dbfs:// 上并使用 pandas 之类的 Python API 读取大于 2GB 的文件,将显示以下错误:If you mount a folder onto dbfs:// and read a file larger than 2GB in a Python API like pandas, you will see following error:

/databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3427)()
/databricks/python/local/lib/python2.7/site-packages/pandas/parser.so in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6883)()
IOError: Initializing from file failed

原因Cause

发生此错误是因为 Python 方法中用于读取文件的一个参数是有符号整数,文件的长度也是整数,并且如果该对象是大于 2GB 的文件,则该长度可以大于最大的有符号整数。The error occurs because one argument in the Python method to read a file is a signed int, the length of the file is an int, and if the object is a file larger than 2GB, the length can be larger than maximum signed int.

解决方案Solution

将文件从 dbfs:// 移动到本地文件系统 (file://)Move the file from dbfs:// to local file system (file://). 然后使用 Python API 读取。Then read using the Python API. 例如:For example:

  1. 将文件从 dbfs:// 复制到 file://Copy the file from dbfs:// to file://:

    %fs cp dbfs:/mnt/large_file.csv file:/tmp/large_file.csv
    
  2. 读取 pandas API 中的文件:Read the file in the pandas API:

    import pandas as pd
    pd.read_csv('file:/tmp/large_file.csv',).head()