Amazon S3 Select

Amazon S3 Select enables retrieving only required data from an object. The Databricks S3 Select connector provides an Apache Spark data source that leverages S3 Select. When you use an S3 Select data source, filter and column selection on a DataFrame is pushed down, saving S3 data bandwidth.

Limitations

Amazon S3 Select supports the following file formats:

  • CSV and JSON files
  • UTF-8 encoding
  • GZIP or no compression

The Databricks S3 Select connector has the following limitations:

  • Complex types (arrays and objects) cannot be used in JSON
  • Schema inference is not supported
  • File splitting is not supported, however multiline records are supported
  • DBFS mount points are not supported

Important

Azure Databricks strongly encourages you to use S3AFileSystem provided by Databricks, which is the default for s3a://, s3://, and s3n:// file system schemes in Databricks Runtime. If you need assistance with migration to S3AFileSystem, contact Databricks support or your Azure Databricks account team.

Usage

Scala

sc.read.format("s3select").schema(...).options(...).load("s3://bucket/filename")

SQL

CREATE TABLE name (...) USING S3SELECT LOCATION 's3://bucket/filename' [ OPTIONS (...) ]

If the filename extension is .csv or .json, the format is automatically detected; otherwise you must provide the FileFormat option.

Options

This section describes options for all file types and options specific to CSV and JSON.

Generic options

Option name Default value Description
FileFormat 'auto' Input file type ('auto', 'csv', or 'json')
CompressionType 'none' Compression codec used by the input file ('none' or 'gzip')

CSV specific options

Option name Default value Description
NullValue '' Character string representing null values in the input
Header false Whether to skip the first line of the input (potential header contents are ignored)
Comment '#' Lines starting with the value of this parameters are ignored
RecordDelimiter 'n' Character separating records in a file
Delimiter ',' Character separating fields within a record
Quote '”' Character used to quote values containing reserved characters
Escape '”' Character used to escape quoted quote character
AllowQuotedRecordDelimiter false Whether values can contain quoted record delimiters

JSON specific options

Option name Default value Description
Type document Type of input ('document' or 'lines')

S3 authentication

  1. Default Credential Provider Chain (recommended option):

    AWS credentials are automatically retrieved through the DefaultAWSCredentialsProviderChain. If you use instance profiles to authenticate to S3 then you should use this method. Other methods of providing credentials (methods 2 and 3) take precedence over this default.

  2. Set keys in Hadoop conf: Specify AWS keys in Hadoop configuration properties.

    Important

    • When using AWS keys to access S3, always set the configuration properties fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey as shown in the following example; the properties fs.s3a.access.key and fs.s3a.secret.key are not supported.

    • To reference the s3a:// filesystem, set the fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties in a Hadoop XML configuration file or call sc.hadoopConfiguration.set() to set Spark's global Hadoop configuration.

      sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "$AccessKey")
      sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "$SecretKey")
      
      sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
      sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)
      
  3. Encode keys in URI: For example, the URI s3a://$AccessKey:$SecretKey@bucket/path/to/dir encodes the key pair (AccessKey, SecretKey).