Auto Loader FAQ
Commonly asked questions about Databricks Auto Loader.
Does Auto Loader process the file again when the file gets appended or overwritten?
Files are processed exactly once unless cloudFiles.allowOverwrites
is enabled. When a file is appended to or overwritten, Azure Databricks cannot guarantee which version of the file will be processed. You should also use caution when enabling cloudFiles.allowOverwrites
in file notification mode, where Auto Loader might identify new files through both file notifications and directory listing. Due to the discrepancy between file notification event time and file modification time, Auto Loader might obtain two different timestamps and therefore ingest the same file twice, even when the file is only written once.
In general, Databricks recommends you use Auto Loader to ingest only immutable files and avoid setting cloudFiles.allowOverwrites
. If this does not meet your requirements, contact your Azure Databricks account team.
If my data files do not arrive continuously, but in regular intervals, for example, once a day, should I still use this source and are there any benefits?
In this case, you can set up a Trigger.AvailableNow
(available in Databricks Runtime 10.4 LTS and above) Structured Streaming job and schedule to run after the anticipated file arrival time. Auto Loader works well with both infrequent or frequent updates. Even if the eventual updates are very large, Auto Loader scales well to the input size. Auto Loader's efficient file discovery techniques and schema evolution capabilities make Auto Loader the recommended method for incremental data ingestion.
What happens if I change the checkpoint location when restarting the stream?
A checkpoint location maintains important identifying information of a stream. Changing the checkpoint location effectively means that you have abandoned the previous stream and started a new stream.
Do I need to create event notification services beforehand?
No. If you choose file notification mode and provide the required permissions, Auto Loader can create file notification services for you. See What is Auto Loader file notification mode?
How do I clean up the event notification resources created by Auto Loader?
You can use the cloud resource manager to list and tear down resources. You can also delete these resources manually using the cloud provider's UI or APIs.
Can I run multiple streaming queries from different input directories on the same bucket/container?
Yes, as long as they are not parent-child directories; for example, prod-logs/
and prod-logs/usage/
would not work because /usage
is a child directory of /prod-logs
.
Can I use this feature when there are existing file notifications on my bucket or container?
Yes, as long as your input directory does not conflict with the existing notification prefix (for example, the above parent-child directories).
How does Auto Loader infer schema?
When the DataFrame is first defined, Auto Loader lists your source directory and chooses the most recent (by file modification time) 50 GB of data or 1000 files, and uses those to infer your data schema.
Auto Loader also infers partition columns by examining the source directory structure and looks for file paths that contain the /key=value/
structure. If the source directory has an inconsistent structure, for example:
base/path/partition=1/date=2020-12-31/file1.json
// inconsistent because date and partition directories are in different orders
base/path/date=2020-12-31/partition=2/file2.json
// inconsistent because the date directory is missing
base/path/partition=3/file3.json
Auto Loader infers the partition columns as empty. Use cloudFiles.partitionColumns
to explicitly parse columns from the directory structure.
How does Auto Loader behave when the source folder is empty?
If the source directory is empty, Auto Loader requires you to provide a schema as there is no data to perform inference.
When does Autoloader infer schema? Does it evolve automatically after every micro-batch?
The schema is inferred when the DataFrame is first defined in your code. During each micro-batch, schema changes are evaluated on the fly; therefore, you don't need to worry about performance hits. When the stream restarts, it picks up the evolved schema from the schema location and starts executing without any overhead from inference.
What's the performance impact on ingesting the data when using Auto Loader schema inference?
You should expect schema inference to take a couple of minutes for very large source directories during initial schema inference. You shouldn't observe significant performance hits otherwise during stream execution. If you run your code in an Azure Databricks notebook, you can see status updates that specify when Auto Loader will be listing your directory for sampling and inferring your data schema.
Due to a bug, a bad file has changed my schema drastically. What should I do to roll back a schema change?
Contact Databricks support for help.