Trigger jobs when new files arrive

You can use file arrival triggers to trigger a run of your Azure Databricks job when new files arrive in an external location such as Amazon S3, or Azure storage. You can use this feature when a scheduled job might be inefficient because new data arrives on an irregular schedule.

File arrival triggers make a best effort to check for new files every minute, although this can be affected by the performance of the underlying cloud storage. File arrival triggers do not incur additional costs other than cloud provider costs associated with listing files in the storage location.

A file arrival trigger can be configured to monitor the root of a Unity Catalog external location or volume, or a subpath of an external location or volume. For example, for the Unity Catalog root volume /Volumes/mycatalog/myschema/myvolume/, the following are valid paths for a file arrival trigger:

/Volumes/mycatalog/myschema/myvolume/
/Volumes/mycatalog/myschema/myvolume/mydirectory/

A file arrival trigger recursively checks for new files in all subdirectories of the configured location. For example, if you create a file arrival trigger for the location /Volumes/mycatalog/myschema/myvolume/mydirectory/ and this location has the following subdirectories:

/Volumes/mycatalog/myschema/myvolume/mydirectory/subdirA
/Volumes/mycatalog/myschema/myvolume/mydirectory/subdirB
/Volumes/mycatalog/myschema/myvolume/mydirectory/subdirC/subdirD

The trigger checks for new files in mydirectory, subdirA, subdirB, subdirC, and subdirC/subdirD.

Requirements

The following are required to use file arrival triggers:

Limitations

  • Only new files trigger runs. Overwriting an existing file with a file of the same name does not trigger a run.
  • A maximum of fifty jobs can be configured with a file arrival trigger in an Azure Databricks workspace.
  • A storage location configured for a file arrival trigger can contain only up to 10,000 files. Locations with more files cannot be monitored for new file arrivals. If the configured storage location is a subpath of a Unity Catalog external location or volume, the 10,000 file limit applies to the subpath and not the root of the storage location. For example, the root of the storage location can contain more than 10,000 files across its subdirectories, but the configured subdirectory must not exceed the 10,000 file limit.
  • The path used for a file arrival trigger must not contain any external tables or managed locations of catalogs and schemas.
  • The path used for a file arrival trigger cannot contain wildcards, for example, * or ?.

Add a file arrival trigger

To add a file arrival trigger to a job:

  1. In the sidebar, click Workflows.
  2. In the Name column on the Jobs tab, click the job name.
  3. In the Job details panel on the right, click Add trigger.
  4. In Trigger type, select File arrival.
  5. In Storage location, enter the URL of the root or a subpath of a Unity Catalog external location or the root or a subpath of a Unity Catalog volume to monitor.
  6. (Optional) Configure advanced options:
    • Minimum time between triggers in seconds: The minimum time to wait to trigger a run after a previous run completes. Files that arrive in this period trigger a run only after the waiting time expires. Use this setting to control the frequency of run creation.
    • Wait after last change in seconds: The time to wait to trigger a run after file arrival. Another file arrival in this period resets the timer. This setting can be used when files arrive in batches, and the whole batch needs to be processed after all files have arrived.
  7. To validate the configuration, click Test connection.
  8. Click Save.

Receive notifications of failed file arrival triggers

To be notified if a file arrival trigger fails to evaluate, configure email or system destination notifications on job failure. See Add email and system notifications for job events.