Using data flows in pipelines

When building complex pipelines with multiple data flows, your logical flow can have a big impact on timing and cost. This section covers the impact of different architecture strategies.

Executing data flows in parallel

If you execute multiple data flows in parallel, the service spins up separate Spark clusters for each activity. This allows for each job to be isolated and run in parallel, but will lead to multiple clusters running at the same time.

If your data flows execute in parallel, we recommend that you don't enable the Azure IR time to the live property because it leads to multiple unused warm pools.

Tip

Instead of running the same data flow multiple times in a for each activity, stage your data in a data lake and use wildcard paths to process the data in a single data flow.

Execute data flows sequentially

If you execute your data flow activities in sequence, it's recommended that you set a TTL in the Azure IR configuration. The service reuses the compute resources, resulting in a faster cluster start-up time. Each activity is still isolated and receives a new Spark context for each execution.

Overloading a single data flow

If you put all of your logic inside of a single data flow, the service executes the entire job on a single Spark instance. While this might seem like a way to reduce costs, it mixes together different logical flows and can be difficult to monitor and debug. If one component fails, all other parts of the job fail as well. Organizing data flows by independent flows of business logic is recommended. If your data flow becomes too large, splitting it into separate components makes monitoring and debugging easier. While there's no hard limit on the number of transformations in a data flow, having too many makes the job complex.

Execute sinks in parallel

The default behavior of data flow sinks is to execute each sink sequentially, in a serial manner, and to fail the data flow when an error is encountered in the sink. Additionally, all sinks are defaulted to the same group unless you go into the data flow properties and set different priorities for the sinks.

Data flows allow you to group sinks together into groups from the data flow properties tab in the UI designer. You can both set the order of execution of your sinks and to group sinks together using the same group number. To help manage groups, you can ask the service to run sinks in the same group, to run in parallel.

On the pipeline, execute data flow activity under the "Sink Properties" section is an option to turn on parallel sink loading. When you enable "run in parallel", you're instructing data flows write to connected sinks at the same time rather than in a sequential manner. In order to utilize the parallel option, the sinks must be group together and connected to the same stream via a New Branch or Conditional Split.

See other Data Flow articles related to performance: