Manage inputs and outputs of component and pipeline

In this article you learn:

  • Overview of inputs and outputs in component and pipeline
  • How to promote component inputs/outputs to pipeline inputs/outputs
  • How to define optional inputs
  • How to customize outputs path
  • How to download outputs
  • How to register outputs as named asset

Overview of inputs & outputs

Azure Machine Learning pipelines support inputs and outputs at both the component and pipeline levels.

At the component level, the inputs and outputs define the interface of a component. The output from one component can be used as an input for another component in the same parent pipeline, allowing for data or models to be passed between components. This interconnectivity forms a graph, illustrating the data flow within the pipeline.

At the pipeline level, inputs and outputs are useful for submitting pipeline jobs with varying data inputs or parameters that control the training logic (for example learning_rate). They're especially useful when invoking the pipeline via a REST endpoint. These inputs and outputs enable you to assign different values to the pipeline input or access the output of pipeline jobs through the REST endpoint. To learn more, see Creating Jobs and Input Data for Batch Endpoint.

Types of Inputs and Outputs

The following types are supported as outputs of a component or a pipeline.

Using data or model output essentially serializing the outputs and save them as files in a storage location. In subsequent steps, this storage location can be mounted, downloaded, or uploaded to the compute target filesystem, enabling the next step to access the files during job execution.

This process requires the component's source code serializing the desired output object - usually stored in memory - into files. For instance, you could serialize a pandas dataframe as a CSV file. Note that Azure Machine Learning doesn't define any standardized methods for object serialization. As a user, you have the flexibility to choose your preferred method to serialize objects into files. Following that, in the downstream component, you can independently deserialize and read these files. Here are a few examples for your reference:

  • In the nyc_taxi_data_regression example, the prep component has anuri_folder type output. In the component source code, it reads the csv files from input folder, processes the files and writes processed CSV files to the output folder.
  • In the nyc_taxi_data_regression example, the train component has a mlflow_model type output. In the component source code, it saves the trained model using mlflow.sklearn.save_model method.

In addition to above data or model types, pipeline or component inputs can also be following primitive types.

  • string
  • number
  • integer
  • boolean

In the nyc_taxi_data_regression example, train component has a number input named test_split_ratio.

Note

Primitive types output is not supported.

Path and mode for data inputs/outputs

For data asset input/output, you must specify a path parameter that points to the data location. This table shows the different data locations that Azure Machine Learning pipeline supports, and also shows path parameter examples:

Location Examples Input Output
A path on your local computer ./home/username/data/my_data
A path on a public http(s) server https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
A path on Azure Storage wasbs://<container_name>@<account_name>.blob.core.chinacloudapi.cn/<path>
abfss://<file_system>@<account_name>.dfs.core.chinacloudapi.cn/<path>
Not suggested because it may need extra identity configuration to read the data.
A path on an Azure Machine Learning Datastore azureml://datastores/<data_store_name>/paths/<path>
A path to a Data Asset azureml:<my_data>:<version>

Note

For input/output on storage, we highly suggest to use Azure Machine Learning datastore path instead of direct Azure Storage path. Datastore path are supported across various job types in pipeline.

For data input/output, you can choose from various modes (download, mount or upload) to define how the data is accessed in the compute target. This table shows the possible modes for different type/mode/input/output combinations.

Type Input/Output upload download ro_mount rw_mount direct eval_download eval_mount
uri_folder Input
uri_file Input
mltable Input
uri_folder Output
uri_file Output
mltable Output

Note

In most cases, we suggest to use ro_mount or rw_mount mode. To learn more about mode, see data asset modes.

Visual representation in Azure Machine Learning studio

The following screenshots provide an example of how inputs and outputs are displayed in a pipeline job in Azure Machine Learning studio. This particular job, named nyc-taxi-data-regression, can be found in azureml-example.

In the pipeline job page of studio, the data/model type inputs/output of a component is shown as a small circle in the corresponding component, known as the Input/Output port. These ports represent the data flow in a pipeline.

The pipeline level output is displayed as a purple box for easy identification.

Screenshot highlighting the pipeline input and output port.

When you hover the mouse on an input/output port, the type is displayed.

Screenshot highlighting the port type when hovering the mouse.

The primitive type inputs won't be displayed on the graph. It can be found in the Settings tab of the pipeline job overview panel (for pipeline level inputs) or the component panel (for component level inputs). Following screenshot shows the Settings tab of a pipeline job, it can be opened by selecting the Job Overview link.

If you want to check inputs for a component, double click on the component to open component panel.

Screenshot highlighting the job overview setting panel.

Similarly, when editing a pipeline in designer, you can find the pipeline inputs & outputs in Pipeline interface panel, and the component inputs&outputs in the component's panel (trigger by double click on the component).

Screenshot highlighting the pipeline interface in designer.

How to promote component inputs & outputs to pipeline level

Promoting a component's input/output to pipeline level allows you to overwrite the component's input/output when submitting a pipeline job. It's also useful if you want to trigger the pipeline using REST endpoint.

Following are examples to promote component inputs/outputs to pipeline level inputs/outputs.

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 1b_e2e_registered_components
description: E2E dummy train-score-eval pipeline with registered components

inputs:
  pipeline_job_training_max_epocs: 20
  pipeline_job_training_learning_rate: 1.8
  pipeline_job_learning_rate_schedule: 'time-based'

outputs: 
  pipeline_job_trained_model:
    mode: upload
  pipeline_job_scored_data:
    mode: upload
  pipeline_job_evaluation_report:
    mode: upload

settings:
 default_compute: azureml:cpu-cluster

jobs:
  train_job:
    type: command
    component: azureml:my_train@latest
    inputs:
      training_data: 
        type: uri_folder 
        path: ./data      
      max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
      learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
      learning_rate_schedule: ${{parent.inputs.pipeline_job_learning_rate_schedule}}
    outputs:
      model_output: ${{parent.outputs.pipeline_job_trained_model}}
    services:
      my_vscode:
        type: vs_code
      my_jupyter_lab:
        type: jupyter_lab
      my_tensorboard:
        type: tensor_board
        log_dir: "outputs/tblogs"
    #  my_ssh:
    #    type: tensor_board
    #    ssh_public_keys: <paste the entire pub key content>
    #    nodes: all # Use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node.

  score_job:
    type: command
    component: azureml:my_score@latest
    inputs:
      model_input: ${{parent.jobs.train_job.outputs.model_output}}
      test_data: 
        type: uri_folder 
        path: ./data
    outputs:
      score_output: ${{parent.outputs.pipeline_job_scored_data}}

  evaluate_job:
    type: command
    component: azureml:my_eval@latest
    inputs:
      scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
    outputs:
      eval_output: ${{parent.outputs.pipeline_job_evaluation_report}}

The full example can be found in train-score-eval pipeline with registered components. This pipeline promotes three inputs and three outputs to pipeline level. Let's take pipeline_job_training_max_epocs as example. It's declared under inputs section on the root level, which means's its pipeline level input. Under jobs -> train_job section, the input named max_epocs is referenced as ${{parent.inputs.pipeline_job_training_max_epocs}}, which indicates the train_job's input max_epocs references the pipeline level input pipeline_job_training_max_epocs. Similarly, you can promote pipeline output using the same schema.

Studio

You can promote a component's input to pipeline level input in designer authoring page. Go to the component's setting panel by double clicking the component -> find the input you'd like to promote -> Select the three dots on the right -> Select Add to pipeline input.

Screenshot highlighting how to promote to pipeline input in designer.

Optional input

By default, all inputs are required and must be assigned a value (or a default value) each time you submit a pipeline job. However, there may be instances where you need optional inputs. In such cases, you have the flexibility to not assign a value to the input when submitting a pipeline job.

Optional input can be useful in below two scenarios:

  • If you have an optional data/model type input and don't assign a value to it when submitting the pipeline job, there will be a component in the pipeline that lacks a preceding data dependency. In other words, the input port isn't linked to any component or data/model node. This causes the pipeline service to invoke this component directly, instead of waiting for the preceding dependency to be ready.

  • Below screenshot provides a clear example of the second scenario. If you set continue_on_step_failure = True for the pipeline and have a second node (node2) that uses the output from the first node (node1) as an optional input, node2 will still be executed even if node1 fails. However, if node2 is using required input from node1, it will not be executed if node1 fails.

    Screenshot to show the orchestration logic of optional input and continue on failure.

Following are examples about how to define optional input.

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_data_component_cli
display_name: train_data
description: A example train component
tags:
  author: azureml-sdk-team
version: 9
type: command
inputs:
  training_data: 
    type: uri_folder
  max_epocs:
    type: integer
    optional: true
  learning_rate: 
    type: number
    default: 0.01
    optional: true
  learning_rate_schedule: 
    type: string
    default: time-based
    optional: true
outputs:
  model_output:
    type: uri_folder
code: ./train_src
environment: azureml://registries/azureml/environments/sklearn-1.0/labels/latest
command: >-
  python train.py 
  --training_data ${{inputs.training_data}} 
  $[[--max_epocs ${{inputs.max_epocs}}]]
  $[[--learning_rate ${{inputs.learning_rate}}]]
  $[[--learning_rate_schedule ${{inputs.learning_rate_schedule}}]]
  --model_output ${{outputs.model_output}}

When the input is set as optional = true, you need use $[[]] to embrace the command line with inputs. See highlighted line in above example.

Note

Optional output is not supported.

In the pipeline graph, optional inputs of the Data/Model type are represented by a dotted circle. Optional inputs of primitive types can be located under the Settings tab. Unlike required inputs, optional inputs don't have an asterisk next to them, signifying that they aren't mandatory.

Screenshot highlighting the optional input.

How to customize output path

By default, the output of a component will be stored in azureml://datastores/${{default_datastore}}/paths/${{name}}/${{output_name}}. The {default_datastore} is default datastore customer set for the pipeline. If not set it's workspace blob storage. The {name} is the job name, which will be resolved at job execution time. The {output_name} is the output name customer defined in the component YAML.

But you can also customize where to store the output by defining path of an output. Following are example:

The pipeline.yaml defines a pipeline that has three pipeline level outputs. The full YAML can be found in the train-score-eval pipeline with registered components example. You can use following command to set custom output path for the pipeline_job_trained_modeloutput.

# define the custom output path using datastore uri
# add relative path to your blob container after "azureml://datastores/<datastore_name>/paths"
output_path="azureml://datastores/{datastore_name}/paths/{relative_path_of_container}"  

# create job and define path using --outputs.<outputname>
az ml job create -f ./pipeline.yml --set outputs.pipeline_job_trained_model.path=$output_path  

How to download the output

You can download a component's output or pipeline output following below example.

Download pipeline level output

# Download all the outputs of the job
az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

# Download specific output
az ml job download --output-name <OUTPUT_PORT_NAME> -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

Download child job's output

When you need to download the output of a child job (a component output that not promotes to pipeline level), you should first list all child job entity of a pipeline job and then use similar code to download the output.

# List all child jobs in the job and print job details in table format
az ml job list --parent-job-name <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID> -o table

# Select needed child job name to download output
az ml job download --all -n <JOB_NAME> -g <RESOURCE_GROUP_NAME> -w <WORKSPACE_NAME> --subscription <SUBSCRIPTION_ID>

How to register output as named asset

You can register output of a component or pipeline as named asset by assigning name and version to the output. The registered asset can be list in your workspace through studio UI/CLI/SDK and also be referenced in your future jobs.

Register pipeline output

display_name: register_pipeline_output
type: pipeline
jobs:
  node:
    type: command
    inputs:
      component_in_path:
        type: uri_file
        path: https://dprepdata.blob.core.chinacloudapi.cn/demo/Titanic.csv
    component: ../components/helloworld_component.yml
    outputs:
      component_out_path: ${{parent.outputs.component_out_path}}
outputs:
  component_out_path:
    type: mltable
    name: pipeline_output  # Define name and version to register pipeline output
    version: '1'
settings:
  default_compute: azureml:cpu-cluster

Register a child job's output

display_name: register_node_output
type: pipeline
jobs:
  node:
    type: command
    component: ../components/helloworld_component.yml
    inputs:
      component_in_path:
        type: uri_file
        path: 'https://dprepdata.blob.core.chinacloudapi.cn/demo/Titanic.csv'
    outputs:
      component_out_path:
        type: uri_folder
        name: 'node_output'  # Define name and version to register a child job's output
        version: '1'
settings:
  default_compute: azureml:cpu-cluster

Next steps