数据引入管道的 DevOpsDevOps for a data ingestion pipeline

在大多数方案中,数据引入解决方案是脚本、服务调用以及协调所有活动的管道的复合体。In most scenarios, a data ingestion solution is a composition of scripts, service invocations, and a pipeline orchestrating all the activities. 本文介绍如何将 DevOps 做法应用到常用数据引入管道的开发生命周期,以便准备数据来进行机器学习模型训练。In this article, you learn how to apply DevOps practices to the development lifecycle of a common data ingestion pipeline that prepares data for machine learning model training. 管道是使用以下 Azure 服务构建的:The pipeline is built using the following Azure services:

  • Azure 数据工厂:读取原始数据并协调数据准备。Azure Data Factory: Reads the raw data and orchestrates data preparation.

  • Azure Pipelines:实现持续集成和开发过程的自动化。Azure Pipelines: Automates a continuous integration and development process.

数据引入管道工作流Data ingestion pipeline workflow

数据引入管道实现以下工作流:The data ingestion pipeline implements the following workflow:

  1. 原始数据将读取到 Azure 数据工厂 (ADF) 管道中。Raw data is read into an Azure Data Factory (ADF) pipeline.
  2. ADF 管道将数据发送到 Azure Databricks 群集,该群集运行 Python 笔记本以转换数据。The ADF pipeline sends the data to an Azure Databricks cluster, which runs a Python notebook to transform the data.
  3. 数据会存储到 Blob 容器中,Azure 机器学习可以在该容器中使用这些数据训练模型。The data is stored to a blob container, where it can be used by Azure Machine Learning to train a model.

数据引入管道工作流

持续集成和交付概述Continuous integration and delivery overview

与许多软件解决方案一样,会有某个团队(例如“数据工程师”)在致力于它。As with many software solutions, there is a team (for example, Data Engineers) working on it. 他们相互协作并共享相同的 Azure 资源,例如 Azure 数据工厂、Azure Databricks 和 Azure 存储帐户。They collaborate and share the same Azure resources such as Azure Data Factory, Azure Databricks, and Azure Storage accounts. 这些资源的集合构成了开发环境。The collection of these resources is a Development environment. 数据工程师为同一源代码库贡献代码。The data engineers contribute to the same source code base.

持续集成和交付系统会自动执行生成、测试和交付(部署)解决方案的过程。A continuous integration and delivery system automates the process of building, testing, and delivering (deploying) the solution. 持续集成 (CI) 过程执行以下任务:The Continuous Integration (CI) process performs the following tasks:

  • 汇编代码Assembles the code
  • 执行代码质量测试来检查代码Checks it with the code quality tests
  • 运行单元测试Runs unit tests
  • 生成经过测试的代码和 Azure 资源管理器模板等项目Produces artifacts such as tested code and Azure Resource Manager templates

持续交付 (CD) 过程将项目部署到下游环境。The Continuous Delivery (CD) process deploys the artifacts to the downstream environments.

cicd 数据引入示意图

本文演示如何使用 Azure Pipelines 将 CI 和 CD 过程自动化。This article demonstrates how to automate the CI and CD processes with Azure Pipelines.

源代码管理Source control management

若要跟踪更改并使得团队成员之间可以进行协作,需要进行源代码管理。Source control management is needed to track changes and enable collaboration between team members. 例如,将代码存储在 Azure DevOps、GitHub 或 GitLab 存储库中。For example, the code would be stored in an Azure DevOps, GitHub, or GitLab repository. 协作工作流基于分支模型。The collaboration workflow is based on a branching model. 例如,GitFlowFor example, GitFlow.

Python 笔记本源代码Python Notebook Source Code

数据工程师可以在 IDE(例如 Visual Studio Code)本地或者直接在 Databricks 工作区中处理 Python 笔记本源代码。The data engineers work with the Python notebook source code either locally in an IDE (for example, Visual Studio Code) or directly in the Databricks workspace. 代码更改完成后,代码更改会根据分支策略合并到存储库中。Once the code changes are complete, they are merged to the repository following a branching policy.

提示

建议将代码存储在 .py 文件中,而不要以 .ipynb Jupyter 笔记本格式存储。We recommended storing the code in .py files rather than in .ipynb Jupyter notebook format. 这样可以改善代码的可读性,并在 CI 过程中实现自动代码质量检查。It improves the code readability and enables automatic code quality checks in the CI process.

Azure 数据工厂源代码Azure Data Factory Source Code

Azure 数据工厂管道的源代码是 Azure 数据工厂工作区生成的 JSON 文件的集合。The source code of Azure Data Factory pipelines is a collection of JSON files generated by an Azure Data Factory workspace. 通常,数据工程师会在 Azure 数据工厂工作区中使用可视化设计器,而不是直接处理源代码文件。Normally the data engineers work with a visual designer in the Azure Data Factory workspace rather than with the source code files directly.

持续集成 (CI)Continuous integration (CI)

持续集成过程的最终目标是从源代码收集联合团队的工作,并为部署到下游环境做好准备。The ultimate goal of the Continuous Integration process is to gather the joint team work from the source code and prepare it for the deployment to the downstream environments. 与源代码管理一样,此过程对于 Python 笔记本和 Azure 数据工厂管道是不同的。As with the source code management this process is different for the Python notebooks and Azure Data Factory pipelines.

Python 笔记本 CIPython Notebook CI

Python 笔记本的 CI 过程从协作分支(例如 masterdevelop)获取代码并执行以下活动:The CI process for the Python Notebooks gets the code from the collaboration branch (for example, master or develop) and performs the following activities:

  • 代码检查Code linting
  • 单元测试Unit testing
  • 将代码保存为项目Saving the code as an artifact

以下代码片段演示如何在 Azure DevOps yaml 管道中实现这些步骤:The following code snippet demonstrates the implementation of these steps in an Azure DevOps yaml pipeline:

steps:
- script: |
   flake8 --output-file=$(Build.BinariesDirectory)/lint-testresults.xml --format junit-xml  
  workingDirectory: '$(Build.SourcesDirectory)'
  displayName: 'Run flake8 (code style analysis)'  
  
- script: |
   python -m pytest --junitxml=$(Build.BinariesDirectory)/unit-testresults.xml $(Build.SourcesDirectory)
  displayName: 'Run unit tests'

- task: PublishTestResults@2
  condition: succeededOrFailed()
  inputs:
    testResultsFiles: '$(Build.BinariesDirectory)/*-testresults.xml'
    testRunTitle: 'Linting & Unit tests'
    failTaskOnFailedTests: true
  displayName: 'Publish linting and unit test results'

- publish: $(Build.SourcesDirectory)
    artifact: di-notebooks

该管道使用 flake8 执行 Python 代码检查。The pipeline uses flake8 to do the Python code linting. 它运行源代码中定义的单元测试,并发布检查和测试结果以便在 Azure 管道执行屏幕中显示:It runs the unit tests defined in the source code and publishes the linting and test results so they're available in the Azure Pipeline execution screen:

Lint 分析单元测试

如果Lint 分析和单元测试成功,管道会将源代码复制到项目存储库,供后续的部署步骤使用。If the linting and unit testing is successful, the pipeline will copy the source code to the artifact repository to be used by the subsequent deployment steps.

Azure 数据工厂 CIAzure Data Factory CI

Azure 数据工厂管道的 CI 过程是数据引入管道的瓶颈。CI process for an Azure Data Factory pipeline is a bottleneck for a data ingestion pipeline. 不会执行持续集成。There's no continuous integration. Azure 数据工厂的可部署项目是 Azure 资源管理器模板的集合。A deployable artifact for Azure Data Factory is a collection of Azure Resource Manager templates. 生成这些模板的唯一方式是单击 Azure 数据工厂工作区中的“发布”按钮。The only way to produce those templates is to click the publish button in the Azure Data Factory workspace.

  1. 数据工程师将源代码从其功能分支合并到协作分支(例如 masterdevelop)。The data engineers merge the source code from their feature branches into the collaboration branch, for example, master or develop.
  2. 已被授予权限的人员单击“发布”按钮,从协作分支中的源代码生成 Azure 资源管理器模板。Someone with the granted permissions clicks the publish button to generate Azure Resource Manager templates from the source code in the collaboration branch.
  3. 工作区将验证管道(将其视为 Lint 分析和单元测试),生成 Azure 资源管理器模板(将其视为生成),并将生成的模板保存到同一代码存储库中的技术分支 adf_publish(将其视为发布项目)。The workspace validates the pipelines (think of it as of linting and unit testing), generates Azure Resource Manager templates (think of it as of building) and saves the generated templates to a technical branch adf_publish in the same code repository (think of it as of publishing artifacts). 此分支由 Azure 数据工厂工作区自动创建。This branch is created automatically by the Azure Data Factory workspace.

有关此过程的详细信息,请参阅 Azure 数据工厂中的持续集成和交付For more information on this process, see Continuous integration and delivery in Azure Data Factory.

必须确保生成的 Azure 资源管理器模板不区分环境。It's important to make sure that the generated Azure Resource Manager templates are environment agnostic. 这意味着,所有在不同环境中可能会有所不同的值均会参数化。This means that all values that may differ between environments are parametrized. Azure 数据工厂足够智能,可将大多数此类值作为参数公开。Azure Data Factory is smart enough to expose the majority of such values as parameters. 例如,在以下模板中,Azure 机器学习工作区的连接属性将作为参数公开:For example, in the following template the connection properties to an Azure Machine Learning workspace are exposed as parameters:

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "factoryName": {
            "value": "devops-ds-adf"
        },
        "AzureMLService_servicePrincipalKey": {
            "value": ""
        },
        "AzureMLService_properties_typeProperties_subscriptionId": {
            "value": "0fe1c235-5cfa-4152-17d7-5dff45a8d4ba"
        },
        "AzureMLService_properties_typeProperties_resourceGroupName": {
            "value": "devops-ds-rg"
        },
        "AzureMLService_properties_typeProperties_servicePrincipalId": {
            "value": "6e35e589-3b22-4edb-89d0-2ab7fc08d488"
        },
        "AzureMLService_properties_typeProperties_tenant": {
            "value": "72f988bf-86f1-41af-912b-2d7cd611db47"
        }
    }
}

但是,你可能想要公开 Azure 数据工厂工作区默认不会处理的自定义属性。However, you may want to expose your custom properties that are not handled by the Azure Data Factory workspace by default. 在本文所述的方案中,Azure 数据工厂管道将调用一个处理数据的 Python 笔记本。In the scenario of this article an Azure Data Factory pipeline invokes a Python notebook processing the data. 该笔记本接受包含输入数据文件名的参数。The notebook accepts a parameter with the name of an input data file.

import pandas as pd
import numpy as np

data_file_name = getArgument("data_file_name")
data = pd.read_csv(data_file_name)

labels = np.array(data['target'])
...

此名称对于 DevQAUATPROD 环境是不同的。This name is different for Dev, QA, UAT, and PROD environments. 在包含多个活动的复杂管道中,可能存在多个自定义属性。In a complex pipeline with multiple activities, there can be several custom properties. 最好是将所有这些值收集到一个位置,并将其定义为管道变量:It's good practice to collect all those values in one place and define them as pipeline variables:

adf-variables

管道活动在实际使用管道变量时可以引用它们:The pipeline activities may refer to the pipeline variables while actually using them:

adf-notebook-parameters

Azure 数据工厂工作区默认不会将管道变量作为 Azure 资源管理器模板参数公开。The Azure Data Factory workspace doesn't expose pipeline variables as Azure Resource Manager templates parameters by default. 工作区使用默认参数化模板,指明应将哪些管道属性作为 Azure 资源管理器模板参数公开。The workspace uses the Default Parameterization Template dictating what pipeline properties should be exposed as Azure Resource Manager template parameters. 若要将管道变量添加到列表中,请使用以下代码片段更新默认参数化模板的“Microsoft.DataFactory/factories/pipelines”节,并将结果 JSON 文件放在源文件夹的根目录中:To add pipeline variables to the list, update the "Microsoft.DataFactory/factories/pipelines" section of the Default Parameterization Template with the following snippet and place the result json file in the root of the source folder:

"Microsoft.DataFactory/factories/pipelines": {
        "properties": {
            "variables": {
                "*": {
                    "defaultValue": "="
                }
            }
        }
    }

这样,在单击“发布”按钮时,会强制 Azure 数据工厂工作区将变量添加到参数列表中:Doing so will force the Azure Data Factory workspace to add the variables to the parameters list when the publish button is clicked:

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentParameters.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "factoryName": {
            "value": "devops-ds-adf"
        },
        ...
        "data-ingestion-pipeline_properties_variables_data_file_name_defaultValue": {
            "value": "driver_prediction_train.csv"
        }        
    }
}

JSON 文件中的值是在管道定义中配置的默认值。The values in the json file are default values configured in the pipeline definition. 部署 Azure 资源管理器模板时,这些值预期会由目标环境值重写。They're expected to be overridden with the target environment values when the Azure Resource Manager template is deployed.

持续交付 (CD)Continuous delivery (CD)

持续交付过程提取项目并将其部署到第一个目标环境。The Continuous Delivery process takes the artifacts and deploys them to the first target environment. 它通过运行测试来确保解决方案可正常运行。It makes sure that the solution works by running tests. 如果测试成功,则继续部署到下一个环境。If successful, it continues to the next environment.

CD Azure 管道由多个表示环境的阶段组成。The CD Azure Pipeline consists of multiple stages representing the environments. 每个阶段包含部署,以及执行以下步骤的作业Each stage contains deployments and jobs that perform the following steps:

  • 将 Python 笔记本部署到 Azure Databricks 工作区Deploy a Python Notebook to Azure Databricks workspace
  • 部署 Azure 数据工厂管道Deploy an Azure Data Factory pipeline
  • 运行管道Run the pipeline
  • 检查数据引入结果Check the data ingestion result

可以使用审批门限(就部署过程如何在环境链中递进提供额外的控制)来配置管道阶段。The pipeline stages can be configured with approvals and gates that provide additional control on how the deployment process evolves through the chain of environments.

部署 Python 笔记本Deploy a Python Notebook

以下代码片段定义一个将 Python 笔记本复制到 Databricks 群集的 Azure 管道部署The following code snippet defines an Azure Pipeline deployment that copies a Python notebook to a Databricks cluster:

- stage: 'Deploy_to_QA'
  displayName: 'Deploy to QA'
  variables:
  - group: devops-ds-qa-vg
  jobs:
  - deployment: "Deploy_to_Databricks"
    displayName: 'Deploy to Databricks'
    timeoutInMinutes: 0
    environment: qa
    strategy:
      runOnce:
        deploy:
          steps:
            - task: UsePythonVersion@0
              inputs:
                versionSpec: '3.x'
                addToPath: true
                architecture: 'x64'
              displayName: 'Use Python3'

            - task: configuredatabricks@0
              inputs:
                url: '$(DATABRICKS_URL)'
                token: '$(DATABRICKS_TOKEN)'
              displayName: 'Configure Databricks CLI'    

            - task: deploynotebooks@0
              inputs:
                notebooksFolderPath: '$(Pipeline.Workspace)/di-notebooks'
                workspaceFolder: '/Shared/devops-ds'
              displayName: 'Deploy (copy) data processing notebook to the Databricks cluster'       

CI 生成的项目将自动复制到部署代理,并在 $(Pipeline.Workspace) 文件夹中提供。The artifacts produced by the CI are automatically copied to the deployment agent and are available in the $(Pipeline.Workspace) folder. 在本例中,部署任务引用包含 Python 笔记本的 di-notebooks 项目。In this case, the deployment task refers to the di-notebooks artifact containing the Python notebook. 部署使用 Databricks Azure DevOps 扩展将笔记本文件复制到 Databricks 工作区。This deployment uses the Databricks Azure DevOps extension to copy the notebook files to the Databricks workspace.

Deploy_to_QA 阶段包含对 Azure DevOps 项目中定义的 devops-ds-qa-vg 变量组的引用。The Deploy_to_QA stage contains a reference to the devops-ds-qa-vg variable group defined in the Azure DevOps project. 此阶段中的步骤引用此变量组中的变量(例如 $(DATABRICKS_URL)$(DATABRICKS_TOKEN))。The steps in this stage refer to the variables from this variable group (for example, $(DATABRICKS_URL) and $(DATABRICKS_TOKEN)). 其思路是,下一阶段(例如 Deploy_to_UAT)将使用其自己的 UAT 范围内的变量组中定义的相同变量名称运行。The idea is that the next stage (for example, Deploy_to_UAT) will operate with the same variable names defined in its own UAT-scoped variable group.

部署 Azure 数据工厂管道Deploy an Azure Data Factory pipeline

Azure 数据工厂的可部署项目是一个 Azure 资源管理器模板。A deployable artifact for Azure Data Factory is an Azure Resource Manager template. 它将通过“Azure 资源组部署”任务来部署,如以下代码片段所示:It's going to be deployed with the Azure Resource Group Deployment task as it is demonstrated in the following snippet:

  - deployment: "Deploy_to_ADF"
    displayName: 'Deploy to ADF'
    timeoutInMinutes: 0
    environment: qa
    strategy:
      runOnce:
        deploy:
          steps:
            - task: AzureResourceGroupDeployment@2
              displayName: 'Deploy ADF resources'
              inputs:
                azureSubscription: $(AZURE_RM_CONNECTION)
                resourceGroupName: $(RESOURCE_GROUP)
                location: $(LOCATION)
                csmFile: '$(Pipeline.Workspace)/adf-pipelines/ARMTemplateForFactory.json'
                csmParametersFile: '$(Pipeline.Workspace)/adf-pipelines/ARMTemplateParametersForFactory.json'
                overrideParameters: -data-ingestion-pipeline_properties_variables_data_file_name_defaultValue "$(DATA_FILE_NAME)"

数据文件名参数的值取自 QA 阶段变量组中定义的 $(DATA_FILE_NAME) 变量。The value of the data filename parameter comes from the $(DATA_FILE_NAME) variable defined in a QA stage variable group. 同样,可以重写 ARMTemplateForFactory.json 中定义的所有参数。Similarly, all parameters defined in ARMTemplateForFactory.json can be overridden. 如果未重写,则使用默认值。If they are not, then the default values are used.

运行管道并检查数据引入结果Run the pipeline and check the data ingestion result

下一步是确保部署的解决方案可正常运行。The next step is to make sure that the deployed solution is working. 以下作业定义使用 PowerShell 脚本运行 Azure 数据工厂管道,并在 Azure Databricks 群集上执行一个 Python 笔记本。The following job definition runs an Azure Data Factory pipeline with a PowerShell script and executes a Python notebook on an Azure Databricks cluster. 该笔记本检查是否已正确引入数据,并验证名称为 $(bin_FILE_NAME) 的结果数据文件。The notebook checks if the data has been ingested correctly and validates the result data file with $(bin_FILE_NAME) name.

  - job: "Integration_test_job"
    displayName: "Integration test job"
    dependsOn: [Deploy_to_Databricks, Deploy_to_ADF]
    pool:
      vmImage: 'ubuntu-latest'
    timeoutInMinutes: 0
    steps:
    - task: AzurePowerShell@4
      displayName: 'Execute ADF Pipeline'
      inputs:
        azureSubscription: $(AZURE_RM_CONNECTION)
        ScriptPath: '$(Build.SourcesDirectory)/adf/utils/Invoke-ADFPipeline.ps1'
        ScriptArguments: '-ResourceGroupName $(RESOURCE_GROUP) -DataFactoryName $(DATA_FACTORY_NAME) -PipelineName $(PIPELINE_NAME)'
        azurePowerShellVersion: LatestVersion
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '3.x'
        addToPath: true
        architecture: 'x64'
      displayName: 'Use Python3'

    - task: configuredatabricks@0
      inputs:
        url: '$(DATABRICKS_URL)'
        token: '$(DATABRICKS_TOKEN)'
      displayName: 'Configure Databricks CLI'    

    - task: executenotebook@0
      inputs:
        notebookPath: '/Shared/devops-ds/test-data-ingestion'
        existingClusterId: '$(DATABRICKS_CLUSTER_ID)'
        executionParams: '{"bin_file_name":"$(bin_FILE_NAME)"}'
      displayName: 'Test data ingestion'

    - task: waitexecution@0
      displayName: 'Wait until the testing is done'

作业中的最后一个任务检查笔记本执行结果。The final task in the job checks the result of the notebook execution. 如果返回了错误,则将管道执行状态设置为 failed。If it returns an error, it sets the status of pipeline execution to failed.

将各个部分组合到一起Putting pieces together

完整的 CI/CD Azure 管道包括以下阶段:The complete CI/CD Azure Pipeline consists of the following stages:

  • CICI
  • 部署到 QADeploy To QA
    • 部署到 Databricks + 部署到 ADFDeploy to Databricks + Deploy to ADF
    • 集成测试Integration Test

此管道包含许多“部署”阶段,阶段数目与现有的目标环境数目相同。It contains a number of Deploy stages equal to the number of target environments you have. 每个“部署”阶段包含两个并行运行的部署,以及一个在部署后运行的、用于在环境中测试解决方案的作业Each Deploy stage contains two deployments that run in parallel and a job that runs after deployments to test the solution on the environment.

以下 yaml 代码片段中汇编了该管道的示例实现:A sample implementation of the pipeline is assembled in the following yaml snippet:

variables:
- group: devops-ds-vg

stages:
- stage: 'CI'
  displayName: 'CI'
  jobs:
  - job: "CI_Job"
    displayName: "CI Job"
    pool:
      vmImage: 'ubuntu-latest'
    timeoutInMinutes: 0
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '3.x'
        addToPath: true
        architecture: 'x64'
      displayName: 'Use Python3'
    - script: pip install --upgrade flake8 flake8_formatter_junit_xml
      displayName: 'Install flake8'
    - checkout: self
    - script: |
       flake8 --output-file=$(Build.BinariesDirectory)/lint-testresults.xml --format junit-xml  
    workingDirectory: '$(Build.SourcesDirectory)'
    displayName: 'Run flake8 (code style analysis)'  
    - script: |
       python -m pytest --junitxml=$(Build.BinariesDirectory)/unit-testresults.xml $(Build.SourcesDirectory)
    displayName: 'Run unit tests'
    - task: PublishTestResults@2
    condition: succeededOrFailed()
    inputs:
        testResultsFiles: '$(Build.BinariesDirectory)/*-testresults.xml'
        testRunTitle: 'Linting & Unit tests'
        failTaskOnFailedTests: true
    displayName: 'Publish linting and unit test results'    

    # The CI stage produces two artifacts (notebooks and ADF pipelines).
    # The pipelines Azure Resource Manager templates are stored in a technical branch "adf_publish"
    - publish: $(Build.SourcesDirectory)/$(Build.Repository.Name)/code/dataingestion
      artifact: di-notebooks
    - checkout: git://${{variables['System.TeamProject']}}@adf_publish    
    - publish: $(Build.SourcesDirectory)/$(Build.Repository.Name)/devops-ds-adf
      artifact: adf-pipelines

- stage: 'Deploy_to_QA'
  displayName: 'Deploy to QA'
  variables:
  - group: devops-ds-qa-vg
  jobs:
  - deployment: "Deploy_to_Databricks"
    displayName: 'Deploy to Databricks'
    timeoutInMinutes: 0
    environment: qa
    strategy:
      runOnce:
        deploy:
          steps:
            - task: UsePythonVersion@0
              inputs:
                versionSpec: '3.x'
                addToPath: true
                architecture: 'x64'
              displayName: 'Use Python3'

            - task: configuredatabricks@0
              inputs:
                url: '$(DATABRICKS_URL)'
                token: '$(DATABRICKS_TOKEN)'
              displayName: 'Configure Databricks CLI'    

            - task: deploynotebooks@0
              inputs:
                notebooksFolderPath: '$(Pipeline.Workspace)/di-notebooks'
                workspaceFolder: '/Shared/devops-ds'
              displayName: 'Deploy (copy) data processing notebook to the Databricks cluster'             
  - deployment: "Deploy_to_ADF"
    displayName: 'Deploy to ADF'
    timeoutInMinutes: 0
    environment: qa
    strategy:
      runOnce:
        deploy:
          steps:
            - task: AzureResourceGroupDeployment@2
              displayName: 'Deploy ADF resources'
              inputs:
                azureSubscription: $(AZURE_RM_CONNECTION)
                resourceGroupName: $(RESOURCE_GROUP)
                location: $(LOCATION)
                csmFile: '$(Pipeline.Workspace)/adf-pipelines/ARMTemplateForFactory.json'
                csmParametersFile: '$(Pipeline.Workspace)/adf-pipelines/ARMTemplateParametersForFactory.json'
                overrideParameters: -data-ingestion-pipeline_properties_variables_data_file_name_defaultValue "$(DATA_FILE_NAME)"
  - job: "Integration_test_job"
    displayName: "Integration test job"
    dependsOn: [Deploy_to_Databricks, Deploy_to_ADF]
    pool:
      vmImage: 'ubuntu-latest'
    timeoutInMinutes: 0
    steps:
    - task: AzurePowerShell@4
      displayName: 'Execute ADF Pipeline'
      inputs:
        azureSubscription: $(AZURE_RM_CONNECTION)
        ScriptPath: '$(Build.SourcesDirectory)/adf/utils/Invoke-ADFPipeline.ps1'
        ScriptArguments: '-ResourceGroupName $(RESOURCE_GROUP) -DataFactoryName $(DATA_FACTORY_NAME) -PipelineName $(PIPELINE_NAME)'
        azurePowerShellVersion: LatestVersion
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '3.x'
        addToPath: true
        architecture: 'x64'
      displayName: 'Use Python3'

    - task: configuredatabricks@0
      inputs:
        url: '$(DATABRICKS_URL)'
        token: '$(DATABRICKS_TOKEN)'
      displayName: 'Configure Databricks CLI'    

    - task: executenotebook@0
      inputs:
        notebookPath: '/Shared/devops-ds/test-data-ingestion'
        existingClusterId: '$(DATABRICKS_CLUSTER_ID)'
        executionParams: '{"bin_file_name":"$(bin_FILE_NAME)"}'
      displayName: 'Test data ingestion'

    - task: waitexecution@0
      displayName: 'Wait until the testing is done'                

后续步骤Next steps