“肿瘤/正常”管道Tumor/Normal pipeline

Azure Databricks 肿瘤/正常管道是 GATK 最佳做法兼容管道,用于使用 MuTect2 基因突变鉴定软件进行短读比对和识别体细胞突变。The Azure Databricks tumor/normal pipeline is a GATK best practices compliant pipeline for short read alignment and somatic variant calling using the MuTect2 variant caller.

演练Walkthrough

管道包括以下步骤:The pipeline consists of the following steps:

  1. 使用 BWA-MEM 进行正常样本比对。Normal sample alignment using BWA-MEM.
  2. 使用 BWA-MEM 进行肿瘤样本比对。Tumor sample alignment using BWA-MEM.
  3. 使用 MuTect2 识别突变。Variant calling with MuTect2.

设置Setup

管道将作为 Azure Databricks 作业运行。The pipeline is run as an Azure Databricks job. 可以设置群集策略以保存配置:You can set up a cluster policy to save the configuration:

{
  "num_workers": {
    "type": "unlimited",
    "defaultValue": 13
  },
  "node_type_id": {
    "type": "unlimited",
    "defaultValue": "Standard_F32s_v2"
  },
  "spark_env_vars.refGenomeId": {
    "type": "unlimited",
    "defaultValue": "grch38"
  },
  "spark_version": {
    "type": "regex",
    "pattern": ".*-hls.*",
    "defaultValue": "7.0.x-hls-scala2.12"
  }
}
  • 群集配置应使用用于基因组学的 Databricks Runtime。The cluster configuration should use Databricks Runtime for Genomics.
  • 此任务应为此页面底部提供的肿瘤/正常笔记本。The task should be the tumor/normal notebook found at the bottom of this page.
  • 若要获得最佳性能,请使用至少具有 60GB 内存的经过计算优化的 VM。For best performance, use the compute optimized VMs with at least 60GB of memory. 建议使用 Standard_F32s_v2 VM。We recommend Standard_F32s_v2 VMs.
  • 如果正在运行碱基质量分数重校准,请改用常规用途 (Standard_D32s_v3) 实例,因为此操作需要更多内存。If you’re running base quality score recalibration, use general purpose (Standard_D32s_v3) instances instead since this operation requires more memory.

参数Parameters

管道接受控制其行为的参数。The pipeline accepts parameters that control its behavior. 此处记录了最重要且最常更改的参数。The most important and commonly changed parameters are documented here. 若要查看所有可用参数及其使用信息,请运行管道笔记本的第一个单元。To view all available parameters and their usage information, run the first cell of the pipeline notebook. 定期添加新的参数。New parameters are added regularly. 可以为所有运行或每次运行设置参数。Parameters can be set for all runs or per-run.

参数Parameter 默认Default 描述Description
manifestmanifest 不适用n/a 描述输入的清单。The manifest describing the input.
输出output 不适用n/a 应写入管道输出的路径。The path where pipeline output should be written.
replayModereplayMode skipskip * 如果 skip输出已存在,则跳过相关阶段。* If skip, stages will be skipped if output already exists.
* 如果 overwrite,将删除现有输出。* If overwrite, existing output will be deleted.
exportVCFexportVCF falsefalse 如果为 true,则管道会将结果写入 VCF 文件以及 Delta 中。If true, the pipeline writes results to a VCF file as well as Delta.
perSampleTimeoutperSampleTimeout 12 小时12h 每个样本应用的超时。A timeout applied per sample. 达到此超时时间后,管道将继续执行下一个样本。After reaching this timeout, the pipeline continues on to the next sample. 此参数的值必须包含超时单位:“s”(表示秒)、“m”(表示分钟)或“h”(表示小时)。The value of this parameter must include a timeout unit: ‘s’ for seconds, ‘m’ for minutes, or ‘h’ for hours. 例如,“60m”将使超时值为 60 分钟。For example, ‘60m’ will result in a timeout of 60 minutes.

提示

若要优化运行时,请在 Spark 配置中将 spark.sql.shuffle.partitions 配置为群集核心数的三倍。To optimize runtime, set spark.sql.shuffle.partitions in the Spark config to three times the number of cores of the cluster.

参考基因组Reference genomes

必须使用环境变量来配置参考基因组。You must configure the reference genome using an environment variable. 若要使用 GRCh37,请按如下所示设置环境变量:To use GRCh37, set an environment variable like this:

refGenomeId=grch37

若要使用 GRCh38,请将 grch37 更改为 grch38To use GRCh38, change grch37 to grch38.

若要使用自定义参考基因组,请参阅自定义参考基因组中的说明。To use a custom reference genome, see instructions in Custom reference genomes.

清单格式Manifest format

备注

用于基因组学的 Databricks Runtime 6.6 及更高版本支持清单 blob。Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.

清单是一个 CSV 文件或 blob,用于描述在何处查找输入 FASTQ 或 BAM 文件。The manifest is a CSV file or blob describing where to find the input FASTQ or BAM files. 示例:An example:

pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*_R1_*.normal.fastq.bgz,HG001_normal,normal,1,read_group_normal
HG001,*_R2_*.normal.fastq.bgz,HG001_normal,normal,2,read_group_normal
HG001,*_R1_*.tumor.fastq.bgz,HG001_tumor,1,tumor,read_group_tumor
HG001,*_R2_*.tumor.fastq.bgz,HG001_tumor,2,tumor,read_group_tumor

如果输入包含未比对的 BAM 文件,则应省略 paired_end 字段:If your input consists of unaligned BAM files, you should omit the paired_end field:

pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*.normal.bam,HG001_normal,normal,,read_group_tumor
HG001,*.tumor.bam,HG001_tumor,tumor,,read_group_normal

给定个体的肿瘤样本和正常样本按 pair_id 字段分组。The tumor and normal samples for a given individual are grouped by the pair_id field. 一对肿瘤样本和正常样本名称读取组名称必须是不同的。The tumor and normal sample names read group names must be different within a pair.

提示

如果提供的清单是一个文件,则每行中的 file_path 字段可以是绝对路径或相对于清单文件的路径。If the provided manifest is a file, the file_path field in each row may be an absolute path or a path relative to the manifest file. 如果提供的清单是 blob,则 file_path 字段必须是绝对路径。If the provided manifest is a blob, the file_path field must be an absolute path. 可以包含 glob (*),以匹配许多文件。You can include globs (*) to match many files.

其他使用信息和故障排除Additional usage info and troubleshooting

肿瘤/正常管道与其他 Azure Databricks 管道有着许多相同的操作详细信息。The tumor/normal pipeline shares many operational details with the other Azure Databricks pipelines. 有关更详细的用法信息(例如输出格式结构、以编程方式运行的提示以及设置自定义参考基因组的步骤)和常见问题,请参阅 DNASeq 管道For more detailed usage information, such as output format structure, tips for running programmatically, steps for setting up custom reference genomes, and common issues, see DNASeq pipeline.

备注

在用于基因组学的 Databricks Runtime 7.3 及更高版本中,将管道从 TNSeq 更名为 MutSeq。The pipeline was renamed from TNSeq to MutSeq in Databricks Runtime 7.3 for Genomics and above.

MutSeq 管道笔记本MutSeq pipeline notebook

获取笔记本Get notebook

TNSeq 管道笔记本(旧版)TNSeq pipeline notebook (Legacy)

获取笔记本Get notebook