联合基因分型管道Joint genotyping pipeline

Azure Databricks 联合基因分型管道是 GATK 最佳做法兼容管道,用于使用 GenotypeGVCFs 进行联合基因分型。The Azure Databricks joint genotyping pipeline is a GATK best practices compliant pipeline for joint genotyping using GenotypeGVCFs.

演练Walkthrough

管道通常包括以下步骤:The pipeline typically consists of the following steps:

  1. 将突变引入到 Delta Lake。Ingest variants into Delta Lake.
  2. 使用 GenotypeGVCF 联合调用队列。Joint-call the cohort with GenotypeGVCFs.

在突变引入过程中,分批处理单样本 gVCF,并将行存储在 Delta Lake 中,以提供容错、快速查询和增量联合基因分型。During variant ingest, single-sample gVCFs are processed in batches and the rows are stored in Delta Lake to provide fault tolerance, fast querying, and incremental joint genotyping. 在联合基因分型步骤中,从 Delta Lake 提取 gVCF 行,将其拆分为箱,然后将其分发到分区。In the joint genotyping step, the gVCF rows are ingested from Delta Lake, split into bins, and distributed to partitions. 对于每个突变位点,将识别每个样本的相关 gVCF 行并将其用于基因再分型。For each variant site, the relevant gVCF rows per sample are identified and used for regenotyping.

设置Setup

管道将作为 Azure Databricks 作业运行。The pipeline is run as an Azure Databricks job. 可能会有 Azure Databricks 解决方案架构师很与你协作,以设置初始作业。Most likely an Azure Databricks solutions architect will work with you to set up the initial job. 必要的详细信息如下:The necessary details are:

{
  "autoscale.min_workers": {
    "type": "unlimited",
    "defaultValue": 1
  },
  "autoscale.max_workers": {
    "type": "unlimited",
    "defaultValue": 25
  },
  "node_type_id": {
    "type": "unlimited",
    "defaultValue": "Standard_L32s_v2"
  },
  "spark_env_vars.refGenomeId": {
    "type": "unlimited",
    "defaultValue": "grch38"
  },
  "spark_version": {
    "type": "regex",
    "pattern": ".*-hls.*",
    "defaultValue": "7.0.x-hls-scala2.12"
  }
}
  • 群集配置应使用用于基因组学的 Databricks RuntimeThe cluster configuration should use Databricks Runtime for Genomics.
  • 此任务应为此页面底部提供的联合基因分型管道笔记本。The task should be the joint genotyping pipeline notebook found at the bottom of this page.
  • 若要获得最佳性能,请使用针对存储进行了优化的 VM。For best performance, use the storage-optimized VMs. 建议使用 Standard_L32s_v2。We recommend Standard_L32s_v2.
  • 若要降低成本,请启用自动缩放,根据延迟要求最少保留 1 个辅助角色,最多扩展到 10-50 个。To reduce costs, enable autoscaling with a minimum of 1 worker and a maximum of 10-50 depending on latency requirements.

参数Parameters

管道接受控制其行为的参数。The pipeline accepts parameters that control its behavior. 此处记录了最重要且最常更改的参数。The most important and commonly changed parameters are documented here. 若要查看所有可用参数及其使用信息,请运行管道笔记本的第一个单元。To view all available parameters and their usage information, run the first cell of the pipeline notebook. 定期添加新的参数。New parameters are added regularly. 可以为所有运行或每次运行设置参数。Parameters can be set for all runs or per-run.

参数Parameter 默认Default 描述Description
manifestmanifest 不适用n/a 描述输入的清单The manifest describing the input.
输出output 不适用n/a 写入管道输出的路径。The path where pipeline output is written.
replayModereplayMode skipskip 下列其中一项:One of:

* skip:如果输出已存在,则跳过相关阶段。* skip: stages are skipped if output already exists.
* overwrite:将删除现有输出。* overwrite: existing output is deleted.
exportVCFexportVCF falsefalse 如果为 true,则管道会将结果写入 VCF 以及 Delta Lake。If true, the pipeline writes results in VCF as well as Delta Lake.
targetedRegionstargetedRegions 不适用n/a 包含要调用的区域的文件的路径。Path to files containing regions to call. 如果省略,则调用所有区域。If omitted, calls all regions.
genotypeGivenAllelesgenotypeGivenAlleles falsefalse 如果为 true,则根据输入 gVCF 中的等位基因对突变位点进行基因再分型。If true, regenotypes variant sites based on the alleles in the input gVCFs.
emitAllSitesemitAllSites falsefalse 如果为 true,请在输出中保留低质量的位点。If true, retain low quality sites in the output.
gvcfDeltaOutputgvcfDeltaOutput 不适用n/a 如果指定,则在进行基因分型之前,会将 gVCF 引入到 Delta 表中。If specified, gVCFs are ingested to a Delta table before genotyping. 仅应在希望多次联合调用同一 gVCF 时指定此参数。You should specify this parameter only if you expect to joint call the same gVCFs many times.
performValidationperformValidation falsefalse 如果为 true,则系统会验证每个记录是否包含用于联合基因分型的必要信息。If true, the system verifies that each record contains the necessary information for joint genotyping. 特别是,它会检查是否存在正确数量的基因型概率。In particular, it checks that the correct number of genotype probabilities are present.
validationStringencyvalidationStringency STRICTSTRICT 在加载和验证过程中如何处理格式错误的记录。How to handle malformed records, both during loading and validation.

* STRICT:作业失败* STRICT: fail the job
* LENIENT:记录警告并删除记录* LENIENT: log a warning and drop the record
* SILENT:删除记录且不发出警告* SILENT: drop the record without a warning

提示

  • 若要保留极少数突变,请将 genotypeGivenAllelesemitAllSites 设置为 trueTo keep rare variants, set genotypeGivenAlleles and emitAllSites to true. 这等同于将 GATK 设置 genotyping_modeDISCOVERY(选择最可能的等位基因)更改为 GENOTYPE_GIVEN_ALLELES(使用输入 gVCF 中存在的等位基因),并将 output_modeEMIT_VARIANTS_ONLY (仅在突变位点生成调用)更改为 EMIT_ALL_SITES(无论置信度如何,都会在任何可调用的位点生成调用)。This is equivalent to changing the GATK settings genotyping_mode from DISCOVERY (choose the most probable alleles) to GENOTYPE_GIVEN_ALLELES (use the alleles present in the input gVCFs), and output_mode from EMIT_VARIANTS_ONLY (produces calls only at variant sites) to EMIT_ALL_SITES (produces calls at any callable site regardless of confidence).
  • 若要从现有 Delta 表执行联合调用,请将 gvcfDeltaOutput 设置为表路径,并将 replayMode 设置为 skipTo perform joint calling from an existing Delta table, set gvcfDeltaOutput to the table path and replayMode to skip. 还可以提供 manifest,它将用于定义 VCF 架构和样本,否则,将从 Delta 表中推断出它们。You can also provide the manifest, which will be used to define the VCF schema and samples; these will be inferred from the Delta table otherwise. 忽略此设置中的 targetedRegionsperformValidation 参数。We ignore the targetedRegions and performValidation parameters in this setup.

输出 Output

基因再分型的突变均写入到提供的输出目录内的 Delta 表中。The regenotyped variants are all written out to Delta tables inside the provided output directory. 此外,如果将管道配置为导出 VCF,这些变体也会显示在输出目录下。In addition, if you configured the pipeline to export VCFs, they’ll appear under the output directory as well.

output
|---genotypes
    |---Delta files
|---genotypes.vcf
    |---VCF files

参考基因组Reference genomes

必须使用环境变量来配置参考基因组。You must configure the reference genome using environment variables. 若要使用 GRCh37,请设置环境变量:To use GRCh37, set the environment variable:

refGenomeId=grch37

若要使用 GRCh38,请将 grch37 更改为 grch38To use GRCh38, change grch37 to grch38.

若要使用自定义参考基因组,请参阅自定义参考基因组中的说明。To use a custom reference genome, see instructions in Custom reference genomes.

清单格式Manifest format

备注

用于基因组学的 Databricks Runtime 6.6 及更高版本支持清单 blob。Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.

清单是一个文件或 blob,用于描述查找输入 GVCF 文件的位置,其中每个路径都位于新行中。The manifest is a file or blob describing where to find the input GVCF files, with each path on a new row.

提示

如果提供的清单是文件,则每一行可以是绝对路径或相对于清单文件的路径。If the provided manifest is a file, each row may be an absolute path or a path relative to the manifest file. 如果提供的清单是 blob,则行字段必须是绝对路径。If the provided manifest is a blob, the row field must be an absolute path. 可以包含 glob (*),以匹配许多文件。You can include globs (*) to match many files.

疑难解答Troubleshooting

作业由于出现 ArrayIndexOutOfBoundsException 而失败Job fails with an ArrayIndexOutOfBoundsException

此错误通常表明输入记录的基因型概率数量不正确。This error usually indicates that an input record has an incorrect number of genotype probabilities. 尝试将 performValidation 选项设置为 true,并将 validationStringency 选项设置为 LENIENTSILENTTry setting the performValidation option to true and the validationStringency option to LENIENT or SILENT.

其他使用信息Additional usage info

联合基因分型管道与其他 Azure Databricks 管道有着许多相同的操作详细信息。The joint genotyping pipeline shares many operational details with the other Azure Databricks pipelines. 有关更详细的用法信息,例如输出格式结构、以编程方式运行的提示以及设置自定义参考基因组的步骤,请参阅 DNASeq 管道For more detailed usage information, such as output format structure, tips for running programmatically, and steps for setting up custom reference genomes, see DNASeq pipeline.

联合基因分型管道笔记本Joint genotyping pipeline notebook

获取笔记本Get notebook