DNASeq 管道DNASeq pipeline


以下库版本已打包到用于基因组学的 Databricks Runtime 7.0 中。The following library versions are packaged in Databricks Runtime 7.0 for Genomics. 对于用于基因组学的 Databricks Runtime 的较低版本中包含的库,请参阅发行说明For libraries included in lower versions of Databricks Runtime for Genomics, see the release notes.

Azure Databricks DNASeq 管道是符合 GATK 最佳做法的管道,用于短读比对,变体识别和变体批注。The Azure Databricks DNASeq pipeline is a GATK best practices compliant pipeline for short read alignment, variant calling, and variant annotation. 它使用通过 Spark 进行并行化的以下软件包。It uses the following software packages, parallelized using Spark.

  • BWA v0.7.17BWA v0.7.17
  • ADAM v0.32.0ADAM v0.32.0
  • GATK HaplotypeCaller v4.1.4.1GATK HaplotypeCaller v4.1.4.1
  • SnpEff v4.3SnpEff v4.3

若要详细了解管道实现以及各种选项组合的预期运行时间和开销,请参阅在 Scala 中构建最快的 DNASeq 管道For more information about the pipeline implementation and expected runtimes and costs for various option combinations, see Building the Fastest DNASeq Pipeline at Scala.


管道将作为 Azure Databricks 作业运行。The pipeline is run as an Azure Databricks job. 可以设置群集策略以保存配置:You can set up a cluster policy to save the configuration:

  "num_workers": {
    "type": "unlimited",
    "defaultValue": 13
  "node_type_id": {
    "type": "unlimited",
    "defaultValue": "Standard_F32s_v2"
  "spark_env_vars.refGenomeId": {
    "type": "unlimited",
    "defaultValue": "grch38"
  "spark_version": {
    "type": "regex",
    "pattern": ".*-hls.*",
    "defaultValue": "7.0.x-hls-scala2.12"
  • 群集配置应使用用于基因组学的 Databricks Runtime。The cluster configuration should use Databricks Runtime for Genomics.
  • 该任务应当是此页面底部提供的 DNASeq 笔记本。The task should be the DNASeq notebook found at the bottom of this page.
  • 为了获得最佳性能,请使用至少具有 60GB 内存的计算优化 VM。For best performance, use the compute optimized VMs with at least 60GB of memory. 我们建议使用 Standard_F32s_v2 VM。We recommend Standard_F32s_v2 VMs.
  • 如果正在运行基本质量分数校准,请改用常规用途 (Standard_D32s_v3) 实例,因为此操作需要更多内存。If you’re running base quality score recalibration, use general purpose (Standard_D32s_v3) instances instead since this operation requires more memory.


管道接受控制其行为的参数。The pipeline accepts parameters that control its behavior. 这里记录了最重要且最常更改的参数;其余参数可以在 DNASeq 笔记本中找到。The most important and commonly changed parameters are documented here; the rest can be found in the DNASeq notebook. 可以为所有运行或单次运行设置参数。Parameters can be set for all runs or per-run.

参数Parameter 默认Default 描述Description
manifestmanifest 不适用n/a 描述输入的清单。The manifest describing the input.
输出output 不适用n/a 应将管道输出写入到的路径。The path where pipeline output should be written.
replayModereplayMode skipskip 下列其中一项:One of:

* skip:如果输出已存在,则跳过相关阶段。* skip: stages are skipped if output already exists.
* overwrite:将删除现有输出。* overwrite: existing output is deleted.
exportVCFexportVCF falsefalse 如果为 true,则管道会将结果写入 VCF 以及 Delta Lake。If true, the pipeline writes results in VCF as well as Delta Lake.
referenceConfidenceModereferenceConfidenceMode NONE 下列其中一项:One of:

* 如果为 NONE,则输出中将仅包含变体站点* If NONE, only variant sites are included in the output
* 如果为 GVCF,则会包含所有站点,其中包含相邻的参考站点。* If GVCF, all sites are included, with adjacent reference sites banded.
* 如果为 BP_RESOLUTION,则会包含所有站点。* If BP_RESOLUTION, all sites are included.
perSampleTimeoutperSampleTimeout 12h12h 每个样本所应用的超时。A timeout applied per sample. 达到此超时时间后,管道会继续执行下一个样本。After reaching this timeout, the pipeline continues on to the next sample. 此参数的值必须包含超时单位:“s”(表示秒)、“m”(表示分钟)或“h”(表示小时)。The value of this parameter must include a timeout unit: ‘s’ for seconds, ‘m’ for minutes, or ‘h’ for hours. 例如,“60m”会导致超时 60 分钟。For example, ‘60m’ will result in a timeout of 60 minutes.


若要优化运行时间,请将 Spark 配置中的 spark.sql.shuffle.partitions 设置为群集的核心数的三倍。To optimize runtime, set spark.sql.shuffle.partitions in the Spark config to three times the number of cores of the cluster.


你可以通过禁用读取比对、变体识别和变体批注来自定义 DNASeq 管道。You can customize the DNASeq pipeline by disabling read alignment, variant calling, and variant annotation. 默认情况下会启用所有三个阶段。By default, all three stages are enabled.

val pipeline = new DNASeqPipeline(align = true, callVariants = true, annotate = true)

若要禁用变体批注,请设置管道,如下所示:To disable variant annotation, set the pipeline as follows:

val pipeline = new DNASeqPipeline(align = true, callVariants = true, annotate = false)

允许的阶段组合如下:The permitted stage combinations are:

读取比对Read alignment 变体识别Variant calling 变体批注Variant annotation
truetrue truetrue truetrue
truetrue truetrue falsefalse
truetrue falsefalse falsefalse
falsefalse truetrue truetrue
falsefalse truetrue falsefalse

参考基因组Reference genomes

必须使用一个环境变量来配置参考基因组。You must configure the reference genome using an environment variable. 若要使用 GRCh37,请设置环境变量,如下所示:To use GRCh37, set an environment variable like this:


若要改用 GRCh38,请将 grch37 替换为 grch38To use GRCh38 instead, replace grch37 with grch38.

自定义参考基因组Custom reference genomes


用于基因组学的 Databricks Runtime 6.6 及更高版本支持自定义参考基因组。Custom reference genome support is available in Databricks Runtime 6.6 for Genomics and above.

若要使用 GRCh37 或 GRCh38 以外的参考生成,请执行以下步骤:To use a reference build other than GRCh37 or GRCh38, follow these steps:

  1. 准备参考,以便将其与 BWA 和 GATK 一起使用。Prepare the reference for use with BWA and GATK.

    参考基因组目录内容应包含以下文件:The reference genome directory contents should include these files:

  2. 将参考基因组文件上传到云存储或 DBFS 中的目录。Upload the reference genome files to a directory in cloud storage or DBFS. 如果你将文件上传到云存储,则必须将目录装载到 DBFS 中的某个位置If you upload the files to cloud storage, you must mount the directory to a location in DBFS.

  3. 在群集配置中,将环境变量 REF_GENOME_PATH 设置为指向 DBFS 中 fasta 文件的路径。In your cluster configuration, set an environment variable REF_GENOME_PATH that points to the path of the fasta file in DBFS. 例如,For example,


    路径不得包含 dbfs: 前缀。The path must not include a dbfs: prefix.

    使用自定义参考基因组时,将跳过 SnpEff 批注阶段。When you use a custom reference genome, the SnpEff annotation stage is skipped.


在群集初始化期间,Azure Databricks DNASeq 管道使用提供的 BWA 索引文件来生成索引映像文件。During cluster initialization, the Azure Databricks DNASeq pipeline uses the provided BWA index files to generate an index image file. 如果计划多次使用相同的参考基因组,可以提前生成索引映像文件来加速群集启动。If you plan to use the same reference genome many times, you can accelerate cluster startup by building the index image file ahead of time. 此过程会将群集启动时间减少大约 30 秒。This process will reduce cluster startup time by about 30 seconds.

  1. 将参考基因组目录复制到用于基因组学的 Databricks Runtime 群集的驱动程序节点。Copy the reference genome directory to the driver node of a Databricks Runtime for Genomics cluster.

    %sh cp -r /dbfs/<reference_dir_path> /local_disk0/reference-genome
  2. 从 BWA 索引文件生成索引映像文件。Generate the index image file from the BWA index files.

    import org.broadinstitute.hellbender.utils.bwa._
    BwaMemIndex.createIndexImageFromIndexFiles("/local_disk0/reference-genome/<reference_name>.fa", "/local_disk0/reference-genome/<reference_name>.fa.img")
  3. 将索引映像文件复制到参考 fasta 文件所在的目录。Copy to the index image file to the same directory as the reference fasta files.

    %sh cp /local_disk0/reference-genome/<reference_name>.fa.img /dbfs/<reference_dir_path>
  4. 从 DBFS 中删除不需要的 BWA 索引文件(.amb.ann.bwt.pac.sa)。Delete the unneeded BWA index files (.amb, .ann, .bwt, .pac, .sa) from DBFS.

    %fs rm <file>

清单格式Manifest format


用于基因组学的 Databricks Runtime 6.6 及更高版本支持清单 blob。Manifest blobs are supported in Databricks Runtime 6.6 for Genomics and above.

清单是一个 CSV 文件或 blob,用于描述在何处查找输入 FASTQ 或 BAM 文件。The manifest is a CSV file or blob describing where to find the input FASTQ or BAM files. 示例:An example:


如果输入包含未比对的 BAM 文件,则应省略 paired_end 字段:If your input consists of unaligned BAM files, you should omit the paired_end field:



如果提供的清单是一个文件,则每行中的 file_path 字段可以是绝对路径或相对于清单文件的路径。If the provided manifest is a file, the file_path field in each row can be an absolute path or a path relative to the manifest file. 如果提供的清单是 blob,则 file_path 字段必须是绝对路径。If the provided manifest is a blob, the file_path field must be an absolute path. 可以包含 glob (*),以匹配许多文件。You can include globs (*) to match many files.

支持的输入格式Supported input formats

  • ParquetParquet
    • bgzip *.fastq.bgz(建议):具有 *.fastq.gz 扩展名的 bgzip 压缩文件会被识别为 bgzbgzip *.fastq.bgz (recommended) bgzipped files with the *.fastq.gz extension are recognized as bgz.
    • 未压缩的 *.fastquncompressed *.fastq
    • gzip *.fastq.gzgzip *.fastq.gz


Gzip 压缩的文件不可拆分。Gzipped files are not splittable. 选择自动缩放群集,将这些文件的成本降至最低。Choose autoscaling clusters to minimize cost for these files.

若要对 FASTQ 进行块压缩,请安装 htslib,其中包括 bgzip 可执行文件。To block compress a FASTQ, install htslib, which includes the bgzip executable.


如果启用了相应的阶段,则比对的读取(称为变体)和带批注的变体将全部写出到提供的输出目录内的 Delta 表中。The aligned reads, called variants, and annotated variants are all written out to Delta tables inside the provided output directory if the corresponding stages are enabled. 每个表按示例 ID 进行分区。Each table is partitioned by sample ID. 此外,如果你将管道配置为导出 BAM 或 VCF,则这些变体也会显示在输出目录下。In addition, if you configured the pipeline to export BAMs or VCFs, they’ll appear under the output directory as well.

        |---Parquet files
    |---Delta files
    |---Delta files

当你在新的示例上运行管道时,它将显示为新的分区。When you run the pipeline on a new sample, it’ll appear as a new partition. 如果为已显示在输出目录中的示例运行管道,则会覆盖该分区。If you run the pipeline for a sample that already appears in the output directory, that partition will be overwritten.

由于所有信息都在 Delta Lake 中提供,因此你可以在 Python、R、Scala 或 SQL 中使用 Spark 轻松对其进行分析。Since all the information is available in Delta Lake, you can easily analyze it with Spark in Python, R, Scala, or SQL. 例如: 。For example:


# Load the data
df = spark.read.format("delta").load("/genomics/output_dir/genotypes")
# Show all variants from chromosome 12
display(df.where("contigName == '12'").orderBy("sampleId", "start"))


-- Register the table in the catalog
CREATE TABLE genotypes
USING delta
LOCATION '/genomics/output_dir/genotypes'


作业运行缓慢,并且正在运行的任务很少Job is slow and few tasks are running

通常指示输入 FASTQ 文件是采用 gzip 压缩的,而不是采用 bgzip 压缩的。Usually indicates that the input FASTQ files are compressed with gzip instead of bgzip. Gzip 压缩的文件不可拆分,因此不能并行处理输入。Gzipped files are not splittable, so the input cannot be processed in parallel. 请尝试使用并行 bgzip 笔记本,以便将文件目录从 gzip 并行转换为 bgzipTry using the parallel bgzip notebook to convert a directory of files from gzip to bgzip in parallel.

以编程方式运行Run programmatically

除了使用 UI 外,还可以使用 Databricks CLI 以编程方式启动管道的运行。In addition to using the UI, you can start runs of the pipeline programmatically using the Databricks CLI.

查找作业 IDFind the job id

在 UI 中设置管道作业后,请复制 作业 ID 并将其传递给 jobs run-now CLI 命令。After setting up the pipeline job in the UI, copy the Job ID as you pass it to the jobs run-now CLI command.

下面是一个示例 bash 脚本,你可以针对你的工作流对其进行修改:Here’s an example bash script that you can adapt for your workflow:

# Generate a manifest file
cat <<HERE >manifest.csv

# Upload the file to DBFS
DBFS_PATH=dbfs:/genomics/manifests/$(date +"%Y-%m-%dT%H-%M-%S")-manifest.csv
databricks fs cp manifest.csv $DBFS_PATH

# Start a new run
databricks jobs run-now --job-id <job-id> --notebook-params "{\"manifest\": \"$DBFS_PATH\"}"

除了从命令行启动运行外,你还可以使用此模式从自动化系统(例如 Jenkins)中调用管道。In addition to starting runs from the command line, you can use this pattern to invoke the pipeline from automated systems like Jenkins.

DNASeq 管道笔记本DNASeq pipeline notebook

获取笔记本Get notebook