预先打包的 VEP 批注管道 Pre-packaged VEP annotation pipeline

设置Setup

VEP(版本 96)作为 Azure Databricks 作业运行。Run VEP (release 96) as an Azure Databricks job. 必要的详细信息如下:The necessary details are:

参数Parameters

管道接受多个控制其行为的参数。The pipeline accepts a number of parameters that control its behavior. 所有参数都可以为所有运行或每个运行设置。All parameters can be set for all runs or per run.

参数Parameter 默认Default 描述Description
inputVcfinputVcf 不适用n/a 要用 VEP 批注的 VCF 文件的路径。Path of the VCF file to annotate with VEP.
输出output 不适用n/a 应将管道输出写入到的路径。Path where pipeline output should be written.
replayModereplayMode skipskip 下列其中一项:One of:

* skip:如果输出已存在,则跳过相关阶段。* skip: if output already exists, stages are skipped.
* overwrite:将删除现有输出。* overwrite: existing output is deleted.
exportVCFexportVCF falsefalse 如果为 true,则管道会将结果同时写入 VCF 和 Delta Lake。If true, pipeline writes results as both VCF and Delta Lake.
extraVepOptionsextraVepOptions --everything --minimal --allele_number --fork 4 要传递给 VEP 的其他命令行选项。Additional command line options to pass to VEP. 有些选项是由管道设置的,不能被重写:--assembly--cache--dir_cache--fasta--format--merged--no_stats--offline--output_file--refseq--vcfSome options are set by the pipeline and cannot be overridden: --assembly, --cache, --dir_cache, --fasta, --format, --merged, --no_stats, --offline, --output_file, --refseq, --vcf. 查看 VEP 站点上的所有可能选项。See all possible options on the VEP site.

此外,必须使用环境变量来配置引用基因组和脚本。In addition, you must configure the reference genome and transcripts using environment variables. 要将 Grch37 与合并的 Ensembl 和 RefSeq 脚本配合使用,请设置以下环境变量:To use Grch37 with merged Ensembl and RefSeq transcripts, set the environment variable:

refGenomeId=grch37_merged_vep_96

所有引用基因组和脚本的 refGenomeId 如下:The refGenomeId for all pairs of reference genomes and transcripts are listed:

grch37grch37 grch38grch38
EnsemblEnsembl grch37_vep_96 grch38_vep_96
RefSeqRefSeq grch37_refseq_vep_96 grch38_refseq_vep_96
已合并Merged grch37_merged_vep_96 grch38_merged_vep_96

LOFTEELOFTEE

可以通过插件运行 VEP,以便扩展、筛选或操作 VEP 输出。You can run VEP with plugins in order to extend, filter, or manipulate the VEP output. 根据所需的引用基因组,按照以下步骤设置 LOFTEESet up LOFTEE with the following instructions according to the desired reference genome.

grch37grch37

使用初始化脚本创建 LOFTEE 群集。Create a LOFTEE cluster using an init script.

#!/bin/bash
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch master https://github.com/konradjk/loftee.git

建议创建一个装入点来存储云存储中的任何其他文件;然后可以使用 FUSE 装载访问这些文件。We recommend creating a mount point to store any additional files in cloud storage; these files can then be accessed using the FUSE mount. 将脚本中的值替换为你的装入点。Replace the values in the scripts with your mount point.

如果需要,请将祖先序列保存在装入点。If desired, save the ancestral sequence at the mount point.

cd <mount-point>
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.fai
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.gzi

如果需要,请将 PhyloCSF 数据库保存在装入点。If desired, save the PhyloCSF database at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh37/phylocsf_gerp.sql.gz
gunzip phylocsf_gerp.sql.gz

运行 VEP 管道时,请提供相应的附加选项。When running the VEP pipeline, provide the corresponding extra options.

--dir_plugins /opt/vep/Plugins --plugin LoF,loftee_path:/opt/vep/Plugins/loftee,human_ancestor_fa:<mount-point>/human_ancestor.fa.gz,conservation_file:<mount-point>/phylocsf_gerp.sql

grch38grch38

创建一个可以使用初始化脚本分析 BigWig 文件的 LOFTEE 群集。Create a LOFTEE cluster that can parse BigWig files using an init script.

#!/bin/bash

# Download LOFTEE
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch grch38 https://github.com/konradjk/loftee.git

# Download Kent source tree
mkdir -p /tmp/bigfile
cd /tmp/bigfile
wget https://github.com/ucscGenomeBrowser/kent/archive/v335_base.tar.gz
tar xzf v335_base.tar.gz

# Build Kent source
export KENT_SRC=$PWD/kent-335_base/src
export MACHTYPE=$(uname -m)
export CFLAGS="-fPIC"
export MYSQLINC=`mysql_config --include | sed -e 's/^-I//g'`
export MYSQLLIBS=`mysql_config --libs`
cd $KENT_SRC/lib
echo 'CFLAGS="-fPIC"' > ../inc/localEnvironment.mk
make clean
make
cd ../jkOwnLib
make clean
make

# Install Bio::DB::BigFile
cpanm --notest Bio::Perl
cpanm --notest Bio::DB::BigFile

建议创建一个装入点来存储云存储中的任何其他文件;然后可以使用 FUSE 装载访问这些文件。We recommend creating a mount point to store any additional files in cloud storage; these files can then be accessed using the FUSE mount. 将脚本中的值替换为你的装入点。Replace the values in the scripts with your mount point.

将 GERP 评分 BigWig 保存在装入点。Save the GERP scores BigWig at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw

如果需要,请将祖先序列保存在装入点。If desired, save the ancestral sequence at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.fai
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.gzi

如果需要,请将 PhyloCSF 数据库保存在装入点。If desired, save the PhyloCSF database at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/loftee.sql.gz
gunzip loftee.sql.gz

运行 VEP 管道时,请提供相应的附加选项。When running the VEP pipeline, provide the corresponding extra options.

--dir_plugins /opt/vep/Plugins --plugin LoF,loftee_path:/opt/vep/Plugins/loftee,gerp_bigwig:<mount-point>/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:<mount-point>/human_ancestor.fa.gz,conservation_file:<mount-point>/loftee.sql

VEP 管道笔记本 VEP pipeline notebook

获取笔记本Get notebook