SAIGE 和 SAIGE-GENE(使用管道转换器) SAIGE and SAIGE-GENE (using Pipe Transformer)

SAIGE 是一种使用 R 语言编写的命令行工具,用于对生物样本库规模的数据执行遗传关联测试。SAIGE is a command-line tool written in R for performing genetic association tests on biobank-scale data. 使用 Glow 的管道转换器,运行用于基因组学的 Databricks Runtime上的单变量关联测试方法 SAIGE基于区域的关联测试方法 SAIGE-GENERun the single-variant association test method SAIGE and the region-based association test method SAIGE-GENE on Databricks Runtime for Genomics using Glow’s Pipe Transformer.

创建 SAIGE 群集 Create a SAIGE cluster

使用初始化脚本,在群集中的每个节点上安装 HTSLib 以及 SAIGE 包、其依赖项和关联的输入文件。Install HTSLib as well as the SAIGE package, its dependencies, and associated input files on every node in the cluster using an init script. 建议创建一个装入点,通过 FUSE 装载来存储和下载云存储中的这些文件。We recommend creating a mount point to store and download these files from cloud storage with the FUSE mount. 将脚本中的值替换为你的装入点和本地目录。Replace the values in the script with your mount point and local directory.

#!/usr/bin/env bash

set -ex
set -o pipefail

# Install SKAT for SAIGE-GENE
sudo apt-get -y install libboost-all-dev autoconf
Rscript -e 'install.packages("SKAT", repos="https://cran.us.r-project.org")'

# Install SAIGE
cd /opt
git clone https://github.com/weizhouUMICH/SAIGE
cd SAIGE
Rscript extdata/install_packages.R
R CMD INSTALL *.tar.gz

# Install HTSLIB
cd /opt
git clone https://github.com/samtools/htslib
cd htslib
autoheader
autoconf
./configure
make
make install

# Sync SAIGE input files
cp -r <mount-point>/* <local-dir>

运行关联测试Run association tests

若要使关联测试并行化,请使用管道转换器。To parallelize the association test, use the Pipe Transformer.

  • 命令:一个 JSON 数组,其中包含 SAIGESAIGE-GENE 脚本Command: A JSON array containing a SAIGE or SAIGE-GENE script
  • 输入格式化程序:VCFInput formatter: VCF
  • 输出格式化程序:CSVOutput formatter: CSV
    • 标头标志:trueHeader flag: true
    • 分隔符:空格Delimiter: space
import glow
import json

cmd = json.dumps(["bash", "-c", saige_script])
output_df = glow.transform("pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header=input_vcf,
                           output_formatter='csv', out_header='true', out_delimiter=' ')

可以直接从 VCF 或带有 VCF 行的 Delta 表中读取输入数据帧。You can read the input DataFrame directly from a VCF or from a Delta table with VCF rows.

单变量关联测试 (SAIGE)Single-variant association test (SAIGE)

以下单元格显示要通过管道传输的示例 SAIGE 脚本。The following cell shows an example SAIGE script to be piped. 单变量关联测试文档中介绍了这些选项。The options are explained in the single-variant association test docs.

#!/bin/sh
set -e

export tmpdir=$(mktemp -d -t vcf.XXXXXX)
cat - > ${tmpdir}/input.vcf
bgzip -c ${tmpdir}/input.vcf > ${tmpdir}/input.vcf.gz
tabix -p vcf ${tmpdir}/input.vcf.gz

Rscript /opt/SAIGE/extdata/step2_SPAtests.R \
    --vcfFile=${tmpdir}/input.vcf.gz \
    --vcfFileIndex=${tmpdir}/input.vcf.gz.tbi \
    --SAIGEOutputFile=${tmpdir}/output.txt \
    --sampleFile=<local-dir>/input/sampleIds.txt \
    --GMMATmodelFile=<local-dir>/step-1/output.rda \
    --varianceRatioFile=<local-dir>/step-1/output.varianceRatio.txt \
    --vcfField=GT \
    --chrom=22 \
    --minMAF=0.0001 \
    --minMAC=1 \
    --numLinesOutput=2 \
    >&2

cat ${tmpdir}/output.txt
rm -rf ${tmpdir}

基于区域的关联测试 (SAIGE-GENE) Region-based association test (SAIGE-GENE)

在进行管道传输前,按区域或基因对输入数据帧进行分区,每个分区包含已排序且不同的相关 VCF 行集。Before piping, partition the input DataFrame by region or gene, with each partition containing the sorted, distinct, and relevant set of VCF rows. 区域和对应的变量应与 SAIGE-GENE 脚本中引用的组文件中的区域和对应的变量匹配。The regions and corresponding variants should match those in the group file referenced in the SAIGE-GENE script. 区域可从任何源派生,例如 SnpEff 管道中的基因注释。Regions can be derived from any source, such as gene annotations from the SnpEff pipeline.

input_df = genotype_df.join(region_df, ["contigName", "start", "end"]) \
  .repartition("geneName") \
  .select("contigName", "start", "end", "referenceAllele", "alternateAlleles", "genotypes") \
  .sortWithinPartitions("contigName", "start")

以下单元格显示了一个要通过管道传输的示例 SAIGE-GENE 脚本。The following cell shows an example SAIGE-GENE script to be piped. 基于区域的关联测试文档中介绍了这些选项。The options are explained in the region-based association test docs.

#!/bin/sh
set -e

export tmpdir=$(mktemp -d -t vcf.XXXXXX)
uniq -w 100 - > ${tmpdir}/input.vcf
bgzip -c ${tmpdir}/input.vcf > ${tmpdir}/input.vcf.gz
tabix -p vcf ${tmpdir}/input.vcf.gz

Rscript /opt/SAIGE/extdata/step2_SPAtests.R \
    --vcfFile=${tmpdir}/input.vcf.gz \
    --vcfFileIndex=${tmpdir}/input.vcf.gz.tbi \
    --SAIGEOutputFile=${tmpdir}/output.txt \
    --groupFile=${tmpdir}/groupFile.txt \
    --sampleFile=<local-dir>/input/sampleIds.txt \
    --GMMATmodelFile=<local-dir>/step-1/output.rda \
    --varianceRatioFile=<local-dir>/step-1/output.varianceRatio.txt \
    --sparseSigmaFile=<local-dir>/step-1/output.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx \
    --vcfField=GT \
    --chrom=22 \
    --minMAF=0 \
    --minMAC=0.5 \
    --maxMAFforGroupTest=0.5 \
    --numLinesOutput=2 \
    >&2

cat ${tmpdir}/output.txt
rm -rf ${tmpdir}

在进行管道传输后,消除输出数据帧中每个基因的关联结果的歧义。After piping, disambiguate the association results for each gene in the output DataFrame.

from pyspark.sql import Window

window_spec = Window.partitionBy("Gene")
assoc_df = output_df.withColumn("numMarkerIDs", size(split("markerIDs", ";"))) \
  .withColumn("maxNumMarkerIDs", max("numMarkerIDs").over(window_spec)) \
  .filter("numMarkerIDs = maxNumMarkerIDs") \
  .drop("numMarkerIDs", "maxNumMarkerIDs") \
  .drop_duplicates(["Gene"])

示例笔记本 Example notebooks

NULL 逻辑混合模型拟合笔记本Null logistic mixed model fit notebook

获取笔记本Get notebook

SAIGE 笔记本SAIGE notebook

获取笔记本Get notebook

SAIGE-GENE 笔记本SAIGE-GENE notebook

获取笔记本Get notebook