用于基因组学的 Databricks Runtime 7.3Databricks Runtime 7.3 for Genomics

Databricks 于 2020 年 9 月发布了此映像。Databricks released this image in September 2020.

用于基因组学的 Databricks Runtime 7.3 是 Databricks Runtime 7.3 的一个版本,已针对基因组和生物医学数据的处理进行了优化。Databricks Runtime 7.3 for Genomics is a version of Databricks Runtime 7.3 optimized for working with genomic and biomedical data. 它是用于基因组学的 Databricks 统一分析平台的组件。It is a component of the Databricks Unified Analytics Platform for Genomics.

有关详细信息,包括有关创建用于基因组学的 Databricks Runtime 群集的说明,请参阅用于基因组学的 Databricks RuntimeFor more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Databricks Runtime for Genomics. 若要详细了解如何开发基因组学应用程序,请参阅基因组学For more information on developing genomics applications, see Genomics.

新增功能New features

用于基因组学的 Databricks Runtime 7.3 是基于 Databricks Runtime 7.3 构建的。Databricks Runtime 7.3 for Genomics is built on top of Databricks Runtime 7.3. 若要了解 Databricks Runtime 7.3 中的新增功能,请参阅 Databricks Runtime 7.3 发行说明。For information on what’s new in Databricks Runtime 7.3, see the Databricks Runtime 7.3 release notes.

支持读取具有未压缩或 zstd 压缩基因型的 BGEN 文件Support for reading BGEN files with uncompressed or zstd-compressed genotypes

现在,Glow 支持读取 BGEN 文件,这些文件包含使用 zstandard 的 ZSTD_compress() 函数压缩的以及未压缩的 SNP 块概率数据,此外还有对使用 zlib 的 compress() 函数压缩的数据的现有读取支持。Glow now supports reading BGEN files containing SNP block probability data that is uncompressed or compressed using zstandard’s ZSTD_compress() function, in addition to the existing support for reading data compressed using zlib’s compress() function.

改进Improvements

变体 liftOver 性能Variant liftOver performance

通过 Glow 执行变体 liftOver 的速度现在已提升 12 倍。Performing variant liftOver with Glow is now up to 12x faster.

大文件上传到 ABFS 的速度更快Faster big file upload to ABFS

将大文件(如 VCF、BGEN 和 BAM)写入 Azure Blob 文件系统的速度现在已比之前快了 2 倍。Writing big files (such as VCF, BGEN and BAM) to the Azure Blob File System is now up to 2x faster.

DNASeq 管道在自动缩放群集上的性能Performance of DNASeq pipeline on autoscaling clusters

DNASeq 管道针对自动缩放群集进行了进一步优化。The DNASeq pipeline is now better tuned for autoscaling clusters.

管道在默认情况下输出经过了 bgzip 的 VCFPipelines output bgzipped VCFs by default

所有基因组学管道现在默认使用 bgzip 压缩输出 VCF。All genomics pipelines now default to compressing output VCFs using bgzip. 默认情况下,输出的 VCF 之前为未压缩状态。The output VCFs were previously uncompressed by default. 若要对此进行配置,请从 bgzf 更改 vcfCompressionCodec 管道选项。To configure this, change the vcfCompressionCodec pipeline option from bgzf.

重构Refactors

TNSeq 管道重命名为 MutSeqTNSeq pipeline renamed to MutSeq

Tumor/Normal 管道从 TNSeq 重命名为 MutSeq。The Tumor/Normal pipeline has been renamed from TNSeq to MutSeq.

Libraries

以下部分列出了用于基因组学的 Databricks Runtime 7.3 中包含的库,这些库不同于 Databricks Runtime 7.3 中包含的库。The following sections list the libraries included in Databricks Runtime 7.3 for Genomics that differ from those included in Databricks Runtime 7.3.

打包的库Packaged libraries

Library 版本Version
ADAMADAM 0.32.00.32.0
GATKGATK 4.1.4.14.1.4.1
Hadoop-bamHadoop-bam 7.9.27.9.2
samtoolssamtools 1.91.9
VEPVEP 9696