用于基因组学的 Databricks Runtime 7.0Databricks Runtime 7.0 for Genomics

Databricks 于 2020 年 6 月发布了此映像。Databricks released this image in June 2020.

用于基因组学的 Databricks Runtime 7.0 是 Databricks Runtime 7.0 的一个版本,已针对基因组和生物医学数据的处理进行了优化。Databricks Runtime 7.0 for Genomics is a version of Databricks Runtime 7.0 optimized for working with genomic and biomedical data. 它是用于基因组学的 Databricks 统一分析平台的组件。It is a component of the Databricks Unified Analytics Platform for Genomics.

有关详细信息(包括有关创建用于基因组学的 Databricks Runtime 群集的说明),请参阅用于基因组学的 Databricks RuntimeFor more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Databricks Runtime for Genomics. 若要详细了解如何开发基因组学应用程序,请参阅基因组学For more information on developing genomics applications, see Genomics.

新增功能New features

用于基因组学的 Databricks Runtime 7.0 是基于 Databricks Runtime 7.0 构建的。Databricks Runtime 7.0 for Genomics is built on top of Databricks Runtime 7.0. 若要了解 Databricks Runtime 7.0 中的新增功能,请参阅 Databricks Runtime 7.0 发行说明。For information on what’s new in Databricks Runtime 7.0, see the Databricks Runtime 7.0 release notes.

GloWGR:整个基因组回归GloWGR: Whole genome regression

现在,Glow 包含可缩放的全基因组回归方法 GloWGRGlow now includes a scalable whole genome regression method, GloWGR. GloWGR 是单节点工具 regenie 的分布式版本。GloWGR is a distributed version of the single-node tool regenie. GloWGR 是一个企业级工具,它为全基因组回归提供了与其他方法相当的准确性,但在速度上有数量级的提高。GloWGR is an enterprise-ready tool that provides equivalent accuracy to other methods for whole-genome regression, but with an order-of-magnitude improvement in speed. 有关详细信息,请参阅开放源代码中的全基因组回归For details, see whole genome regression in open source.

转换器会接受非字符串类型参数Transformers accept non-string typed arguments

现在,所有 Glow 转换器(包括管道转换器和变体规范化程序)都接受其值不是字符串的参数。All Glow transformers, including the pipe transformer and variant normalizer, now accept arguments whose values are not strings. 管道转换器的 Glow 文档反映了此新用法。The Glow documentation for the pipe transformer reflects the new usage. 为实现后向兼容性,所有参数仍接受字符串值。For backwards compatibility, string values are still accepted for all arguments.

Numpy ndarray 文本Numpy ndarray literals

现在,可以将文本 numpy 1D 和 2D float 类型 ndarray 传递给需要类型分别为 array<double>DenseMatrix 的数据帧列的函数。You can now pass literal numpy 1D and 2D float-typed ndarrays to functions that expect DataFrame columns with types array<double> and DenseMatrix respectively. Glow 基因组范围的关联研究文档演示了此新用法。The Glow genome-wide association study documentation demonstrates the new usage.

平均值替换函数Mean substitution function

现在,Glow 提供了 mean_substitute 函数,用于将数组中缺少的值替换为非缺失值的平均值。Glow now provides a mean_substitute function to substitute missing values in an array with the mean of the non-missing values.

改进Improvements

联合基因分型性能Joint genotyping performance

联合基因分型管道的性能已提高了 5-20%。The performance of the Joint genotyping pipeline has improved by 5-20%. 当使用每个节点有多个核心的群集节点类型时,这种改进尤其明显。The improvement is particularly pronounced when using cluster node types with many cores per node.

VCF 读取器会忽略 tabix 索引文件VCF reader ignores tabix index files

在以前的版本中,如果 VCF 文件的目录包含 tabix 索引文件,则 VCF 读取器在读取该目录时可能会失败。In previous releases, the VCF reader could fail when reading a directory of VCF files if the directory contained tabix index files. 读取器会尝试将 tabix 文件解释为 VCF 文件,并报告错误。The reader would attempt to interpret the tabix files as VCF files and report an error. 现在,读取器仅使用索引文件来确定要读取的数据文件。Now, the reader only uses index files to determine which data files to read.

从 VCF 读取器中删除了 splitToBiallelic 选项Removed splitToBiallelic option from VCF reader

为了支持 split_multiallelics 转换器,此选项已被删除。This option has been removed in favor of the split_multiallelics transformer. 转换器比 VCF 读取器选项更快、更精确。The transformer is faster and more accurate than the VCF reader option.

Libraries

以下部分列出了用于基因组学的 Databricks Runtime 7.0 中包含的库,这些库不同于 Databricks Runtime 7.0 中包含的库。The following sections list the libraries included in Databricks Runtime 7.0 for Genomics that differ from those included in Databricks Runtime 7.0.

已升级的库Upgraded libraries

  • ADAM:从 0.30.0 升级到 0.32.0ADAM: 0.30.0 to 0.32.0

已删除的库Removed libraries

由于没有基于 Apache Spark 3.0 的发行版,因此,用于基因组学的 Databricks Runtime 7.0 中不包含 Hail。Hail is not included in Databricks Runtime 7.0 for Genomics as there is no release based on Apache Spark 3.0.

已打包的库Packaged libraries

Library 版本Version
ADAMADAM 0.32.00.32.0
GATKGATK 4.1.4.14.1.4.1
Hadoop-bamHadoop-bam 7.9.27.9.2
samtoolssamtools 1.91.9
VEPVEP 9696