RNASeq 管道RNASeq pipeline

备注

以下库版本已打包到用于基因组学的 Databricks Runtime 7.0 中。The following library versions are packaged in Databricks Runtime 7.0 for Genomics. 对于用于基因组学的 Databricks Runtime 的较低版本中包含的库,请参阅发行说明For libraries included in lower versions of Databricks Runtime for Genomics, see the release notes.

Databricks RNASeq 管道使用 STAR v2.6.1a 和 ADAM v0.32.0 处理短读取比对和量化。The Databricks RNASeq pipeline handles short read alignment and quantification using STAR v2.6.1a and ADAM v0.32.0.

设置Setup

管道将作为 Azure Databricks 作业运行。The pipeline is run as an Azure Databricks job. 可以设置群集策略以保存配置:You can set up a cluster policy to save the configuration:

{
  "num_workers": {
    "type": "unlimited",
    "defaultValue": 13
  },
  "node_type_id": {
    "type": "unlimited",
    "defaultValue": "Standard_F32s_v2"
  },
  "spark_env_vars.refGenomeId": {
    "type": "unlimited",
    "defaultValue": "grch38_star"
  },
  "spark_version": {
    "type": "regex",
    "pattern": ".*-hls.*",
    "defaultValue": "7.0.x-hls-scala2.12"
  }
}
  • 该任务应为此页底部提供的 RNASeq 笔记本。The task should be the RNASeq notebook provided at the bottom of this page.
  • 为了获得最佳性能,请使用至少具有 60GB 内存的计算优化 VM。For best performance, use the compute optimized VMs with at least 60GB of memory. 我们建议使用 Standard_F32s_v2 VM。We recommend Standard_F32s_v2 VMs.

参数Parameters

管道接受多个控制其行为的参数。The pipeline accepts a number of parameters that control its behavior. 这里记录了最重要且最常更改的参数;其余参数可以在 RNASeq 笔记本中找到。The most important and commonly changed parameters are documented here; the rest can be found in the RNASeq notebook. 可以为所有运行或单次运行设置所有参数。All parameters can be set for all runs or per-run.

参数Parameter 默认Default 描述Description
manifestmanifest 不适用n/a 描述输入的清单。The manifest describing the input.
输出output 不适用n/a 应将管道输出写入到的路径。The path where pipeline output should be written.
replayModereplayMode skipskip 下列其中一项:One of:

* skip:如果输出已存在,则跳过相关阶段。* skip: stages are skipped if output already exists.
* overwrite:将删除现有输出。* overwrite: existing output is deleted.
perSampleTimeoutperSampleTimeout 12 小时12h 每个样本应用的超时。A timeout applied per sample. 达到此超时时间后,管道会继续执行下一个样本。After reaching this timeout, the pipeline continues on to the next sample. 此参数的值必须包含超时单位:“s”(表示秒)、“m”(表示分钟)或“h”(表示小时)。The value of this parameter must include a timeout unit: ‘s’ for seconds, ‘m’ for minutes, or ‘h’ for hours. 例如,“60m”会导致超时 60 分钟。For example, ‘60m’ will result in a timeout of 60 minutes.

此外,必须使用环境变量来配置参考基因组。In addition, you must configure the reference genome using environment variables. 若要使用 Grch37,请设置环境变量:To use Grch37, set the environment variable:

refGenomeId=grch37_star

若要改用 Grch38,请按如下所示设置环境变量:To use Grch38 instead, set an environment variable like this:

refGenomeId=grch38_star

演练Walkthrough

此管道包含两个步骤:The pipeline consists of two steps:

  1. 比对:使用 STAR 比对器将每个短读取映射到参考基因组。Alignment: Map each short read to the reference genome using the STAR aligner.
  2. 量化:计算与每个参考脚本对应的读取数。Quantification: Count how many reads correspond to each reference transcript.

其他使用信息和故障排除Additional usage info and troubleshooting

RNASeq 管道在操作方面与 DNASeq 管道非常相似。The operational aspects of the RNASeq pipeline are very similar to the DNASeq pipeline. 有关清单格式、输出结构、编程用法以及常见问题的详细信息,请参阅 DNASeq 管道For more information about manifest format, output structure, programmatic usage, and common issues, see DNASeq pipeline.

RNASeq 管道笔记本RNASeq pipeline notebook

获取笔记本Get notebook