Hail 0.2Hail 0.2

备注

Apache Spark 3.0 尚不支持 Hail,因此在用于基因组学的 Databricks Runtime 7.x 中不可用。Hail is not yet supported on Apache Spark 3.0, and is therefore not available in Databricks Runtime 7.x for Genomics. 用于基因组学的 Databricks Runtime 6.x 的所有版本均支持 Hail。Hail is supported in all releases of Databricks Runtime 6.x for Genomics.

Hail 是基于 Apache Spark 构建的库,用于分析大型基因组学数据集。Hail is a library built on Apache Spark for analyzing large genomic datasets. Hail 0.2 集成到用于基因组学的 Databricks RuntimeHail 0.2 is integrated into Databricks Runtime for Genomics.

创建 Hail 群集Create a Hail cluster

若要创建安装了 Hail 的群集,请执行以下操作:To create a cluster with Hail installed:

  1. 设置以下环境变量Set the following environment variable:

    ENABLE_HAIL=true
    

    此环境变量会导致群集在安装 Hail 0.2 及其依赖项和 Python 3.6 后启动。This environment variable causes the cluster to launch with Hail 0.2, its dependencies, and Python 3.6 installed.

在笔记本中使用 HailUse Hail in a notebook

Azure Databricks 中的 Hail 0.2 代码与 Hail 文档的工作原理大致相同。For the most part, Hail 0.2 code in Azure Databricks works identically to the Hail documentation. 但 Azure Databricks 环境需要进行一些修改。However, there are a few modifications that are necessary for the Azure Databricks environment.

初始化Initialization

初始化 Hail 时,传入预先创建的 SparkContext 并将初始化标记为幂等。When initializing Hail, pass in the pre-created SparkContext and mark the initialization as idempotent. 此设置使多个 Azure Databricks 笔记本可以使用相同的 Hail 上下文。This setting enables multiple Azure Databricks notebooks to use the same Hail context.

备注

启用 skip_logging_configuration,将日志保存到滚动驱动程序 log4j 输出。Enable skip_logging_configuration to save logs to the rolling driver log4j output. 此设置仅在用于基因组学的 Databricks Runtime 6.6 中受支持。This setting is only supported in Databricks Runtime 6.6 for Genomics.

import hail as hl
hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)

正在绘图Plotting

Hail 使用 Bokeh 库创建绘图。Hail uses the Bokeh library to create plots. 内置到 Bokeh 中的 show 函数在 Azure Databricks 中不起作用。The show function built into Bokeh does not work in Azure Databricks. 若要显示由 Hail 生成的 Bokeh 图,可以运行以下命令:To display a Bokeh plot generated by Hail, you can run a command like:

from bokeh.embed import components, file_html
from bokeh.resources import CDN
plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
html = file_html(plot, CDN, "Chart")
displayHTML(html)

有关详细信息,请参阅 BokehSee Bokeh for more information.

限制Limitations

  • 启用 Hail 支持后,群集将使用 Python 3.6,因此针对其他版本的 Python 编写的笔记本可能无法工作。When Hail support is enabled, your cluster uses Python 3.6, so notebooks written against different versions of Python may not work.
  • 启用 Hail 支持后,默认安装较少的 Python 库。When Hail support is enabled, fewer Python libraries are installed by default. 仍可使用功能来安装新的库。You can still use the Libraries feature to install new libraries.

设置 Hail 群集后,请尝试使用 Hail 概述笔记本。After you’ve set up a Hail cluster, try out the Hail overview notebook.

Hail 概述笔记本Hail overview notebook

获取笔记本Get notebook