TensorFlow TensorFlow

TensorFlow 是 Google 创建的机器学习开源框架。TensorFlow is an open-source framework for machine learning created by Google. 它支持在 CPU、GPU 以及 GPU 群集上进行深度学习和常规数字计算。It supports deep-learning and general numerical computations on CPUs, GPUs, and clusters of GPUs. 它遵守 Apache 2.0 许可证的条款和条件。It is subject to the terms and conditions of the Apache 2.0 License.

以下部分提供了有关在 Azure Databricks 上安装 TensorFlow 的指导,并提供了运行 TensorFlow 程序的示例。The following sections provide guidance on installing TensorFlow on Azure Databricks and give an example of running TensorFlow programs.

备注

本指南不是关于 TensorFlow 的综合指南。This guide is not a comprehensive guide on TensorFlow. 请参阅 TensorFlow 网站See the TensorFlow website.

Databricks Runtime ML 中包含的 TensorFlow 版本TensorFlow versions included in Databricks Runtime ML

用于机器学习的 Databricks Runtime 包括 TensorFlow 和 TensorBoard,因此你可以在不安装任何程序包的情况下使用这些库。Databricks Runtime for Machine Learning includes TensorFlow and TensorBoard so you can use these libraries without installing any packages. 下面是包括的 TensorFlow 版本:Here are the TensorFlow versions included:

Databricks Runtime ML 版Databricks Runtime ML Version TensorFlow 版本TensorFlow Version
7.37.3 2.3.02.3.0
7.0 - 7.27.0 - 7.2 2.2.02.2.0
6.3 - 6.66.3 - 6.6 1.15.01.15.0

安装 TensorFlow Install TensorFlow

本部分提供的说明用于在用于机器学习的 Databricks RuntimeDatabricks Runtime 上安装或降级 TensorFlow,以便你可以试用 TensorFlow 中的最新功能。This section provides instructions for installing or downgrading TensorFlow on Databricks Runtime for Machine Learning and Databricks Runtime, so that you can try out the latest features in TensorFlow. 由于程序包依赖关系,可能存在与其他预安装程序包的兼容性问题。Due to package dependencies, there might be compatibility issues with other pre-installed packages. 安装后,可以通过在 Python 笔记本中执行以下命令来验证已安装的版本:After installation, you can verify the installed version by executing the following command in a Python notebook:

import tensorflow as tf
print([tf.__version__, tf.test.is_gpu_available()])

在 Databricks Runtime 7.2 上安装 TensorFlow 2.3Install TensorFlow 2.3 on Databricks Runtime 7.2

Azure Databricks 建议使用 %pip 和 %conda magic 命令安装 TensorFlow。Azure Databricks recommends installing TensorFlow using %pip and %conda magic commands. 在笔记本中,运行:In a notebook, run:

%pip install tensorflow-cpu==2.3.*

在 Databricks Runtime 7.2 上安装 TensorFlow 1.15Install TensorFlow 1.15 on Databricks Runtime 7.2

在笔记本中,运行:In a notebook, run:

%pip install tensorflow-cpu==1.15.*

在 Databricks Runtime 7.2 ML 上安装 TensorFlow 2.3Install TensorFlow 2.3 on Databricks Runtime 7.2 ML

在笔记本中,运行:In a notebook, run:

CpuCpu

%pip install tensorflow-cpu==2.3.*

GpuGpu

%pip install tensorflow-gpu==2.3.*

在 Databricks Runtime 7.2 ML 上安装 TensorFlow 1.15Install TensorFlow 1.15 on Databricks Runtime 7.2 ML

在笔记本中,运行:In a notebook, run:

CpuCpu

%pip install tensorflow-cpu==1.15.*

GpuGpu

正式的 TensorFlow 1.15 版本是针对 CUDA 10.0 构建的,它与 Databricks Runtime 7.0 ML 及更高版本中安装的 CUDA 10.1 不兼容。The official TensorFlow 1.15 release is built against CUDA 10.0, which is not compatible with CUDA 10.1 installed in Databricks Runtime 7.0 ML and above. Azure Databricks 提供了与 CUDA 10.1 兼容的 TensorFlow 1.15.3 的自定义版本。Azure Databricks provides a custom build of TensorFlow 1.15.3 that is compatbile with CUDA 10.1. 可以使用以下命令安装它。Use the command below to install it.

%pip install https://databricks-prod-cloudfront.cloud.databricks.com/artifacts/tensorflow/runtime-7.x/tensorflow-1.15.3-cp37-cp37m-linux_x86_64.whl

在 Databricks Runtime 5.5 LTS ML 上安装 TensorFlow 2.3Install TensorFlow 2.3 on Databricks Runtime 5.5 LTS ML

以下位置的用于群集的初始化脚本:Init script for clusters on:

CpuCpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python

pip install --upgrade pip
pip install tensorflow-cpu==2.3.* setuptools==41.* grpcio==1.24.*

GpuGpu

#!/bin/bash

set -e

apt-get remove -y --auto-remove cuda-toolkit-10-0
apt-get update
apt-get install -y --no-install-recommends --allow-downgrades \
  libnccl2=2.4.8-1+cuda10.1 \
  libnccl-dev=2.4.8-1+cuda10.1 \
  cuda-libraries-10-1 \
  libcudnn7=7.6.4.38-1+cuda10.1 \
  libcudnn7-dev=7.6.4.38-1+cuda10.1 \
  libcublas10=10.2.1.243-1 \
  libcublas-dev=10.2.1.243-1 \
  cuda-libraries-dev-10-1 \
  cuda-compiler-10-1
ln -sfn cuda-10.1 /usr/local/cuda

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python

pip install --upgrade pip
pip install tensorflow==2.3.* setuptools==41.* grpcio==1.24.*

在 Databricks Runtime 5.5 LTS 上安装 TensorFlow 2.3Install TensorFlow 2.3 on Databricks Runtime 5.5 LTS

以下位置的用于群集的初始化脚本:Init script for clusters on:

CpuCpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-cpu==2.3.* setuptools==41.* pyasn1==0.4.6
/databricks/python/bin/pip uninstall -y numpy
rm -rf /databricks/python/lib/python3.5/site-packages/numpy
/databricks/python/bin/pip install numpy==1.18.4

GpuGpu

#!/bin/bash

set -e

apt-get update
apt-get install -y gnupg-curl

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.1.243-1_amd64.deb
dpkg -i cuda-repo-ubuntu1604_10.1.243-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

apt-get update
apt-get install -y --no-install-recommends --allow-downgrades \
  libnccl2=2.4.8-1+cuda10.1 \
  libnccl-dev=2.4.8-1+cuda10.1 \
  cuda-libraries-10-1 \
  libcudnn7=7.6.4.38-1+cuda10.1 \
  libcudnn7-dev=7.6.4.38-1+cuda10.1 \
  libcublas10=10.2.1.243-1 \
  libcublas-dev=10.2.1.243-1 \
  cuda-libraries-dev-10-1 \
  cuda-compiler-10-1
ln -sfn cuda-10.1 /usr/local/cuda

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow==2.3.* setuptools==41.*
/databricks/python/bin/pip uninstall -y numpy
rm -rf /databricks/python/lib/python3.5/site-packages/numpy
/databricks/python/bin/pip install numpy==1.18.4

TensorFlow 2 已知问题 TensorFlow 2 known issues

TensorFlow 2 与 Python pickling 之间有一个已知的兼容性问题TensorFlow 2 has a known incompatibility with Python pickling. 如果你使用依赖于 pickling 的 PySpark、HorovodRunnerHyperopt 或任何其他程序包,则可能会遇到此问题。You might encounter it if you use PySpark, HorovodRunner, Hyperopt, or any other packages that depend on pickling. 解决方法是在函数中显式导入 TensorFlow 模块。The workaround is to explicitly import TensorFlow modules inside your functions. 以下是示例:Here is an example:

import tensorflow as tf

def bad_func(_):
  tf.keras.Sequential()

# You might see an error.
sc.parallelize(range(0)).foreach(bad_func)

def good_func(_):
  import tensorflow as tf
  tf.keras.some_func

# No error.
sc.parallelize(range(0)).foreach(good_func)

在 Databricks Runtime 5.5 LTS ML 上安装 TensorFlow 1.15Install TensorFlow 1.15 on Databricks Runtime 5.5 LTS ML

Azure Databricks 建议使用初始化脚本在 Databricks Runtime 5.5 LTS ML 上安装 TensorFlow 1.15。Azure Databricks recommends installing TensorFlow 1.15 on Databricks Runtime 5.5 LTS ML using an init script.

以下位置的用于群集的初始化脚本:Init script for clusters on:

CpuCpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda install -y conda=4.6
conda activate /databricks/python

conda install -y tensorflow-mkl=1.15 setuptools=41

GpuGpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda install -y conda=4.6
conda activate /databricks/python

conda install -y tensorflow-gpu=1.15 setuptools=41

在 Databricks Runtime 5.5 LTS 上安装 TensorFlow 1.15Install TensorFlow 1.15 on Databricks Runtime 5.5 LTS

Azure Databricks 建议使用初始化脚本在 Databricks Runtime 5.5 LTS 上安装 TensorFlow 1.15。Azure Databricks recommends installing TensorFlow 1.15 on Databricks Runtime 5.5 LTS using an init script.

以下位置的用于群集的初始化脚本:Init script for clusters on:

CpuCpu

#!/bin/bash

set -e

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-cpu==1.15.* setuptools==41.*

GpuGpu

#!/bin/bash

set -e

apt-get update
apt-get install -y gnupg-curl

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
dpkg -i cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

apt-get update
apt-get install -y --no-install-recommends cuda-libraries-10-0 libcudnn7=7.4.2.24-1+cuda10.0

/databricks/python/bin/python -V
/databricks/python/bin/pip install tensorflow-gpu==1.15.* setuptools==41.*

TensorBoard TensorBoard

TensorBoard 是一套可视化效果工具,用于调试、优化和了解 TensorFlow、PyTorch 和其他机器学习程序。TensorBoard is a suite of visualization tools for debugging, optimizing, and understanding TensorFlow, PyTorch, and other machine learning programs.

使用 TensorBoardUse TensorBoard

在 Databricks Runtime 7.2 及更高版本中使用 TensorBoardUse TensorBoard on Databricks Runtime 7.2 and above

在 Azure Databricks 中启动 TensorBoard 与在本地计算机上的 Jupyter 笔记本中启动它没有什么不同。Starting TensorBoard in Azure Databricks is no different than starting it on a Jupyter notebook on your local computer.

  1. 加载 %tensorboard magic 命令并定义日志目录。Load the %tensorboard magic command and define your log directory.

    %load_ext tensorboard
    experiment_log_dir = <log-directory>
    
  2. 调用 %tensorboard magic 命令。Invoke the %tensorboard magic command.

    %tensorboard --logdir $experiment_log_dir
    

    TensorBoard 服务器将启动并在笔记本中显示内联用户界面。The TensorBoard server starts and displays the user interface inline in the notebook. 它还提供了用于在新标签页中打开 TensorBoard 的链接。It also provides a link to open TensorBoard in a new tab.

    以下屏幕截图显示了在填充的日志目录中启动的 TensorBoard UI。The following screenshot shows the TensorBoard UI started in a populated log directory.

    TensorBoardTensorBoard

还可以直接使用 TensorBoard 的笔记本模块启动 TensorBoard。You can also start TensorBoard by using TensorBoard’s notebook module directly.

from tensorboard import notebook
notebook.start("--logdir {}".format(experiment_log_dir))

在 Databricks Runtime 7.1 及更低版本中使用 TensorBoardUse TensorBoard on Databricks Runtime 7.1 and below

若要从笔记本启动 TensorBoard,请使用 dbutils.tensorboard 实用工具。To start TensorBoard from your notebook, use the dbutils.tensorboard utility.

dbutils.tensorboard.start("/tmp/tensorflow_log_dir")

此命令会显示一个链接,单击该链接会在新标签页中打开 TensorBoard。This command displays a link that, when clicked, opens TensorBoard in a new tab.

当使用此 API 启动时,TensorBoard 会一直运行,直到你使用 dbutils.tensorboard.stop() 停止它或关闭群集。When started using this API TensorBoard continues to run until you either stop it with dbutils.tensorboard.stop() or you shut down your cluster.

备注

如果将 TensorFlow 作为 Azure Databricks 库附加到群集,则在启动 TensorBoard 之前可能需要重新附加笔记本。If you attach TensorFlow to your cluster as an Azure Databricks library, you may need to reattach your notebook before starting TensorBoard.

TensorBoard 日志和目录TensorBoard logs and directories

TensorBoard 通过读取 TensorBoardPyTorch 中的 TensorBoard 回调和函数生成的日志来可视化你的机器学习程序。TensorBoard visualizes your machine learning programs by reading logs generated by TensorBoard callbacks and functions in TensorBoard or PyTorch. 若要为其他机器学习库生成日志,可以使用 TensorFlow 文件编写器直接编写日志(对于 TensorFlow 2.x,请参阅模块:tf.summary,而对于 TensorFlow 1.x 中的较旧 API,请参阅模块:tf.compat.v1.summary)。To generate logs for other machine learning libraries, you can directly write logs using TensorFlow file writers (see Module: tf.summary for TensorFlow 2.x and see Module: tf.compat.v1.summary for the older API in TensorFlow 1.x ).

若要确保可靠地存储试验日志,Azure Databricks 建议将日志写入到 DBFS(即 /dbfs/ 下的一个日志目录),而不是写入到临时的群集文件系统。To make sure that your experiment logs are reliably stored, Azure Databricks recommends writing logs to DBFS (that is, a log directory under /dbfs/) rather than on the ephemeral cluster file system. 对于每个试验,请在唯一目录中启动 TensorBoard。For each experiment, start TensorBoard in a unique directory. 对于生成日志的试验中的机器学习代码的每次运行,请将 TensorBoard 回调或文件编写器设置为写入到试验目录的子目录。For each run of your machine learning code in the experiment that generates logs, set the TensorBoard callback or filewriter to write to a subdirectory of the experiment directory. 这样,TensorBoard UI 中的数据就会被分隔到运行中。That way, the data in the TensorBoard UI will be separated into runs.

阅读官方 TensorBoard 文档,开始使用 TensorBoard 来记录你的机器学习程序的信息。Read the official TensorBoard documentation to get started using TensorBoard to log information for your machine learning program.

管理 TensorBoard 进程Manage TensorBoard processes

当 Azure Databricks 笔记本拆离或 REPL 重启时(例如,当你清除笔记本的状态时),笔记本中启动的 TensorBoard 进程不会终止。The TensorBoard processes started within Azure Databricks notebook are not terminated when the notebook is detached or the REPL is restarted (for example, when you clear the state of the notebook). 若要手动终止 TensorBoard 进程,请使用 %sh kill -15 pid 向其发送终止信号。To manually kill a TensorBoard process, send it a termination signal using %sh kill -15 pid. 不正确地终止 TensorBoard 进程可能会损坏 notebook.list()Improperly killed TensorBoard processes may corrupt notebook.list().

若要列出群集上当前运行的 TensorBoard 服务器及其相应的日志目录和进程 ID,请从 TensorBoard 笔记本模块运行 notebook.list()To list the TensorBoard servers currently running on your cluster, with their corresponding log directories and process IDs, run notebook.list() from the TensorBoard notebook module.

已知问题Known issues

  • 内联 TensorBoard UI 位于一个 iframe 中。The inline TensorBoard UI is inside an iframe. 浏览器安全功能会妨碍 UI 内的外部链接生效,除非你在新标签页中打开此类链接。Browser security features prevent external links within the UI from working unless you open the link in a new tab.
  • TensorBoard 的 --window_title 选项在 Azure Databricks 上被替代。The --window_title option of TensorBoard is overridden on Azure Databricks.
  • 默认情况下,TensorBoard 会扫描一个端口范围来选择要侦听的端口。By default, TensorBoard scans a port range for selecting a port to listen to. 如果在群集上运行的 TensorBoard 进程太多,则并非端口范围内的所有端口都可用。If there are too many TensorBoard processes running on the cluster, all ports in the port range may be unavailable. 可以通过使用 --port 参数指定端口号来解决此限制。You can work around this limitation by specifying a port number with the --port argument. 指定的端口应介于 6006 到 6106 之间。The specified port should be between 6006 and 6106.
  • 为了使下载链接生效,应在标签页中打开 TensorBoard。In order for download links to work, you should open TensorBoard in a tab.
  • 当使用 TensorBoard 1.15.0 时,“投影仪”选项卡为空。When using TensorBoard 1.15.0, the Projector tab is blank. 可以将 URL 中的 #projector 替换为 data/plugin/projector/projector_binary.html,以便直接访问投影仪页,这是一种解决方法。As a workaround, to visit the projector page directly, you can replace #projector in the URL by data/plugin/projector/projector_binary.html.

在单节点上使用 TensorFlowUse TensorFlow on a single node

若要测试和迁移单机 TensorFlow 工作流,可以通过将工作器数设置为零,在 Azure Databricks 上从仅限驱动程序的群集着手。To test and migrate single-machine TensorFlow workflows, you can start with a driver-only cluster on Azure Databricks by setting the number of workers to zero. 尽管 Apache Spark 在此设置下不起作用,但这是运行单机 TensorFlow 工作流的一种经济高效的方法。Though Apache Spark is not functional under this setting, it is a cost-effective way to run single-machine TensorFlow workflows. 以下笔记本显示了如何在仅限驱动程序的群集上运行 TensorFlow(1.x 和 2.x)和 TensorBoard 监视。The following notebook shows how you can run TensorFlow (1.x and 2.x), with TensorBoard monitoring on a driver-only cluster.

TensorFlow 1.15/2.x 笔记本TensorFlow 1.15/2.x notebook

获取笔记本Get notebook