2019 年 7 月July 2019

这些功能和 Azure Databricks 平台的改进已于 2019 年 7 月发布。These features and Azure Databricks platform improvements were released in July 2019.


发布分阶段进行。Releases are staged. 在初始发布日期后,可能最长需要等待一周,你的 Azure Databricks 帐户才会更新。Your Azure Databricks account may not be updated until up to a week after the initial release date.

即将发生的更改:Databricks 6.0 将不支持 Python 2Coming soon: Databricks 6.0 will not support Python 2

预计在即将到来的 Python 2 生命周期结束之后(在 2020 年宣布),Databricks Runtime 6.0 将不支持 Python 2。In anticipation of the upcoming end of life of Python 2, announced for 2020, Python 2 will not be supported in Databricks Runtime 6.0. 更早版本的 Databricks Runtime 将继续支持 Python 2。Earlier versions of Databricks Runtime will continue to support Python 2. 我们预计在 2019 年晚些时候发布 Databricks Runtime 6.0。We expect to release Databricks Runtime 6.0 later in 2019.

在池空闲实例上预先加载 Databricks Runtime 版本Preload the Databricks Runtime version on pool idle instances

2019 年 7 月 30 日 - 8 月 6 日:版本 2.103July 30 - Aug 6, 2019: Version 2.103

现在可以通过选择要在池中空闲实例上加载的 Databricks Runtime 版本来加快池支持的群集启动。You can now speed up pool-backed cluster launches by selecting a Databricks Runtime version to be loaded on idle instances in the pool. 池 UI 上的字段称为预加载的 Spark 版本The field on the Pool UI is called Preloaded Spark Version.

预加载的 Spark 版本Preloaded Spark version

自定义群集标记和池标记可以更好地配合工作Custom cluster tags and pool tags play better together

2019 年 7 月 30 日 - 8 月 6 日:版本 2.103July 30 - Aug 6, 2019: Version 2.103

本月初,Azure Databricks 引入了池,这是一组空闲实例,可帮助你快速启动群集。Earlier this month, Azure Databricks introduced pools, a set of idle instances that help you spin up clusters fast. 在原始版本中,池支持的群集继承了池配置中的默认标记和自定义标记,因此无法在群集级别修改这些标记。In the original release, pool-backed clusters inherited default and custom tags from the pool configuration, and you could not modify these tags at the cluster level. 现在,你可以配置特定于池支持的群集的自定义标记,并且该群集将应用所有自定义标记,无论这些标记是从池中继承的还是专门分配到该群集的。Now you can configure custom tags specific to a pool-backed cluster, and that cluster will apply all custom tags, whether inherited from the pool or assigned to that cluster specifically. 添加特定于群集的自定义标记时,其键名不能与从池继承的自定义标记的键名相同(也就是说,不能重写从池继承的自定义标记)。You cannot add a cluster-specific custom tag with the same key name as a custom tag inherited from a pool (that is, you cannot override a custom tag that is inherited from the pool). 有关详细信息,请参阅池标记For details, see Pool tags.

MLflow 1.1 带来了多项 UI 和 API 改进MLflow 1.1 brings several UI and API improvements

2019 年 7 月 30 日 - 8 月 6 日:版本 2.103July 30 - Aug 6, 2019: Version 2.103

MLflow 1.1 引入了多项新功能来提高 UI 和 API 可用性:MLflow 1.1 introduces several new features to improve UI and API usability:

  • 现在,如果运行次数超过 100,可以在“运行概述 UI”中浏览多页的运行信息。The runs overview UI now lets you browse through multiple pages of runs if the number of runs exceeds 100. 在第 100 次运行之后,单击“加载更多”按钮即可加载随后的 100 次运行。After the 100th run, click the Load more button to load the next 100 runs.

    分页运行Paged runs

  • 比较运行 UI 现在提供一个平行坐标图。The compare runs UI now provides a parallel coordinates plot. 通过此绘图,你可以观察 n 维参数集与指标之间的关系。The plot allows you to observe relationships between an n-dimensional set of parameters and metrics. 它直观显示了所有运行,并以基于指标值(如准确度)的带颜色编码的线表示,还显示了每次运行所采用的参数值。It visualizes all runs as lines that are color-coded based on the value of a metric (for example, accuracy), and shows the parameter values that each run took on.

    平行坐标绘图Parallel coordinates plot

  • 现在可从运行概述 UI 中添加和编辑标记,并在试验搜索视图中查看标记。Now you can add and edit tags from the run overview UI and view tags in the experiment search view.

  • 新的 MLflowContext API 使你能够使用与 Python API 类似的方式创建和记录运行。The new MLflowContext API lets you create and log runs in a way that is similar to the Python API. 此 API 不同于现有的低级别 MlflowClient API,后者仅包装 REST API。This API contrasts with the existing low-level MlflowClient API, which simply wraps the REST APIs.

  • 现在可以使用 DeleteTag API 从 MLflow 运行中删除标记。You can now delete tags from MLflow runs using the DeleteTag API.

有关详细信息,请参阅 MLflow 1.1 博客文章For details, see the MLflow 1.1 blog post. 有关功能和修补程序的完整列表,请参阅 MLflow 1.1 ChangelogFor the complete list of features and fixes, see the MLflow 1.1 Changelog.

pandas 数据帧像在 Jupyter 中一样显示呈现器pandas DataFrame display renders like it does in Jupyter

2019 年 7 月 30 日 - 8 月 6 日:版本 2.103July 30 - Aug 6, 2019: Version 2.103

现在,当你调用 pandas 数据帧时,它的呈现方式将与在 Jupyter 中的呈现方式相同。Now when you call a pandas DataFrame, it will render the same way as it does in Jupyter.

显示 pandas 数据帧Display pandas DataFrame

新区域New regions

2019 年 7 月 30 日July 30, 2019

目前以下其他区域提供 Azure Databricks:Azure Databricks is now available in the following additional regions:

  • 韩国中部Korea Central
  • 南非北部South Africa North

带有 Conda 的 Databricks Runtime 5.5 (Beta)Databricks Runtime 5.5 with Conda (Beta)

2019 年 7 月 23 日July 23, 2019


带有 Conda 的 Databricks Runtime 以 Beta 版本提供。Databricks Runtime with Conda is in Beta. 在即将发布的 Beta 版本中,支持的环境的内容可能会发生变化。The contents of the supported environments may change in upcoming Beta releases. 更改可能包括包列表或已安装包的版本的列表。Changes can include the list of packages or versions of installed packages. 带有 Conda 的 Databricks Runtime 5.5 是基于 Databricks Runtime 5.5 LTS 构建的。Databricks Runtime 5.5 with Conda is built on top of Databricks Runtime 5.5 LTS.

带有 Conda 的 Databricks Runtime 5.5 版本添加了一个新的笔记本范围的库 API,用于支持使用 YAML 规范来更新笔记本的 Conda 环境(请参阅 Conda 文档)。The Databricks Runtime 5.5 with Conda release adds a new notebook-scoped library API to support updating the notebook’s Conda environment with a YAML specification (see Conda documentation).

请参阅带有 Conda 的 Databricks Runtime 5.5(Beta 版本)的完整发行说明。See the complete release notes at Databricks Runtime 5.5 with Conda (Beta).

更新了元存储连接限制Updated metastore connection limit

2019 年 7 月 16 日至 23 日:版本 2.102July 16 - 23, 2019: Version 2.102

eastus、eastus2、centralus、westus、westus2、westeurope、northeurope 中的新 Azure Databricks 工作区将具有更高的元存储连接上限 (250)。New Azure Databricks workspaces in eastus, eastus2, centralus, westus, westus2, westeurope, northeurope will have a higher metastore connection limit of 250. 现有工作区将继续使用当前的元存储,不会受影响和中断,连接限制依然为 100。Existing workspaces will continue to use the current metastore with no disruption and continue to have a connection limit of 100.

设置对池的权限(公共预览版)Set permissions on pools (Public Preview)

2019 年 7 月 16 日至 23 日:版本 2.102July 16 - 23, 2019: Version 2.102

池 UI 现在支持对可管理池的人员以及可将群集附加到池的人员设置权限。The pool UI now supports setting permissions on who can manage pools and who can attach clusters to pools.

有关详细信息,请参阅池访问控制For details, see Pool access control.

用于机器学习的 Databricks Runtime 5.5Databricks Runtime 5.5 for Machine Learning

2019 年 7 月 15 日July 15, 2019

Databricks Runtime 5.5 ML 是基于 Databricks Runtime 5.5 LTS 构建的。Databricks Runtime 5.5 ML is built on top of Databricks Runtime 5.5 LTS. 它包含许多常见的机器学习库,包括 TensorFlow、PyTorch、Keras 和 XGBoost,并使用 Horovod 提供分布式 TensorFlow 训练。It contains many popular machine learning libraries, including TensorFlow, PyTorch, Keras, and XGBoost, and provides distributed TensorFlow training using Horovod.

此版本包含以下新功能和改进:This release includes the following new features and improvements:

  • 添加了 MLflow 1.0 Python 包Added the MLflow 1.0 Python package
  • 升级了机器学习库Upgraded machine learning libraries
    • Tensorflow 已从 1.12.0 升级到 1.13.1Tensorflow upgraded from 1.12.0 to 1.13.1
    • PyTorch 已从 0.4.1 升级到 1.1.0PyTorch upgraded from 0.4.1 to 1.1.0
    • scikit-learn 已从 0.19.1 升级到 0.20.3scikit-learn upgraded from 0.19.1 to 0.20.3
  • HorovodRunner 的单节点操作Single-node operation for HorovodRunner

有关详细信息,请参阅 Databricks Runtime 5.5 LTS MLFor details, see Databricks Runtime 5.5 LTS ML.

Databricks Runtime 5.5Databricks Runtime 5.5

2019 年 7 月 15 日July 15, 2019

Databricks Runtime 5.5 现已推出。Databricks Runtime 5.5 is now available. Databricks Runtime 5.5 包括 Apache Spark 2.4.3、已升级的 Python、R、Java 和 Scala 库,以及以下新增功能:Databricks Runtime 5.5 includes Apache Spark 2.4.3, upgraded Python, R, Java, and Scala libraries, and the following new features:

  • Azure Databricks 上的 Delta Lake 自动优化 GADelta Lake on Azure Databricks Auto Optimize GA
  • Azure Databricks 上的 Delta Lake 提升了最小值、最大值和计数聚合查询性能Delta Lake on Azure Databricks improved min, max, and count aggregation query performance
  • 处理速度更快的模型推理管道以及经过优化的二进制文件数据源和标量迭代器 pandas UDF(公共预览版)Faster model inference pipelines with improved binary file data source and scalar iterator pandas UDF (Public Preview)
  • R 笔记本中的机密 APISecrets API in R notebooks

有关详细信息,请参阅 Databricks Runtime 5.5 LTSFor details, see Databricks Runtime 5.5 LTS.

将实例池保持后备状态以快速启动群集(公共预览版)Keep a pool of instances on standby for quick cluster launch (Public Preview)

2019 年 7 月 9 日至 11 日:版本 2.101July 9 - 11, 2019: Version 2.101

为了缩短群集启动时间,Azure Databricks 现在支持将群集附加到空闲实例的预定义池。To reduce cluster start time, Azure Databricks now supports attaching a cluster to a pre-defined pool of idle instances. 附加到池时,群集会从池中分配其驱动程序节点和工作器节点。When attached to a pool, a cluster allocates its driver and worker nodes from the pool. 如果池中没有足够的空闲资源来满足群集的请求,则池会通过从云提供程序分配新的实例进行扩展。If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the cloud provider. 终止附加的群集后,它使用的实例会返回到池中,可供其他群集重复使用。When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.

当实例在池中处于空闲状态时,Azure Databricks 不会收取 DBU 费用,Azure Databricks does not charge DBUs while instances are idle in the pool. 但这会产生实例提供程序费用,具体请参阅定价Instance provider billing does apply; see pricing.

有关详细信息,请参阅For details, see Pools.

Ganglia 指标Ganglia metrics

2019 年 7 月 9 日至 11 日:版本 2.101July 9 - 11, 2019: Version 2.101

Ganglia 是一种可缩放的分布式监视系统,当前可在 Azure Databricks 群集中使用。Ganglia is a scalable distributed monitoring system that is now available on Azure Databricks clusters. Ganglia 指标有助于监视群集性能和运行状况。Ganglia metrics help you to monitor cluster performance and health. 可以从群集详细信息页访问 Ganglia 指标:You can access Ganglia metrics from the cluster details page:

“Ganglia 指标”选项卡Ganglia Metrics tab

有关使用和配置 Ganglia 指标的详细信息,请参阅 Ganglia 指标For details on using and configuring Ganglia metrics, see Ganglia metrics.

全局系列颜色Global series color

2019 年 7 月 9 日至 11 日:版本 2.101July 9 - 11, 2019: Version 2.101

现在可以指定某个系列的颜色在笔记本中的所有图表中应保持一致。You can now specify that the colors of a series should be consistent across all charts in your notebook. 请参阅图表之间的颜色一致性See Color consistency across charts.

全局系列颜色Global series color