Azure Databricks 工作区概念 Azure Databricks Workspace concepts

本文介绍为有效地使用 Azure Databricks 工作区,你需要了解的一组基本概念。This article introduces the set of fundamental concepts you need to understand in order to use Azure Databricks Workspace effectively.

工作区Workspace

工作区是用于访问所有 Azure Databricks 资产的环境。The workspace is an environment for accessing all of your Azure Databricks assets. 工作区将对象(笔记本、库、仪表板和试验)组织成文件夹,并提供对数据对象和计算资源的访问。The workspace organizes objects (notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and computational resources.

本部分介绍 Azure Databricks 工作区文件夹中包含的对象。This section describes the objects contained in the Azure Databricks workspace folders.

笔记本Notebook

文档的基于 Web 的接口,其中包含可运行的代码、可视化效果和叙述性文本。A web-based interface to documents that contain runnable commands, visualizations, and narrative text.

仪表板Dashboard

以有组织的方式访问可视化效果的的接口。An interface that provides organized access to visualizations.

Library

可对群集上运行的笔记本或作业使用的代码包。A package of code available to the notebook or job running on your cluster. Databricks 运行时包括许多库,也可以添加自己的库。Databricks runtimes include many libraries and you can add your own.

试验Experiment

MLflow 运行的集合,用于训练机器学习模型。A collection of MLflow runs for training a machine learning model.

接口Interface

本部分介绍 Azure Databricks 支持的用于访问资产的接口:UI、API 和命令行 (CLI)。This section describes the interfaces that Azure Databricks supports for accessing your assets: UI, API, and command-line (CLI).

UIUI

Azure Databricks UI 为工作区文件夹及其包含的对象、数据对象和计算资源提供了一个易于使用的图形界面。The Azure Databricks UI provides an easy-to-use graphical interface to workspace folders and their contained objects, data objects, and computational resources.

登陆页面Landing page

REST APIREST API

有两种 REST API 版本:REST API 2.0REST API 1.2There are two versions of the REST API: REST API 2.0 and REST API 1.2. 首选 REST API 2.0,其支持 REST API 1.2 的大多数功能,并支持其他功能。The REST API 2.0 supports most of the functionality of the REST API 1.2, as well as additional functionality and is preferred.

CLICLI

此开放源代码项目承载在 GitHub 上。An open source project hosted on GitHub. CLI 在 REST API 2.0 的基础上构建。The CLI is built on top of the REST API 2.0.

数据管理Data management

本部分介绍用于保存数据的对象,将对这些数据执行分析并将其馈送到机器学习算法中。This section describes the objects that hold the data on which you perform analytics and feed into machine learning algorithms.

Databricks 文件系统 (DBFS)Databricks File System (DBFS)

blob 存储区之上的文件系统抽象层。A filesystem abstraction layer over a blob store. 它包含目录,其中可以包含文件(数据文件、库和图像)和其他目录。It contains directories, which can contain files (data files, libraries, and images), and other directories. 系统会自动使用某些可用来了解 Azure Databricks 的数据集来填充 DBFS。DBFS is automatically populated with some datasets that you can use to learn Azure Databricks.

数据库Database

信息集合,该信息经过整理以便轻松进行访问、管理和更新。A collection of information that is organized so that it can be easily accessed, managed, and updated.

TableTable

结构化数据的表示形式。A representation of structured data. 利用 Apache Spark SQL 和 Apache Spark API 查询表。You query tables with Apache Spark SQL and Apache Spark APIs.

元存储Metastore

用于存储数据仓库中各种表和分区的所有结构信息的组件,包括列和列类型信息、读取和写入数据所需的序列化器和去序列化器,以及用于存储数据的相应文件。The component that stores all the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored. 每个 Azure Databricks 部署都有一个中心 Hive 元存储,供所有需要保存表元数据的群集访问。Every Azure Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. 也可以选择使用现有的外部 Hive 元存储You also have the option to use an existing external Hive metastore.

计算管理Computation management

本部分介绍在 Azure Databricks 中运行计算时需要了解的概念。This section describes concepts that you need to know to run computations in Azure Databricks.

群集Cluster

用于运行笔记本和作业的一组计算资源和配置。A set of computation resources and configurations on which you run notebooks and jobs. 有两种类型的群集:通用和作业。There are two types of clusters: all-purpose and job.

  • 使用 UI、CLI 或 REST API 创建通用群集。You create an all-purpose cluster using the UI, CLI, or REST API. 可手动终止和重启通用群集。You can manually terminate and restart an all-purpose cluster. 多个用户可以共享此类群集,以协作的方式执行交互式分析。Multiple users can share such clusters to do collaborative interactive analysis.
  • 当你在新的作业群集上运行作业时,Azure Databricks 作业计划程序将创建一个作业群集,并在作业完成时终止该群集 。The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. 无法重启作业群集。You cannot restart an job cluster.

Pool

一组空闲的随时可用的实例,可减少群集启动和自动缩放时间。A set of idle, ready-to-use instances that reduce cluster start and auto-scaling times. 附加到池时,群集会从池中分配其驱动程序节点和工作器节点。When attached to a pool, a cluster allocates its driver and worker nodes from the pool. 如果池中没有足够的空闲资源来满足群集的请求,则池会通过从实例提供程序分配新的实例进行扩展。If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider. 终止附加的群集后,它使用的实例会返回到池中,可供其他群集重复使用。When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.

Databricks 运行时Databricks runtime

核心组件集,可在 Azure Databricks 管理的群集上运行。The set of core components that run on the clusters managed by Azure Databricks. Azure Databricks 提供多种类型的运行时:Azure Databricks offers several types of runtimes:

  • Databricks Runtime 包括 Apache Spark,但还添加了许多可以显著提高大数据分析可用性、性能和安全性的组件与更新。Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
  • 用于机器学习的 Databricks Runtime 基于 Databricks Runtime 构建的,为机器学习和数据科学提供随时可用的环境。Databricks Runtime for Machine Learning is built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science. 它包含多个流行库,其中包括 TensorFlow、Keras、PyTorch 和 XGBoost。It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
  • 用于基因组学的 Databricks Runtime 是 Databricks Runtime 的一个版本,已针对处理基因组和生物医学数据而进行了优化。Databricks Runtime for Genomics is a version of Databricks Runtime optimized for working with genomic and biomedical data.
  • Databricks Light 是开放源代码 Apache Spark 运行时的 Azure Databricks 包。Databricks Light is the Azure Databricks packaging of the open source Apache Spark runtime. 它为不需要 Databricks Runtime 所提供的高级性能、可靠性或自动缩放优势的作业提供运行时选项。It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by Databricks Runtime. 仅当创建运行 JAR、Python 或 spark-submit 作业的群集时,才可以选择 Databricks Light;对于要在其上运行交互式或笔记本作业工作负荷的群集,不能选择此运行时。You can select Databricks Light only when you create a cluster to run a JAR, Python, or spark-submit job; you cannot select this runtime for clusters on which you run interactive or notebook job workloads.

作业Job

可立即或按计划运行笔记本或库的非交互式机制。A non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis.

工作负载Workload

Azure Databricks 标识了定价方案不同的两种类型的工作负载:数据工程(作业)和数据分析(通用)。Azure Databricks identifies two types of workloads subject to different pricing schemes: data engineering (job) and data analytics (all-purpose).

  • 数据工程(自动)工作负载在 Azure Databricks 作业计划程序为每个工作负载创建的 工作群集 上运行。Data engineering An (automated) workload runs on a job cluster which the Azure Databricks job scheduler creates for each workload.
  • 数据分析(交互式)工作负载在 通用群集 上运行。Data analytics An (interactive) workload runs on an all-purpose cluster. 交互式工作负载通常在 Azure Databricks 笔记本内运行命令。Interactive workloads typically run commands within an Azure Databricks notebook. 但是,在现有通用群集上运行作业也被视为交互式工作负载 。However, running a job on an existing all-purpose cluster is also treated as an interactive workload.

执行上下文Execution context

每种受支持编程语言的 REPL 环境的状态。The state for a REPL environment for each supported programming language. 支持的语言包括 Python、R、Scala 和 SQL。The languages supported are Python, R, Scala, and SQL.

模型管理Model management

本部分介绍训练机器学习模型时需要了解的概念。This section describes concepts that you need to know to train machine learning models.

模型Model

表示一组预测变量和结果之间的关系的数学函数。A mathematical function that represents the relationship between a set of predictors and an outcome. 机器学习包括“训练”和“推理”步骤 。Machine learning consists of training and inference steps. 你使用现有数据集训练模型,然后使用该模型预测新数据的结果(推理) 。You train a model using an existing dataset, and then use that model to predict the outcomes (inference) of new data.

运行Run

与训练机器学习模型相关的参数、指标和标记的集合。A collection of parameters, metrics, and tags related to training a machine learning model.

试验Experiment

对运行进行组织和访问控制时使用的主要单位;所有 MLflow 运行都属于实验。The primary unit of organization and access control for runs; all MLflow runs belong to an experiment. 通过试验,你能够可视化、搜索和比较运行,以及下载运行项目或元数据以便在其他工具中进行分析。An experiment lets you visualize, search, and compare runs, as well as download run artifacts or metadata for analysis in other tools.

身份验证和授权Authentication and authorization

本部分介绍管理 Azure Databricks 用户及其对 Azure Databricks 资产的访问时需要了解的概念。This section describes concepts that you need to know when you manage Azure Databricks users and their access to Azure Databricks assets.

用户User

有权访问系统的唯一个人。A unique individual who has access to the system.

Group

用户集合。A collection of users.

访问控制列表 (ACL)Access control list (ACL)

附加到工作区、群集、作业、表或实验的权限的列表。A list of permissions attached to the Workspace, cluster, job, table, or experiment. ACL 指定向哪些用户或系统进程授予对对象的访问权限,以及允许对资产执行哪些操作。An ACL specifies which users or system processes are granted access to the objects, as well as what operations are allowed on the assets. 典型 ACL 中的每个条目都指定主题和操作。Each entry in a typical ACL specifies a subject and an operation.