什么是 Azure 机器学习计算实例?What is an Azure Machine Learning compute instance?

Azure 机器学习计算实例是面向数据科学家的基于云的托管式工作站。An Azure Machine Learning compute instance is a managed cloud-based workstation for data scientists.

计算实例可让客户轻松地开始进行 Azure 机器学习开发,并为 IT 管理员提供管理和企业就绪功能。Compute instances make it easy to get started with Azure Machine Learning development as well as provide management and enterprise readiness capabilities for IT administrators.

可以使用计算实例作为在云中进行机器学习的完全配置和托管的开发环境。Use a compute instance as your fully configured and managed development environment in the cloud for machine learning. 还可以在开发和测试中将它们用作训练和推理的计算目标。They can also be used as a compute target for training and inferencing for development and testing purposes.

对于生产级模型训练,请使用具有多节点缩放功能的 Azure 机器学习计算群集For production grade model training use an Azure Machine Learning compute cluster with multi-node scaling capabilities. 对于生产级模型部署,请使用 Azure Kubernetes 服务群集For production grade model deployment, use Azure Kubernetes Service cluster.

为何使用计算实例?Why use a compute instance?

计算实例是完全托管式基于云的工作站,已针对机器学习开发环境进行优化。A compute instance is a fully-managed cloud-based workstation optimized for your machine learning development environment. 它提供以下优势:It provides the following benefits:

主要优点Key benefits 描述Description
工作效率Productivity 可以在 Azure 机器学习工作室中使用集成的笔记本及以下工具来构建和部署模型:You can build and deploy models using integrated notebooks and the following tools in Azure Machine Learning studio:
- Jupyter- Jupyter
- JupyterLab- JupyterLab
- RStudio(预览版)- RStudio (preview)
计算实例与 Azure 机器学习工作区和工作室完全集成。Compute instance is fully integrated with Azure Machine Learning workspace and studio. 你可以与工作区中的其他数据科学家共享笔记本和数据。You can share notebooks and data with other data scientists in the workspace. 你还可以使用 SSH 设置 VS Code 远程开发You can also setup VS Code remote development using SSH
无需自行管理且安全Managed & secure 减少安全保护工作,增强企业的安全要求合规性。Reduce your security footprint and add compliance with enterprise security requirements. 计算实例提供可靠的管理策略和安全网络配置,例如:Compute instances provide robust management policies and secure networking configurations such as:

- 通过资源管理器模板或 Azure 机器学习 SDK 自动预配- Auto-provisioning from Resource Manager templates or Azure Machine Learning SDK
- 基于角色的访问控制 (RBAC)- Role-based access control (RBAC)
- 虚拟网络支持- Virtual network support
- 用于启用/禁用 SSH 访问的 SSH 策略- SSH policy to enable/disable SSH access
已启用 TLS 1.2TLS 1.2 enabled
已针对 ML 进行了预配置Preconfigured for ML 使用预配置的最新 ML 包、深度学习框架和 GPU 驱动程序完成设置任务,可节省时间。Save time on setup tasks with pre-configured and up-to-date ML packages, deep learning frameworks, GPU drivers.
完全可自定义Fully customizable 支持多种 Azure VM 类型,包括 GPU 和持久性低级自定义,例如,安装相应的包和驱动程序可以轻而易举地实现高级方案。Broad support for Azure VM types including GPUs and persisted low-level customization such as installing packages and drivers makes advanced scenarios a breeze.

工具和环境Tools and environments

重要

下面标记了“(预览版)”的工具目前为公共预览版。Tools marked (preview) below are currently in public preview. 该预览版在提供时没有附带服务级别协议,建议不要将其用于生产工作负载。The preview version is provided without a service level agreement, and it's not recommended for production workloads. 某些功能可能不受支持或者受限。Certain features might not be supported or might have constrained capabilities.

使用 Azure 机器学习计算实例可以在工作区中的完全集成式笔记本体验中创作、训练和部署模型。Azure Machine Learning compute instance enables you to author, train, and deploy models in a fully integrated notebook experience in your workspace.

以下工具和环境安装在计算实例上:These tools and environments are installed on the compute instance:

常规工具和环境General tools & environments 详细信息Details
驱动程序Drivers CUDA
cuDNN
NVIDIA
Blob FUSE
Intel MPI 库Intel MPI library
Azure CLIAzure CLI
Azure 机器学习示例Azure Machine Learning samples
DockerDocker
NginxNginx
NCCL 2.0NCCL 2.0
ProtobufProtobuf
R 工具和环境R tools & environments 详细信息Details
RStudio Server 开源版(预览版)RStudio Server Open Source Edition (preview)
R 内核R kernel
适用于 R 的 Azure 机器学习 SDKAzure Machine Learning SDK for R azuremlsdkazuremlsdk
SDK 示例SDK samples
PYTHON 工具和环境PYTHON tools & environments 详细信息Details
Anaconda PythonAnaconda Python
Jupyter 和扩展Jupyter and extensions
Jupyterlab 和扩展Jupyterlab and extensions
适用于 Python 的 Azure 机器学习 SDKAzure Machine Learning SDK for Python
(来自 PyPI)from PyPI
包括大多数 azureml 额外包。Includes most of the azureml extra packages. 若要查看完整列表,请打开计算实例上的终端窗口并运行To see the full list, open a terminal window on your compute instance and run
conda list -n azureml_py36 azureml*
其他 PyPI 包Other PyPI packages jupytext
tensorboard
nbconvert
notebook
Pillow
Conda 包Conda packages cython
numpy
ipykernel
scikit-learn
matplotlib
tqdm
joblib
nodejs
nb_conda_kernels
深度学习包Deep learning packages PyTorch
TensorFlow
Keras
Horovod
MLFlow
pandas-ml
scrapbook
ONNX 包ONNX packages keras2onnx
onnx
onnxconverter-common
skl2onnx
onnxmltools
Azure 机器学习 Python 和 R SDK 示例Azure Machine Learning Python & R SDK samples

Python 包都安装在 Python 3.6 - AzureML 环境中。Python packages are all installed in the Python 3.6 - AzureML environment.

安装包Installing packages

可以直接在 Jupyter 笔记本或 Rstudio 中安装包:You can install packages directly in a Jupyter notebook or Rstudio:

  • RStudio 使用右下的“包”选项卡或左上的“控制台”选项卡。RStudio Use the Packages tab on the bottom right, or the Console tab on the top left.
  • Python:添加安装代码并在 Jupyter 笔记本单元中执行它。Python: Add install code and execute in a Jupyter notebook cell.

也可通过以下任一方式访问终端窗口:Or you can access a terminal window in any of these ways:

  • RStudio:选择左上的“终端”选项卡。RStudio: Select the Terminal tab on top left.
  • Jupyter 实验室:选择“启动器”选项卡中“其他”标题下的“终端”磁贴。Jupyter Lab: Select the Terminal tile under the Other heading in the Launcher tab.
  • Jupyter:在“文件”选项卡的右上方选择“新建>“终端”。Jupyter: Select New>Terminal on top right in the Files tab.
  • 通过 SSH 连接到计算机。SSH to the machine. 然后,将 Python 包安装到 Python 3.6 - AzureML 环境中。Then install Python packages into the Python 3.6 - AzureML environment. 将 R 包安装到 R 环境中。Install R packages into the R environment.

访问文件Accessing files

笔记本和 R 脚本存储在 Azure 文件共享中工作区的默认存储帐户内。Notebooks and R scripts are stored in the default storage account of your workspace in Azure file share. 这些文件位于“用户文件”目录下。These files are located under your “User files” directory. 通过此存储可以轻松地在计算实例之间共享笔记本。This storage makes it easy to share notebooks between compute instances. 停止或删除计算实例时,存储帐户还会安全保存笔记本。The storage account also keeps your notebooks safely preserved when you stop or delete a compute instance.

工作区的 Azure 文件共享帐户作为驱动器装载到计算实例上。The Azure file share account of your workspace is mounted as a drive on the compute instance. 此驱动器是 Jupyter、Jupyter Labs 和 RStudio 的默认工作目录。This drive is the default working directory for Jupyter, Jupyter Labs, and RStudio. 这意味着,在 Jupyter、JupyterLab 或 RStudio 中创建的笔记本和其他文件会自动存储在文件共享上,并可在其他计算实例中使用。This means that the notebooks and other files you create in Jupyter, JupyterLab, or RStudio are automatically stored on the file share and available to use in other compute instances as well.

可以从同一工作区中的所有计算实例访问文件共享中的文件。The files in the file share are accessible from all compute instances in the same workspace. 对计算实例上的这些文件所做的任何更改将可靠地保存回到文件共享。Any changes to these files on the compute instance will be reliably persisted back to the file share.

还可以将最新 Azure 机器学习示例克隆到工作区文件共享中“用户文件”目录下的文件夹内。You can also clone the latest Azure Machine Learning samples to your folder under the user files directory in the workspace file share.

与写入到计算实例本地磁盘本身相比,在网络驱动器上写入小文件可能速度更慢。Writing small files can be slower on network drives than writing to the compute instance local disk itself. 若要写入许多小文件,请尝试直接在计算实例上使用某个目录,例如 /tmp 目录。If you are writing many small files, try using a directory directly on the compute instance, such as a /tmp directory. 请注意,无法从其他计算实例访问这些文件。Please note these files will not be accessible from other compute instances.

你可以使用计算实例上的 /tmp 目录来保存临时数据。You can use the /tmp directory on the compute instance for your temporary data. 但是,不要在计算实例的 OS 磁盘上写入大型数据文件。However, do not write large files of data on the OS disk of the compute instance. 请改用数据存储Use datastores instead. 如果已安装 JupyterLab git 扩展,它也会导致计算实例性能下降。If you have installed JupyterLab git extension it can also lead to slowdown in compute instance performance.

管理计算实例Managing a compute instance

在 Azure 机器学习工作室中的工作区内选择“计算”,然后在顶部选择“计算实例”。 In your workspace in Azure Machine Learning studio, select Compute, then select Compute Instance on the top.

管理计算实例

可执行以下操作:You can perform the following actions:

  • 创建计算实例Create a compute instance.
  • 刷新“计算实例”选项卡。Refresh the compute instances tab.
  • 启动、停止和重启计算实例。Start, stop and restart a compute instance. 只要实例在运行,你就需要为其付费。You do pay for the instance whenever it is running. 不使用计算实例时,请将其停止,以便降低成本。Stop the compute instance when you are not using it to reduce cost. 停止计算实例会将其解除分配。Stopping a compute instance deallocates it. 然后在需要时重启。Then start it again when you need it.
  • 删除计算实例。Delete a compute instance.
  • 将计算实例的列表筛选为你创建的实例。Filter the list of compute instances to the ones you created. 这些是你可以访问的计算实例。These are the compute instances you can access.

对于工作区中你有权访问的每个计算实例,你可以:For each compute instance in your workspace that you have access to, you can:

  • 访问计算实例上的 Jupyter、JupyterLab、RStudioAccess Jupyter, JupyterLab, RStudio on the compute instance
  • 通过 SSH 连接到计算实例。SSH into compute instance. 默认已禁用 SSH 访问,但可以在创建计算实例时启用。SSH access is disabled by default but can be enabled at compute instance creation time. SSH 访问是通过公钥/私钥机制实现的。SSH access is through public/private key mechanism. 选项卡中将提供 IP 地址、用户名和端口号等 SSH 连接详细信息。The tab will give you details for SSH connection such as IP address, username, and port number.
  • 获取有关特定计算实例的详细信息,例如 IP 地址和区域。Get details about a specific compute instance such as IP address, and region.

使用 RBAC 可以控制工作区中的哪些用户可以创建、删除、启动、停止和重启计算实例。RBAC allows you to control which users in the workspace can create, delete, start, stop, restart a compute instance. 充当工作区参与者和所有者角色的所有用户可以在整个工作区中创建、删除、启动、停止和重启计算实例。All users in the workspace contributor and owner role can create, delete, start, stop, and restart compute instances across the workspace. 但是,只有特定计算实例的创建者可在该计算实例上访问 Jupyter、JupyterLab 和 RStudio。However, only the creator of a specific compute instance is allowed to access Jupyter, JupyterLab, and RStudio on that compute instance. 计算实例的创建者拥有专用的计算实例,具有根访问权限,且可从终端通过 Jupyter/JupyterLab/RStudio 进入。The creator of the compute instance has the compute instance dedicated to them, have root access, and can terminal in through Jupyter/JupyterLab/RStudio. 计算实例具有创建者用户的单用户登录名,所有操作将使用该用户的标识进行试验运行的 RBAC 控制和权限划分。Compute instance will have single-user login of creator user and all actions will use that user’s identity for RBAC and attribution of experiment runs. SSH 访问是通过公钥/私钥机制控制的。SSH access is controlled through public/private key mechanism.

可以通过 RBAC 来控制这些操作:These actions can be controlled by RBAC:

  • Microsoft.MachineLearningServices/workspaces/computes/readMicrosoft.MachineLearningServices/workspaces/computes/read
  • Microsoft.MachineLearningServices/workspaces/computes/writeMicrosoft.MachineLearningServices/workspaces/computes/write
  • Microsoft.MachineLearningServices/workspaces/computes/deleteMicrosoft.MachineLearningServices/workspaces/computes/delete
  • Microsoft.MachineLearningServices/workspaces/computes/start/actionMicrosoft.MachineLearningServices/workspaces/computes/start/action
  • Microsoft.MachineLearningServices/workspaces/computes/stop/actionMicrosoft.MachineLearningServices/workspaces/computes/stop/action
  • Microsoft.MachineLearningServices/workspaces/computes/restart/actionMicrosoft.MachineLearningServices/workspaces/computes/restart/action

创建计算实例Create a compute instance

在 Azure 机器学习工作室的工作区中,当你准备好运行某个笔记本时,请从“计算”部分或“笔记本”部分创建新的计算实例。In your workspace in Azure Machine Learning studio, create a new compute instance from either the Compute section or in the Notebooks section when you are ready to run one of your notebooks.

新建计算实例

字段Field 说明Description
计算名称Compute name
  • 名称是必须提供的,且长度必须介于 3 到 24 个字符之间。Name is required and must be between 3 to 24 characters long.
  • 有效字符为大小写字母、数字和 - 字符。Valid characters are upper and lower case letters, digits, and the - character.
  • 名称必须以字母开头Name must start with a letter
  • 名称必须在 Azure 区域内的全部现有计算中都是唯一的。Name needs to be unique across all existing computes within an Azure region. 如果选择的名称不是唯一的,则会显示警报You will see an alert if the name you choose is not unique
  • 如果在名称中使用了 - 字符,在此字符之后必须至少跟有一个字母If - character is used, then it needs to be followed by at least one letter later in the name
  • 虚拟机类型Virtual machine type 选择“CPU”或“GPU”。Choose CPU or GPU. 此类型在创建后无法更改This type cannot be changed after creation
    虚拟机大小Virtual machine size 在你的区域中,支持的虚拟机大小可能会受到限制。Supported virtual machine sizes might be restricted in your region. 请查看可用性列表Check the availability list
    启用/禁用 SSH 访问Enable/disable SSH access 默认情况下会禁用 SSH 访问。SSH access is disabled by default. SSH 访问SSH access cannot be. 在创建后无法更改。changed after creation. 如果计划使用 VS Code Remote 以交互模式进行调试,请确保启用访问权限Make sure to enable access if you plan to debug interactively with VS Code Remote
    高级设置Advanced settings 可选。Optional. 配置虚拟网络Configure a virtual network. 指定资源组虚拟网络子网,以在 Azure 虚拟网络 (vnet) 中创建计算实例。Specify the Resource group, Virtual network, and Subnet to create the compute instance inside an Azure Virtual Network (vnet). 有关详细信息,请参阅 vnet 的这些网络要求For more information, see these network requirements for vnet .

    也可以通过以下方式创建实例You can also create an instance

    应用于计算实例创建过程的每区域每 VM 系列专用核心数配额和区域总配额The dedicated cores per region per VM family quota and total regional quota, which applies to compute instance creation. 与 Azure 机器学习训练计算群集配额统一并共享。is unified and shared with Azure Machine Learning training compute cluster quota. 停止计算实例不会释放配额,因此无法确保你能够重启计算实例。Stopping the compute instance does not release quota to ensure you will be able to restart the compute instance.

    计算目标Compute target

    计算实例可用作类似于 Azure 机器学习计算训练群集的训练计算目标Compute instances can be used as a training compute target similar to Azure Machine Learning compute training clusters.

    计算实例:A compute instance:

    • 具有作业队列。Has a job queue.
    • 在虚拟网络环境中安全地运行作业,无需企业打开 SSH 端口。Runs jobs securely in a virtual network environment, without requiring enterprises to open up SSH port. 作业在容器化环境中执行,并将模型依赖项打包到 Docker 容器中。The job executes in a containerized environment and packages your model dependencies in a Docker container.
    • 可以并行运行多个小型作业(预览版)。Can run multiple small jobs in parallel (preview). 每个核心可以并行运行两个作业,而剩余的作业将排队。Two jobs per core can run in parallel while the rest of the jobs are queued.

    可以使用计算实例作为测试/调试方案的本地推理部署目标。You can use compute instance as a local inferencing deployment target for test/debug scenarios.

    备注

    计算实例不支持分布式训练作业。Distributed training jobs are not supported on compute instance. 对于分布式训练,请使用计算群集Use (compute clusters](how-to-set-up-training-targets.md#amlcompute) for distributed training.

    有关更多详细信息,请参阅笔记本 train-on-computeinstanceFor more details, see the notebook train-on-computeinstance. 此笔记本也可在“training/train-on-computeinstance” 中的工作室 Samples 文件夹中找到。This notebook is also available in the studio Samples folder in training/train-on-computeinstance.

    Notebook VM 发生了什么情况?What happened to Notebook VM?

    计算实例即将取代 Notebook VM。Compute instances are replacing the Notebook VM.

    任何存储在工作区文件共享中的笔记本文件和工作区数据存储中的数据都可以从计算实例访问。Any notebook files stored in the workspace file share and data in workspace data stores will be accessible from a compute instance. 但是,以前安装在 Notebook VM 上的任何自定义包都需要在计算实例上重新安装。However, any custom packages previously installed on a Notebook VM will need to be re-installed on the compute instance. 创建计算群集时适用的配额限制在创建计算实例时同样适用。Quota limitations which apply to compute clusters creation will apply to compute instance creation as well.

    不能创建新的 Notebook VM。New Notebook VMs cannot be created. 但你仍然可以访问和使用已创建的 Notebook VM 及其完整功能。However, you can still access and use Notebook VMs you have created, with full functionality. 可以在现有 Notebook VM 所在的同一工作区中创建计算实例。Compute instances can be created in same workspace as the existing Notebook VMs.

    后续步骤Next steps