Azure 机器学习的强化学习(预览版)Reinforcement learning (preview) with Azure Machine Learning


Azure 机器学习强化学习目前是一项预览版功能。Azure Machine Learning Reinforcement Learning is currently a preview feature. 目前仅支持 Ray 和 RLlib 框架。Only Ray and RLlib frameworks are supported at this time.

本文介绍如何训练一个强化学习 (RL) 代理,以便能够畅玩视频游戏 Pong。In this article, you learn how to train a reinforcement learning (RL) agent to play the video game Pong. 将开放源代码 Python 库 Ray RLlib 与 Azure 机器学习配合使用来管理分布式 RL 的复杂性。You use the open-source Python library Ray RLlib with Azure Machine Learning to manage the complexity of distributed RL.

本文介绍如何执行以下操作:In this article you learn how to:

  • 设置试验Set up an experiment
  • 定义头节点和工作器节点Define head and worker nodes
  • 创建 RL 估算器Create an RL estimator
  • 提交试验以开始运行Submit an experiment to start a run
  • 查看结果View results

本文基于可在 Azure 机器学习笔记本 GitHub 存储库中找到的 RLlib Pong 示例This article is based on the RLlib Pong example that can be found in the Azure Machine Learning notebook GitHub repository.


在以下任一环境中运行此代码。Run this code in either of these environments. 建议尝试使用 Azure 机器学习计算实例,以获得最快的启动体验。We recommend you try Azure Machine Learning compute instance for the fastest start-up experience. 可以在 Azure 机器学习计算实例上快速克隆和运行强化示例笔记本。You can quickly clone and run the reinforcement sample notebooks on an Azure Machine Learning compute instance.

  • Azure 机器学习计算实例Azure Machine Learning compute instance

    • 可在以下文章中了解如何克隆示例笔记本:教程:设置环境和工作区Learn how to clone sample notebooks in Tutorial: Setup environment and workspace.
      • 请克隆 how-to-use-azureml 文件夹而不是 tutorialsClone the how-to-use-azureml folder instead of tutorials
    • 运行 /how-to-use-azureml/reinforcement-learning/setup/devenv_setup.ipynb 中的虚拟网络设置笔记本,打开用于分布式强化学习的网络端口。Run the virtual network setup notebook located at /how-to-use-azureml/reinforcement-learning/setup/devenv_setup.ipynb to open network ports used for distributed reinforcement learning.
    • 运行示例笔记本 /how-to-use-azureml/reinforcement-learning/atari-on-distributed-compute/pong_rllib.ipynbRun the sample notebook /how-to-use-azureml/reinforcement-learning/atari-on-distributed-compute/pong_rllib.ipynb
  • 你自己的 Jupyter 笔记本服务器Your own Jupyter Notebook server

如何训练 Pong 游戏代理How to train a Pong-playing agent

强化学习 (RL) 是通过工作来学习知识的一种机器学习方法。Reinforcement learning (RL) is an approach to machine learning that learns by doing. 其他机器学习技术是通过被动地接收输入数据并在其中查找模式来学习知识,而 RL 则是使用 训练代理 主动做出决策,并从结果中学习知识。While other machine learning techniques learn by passively taking input data and finding patterns within it, RL uses training agents to actively make decisions and learn from their outcomes.

你的训练代理将会学习如何在 模拟环境 中玩 Pong 游戏。Your training agents learn to play Pong in a simulated environment. 训练代理会在每个游戏帧处,决定是要将球拍上移、下移还是保留原位。Training agents make a decision every frame of the game to move the paddle up, down, or stay in place. 它会查看游戏状态(屏幕的 RGB 图像)来做出决策。It looks at the state of the game (an RGB image of the screen) to make a decision.

RL 使用 奖励 来告知代理其决策是否成功。RL uses rewards to tell the agent if its decisions are successful. 在此示例中,代理获得一分时则获得正奖励,代理被扣掉一分时则获得负奖励。In this example, the agent gets a positive reward when it scores a point and a negative reward when a point is scored against it. 经过多次迭代后,训练代理会根据其当前状态学习选择动作,并进行优化以提高将来的预期总奖励。Over many iterations, the training agent learns to choose the action, based on its current state, that optimizes for the sum of expected future rewards. 在 RL 中,通常使用深度神经网络 (DNN) 执行这种优化。It's common to use deep neural networks (DNN) to perform this optimization in RL.

当代理在一个训练时期达到平均奖励评分 18 时,训练即告结束。Training ends when the agent reaches an average reward score of 18 in a training epoch. 这意味着,代理在比赛中,凭借最低 18 分的平均比分(总分 21 分)击败了其对手。This means that the agent has beaten its opponent by an average of at least 18 points in matches up to 21.

这种迭代模拟和重新训练 DNN 的过程会消耗大量计算资源,且需要大量数据。The process of iterating through simulation and retraining a DNN is computationally expensive, and requires a lot of data. 提高 RL 作业性能的一种方法是 将工作并行化,使多个训练代理能够同时做出动作并学习知识。One way to improve performance of RL jobs is by parallelizing work so that multiple training agents can act and learn simultaneously. 但是,管理分布式 RL 环境可能是一项复杂的任务。However, managing a distributed RL environment can be a complex undertaking.

Azure 机器学习提供了一个框架,用于处理这种复杂性,以便能够横向扩展 RL 工作负荷。Azure Machine Learning provides the framework to manage these complexities to scale out your RL workloads.

设置环境Set up the environment

通过以下方式设置本地 RL 环境:Set up the local RL environment by:

  1. 加载所需的 Python 包Loading the required Python packages
  2. 初始化工作区Initializing your workspace
  3. 创建试验Creating an experiment
  4. 指定配置的虚拟网络。Specifying a configured virtual network.

导入库Import libraries

导入所需的 Python 包以运行本示例的其余部分。Import the necessary Python packages to run the rest of this example.

# Azure ML Core imports
import azureml.core
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.runconfig import EnvironmentDefinition
from azureml.widgets import RunDetails
from azureml.tensorboard import Tensorboard

# Azure ML Reinforcement Learning imports
from azureml.contrib.train.rl import ReinforcementLearningEstimator, Ray
from azureml.contrib.train.rl import WorkerConfiguration

初始化工作区Initialize a workspace

根据在先决条件部分中创建的 config.json 文件初始化工作区对象。Initialize a workspace object from the config.json file created in the prerequisites section. 如果在 Azure 机器学习计算实例中执行此代码,则系统已经为你创建了配置文件。If you are executing this code in an Azure Machine Learning Compute Instance, the configuration file has already been created for you.

ws = Workspace.from_config()

创建强化学习试验Create a reinforcement learning experiment

创建一个试验,用于跟踪强化学习运行。Create an experiment to track your reinforcement learning run. 在 Azure 机器学习中,试验是相关试运行的逻辑集合,用于组织运行日志、历史记录、输出等信息。In Azure Machine Learning, experiments are logical collections of related trials to organize run logs, history, outputs, and more.


exp = Experiment(workspace=ws, name=experiment_name)

指定虚拟网络Specify a virtual network

对于使用多个计算目标的 RL 作业,必须指定一个虚拟网络,其中带有开放端口,这些端口使工作器节点和头节点可以相互通信。For RL jobs that use multiple compute targets, you must specify a virtual network with open ports that allow worker nodes and head nodes to communicate with each other.

虚拟网络可以位于任意资源组中,但应该与工作区位于同一区域。The virtual network can be in any resource group, but it should be in the same region as your workspace. 有关设置虚拟网络的详细信息,请参阅“先决条件”部分中的工作区设置笔记本。For more information on setting up your virtual network, see the workspace setup notebook in the prerequisites section. 在这里,请指定资源组中的虚拟网络的名称。Here, you specify the name of the virtual network in your resource group.

vnet = 'your_vnet'

定义头节点和工作器节点计算目标Define head and worker compute targets

此示例对 Ray 头节点和工作器节点使用不同的计算目标。This example uses separate compute targets for the Ray head and workers nodes. 使用这些设置,可以根据工作负载纵向扩展和缩减计算资源。These settings let you scale your compute resources up and down depending on your workload. 根据需要设置节点数目和每个节点的大小。Set the number of nodes, and the size of each node, based on your needs.

头节点计算目标Head computing target

可以使用配备了 GPU 的头节点群集来提高深度学习性能。You can use a GPU-equipped head cluster to improve deep learning performance. 头节点将训练供代理用来做出决策的神经网络。The head node trains the neural network that the agent uses to make decisions. 头节点还从工作器节点收集数据点,以训练神经网络。The head node also collects data points from the worker nodes to train the neural network.

头节点计算使用单个 STANDARD_NC6 虚拟机 (VM)。The head compute uses a single STANDARD_NC6 virtual machine (VM). 它有 6 个可以分配工作的虚拟 CPU。It has 6 virtual CPUs to distribute work across.

from azureml.core.compute import AmlCompute, ComputeTarget

# choose a name for the Ray head cluster
head_compute_name = 'head-gpu'
head_compute_min_nodes = 0
head_compute_max_nodes = 2

# This example uses GPU VM. For using CPU VM, set SKU to STANDARD_D2_V2
head_vm_size = 'STANDARD_NC6'

if head_compute_name in ws.compute_targets:
    head_compute_target = ws.compute_targets[head_compute_name]
    if head_compute_target and type(head_compute_target) is AmlCompute:
        print(f'found head compute target. just use it {head_compute_name}')
    print('creating a new head compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = head_vm_size,
                                                                min_nodes = head_compute_min_nodes, 
                                                                max_nodes = head_compute_max_nodes,
                                                                vnet_resourcegroup_name = ws.resource_group,
                                                                vnet_name = vnet_name,
                                                                subnet_name = 'default')

    # create the cluster
    head_compute_target = ComputeTarget.create(ws, head_compute_name, provisioning_config)
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    head_compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
     # For a more detailed view of current AmlCompute status, use get_status()

工作器节点计算群集Worker computing cluster

此示例对工作器节点计算目标使用 4 个 STANDARD_D2_V2 VM。This example uses four STANDARD_D2_V2 VMs for the worker compute target. 每个工作器节点有 2 个可用 CPU,因此总共有 8 个可用 CPU。Each worker node has 2 available CPUs for a total of 8 available CPUs.

没有必要在工作器节点上配备 GPU,因为 GPU 不执行深度学习。GPUs aren't necessary for the worker nodes since they aren't performing deep learning. 工作器运行游戏模拟并收集数据。The workers run the game simulations and collect data.

# choose a name for your Ray worker cluster
worker_compute_name = 'worker-cpu'
worker_compute_min_nodes = 0 
worker_compute_max_nodes = 4

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
worker_vm_size = 'STANDARD_D2_V2'

# Create the compute target if it hasn't been created already
if worker_compute_name in ws.compute_targets:
    worker_compute_target = ws.compute_targets[worker_compute_name]
    if worker_compute_target and type(worker_compute_target) is AmlCompute:
        print(f'found worker compute target. just use it {worker_compute_name}')
    print('creating a new worker compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = worker_vm_size,
                                                                min_nodes = worker_compute_min_nodes, 
                                                                max_nodes = worker_compute_max_nodes,
                                                                vnet_resourcegroup_name = ws.resource_group,
                                                                vnet_name = vnet_name,
                                                                subnet_name = 'default')

    # create the cluster
    worker_compute_target = ComputeTarget.create(ws, worker_compute_name, provisioning_config)
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    worker_compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
     # For a more detailed view of current AmlCompute status, use get_status()

创建强化学习估算器Create a reinforcement learning estimator

使用 ReinforcementLearningEstimator 将训练的作业提交到 Azure 机器学习。Use the ReinforcementLearningEstimator to submit a training job to Azure Machine Learning.

Azure 机器学习使用估算器类来封装运行配置信息。Azure Machine Learning uses estimator classes to encapsulate run configuration information. 这样,你便可以指定脚本执行的配置方式。This lets you specify how to configure a script execution.

定义工作器节点配置Define a worker configuration

WorkerConfiguration 对象告知 Azure 机器学习如何初始化运行入口脚本的工作器节点群集。The WorkerConfiguration object tells Azure Machine Learning how to initialize the worker cluster that runs the entry script.

# Pip packages we will use for both head and worker
pip_packages=["ray[rllib]==0.8.3"] # Latest version of Ray has fixes for isses related to object transfers

# Specify the Ray worker configuration
worker_conf = WorkerConfiguration(
    # Azure ML compute cluster to run Ray workers
    # Number of worker nodes
    # GPU
    # PIP packages to use

定义脚本参数Define script parameters

入口脚本 接受一个参数列表,这些参数定义如何执行训练作业。The entry script accepts a list of parameters that defines how to execute the training job. 通过用作封装层的估算器传递这些参数可以轻松更改脚本参数以及独立运行每项配置。Passing these parameters through the estimator as a layer of encapsulation makes it easy to change script parameters and run configurations independently of each other.

指定正确的 num_workers 可以最大程度地发挥并行化的作用。Specifying the correct num_workers makes the most out of your parallelization efforts. 将工作器数量设置为与可用 CPU 数量相同。Set the number of workers to the same as the number of available CPUs. 在此示例中,可以使用以下计算:For this example, you can use the following calculation:

头节点是有 6 个 vCPU 的 Standard_NC6。The head node is a Standard_NC6 with 6 vCPUs. 工作器节点群集是 4 个 Standard_D2_V2 VM,每个 VM 有 2 个 CPU,因此总共有 8 个 CPU。The worker cluster is 4 Standard_D2_V2 VMs with 2 CPUs each, for a total of 8 CPUs. 但是,必须从工作器计数中减去 1 个 CPU,因为必须有 1 个 CPU 专用于头节点角色。However, you must subtract 1 CPU from the worker count since 1 must be dedicated to the head node role. 6个 CPU + 8 个 CPU - 1 个头节点 CPU = 13 个同步工作器。6 CPUs + 8 CPUs - 1 head CPU = 13 simultaneous workers. Azure 机器学习使用头节点和工作器节点群集来区分计算资源。Azure Machine Learning uses head and worker clusters to distinguish compute resources. 但是,Ray 并不区分头节点和工作器节点,所有 CPU 都可用作工作线程。However, Ray does not distinguish between head and workers, and all CPUs are available as worker threads.

training_algorithm = "IMPALA"
rl_environment = "PongNoFrameskip-v4"

# Training script parameters
script_params = {
    # Training algorithm, IMPALA in this case
    "--run": training_algorithm,
    # Environment, Pong in this case
    "--env": rl_environment,
    # Add additional single quotes at the both ends of string values as we have spaces in the 
    # string parameters, outermost quotes are not passed to scripts as they are not actually part of string
    # Number of GPUs
    # Number of ray workers
    "--config": '\'{"num_gpus": 1, "num_workers": 13}\'',
    # Target episode reward mean to stop the training
    # Total training time in seconds
    "--stop": '\'{"episode_reward_mean": 18, "time_total_s": 3600}\'',

定义强化学习估算器Define the reinforcement learning estimator

使用参数列表和工作器节点配置对象来构造估算器。Use the parameter list and the worker configuration object to construct the estimator.

# RL estimator
rl_estimator = ReinforcementLearningEstimator(
    # Location of source files
    # Python script file
    # Parameters to pass to the script file
    # Defined above.
    # The Azure ML compute target set up for Ray head nodes
    # Pip packages
    # GPU usage
    # RL framework. Currently must be Ray.
    # Ray worker configuration defined above.
    # How long to wait for whole cluster to start
    # Maximum time for the whole Ray job to run
    # This will cut off the run after an hour
    # Allow the docker container Ray runs in to make full use
    # of the shared memory available from the host OS.

入口脚本Entry script

入口脚本 使用 OpenAI Gym 环境 PongNoFrameSkip-v4 来训练神经网络。The entry script trains a neural network using the OpenAI Gym environment PongNoFrameSkip-v4. OpenAI Gym 是用于在经典 Atari 游戏中测试强化学习算法的标准化接口。OpenAI Gyms are standardized interfaces to test reinforcement learning algorithms on classic Atari games.

此示例使用名为 IMPALA(重要性加权操作器-学习器体系结构)的训练算法。This example uses a training algorithm known as IMPALA (Importance Weighted Actor-Learner Architecture). IMPALA 并行化每个学习操作器,以跨多个计算节点进行缩放,且不降低速度或稳定性。IMPALA parallelizes each individual learning actor to scale across many compute nodes without sacrificing speed or stability.

Ray 优化协调 IMPALA 工作器任务。Ray Tune orchestrates the IMPALA worker tasks.

import ray
import ray.tune as tune
from ray.rllib import train

import os
import sys

from azureml.core import Run
from utils import callbacks

DEFAULT_RAY_ADDRESS = 'localhost:6379'

if __name__ == "__main__":

    # Parse arguments
    train_parser = train.create_parser()

    args = train_parser.parse_args()
    print("Algorithm config:", args.config)

    if args.ray_address is None:
        args.ray_address = DEFAULT_RAY_ADDRESS

                 "env": args.env,
                 "num_gpus": args.config["num_gpus"],
                 "num_workers": args.config["num_workers"],
                 "callbacks": {"on_train_result": callbacks.on_train_result},
                 "sample_batch_size": 50,
                 "train_batch_size": 1000,
                 "num_sgd_iter": 2,
                 "num_data_loader_buffers": 2,
                 "model": {
                    "dim": 42

记录回调函数Logging callback function

入口脚本使用某个实用工具函数定义一个自定义 RLlib 回调函数,以将指标记录到 Azure 机器学习工作区。The entry script uses a utility function to define a custom RLlib callback function to log metrics to your Azure Machine Learning workspace. 请在监视和查看结果部分了解如何查看这些指标。Learn how to view these metrics in the Monitor and view results section.

'''RLlib callbacks module:
    Common callback methods to be passed to RLlib trainer.
from azureml.core import Run

def on_train_result(info):
    '''Callback on train result to record metrics returned by trainer.
    run = Run.get_context()

提交运行Submit a run

运行处理正在进行的或已完成的作业的运行历史记录。Run handles the run history of in-progress or complete jobs.

run = exp.submit(config=rl_estimator)


完成运行最多可能需要 30 到 45 分钟。The run may take up to 30 to 45 minutes to complete.

监视和查看结果Monitor and view results

使用 Azure 机器学习 Jupyter 小组件可实时查看运行状态。Use the Azure Machine Learning Jupyter widget to see the status of your runs in real time. 该小组件显示两个子运行:一个是头节点的运行,另一个是工作器节点的运行。The widget shows two child runs: one for head and one for workers.

from azureml.widgets import RunDetails

  1. 等待加载小组件。Wait for the widget to load.
  2. 在运行列表中选择头节点运行。Select the head run in the list of runs.

选择“单击此处在 Azure 机器学习工作室中查看运行”可在工作室中查看更多运行信息。Select Click here to see the run in Azure Machine Learning studio for additional run information in the studio. 可以在运行正在进行时访问此信息,也可以在完成运行后访问。You can access this information while the run is in progress or after it completes.


episode_reward_mean 绘图显示每个训练时期的平均评分。The episode_reward_mean plot shows the mean number of points scored per training epoch. 可以看到,训练代理最初的表现很差,输掉了比赛且一分未得(reward_mean 为 -21)。You can see that the training agent initially performed poorly, losing its matches without scoring a single point (shown by a reward_mean of -21). 在 100 次迭代内,训练代理已学会了如何击败计算机对手,平均分数为 18 分。Within 100 iterations, the training agent learned to beat the computer opponent by an average of 18 points.

如果浏览子运行的日志,可以看到 driver_log.txt 文件中记录的评估结果。If you browse logs of the child run, you can see the evaluation results recorded in driver_log.txt file. 可能需要等待几分钟,这些指标才会出现在“运行”页上。You may need to wait several minutes before these metrics become available on the Run page.

简而言之,你已了解如何配置多个计算资源来训练强化学习代理,使其能够很好地与计算机对手玩 Pong 游戏。In short work, you have learned to configure multiple compute resources to train a reinforcement learning agent to play Pong very well against a computer oppponent.

后续步骤Next steps

在本文中,你已了解如何使用 IMPALA 学习代理来训练强化学习代理。In this article, you learned how to train a reinforcement learning agent using an IMPALA learning agent. 若要查看更多示例,请转到 Azure 机器学习强化学习 GitHub 存储库To see additional examples, go to the Azure Machine Learning Reinforcement Learning GitHub repository.