Azure 机器学习的强化学习(预览版)Reinforcement learning (preview) with Azure Machine Learning

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

备注

Azure 机器学习强化学习目前是一项预览版功能。Azure Machine Learning Reinforcement Learning is currently a preview feature. 目前仅支持 Ray 和 RLlib 框架。Only Ray and RLlib frameworks are supported at this time.

本文介绍如何训练一个强化学习 (RL) 代理,以便能够畅玩视频游戏 Pong。In this article, you learn how to train a reinforcement learning (RL) agent to play the video game Pong. 你要将开源 Python 库 Ray RLlib 与 Azure 机器学习配合使用来处理分布式 RL 作业的复杂性。You will use the open-source Python library Ray RLlib with Azure Machine Learning to manage the complexity of distributed RL jobs.

本文介绍如何执行以下操作:In this article you will learn how to:

  • 设置试验Set up an experiment
  • 定义头节点和工作器节点Define head and worker nodes
  • 创建 RL 估算器Create an RL estimator
  • 提交试验以开始运行Submit an experiment to start a run
  • 查看结果View results

本文基于可在 Azure 机器学习笔记本 GitHub 存储库中找到的 RLlib Pong 示例This article is based on the RLlib Pong example that can be found in the Azure Machine Learning notebook GitHub repository.

先决条件Prerequisites

在以下任一环境中运行此代码。Run this code in either of the following environments. 我们建议尝试使用 Azure 机器学习计算实例,以获得最快的启动体验。We recommend that you try Azure Machine Learning compute instance for the fastest start-up experience. 可以在 Azure 机器学习计算实例上快速克隆和运行强化示例笔记本。The reinforcement sample notebooks are available to quickly clone and run on Azure Machine Learning compute instance.

  • Azure 机器学习计算实例Azure Machine Learning compute instance

    • 可在以下文章中了解如何克隆示例笔记本:教程:设置环境和工作区Learn how to clone sample notebooks in Tutorial: Setup environment and workspace.
      • 请克隆 how-to-use-azureml 文件夹而不是 tutorialsClone the how-to-use-azureml folder instead of tutorials
    • 运行 /how-to-use-azureml/reinforcement-learning/setup/devenv_setup.ipynb 中的虚拟网络设置笔记本,打开用于分布式强化学习的网络端口。Run the virtual network setup notebook located at /how-to-use-azureml/reinforcement-learning/setup/devenv_setup.ipynb to open network ports used for distributed reinforcement learning.
    • 运行示例笔记本 /how-to-use-azureml/reinforcement-learning/atari-on-distributed-compute/pong_rllib.ipynbRun the sample notebook /how-to-use-azureml/reinforcement-learning/atari-on-distributed-compute/pong_rllib.ipynb
  • 你自己的 Jupyter 笔记本服务器Your own Jupyter Notebook server

如何训练 Pong 游戏代理How to train a Pong-playing agent

强化学习 (RL) 是通过工作来学习知识的一种机器学习方法。Reinforcement learning (RL) is an approach to machine learning that learns by doing. 其他机器学习技术是通过被动地接收输入数据并在其中查找模式来学习知识,而 RL 则是使用训练代理主动做出决策,并从结果中学习知识。While other machine learning techniques learn by passively taking input data and finding patterns within it, RL uses training agents to actively make decisions and learn from their outcomes.

你的训练代理将会学习如何在模拟环境中玩 Pong 游戏。Your training agents learn to play Pong in a simulated environment. 训练代理会在每个游戏帧处,决定是要将球拍上移、下移还是保留原位。Training agents make a decision every frame of the game to move the paddle up, down, or stay in place. 它会查看游戏状态(屏幕的 RGB 图像)来做出决策。It looks at the state of the game (an RGB image of the screen) to make a decision.

RL 使用奖励来告知代理其决策是否成功。RL uses rewards to tell the agent if its decisions are successful. 在此环境中,当代理获得一分时,将会获得正奖励,被扣掉一分时则会获得负奖励。In this environment, the agent gets a positive reward when it scores a point and a negative reward when a point is scored against it. 经过多次迭代后,训练代理会根据其当前状态学习选择动作,并进行优化以提高将来的预期总奖励。Over many iterations, the training agent learns to choose the action, based on its current state, that optimizes for the sum of expected future rewards.

在 RL 中,我们通常使用深度神经网络 (DNN) 模型执行这种优化。It's common to use a deep neural network (DNN) model to perform this optimization in RL. 最初,学习代理的表现较差,但每款游戏都会生成更多的样本,以此进一步改善模型。Initially, the learning agent will perform poorly, but every game will generate additional samples to further improve the model.

当代理在一个训练时期达到平均奖励评分 18 时,训练即告结束。Training ends when the agent reaches an average reward score of 18 in a training epoch. 这意味着,代理在比赛中,凭借最低 18 分的平均比分(总分 21 分)击败了其对手。This means that the agent has beaten its opponent by an average of at least 18 points in matches up to 21.

迭代模拟和重新训练 DNN 的过程会消耗大量计算资源,且需要大量数据。The process of iterating through simulation and retraining a DNN is computationally expensive, and requires large amounts of data. 提高 RL 作业性能的一种方法是将工作并行化,使多个训练代理能够同时做出动作并学习知识。One way to improve performance of RL jobs is by parallelizing work so that multiple training agents can act and learn simultaneously. 但是,管理分布式 RL 环境可能是一项复杂的任务。However, managing a distributed RL environment can be a complex undertaking.

Azure 机器学习提供了一个框架,用于处理这种复杂性,以便能够横向扩展 RL 工作负荷。Azure Machine Learning provides the framework to manage these complexities to scale out your RL workloads.

设置环境Set up the environment

通过加载所需的 Python 包、初始化工作区、创建试验并指定配置的虚拟网络,来设置本地 RL 环境。Set up the local RL environment by loading the required Python packages, initializing your workspace, creating an experiment, and specifying a configured virtual network.

导入库Import libraries

导入所需的 Python 包以运行本示例的其余部分。Import the necessary Python packages to run the rest of this example.

# Azure ML Core imports
import azureml.core
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.runconfig import EnvironmentDefinition
from azureml.widgets import RunDetails
from azureml.tensorboard import Tensorboard

# Azure ML Reinforcement Learning imports
from azureml.contrib.train.rl import ReinforcementLearningEstimator, Ray
from azureml.contrib.train.rl import WorkerConfiguration

初始化工作区Initialize a workspace

Azure 机器学习工作区是 Azure 机器学习的顶级资源。The Azure Machine Learning workspace is the top-level resource for Azure Machine Learning. 它提供一个中心位置,用于处理你创建的所有项目。It provides you with a centralized place to work with all the artifacts you create.

通过先决条件部分中创建的 config.json 文件初始化一个工作区对象。Initialize a workspace object from the config.json file created in the prerequisites section. 如果在 Azure 机器学习计算实例中执行此代码,则系统已经为你创建了配置文件。If you are executing this code in an Azure Machine Learning Compute Instance, the configuration file has already been created for you.

ws = Workspace.from_config()

创建强化学习试验Create a reinforcement learning experiment

创建一个试验,用于跟踪强化学习运行。Create an experiment to track your reinforcement learning run. 在 Azure 机器学习中,试验是相关试运行的逻辑集合,用于组织运行日志、历史记录、输出等信息。In Azure Machine Learning, experiments are logical collections of related trials to organize run logs, history, outputs, and more.

experiment_name='rllib-pong-multi-node'

exp = Experiment(workspace=ws, name=experiment_name)

指定虚拟网络Specify a virtual network

对于使用多个计算目标的 RL 作业,必须指定一个虚拟网络,其中带有开放端口,这些端口使工作器节点和头节点可以相互通信。For RL jobs that use multiple compute targets, you must specify a virtual network with open ports that allow worker nodes and head nodes to communicate with each other. 虚拟网络可以位于任意资源组中,但应该与工作区位于同一区域。The virtual network can be in any resource group, but it should be in the same region as your workspace. 有关设置虚拟网络的详细信息,请参阅可在“先决条件”部分中找到的工作区设置笔记本For more information on setting up your virtual network, see the workspace setup notebook that can found in the prerequisites section. 在这里,请指定资源组中的虚拟网络的名称。Here, you specify the name of the virtual network in your resource group.

vnet = 'your_vnet'

定义头节点和工作器节点计算目标Define head and worker compute targets

此示例对 Ray 头节点和工作器节点使用不同的计算目标。This example uses separate compute targets for the Ray head and workers nodes. 使用这些设置可以根据预期工作负荷纵向扩展和缩减计算资源。These settings let you scale your compute resources up and down depending on the expected workload. 根据试验需求设置节点数目和每个节点的大小。Set the number of nodes, and the size of each node, based on your experiment's needs.

头节点计算目标Head computing target

此示例使用配备了 GPU 的头节点群集来优化深度学习性能。This example uses a GPU-equipped head cluster to optimize deep learning performance. 头节点将训练供代理用来做出决策的神经网络。The head node trains the neural network that the agent uses to make decisions. 头节点还从工作器节点收集数据点,以进一步训练神经网络。The head node also collects data points from the worker nodes to further train the neural network.

头节点计算使用单个 STANDARD_NC6 虚拟机 (VM)。The head compute uses a single STANDARD_NC6 virtual machine (VM). 该 VM 有 6 个虚拟 CPU,这意味着,它可以在 6 个工作 CPU 之间分配工作。It has 6 virtual CPUs, which means that it can distribute work across 6 working CPUs.

from azureml.core.compute import AmlCompute, ComputeTarget

# choose a name for the Ray head cluster
head_compute_name = 'head-gpu'
head_compute_min_nodes = 0
head_compute_max_nodes = 2

# This example uses GPU VM. For using CPU VM, set SKU to STANDARD_D2_V2
head_vm_size = 'STANDARD_NC6'

if head_compute_name in ws.compute_targets:
    head_compute_target = ws.compute_targets[head_compute_name]
    if head_compute_target and type(head_compute_target) is AmlCompute:
        print(f'found head compute target. just use it {head_compute_name}')
else:
    print('creating a new head compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = head_vm_size,
                                                                min_nodes = head_compute_min_nodes, 
                                                                max_nodes = head_compute_max_nodes,
                                                                vnet_resourcegroup_name = ws.resource_group,
                                                                vnet_name = vnet_name,
                                                                subnet_name = 'default')

    # create the cluster
    head_compute_target = ComputeTarget.create(ws, head_compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    head_compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(head_compute_target.get_status().serialize())

工作器节点计算群集Worker computing cluster

此示例对工作器节点计算目标使用 4 个 STANDARD_D2_V2 VMThis example uses four STANDARD_D2_V2 VMs for the worker compute target. 每个工作器节点有 2 个可用 CPU,因此总共有 8 个 CPU 可用于将工作并行化。Each worker node has 2 available CPUs for a total of 8 available CPUs to parallelize work.

没有必要在工作器节点上配备 GPU,因为 GPU 不执行深度学习。GPUs aren't necessary for the worker nodes since they aren't performing deep learning. 工作器运行游戏模拟并收集数据。The workers run the game simulations and collect data.

# choose a name for your Ray worker cluster
worker_compute_name = 'worker-cpu'
worker_compute_min_nodes = 0 
worker_compute_max_nodes = 4

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
worker_vm_size = 'STANDARD_D2_V2'

# Create the compute target if it hasn't been created already
if worker_compute_name in ws.compute_targets:
    worker_compute_target = ws.compute_targets[worker_compute_name]
    if worker_compute_target and type(worker_compute_target) is AmlCompute:
        print(f'found worker compute target. just use it {worker_compute_name}')
else:
    print('creating a new worker compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = worker_vm_size,
                                                                min_nodes = worker_compute_min_nodes, 
                                                                max_nodes = worker_compute_max_nodes,
                                                                vnet_resourcegroup_name = ws.resource_group,
                                                                vnet_name = vnet_name,
                                                                subnet_name = 'default')

    # create the cluster
    worker_compute_target = ComputeTarget.create(ws, worker_compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    worker_compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(worker_compute_target.get_status().serialize())

创建强化学习估算器Create a reinforcement learning estimator

本部分介绍如何使用 ReinforcementLearningEstimator 将训练作业提交到 Azure 机器学习。In this section, you learn how to use the ReinforcementLearningEstimator to submit a training job to Azure Machine Learning.

Azure 机器学习使用估算器类来封装运行配置信息。Azure Machine Learning uses estimator classes to encapsulate run configuration information. 这样,你便可以轻松指定如何配置脚本执行。This lets you easily specify how to configure a script execution. 有关 Azure 机器学习估算器模式的详细信息,请参阅如何使用估算器训练模型For more information on the Azure Machine Learning estimator pattern, see How to train models using estimators.

定义工作器节点配置Define a worker configuration

WorkerConfiguration 对象告知 Azure 机器学习如何初始化要运行入口脚本的工作器节点群集。The WorkerConfiguration object tells Azure Machine Learning how to initialize the worker cluster that will run the entry script.

# Pip packages we will use for both head and worker
pip_packages=["ray[rllib]==0.8.3"] # Latest version of Ray has fixes for isses related to object transfers

# Specify the Ray worker configuration
worker_conf = WorkerConfiguration(
    
    # Azure ML compute cluster to run Ray workers
    compute_target=worker_compute_target, 
    
    # Number of worker nodes
    node_count=4,
    
    # GPU
    use_gpu=False, 
    
    # PIP packages to use
    pip_packages=pip_packages
)

定义脚本参数Define script parameters

入口脚本 pong_rllib.py 接受一个参数列表,这些参数定义如何执行训练作业。The entry script pong_rllib.py accepts a list of parameters that defines how to execute the training job. 通过用作封装层的估算器传递这些参数可以轻松更改脚本参数以及独立运行每项配置。Passing these parameters through the estimator as a layer of encapsulation makes it easy to change script parameters and run configurations independently of each other.

指定正确的 num_workers 可以最大程度地发挥并行化的作用。Specifying the correct num_workers will make the most out of your parallelization efforts. 将工作器数量设置为与可用 CPU 数量相同。Set the number of workers to the same as the number of available CPUs. 对于本示例,可按如下所示计算此数量:For this example you can calculate this as follows:

头节点是有 6 个 vCPU 的 Standard_NC6The head node is a Standard_NC6 with 6 vCPUs. 工作器节点群集是 4 个 Standard_D2_V2 VM,每个 VM 有 2 个 CPU,因此总共有 8 个 CPU。The worker cluster is 4 Standard_D2_V2 VMs with 2 CPUs each, for a total of 8 CPUs. 但是,必须从工作器计数中减去 1 个 CPU,因为必须有 1 个 CPU 专用于头节点角色。However, you must subtract 1 CPU from the worker count since 1 must be dedicated to the head node role. 6个 CPU + 8 个 CPU - 1 个头节点 CPU = 13 个同步工作器。6 CPUs + 8 CPUs - 1 head CPU = 13 simultaneous workers. Azure 机器学习使用头节点和工作器节点群集来区分计算资源。Azure Machine Learning uses head and worker clusters to distinguish compute resources. 但是,Ray 并不区分头节点和工作器节点,所有 CPU 都可用于执行工作线程。However, Ray does not distinguish between head and workers, and all CPUs are available CPUs for worker thread execution.

training_algorithm = "IMPALA"
rl_environment = "PongNoFrameskip-v4"

# Training script parameters
script_params = {
    
    # Training algorithm, IMPALA in this case
    "--run": training_algorithm,
    
    # Environment, Pong in this case
    "--env": rl_environment,
    
    # Add additional single quotes at the both ends of string values as we have spaces in the 
    # string parameters, outermost quotes are not passed to scripts as they are not actually part of string
    # Number of GPUs
    # Number of ray workers
    "--config": '\'{"num_gpus": 1, "num_workers": 13}\'',
    
    # Target episode reward mean to stop the training
    # Total training time in seconds
    "--stop": '\'{"episode_reward_mean": 18, "time_total_s": 3600}\'',
}

定义强化学习估算器Define the reinforcement learning estimator

使用参数列表和工作器节点配置对象来构造估算器。Use the parameter list and the worker configuration object to construct the estimator.

# RL estimator
rl_estimator = ReinforcementLearningEstimator(
    
    # Location of source files
    source_directory='files',
    
    # Python script file
    entry_script="pong_rllib.py",
    
    # Parameters to pass to the script file
    # Defined above.
    script_params=script_params,
    
    # The Azure ML compute target set up for Ray head nodes
    compute_target=head_compute_target,
    
    # Pip packages
    pip_packages=pip_packages,
    
    # GPU usage
    use_gpu=True,
    
    # RL framework. Currently must be Ray.
    rl_framework=Ray(),
    
    # Ray worker configuration defined above.
    worker_configuration=worker_conf,
    
    # How long to wait for whole cluster to start
    cluster_coordination_timeout_seconds=3600,
    
    # Maximum time for the whole Ray job to run
    # This will cut off the run after an hour
    max_run_duration_seconds=3600,
    
    # Allow the docker container Ray runs in to make full use
    # of the shared memory available from the host OS.
    shm_size=24*1024*1024*1024
)

入口脚本Entry script

入口脚本 pong_rllib.py 使用 OpenAI Gym 环境 PongNoFrameSkip-v4 来训练神经网络。The entry script pong_rllib.py trains a neural network using the OpenAI Gym environment PongNoFrameSkip-v4. OpenAI Gym 是用于在经典 Atari 游戏中测试强化学习算法的标准化接口。OpenAI Gyms are standardized interfaces to test reinforcement learning algorithms on classic Atari games.

此示例使用名为 IMPALA(重要性加权操作器-学习器体系结构)的训练算法。This example uses a training algorithm known as IMPALA (Importance Weighted Actor-Learner Architecture). IMPALA 并行化每个学习操作器,以跨多个计算节点进行缩放,且不降低速度或稳定性。IMPALA parallelizes each individual learning actor to scale across many compute nodes without sacrificing speed or stability.

Ray 优化协调 IMPALA 工作器任务。Ray Tune orchestrates the IMPALA worker tasks.

import ray
import ray.tune as tune
from ray.rllib import train

import os
import sys

from azureml.core import Run
from utils import callbacks

DEFAULT_RAY_ADDRESS = 'localhost:6379'

if __name__ == "__main__":

    # Parse arguments
    train_parser = train.create_parser()

    args = train_parser.parse_args()
    print("Algorithm config:", args.config)

    if args.ray_address is None:
        args.ray_address = DEFAULT_RAY_ADDRESS

    ray.init(address=args.ray_address)

    tune.run(run_or_experiment=args.run,
             config={
                 "env": args.env,
                 "num_gpus": args.config["num_gpus"],
                 "num_workers": args.config["num_workers"],
                 "callbacks": {"on_train_result": callbacks.on_train_result},
                 "sample_batch_size": 50,
                 "train_batch_size": 1000,
                 "num_sgd_iter": 2,
                 "num_data_loader_buffers": 2,
                 "model": {
                    "dim": 42
                 },
             },
             stop=args.stop,
             local_dir='./logs')

记录回调函数Logging callback function

入口脚本使用某个实用工具函数定义一个自定义 RLlib 回调函数,以将指标记录到 Azure 机器学习工作区。The entry script uses a utility function to define a custom RLlib callback function to log metrics to your Azure Machine Learning workspace. 请在监视和查看结果部分了解如何查看这些指标。Learn how to view these metrics in the Monitor and view results section.

'''RLlib callbacks module:
    Common callback methods to be passed to RLlib trainer.
'''
from azureml.core import Run

def on_train_result(info):
    '''Callback on train result to record metrics returned by trainer.
    '''
    run = Run.get_context()
    run.log(
        name='episode_reward_mean',
        value=info["result"]["episode_reward_mean"])
    run.log(
        name='episodes_total',
        value=info["result"]["episodes_total"])

提交运行Submit a run

运行处理正在进行的或已完成的作业的运行历史记录。Run handles the run history of in-progress or complete jobs.

run = exp.submit(config=rl_estimator)

备注

完成运行最多可能需要 30 到 45 分钟。The run may take up to 30 to 45 minutes to complete.

监视和查看结果Monitor and view results

使用 Azure 机器学习 Jupyter 小组件可实时查看运行状态。Use the Azure Machine Learning Jupyter widget to see the status of your runs in real time. 在此示例中,该小组件显示了两个子运行:一个是头节点的运行,另一个是工作器节点的运行。In this example, the widget shows two child runs: one for head and one for workers.

from azureml.widgets import RunDetails

RunDetails(run).show()
run.wait_for_completion()
  1. 等待加载小组件。Wait for the widget to load.
  2. 在运行列表中选择头节点运行。Select the head run in the list of runs.

选择“单击此处在 Azure 机器学习工作室中查看运行”可在工作室中查看更多运行信息。Select Click here to see the run in Azure Machine Learning studio for additional run information in the studio. 可以在运行正在进行时访问此信息,也可以在完成运行后访问。You can access this information while the run is in progress, or after it has completed.

显示运行详细信息小组件的折线图

episode_reward_mean 绘图显示每个训练时期的平均评分。The episode_reward_mean plot shows the mean number of points scored per training epoch. 可以看到,训练代理最初的表现很差,输掉了比赛且一分未得(reward_mean 为 -21)。You can see that the training agent initially performed poorly, losing its matches without scoring a single point (shown by a reward_mean of -21). 在 100 次迭代内,训练代理已学会了如何击败计算机对手,平均分数为 18 分。Within 100 iterations, the training agent learned to beat the computer opponent by an average of 18 points.

如果浏览子运行的日志,可以看到 driver_log.txt 文件中记录的评估结果。If you browse logs of the child run, you can see the evaluation results recorded in driver_log.txt file. 可能需要等待几分钟,这些指标才会出现在“运行”页上。You may need to wait several minutes before these metrics become available on the Run page.

现在,你已了解如何配置多个计算资源来训练强化学习代理,使其能够很出色地玩 Pong 游戏。In short work, you have learned to configure multiple compute resources to train a reinforcement learning agent to play Pong very well.

后续步骤Next steps

在本文中,你已了解如何使用 IMPALA 学习代理来训练强化学习代理。In this article, you learned how to train a reinforcement learning agent using an IMPALA learning agent. 若要查看更多示例,请转到 Azure 机器学习强化学习 GitHub 存储库To see additional examples, go to the Azure Machine Learning Reinforcement Learning GitHub repository.