在 Azure 机器学习工作室(经典版)中执行 Python 机器学习脚本Execute Python machine learning scripts in Azure Machine Learning Studio (classic)

Note

Studio(经典)中的 Notebooks(预览版)功能将在 2020 年 4 月 13 日关闭。The Notebooks (preview) feature in Studio (classic) will be shut down on April 13th, 2020. 4 月 13 日以后,“Notebooks”选项卡将与所有已保存的笔记本一起删除。After April 13th, the Notebooks tab will be removed along with any saved notebooks.

Python 是许多数据科学家珍藏的一个有用工具。Python is a valuable tool in the tool chest of many data scientists. 它可以在典型机器学习工作流的每个阶段(包括数据探索、特征提取、模型训练和验证,以及部署)中使用。It's used in every stage of typical machine learning workflows including data exploration, feature extraction, model training and validation, and deployment.

本文介绍如何通过“执行 Python 脚本”模块在 Azure 机器学习工作室(经典版)试验和 Web 服务中使用 Python 代码。This article describes how you can use the Execute Python Script module to use Python code in your Azure Machine Learning Studio (classic) experiments and web services.

使用“执行 Python 脚本”模块Using the Execute Python Script module

与工作室(经典版)中的 Python 对接的主要方式是通过执行 Python 脚本模块。The primary interface to Python in Studio (classic) is through the Execute Python Script module. 此模块最多接受三个输入,最多生成两个输出,这类似于执行 R 脚本模块。It accepts up to three inputs and produces up to two outputs, similar to the Execute R Script module. 通过专门命名为 azureml_main 的入口点函数将 Python 代码输入参数框中。Python code is entered into the parameter box through a specially named entry-point function called azureml_main.

“执行 Python 脚本”模块

模块参数框中的示例 Python 代码

输入参数Input parameters

Python 模块的输入公开为 Pandas 数据帧。Inputs to the Python module are exposed as Pandas DataFrames. azureml_main 函数最多接受两个可选的 Pandas 数据帧作为参数。The azureml_main function accepts up to two optional Pandas DataFrames as parameters.

可以定位输入端口和函数参数之间的映射:The mapping between input ports and function parameters is positional:

  • 第一个连接的输入端口映射到函数的第一个参数。The first connected input port is mapped to the first parameter of the function.
  • 第二个输入(如果已连接)映射到函数的第二个参数。The second input (if connected) is mapped to the second parameter of the function.
  • 第三个输入用于导入其他 Python 模块The third input is used to import additional Python modules.

下面显示了有关输入端口如何映射到 azureml_main 函数参数的更详细语义。More detailed semantics of how the input ports get mapped to parameters of the azureml_main function are shown below.

输入端口配置和生成的 Python 签名的表格

输出返回值Output return values

azureml_main 函数必须返回打包在 Python 序列内部的单个 Pandas 数据帧,例如元组、列表或 NumPy 数组。The azureml_main function must return a single Pandas DataFrame packaged in a Python sequence such as a tuple, list, or NumPy array. 然后,此序列的第一个元素返回到模块的第一个输出端口。The first element of this sequence is returned to the first output port of the module. 模块的第二个输出端口用于可视化,不需要返回值。The second output port of the module is used for visualizations and does not require a return value. 下面显示了此方案。This scheme is shown below.

将输入端口映射到参数,并向输出端口返回值。

输入和输出数据类型的转换Translation of input and output data types

工作室数据库与 Panda 数据帧不同。Studio datasets are not the same as Panda DataFrames. 因此,工作室(经典版)中的输入数据集将转换为 Pandas 数据帧,输出数据帧将转换回工作室(经典版)数据集。As a result, input datasets in Studio (classic) are converted to Pandas DataFrame, and output DataFrames are converted back to Studio (classic) datasets. 在此转换过程中还会执行以下转换:During this conversion process, the following translations are also performed:

Python 数据类型Python data type 工作室转换过程Studio translation procedure
字符串和数字Strings and numerics 转换为Translated as is
Pandas "NA"Pandas 'NA' 转换为“缺失值”Translated as 'Missing value'
索引向量Index vectors 不支持*Unsupported*
非字符串列名Non-string column names 对列名调用 strCall str on column names
复制列名Duplicate column names 添加数字后缀:(1)、(2)、(3) 等。Add numeric suffix: (1), (2), (3), and so on.

Python 函数中的所有输入数据帧始终具有 64 位的数字索引,范围从 0 到行数减去 1 的值* *All input data frames in the Python function always have a 64-bit numerical index from 0 to the number of rows minus 1

导入现有的 Python 脚本模块Importing existing Python script modules

用于执行 Python 的后端基于 Anaconda(广泛使用的 Python 科研分发版)。The backend used to execute Python is based on Anaconda, a widely used scientific Python distribution. 它随附了以数据为中心的工作负荷中最常用的将近 200 个 Python 包。It comes with close to 200 of the most common Python packages used in data-centric workloads. 工作室(经典版)目前不支持使用 Pip 或 Conda 等包管理系统来安装和管理外部库。Studio (classic) does not currently support the use of package management systems like Pip or Conda to install and manage external libraries. 如果需要整合其他库,请参考以下方案。If you find the need to incorporate additional libraries, use the following scenario as a guide.

常见的用例是将现有 Python 脚本整合到工作室(经典版)试验中。A common use-case is to incorporate existing Python scripts into Studio (classic) experiments. 执行 Python 脚本模块接受在第三个输入端口上使用包含 Python 模块的 zip 文件。The Execute Python Script module accepts a zip file containing Python modules at the third input port. 在运行时执行框架会解压缩该文件,内容将添加到 Python 解释器的库路径。The file is unzipped by the execution framework at runtime and the contents are added to the library path of the Python interpreter. 然后,azureml_main 入口点函数可直接导入这些模块。The azureml_main entry point function can then import these modules directly.

例如,请考虑包含简单“Hello, World”函数的 Hello.py 文件。As an example, consider the file Hello.py containing a simple "Hello, World" function.

Hello.py 文件中用户定义的函数

接下来,我们创建包含 Hello.py 的 Hello.zip 文件:Next, we create a file Hello.zip that contains Hello.py:

包含用户定义的 Python 代码的 zip 文件

将 zip 文件作为数据集上传到工作室(经典版)中。Upload the zip file as a dataset into Studio (classic). 然后,创建使用 Hello.zip 文件中的 Python 代码的试验,并通过将此试验附加到“执行 Python 脚本”模块的第三个输入端口来运行此试验,如下图所示。 Then create and run an experiment that uses the Python code in the Hello.zip file by attaching it to the third input port of the Execute Python Script module as shown in the following image.

使用 Hello.zip 的示例试验,用作“执行 Python 脚本”模块的输入

作为 zip 文件上传的用户定义的 Python 代码

模块输出显示,zip 文件未打包,并且函数 print_hello 已运行。The module output shows that the zip file has been unpackaged and that the function print_hello has been run.

显示用户定义的函数的模块输出

访问 Azure 存储 BlobAccessing Azure Storage Blobs

可以使用以下步骤访问存储在 Azure Blob 存储帐户中的数据:You can access data stored in an Azure Blob Storage account using these steps:

  1. 在本地下载适用于 Python 的 Azure Blob 存储包Download the Azure Blob Storage package for Python locally.
  2. 将 zip 文件作为数据集上传到工作室(经典版)工作区。Upload the zip file to your Studio (classic) workspace as a dataset.
  3. 使用 protocol='http' 创建 BlobService 对象Create your BlobService object with protocol='http'
from azure.storage.blob import BlockBlobService

# Create the BlockBlockService that is used to call the Blob service for the storage account
block_blob_service = BlockBlobService(account_name='account_name', account_key='account_key', protocol='http')
  1. 在存储的“配置”设置选项卡中禁用“需要安全传输” Disable Secure transfer required in your Storage Configuration setting tab

在 Azure 门户中禁用“需要安全传输”

实现 Python 脚本Operationalizing Python scripts

发布为 Web 服务时,将调用在评分试验中使用的任何执行 Python 脚本模块。Any Execute Python Script modules used in a scoring experiment are called when published as a web service. 例如,下图显示了包含用于评估单个 Python 表达式的代码的评分试验。For example, the image below shows a scoring experiment that contains the code to evaluate a single Python expression.

Web 服务的工作室工作区

Python Pandas 表达式

从此试验创建的 Web 服务将执行以下操作:A web service created from this experiment would take the following actions:

  1. 将 Python 表达式用作输入(字符串)Take a Python expression as input (as a string)
  2. 将 Python 表达式发送到 Python 解释器Send the Python expression to the Python interpreter
  3. 返回包含表达式和评估结果的表格。Returns a table containing both the expression and the evaluated result.

使用可视化效果Working with visualizations

使用 MatplotLib 创建的绘图可由执行 Python 脚本返回。Plots created using MatplotLib can be returned by the Execute Python Script. 但是,在使用 R 时,绘图不会按原样自动重定向到图像。因此,用户必须显式将所有绘图保存到 PNG 文件。However, plots aren't automatically redirected to images as they are when using R. So the user must explicitly save any plots to PNG files.

若要从 MatplotLib 生成图像,必须执行以下步骤:To generate images from MatplotLib, you must take the following steps:

  1. 将后端由默认的基于 Qt 的呈现器切换为“AGG”。Switch the backend to “AGG” from the default Qt-based renderer.
  2. 创建新图形对象。Create a new figure object.
  3. 获取轴,并将所有绘图生成到其中。Get the axis and generate all plots into it.
  4. 将图形保存到 PNG 文件。Save the figure to a PNG file.

此过程在 Pandas 中使用 scatter_matrix 函数创建散点图矩阵,下图对此做了演示。This process is illustrated in the following images that create a scatter plot matrix using the scatter_matrix function in Pandas.

将 MatplotLib 图形保存为图像的代码

单击“执行 Python 脚本”模块中的“可视化”以查看图形

使用 Python 代码可视化示例试验的绘图

可以通过将多个图形保存到不同的图像来返回这些图形。It's possible to return multiple figures by saving them into different images. 工作室(经典版)运行时将选择所有图像,并将其串联以进行可视化。Studio (classic) runtime picks up all images and concatenates them for visualization.

高级示例Advanced examples

安装在工作室(经典版)中的 Anaconda 环境包含常用包,例如 NumPy、SciPy 和 Scikits-Learn。The Anaconda environment installed in Studio (classic) contains common packages such as NumPy, SciPy, and Scikits-Learn. 可以针对机器学习管道中的数据处理有效使用这些包。These packages can be effectively used for data processing in a machine learning pipeline.

例如,以下试验和脚本演示如何在 Scikits-Learn 中使用系综学习器计算数据集的特征重要性评分。For example, the following experiment and script illustrate the use of ensemble learners in Scikits-Learn to compute feature importance scores for a dataset. 在馈送到另一模型之前,评分可用于执行监督式特征选择。The scores can be used to perform supervised feature selection before being fed into another model.

下面演示如何使用 Python 函数计算重要性评分,并根据评分为特征排序:Here is the Python function used to compute the importance scores and order the features based on the scores:

按评分将特征排名的函数

以下试验在 Azure 机器学习工作室(经典版)的“Pima Indian Diabetes”数据集中计算并返回特征的重要性评分:The following experiment then computes and returns the importance scores of features in the "Pima Indian Diabetes" dataset in Azure Machine Learning Studio (classic):

使用 Python 为 Pima Indian Diabetes 数据集中的特征排名的试验

“执行 Python 脚本”模块的输出可视化效果

限制Limitations

执行 Python 脚本模块目前存在以下限制:The Execute Python Script module currently has the following limitations:

沙盒执行Sandboxed execution

Python 运行时目前在沙盒中执行,不允许持续访问网络或本地文件系统。The Python runtime is currently sandboxed and doesn't allow access to the network or the local file system in a persistent manner. 所有本地保存的文件都将隔离,并在模块完成后删除。All files saved locally are isolated and deleted once the module finishes. Python 代码无法访问运行它的计算机上的大部分目录,当前目录及其子目录除外。The Python code cannot access most directories on the machine it runs on, the exception being the current directory and its subdirectories.

缺少先进的开发和调试支持Lack of sophisticated development and debugging support

Python 模块当前不支持诸如 intellisense 和调试等 IDE 功能。The Python module currently does not support IDE features such as intellisense and debugging. 此外,如果模块在运行时失败,则整个 Python 堆栈跟踪可用。Also, if the module fails at runtime, the full Python stack trace is available. 但必须在模块的输出日志中查看跟踪。But it must be viewed in the output log for the module. 我们目前建议在诸如 IPython 的环境中开发和调试 Python 脚本,然后将代码导入模块。We currently recommend that you develop and debug Python scripts in an environment such as IPython and then import the code into the module.

单数据帧输出Single data frame output

Python 入口点仅允许将单个数据帧返回为输出。The Python entry point is only permitted to return a single data frame as output. 目前无法将诸如训练的模型等任意 Python 对象直接返回到工作室(经典版)运行时。It is not currently possible to return arbitrary Python objects such as trained models directly back to the Studio (classic) runtime. 就像存在相同限制的执行 R 脚本一样,在许多情况下可将对象转换为字节数组,并在数据帧中返回该字节数组。Like Execute R Script, which has the same limitation, it is possible in many cases to pickle objects into a byte array and then return that inside of a data frame.

无法自定义 Python 安装Inability to customize Python installation

当前添加自定义 Python 模块的唯一方法是如前所述的使用 zip 文件机制。Currently, the only way to add custom Python modules is via the zip file mechanism described earlier. 虽然这对小模块可行,但对大模块(尤其是具有本机 DLL 的模块)或大量模块而言却很麻烦。While this is feasible for small modules, it's cumbersome for large modules (especially modules with native DLLs) or a large number of modules.

后续步骤Next steps

有关详细信息,请参阅 Python 开发人员中心For more information, see the Python Developer Center.