“执行 R 脚本”模块Execute R Script module

本文介绍如何使用“执行 R 脚本”模块在 Azure 机器学习设计器管道中运行 R 代码。This article describes how to use the Execute R Script module to run R code in your Azure Machine Learning designer pipeline.

使用 R,可以执行现有模块当前不支持的任务,例如:With R, you can perform tasks that existing modules don't currently support, such as:

  • 创建自定义数据转换Create custom data transformations
  • 使用你自己的指标来评估预测Use your own metrics to evaluate predictions
  • 使用未在设计器中作为独立模块实施的算法来生成模型Build models using algorithms that aren't implemented as standalone modules in the designer

R 版本支持R version support

Azure 机器学习设计器使用 R 的 CRAN(综合 R 存档网络)分发。当前使用的版本为 CRAN 3.5.1。Azure Machine Learning designer uses the CRAN (Comprehensive R Archive Network) distribution of R. The currently used version is CRAN 3.5.1.

支持的 R 包Supported R packages

R 环境预装有 100 多个包。The R environment is preinstalled with more than 100 packages. 有关完整列表,请参阅预安装的 R 包部分。For a complete list, see the section Preinstalled R packages.

也可以将以下代码添加到任意“执行 R 脚本”模块来查看已安装的包。You can also add the following code to any Execute R Script module, to see the installed packages.

azureml_main <- function(dataframe1, dataframe2){
  print("R script run.")
  dataframe1 <- data.frame(installed.packages())
  return(list(dataset1=dataframe1, dataset2=dataframe2))
}

备注

如果管道包含的多个“执行 R 脚本”模块需要预安装列表中未包含的包,请在每个模块中安装这些包。If your pipeline contains multiple Execute R Script modules that need packages that aren't in the preinstalled list, install the packages in each module.

安装 R 程序包Installing R packages

若要安装其他 R 包,请使用 install.packages() 方法。To install additional R packages, use the install.packages() method. 包是针对每一个“执行 R 脚本”模块分别安装的。Packages are installed for each Execute R Script module. 它们不在其他“执行 R 脚本”模块之间共享。They aren't shared across other Execute R Script modules.

备注

在安装包时,请指定 CRAN 存储库,例如 install.packages("zoo",repos = "http://cran.us.r-project.org")Specify the CRAN repository when you're installing packages, such as install.packages("zoo",repos = "http://cran.us.r-project.org").

此示例演示如何安装 Zoo:This sample shows how to install Zoo:

# R version: 3.5.1
# The script MUST contain a function named azureml_main,
# which is the entry point for this module.

# Note that functions dependent on the X11 library,
# such as "View," are not supported because the X11 library
# is not preinstalled.

# The entry point function MUST have two input arguments.
# If the input port is not connected, the corresponding
# dataframe argument will be null.
#   Param<dataframe1>: a R DataFrame
#   Param<dataframe2>: a R DataFrame
azureml_main <- function(dataframe1, dataframe2){
  print("R script run.")
  
  if(!require(zoo)) install.packages("zoo",repos = "http://cran.us.r-project.org")
  library(zoo)
  # Return datasets as a Named List
  return(list(dataset1=dataframe1, dataset2=dataframe2))
}

备注

安装包之前,请检查它是否已经存在,以避免重复安装。Before you install a package, check if it already exists so you don't repeat an installation. 重复安装可能会导致 Web 服务请求超时。Repeat installations might cause web service requests to time out.

上传文件Uploading files

“执行 R 脚本”模块支持通过使用 Azure 机器学习 R SDK 上传文件。The Execute R Script module supports uploading files by using the Azure Machine Learning R SDK.

以下示例演示了如何在“执行 R 脚本”中上传图像文件:The following sample shows how to upload an image file in Execute R Script:


# R version: 3.5.1
# The script MUST contain a function named azureml_main,
# which is the entry point for this module.

# Note that functions dependent on the X11 library,
# such as "View," are not supported because the X11 library
# is not preinstalled.

# The entry point function MUST have two input arguments.
# If the input port is not connected, the corresponding
# dataframe argument will be null.
#   Param<dataframe1>: a R DataFrame
#   Param<dataframe2>: a R DataFrame
azureml_main <- function(dataframe1, dataframe2){
  print("R script run.")

  # Generate a jpeg graph
  img_file_name <- "rect.jpg"
  jpeg(file=img_file_name)
  example(rect)
  dev.off()

  upload_files_to_run(names = list(file.path("graphic", img_file_name)), paths=list(img_file_name))


  # Return datasets as a Named List
  return(list(dataset1=dataframe1, dataset2=dataframe2))
}

管道运行完成后,可在模块的右侧面板中预览该图像。After the pipeline run is finished, you can preview the image in the right panel of the module.

预览上传的图像Preview of uploaded image

访问已注册的数据集Access to registered dataset

可以参阅以下示例代码,在工作区中访问已注册的数据集You can refer to the following sample code to access to the registered datasets in your workspace:

  azureml_main <- function(dataframe1, dataframe2){
  print("R script run.")
  run = get_current_run()
  ws = run$experiment$workspace
  dataset = azureml$core$dataset$Dataset$get_by_name(ws, "YOUR DATASET NAME")
  dataframe2 <- dataset$to_pandas_dataframe()
  # Return datasets as a Named List
  return(list(dataset1=dataframe1, dataset2=dataframe2))
}

如何配置“执行 R 脚本”How to configure Execute R Script

“执行 R 脚本”模块包含可用作起点的代码示例。The Execute R Script module contains sample code that you can use as a starting point. 若要配置“执行 R 脚本”模块,请提供一组输入和要运行的代码。To configure the Execute R Script module, provide a set of inputs and code to run.

R 模块的输入示意图

使用此模块加载时,存储在设计器中的数据集将自动转换为 R 数据帧。Datasets stored in the designer are automatically converted to an R data frame when loaded with this module.

  1. 将“执行 R 脚本”模块添加到管道。Add the Execute R Script module to your pipeline.

  2. 连接该脚本需要的任何输入。Connect any inputs that the script needs. 输入是可选的,可以包含数据和其他 R 代码。Inputs are optional and can include data and additional R code.

    • Dataset1 :引用第一个输入作为 dataframe1Dataset1 : Reference the first input as dataframe1. 输入数据集必须是 CSV、TSV 或 ARFF 格式的文件。The input dataset must be formatted as a CSV, TSV, or ARFF file. 或者可以连接 Azure 机器学习数据集。Or you can connect an Azure Machine Learning dataset.

    • Dataset2 :引用第二个输入作为 dataframe2Dataset2 : Reference the second input as dataframe2. 此数据集也必须是 CSV、TSV、ARFF 格式的文件,或者是 Azure 机器学习数据集。This dataset also must be formatted as a CSV, TSV, or ARFF file, or as an Azure Machine Learning dataset.

    • 脚本包 :第三个输入接受 .zip 文件。Script Bundle : The third input accepts .zip files. 压缩文件可以包含多个文件和多种文件类型。A zipped file can contain multiple files and multiple file types.

  3. 在“R 脚本”文本框中,键入或粘贴有效的 R 脚本。In the R script text box, type or paste valid R script.

    备注

    编写脚本时请小心谨慎。Be careful when writing your script. 确保没有语法错误,例如,使用未声明的变量或未导入的模块或函数。Make sure there are no syntax errors, such as using undeclared variables or unimported modules or functions. 请特别注意本文结尾处的预安装包列表。Pay extra attention to the preinstalled package list at the end of this article. 若要使用未列出的包,请通过脚本安装它们。To use packages that aren't listed, install them in your script. 例如 install.packages("zoo",repos = "http://cran.us.r-project.org")An example is install.packages("zoo",repos = "http://cran.us.r-project.org").

    为了帮助你入门,“R 脚本”文本框中预填充了可编辑或替换的代码示例。To help you get started, the R Script text box is prepopulated with sample code, which you can edit or replace.

    # R version: 3.5.1
    # The script MUST contain a function named azureml_main,
    # which is the entry point for this module.
    
    # Note that functions dependent on the X11 library,
    # such as "View," are not supported because the X11 library
    # is not preinstalled.
    
    # The entry point function MUST have two input arguments.
    # If the input port is not connected, the corresponding
    # dataframe argument will be null.
    #   Param<dataframe1>: a R DataFrame
    #   Param<dataframe2>: a R DataFrame
    azureml_main <- function(dataframe1, dataframe2){
    print("R script run.")
    
    # If a .zip file is connected to the third input port, it's
    # unzipped under "./Script Bundle". This directory is added
    # to sys.path.
    
    # Return datasets as a Named List
    return(list(dataset1=dataframe1, dataset2=dataframe2))
    }
    

    入口点函数必须有输入参数 Param<dataframe1>Param<dataframe2>,即使该函数中没有使用这些参数。The entry point function must have the input arguments Param<dataframe1> and Param<dataframe2>, even when these arguments aren't used in the function.

    备注

    传递到“执行 R 脚本”模块的数据会被引用为 dataframe1dataframe2,这不同于 Azure 机器学习设计器(设计器引用为 dataset1dataset2)。The data passed to the Execute R Script module is referenced as dataframe1 and dataframe2, which is different from Azure Machine Learning designer (the designer reference as dataset1, dataset2). 请确保脚本中正确引用了输入数据。Make sure that input data is referenced correctly in your script.

    备注

    现有 R 代码可能需要稍做更改才能在设计器管道中运行。Existing R code might need minor changes to run in a designer pipeline. 例如,以 CSV 格式提供的输入数据应显式转换为数据集,然后才能在代码中使用。For example, input data that you provide in CSV format should be explicitly converted to a dataset before you can use it in your code. R 语言中使用的数据和列类型与在设计器中使用的数据和列类型在某些方面也有所不同。Data and column types used in the R language also differ in some ways from the data and column types used in the designer.

    如果脚本大于 16KB,请使用脚本包端口以避免错误,如命令行超过 16597 个字符的限制。If your script is larger than 16KB, use the Script Bundle port to avoid errors like CommandLine exceeds the limit of 16597 characters .

    将脚本和其他自定义资源捆绑到一个 zip 文件,然后将该 zip 文件作为文件数据集上传到工作室。Bundle the script and other custom resources to a zip file, and upload the zip file as a File Dataset to the studio. 然后可以从设计器创作页面左侧模块窗格中的“我的数据集”列表中拖取数据集模块。Then you can drag the dataset module from the My datasets list in the left module pane in the designer authoring page. 将数据集模块连接到“执行 R 脚本”模块的“脚本包”端口。Connect the dataset module to the Script Bundle port of Execute R Script module.

    下面是使用脚本包中的脚本的示例代码:Following is the sample code to consume the script in the script bundle:

    azureml_main <- function(dataframe1, dataframe2){
    # Source the custom R script: my_script.R
    source("./Script Bundle/my_script.R")
    
    # Use the function that defined in my_script.R
    dataframe1 <- my_func(dataframe1)
    
    sample <- readLines("./Script Bundle/my_sample.txt")
    return (list(dataset1=dataframe1, dataset2=data.frame("Sample"=sample)))
    }
    
  4. 对于“随机种子”,请输入要在 R 环境中用作随机种子值的值。For Random Seed , enter a value to use inside the R environment as the random seed value. 此参数相当于在 R 代码中调用 set.seed(value)This parameter is equivalent to calling set.seed(value) in R code.

  5. 提交管道。Submit the pipeline.

结果Results

“执行 R 脚本”模块可返回多个输出,但必须将它们作为 R 数据帧提供。Execute R Script modules can return multiple outputs, but they must be provided as R data frames. 数据帧会自动转换为设计器中的数据集,以便与其他模块兼容。Data frames are automatically converted to datasets in the designer for compatibility with other modules.

来自 R 的标准消息和错误将返回到模块的日志中。Standard messages and errors from R are returned to the module's log.

如果需要通过 R 脚本输出结果,可在该模块右侧面板中的“输出 + 日志”选项卡下的“70_driver_log”中查找输出的结果。If you need to print results in the R script, you can find the printed results in 70_driver_log under the Outputs+logs tab in the right panel of the module.

示例脚本Sample scripts

通过使用自定义 R 脚本来扩展管道的方法有多种。There are many ways to extend your pipeline by using custom R scripts. 本部分提供常见任务的示例代码。This section provides sample code for common tasks.

添加 R 脚本作为输入Add an R script as an input

“执行 R 脚本”模块支持任意 R 脚本文件作为输入。The Execute R Script module supports arbitrary R script files as inputs. 要使用这些文件,必须将它们作为 .zip 文件的一部分上传到工作区。To use them, you must upload them to your workspace as part of the .zip file.

  1. 若要将包含 R 代码的 .zip 文件上传到工作区,请转到“数据集”资产页。To upload a .zip file that contains R code to your workspace, go to the Datasets asset page. 选择“创建数据集”,然后选择“从本地文件”和“文件”数据集类型选项 。Select Create dataset , and then select From local file and the File dataset type option.

  2. 验证左侧模块树中“数据集”类别下“我的数据集”列表中是否存在该压缩文件 。Verify that the zipped file is available in the My Datasets list under the Datasets category in the left module tree.

  3. 将数据集连接到“脚本包”输入端口。Connect the dataset to the Script Bundle input port.

  4. 该 .zip 文件中的所有文件在管道运行时都是可用的。All files in the .zip file are available during pipeline run time.

    如果脚本包文件中已包含目录结构,则会保留结构。If the script bundle file contained a directory structure, the structure is preserved. 但是,必须更改代码,以将目录“./Script Bundle”追加到路径前面。But you must alter your code to prepend the directory ./Script Bundle to the path.

处理数据Process data

以下示例演示如何缩放和规范化输入数据:The following sample shows how to scale and normalize input data:

# R version: 3.5.1
# The script MUST contain a function named azureml_main,
# which is the entry point for this module.

# Note that functions dependent on the X11 library,
# such as "View," are not supported because the X11 library
# is not preinstalled.

# The entry point function MUST have two input arguments.
# If the input port is not connected, the corresponding
# dataframe argument will be null.
#   Param<dataframe1>: a R DataFrame
#   Param<dataframe2>: a R DataFrame
azureml_main <- function(dataframe1, dataframe2){
  print("R script run.")
  # If a .zip file is connected to the third input port, it's
  # unzipped under "./Script Bundle". This directory is added
  # to sys.path.
  series <- dataframe1$width
  # Find the maximum and minimum values of the width column in dataframe1
  max_v <- max(series)
  min_v <- min(series)
  # calculate the scale and bias
  scale <- max_v - min_v
  bias <- min_v / dis
  # apply min-max normalizing
  dataframe1$width <- dataframe1$width / scale - bias
  dataframe2$width <- dataframe2$width / scale - bias
  # Return datasets as a Named List
  return(list(dataset1=dataframe1, dataset2=dataframe2))
}

读取 .zip 文件作为输入Read a .zip file as input

此示例演示如何使用 .zip 文件中的数据集作为“执行 R 脚本”模块的输入。This sample shows how to use a dataset in a .zip file as an input to the Execute R Script module.

  1. 创建 CSV 格式的数据文件,并将其命名为“mydatafile.csv”。Create the data file in CSV format, and name it mydatafile.csv .
  2. 创建一个 .zip 文件,并将该 CSV 文件添加到此存档。Create a .zip file and add the CSV file to the archive.
  3. 将压缩文件上载到 Azure 机器学习工作区。Upload the zipped file to your Azure Machine Learning workspace.
  4. 将结果数据集连接到“执行 R 脚本”模块的“ScriptBundle”输入 。Connect the resulting dataset to the ScriptBundle input of your Execute R Script module.
  5. 使用以下代码从压缩文件中读取 CSV 数据。Use the following code to read the CSV data from the zipped file.
azureml_main <- function(dataframe1, dataframe2){
  print("R script run.")
  mydataset<-read.csv("./Script Bundle/mydatafile.csv",encoding="UTF-8");  
  # Return datasets as a Named List
  return(list(dataset1=mydataset, dataset2=dataframe2))
}

复制行Replicate rows

此示例演示如何复制数据集中的正面记录来平衡示例:This sample shows how to replicate positive records in a dataset to balance the sample:

azureml_main <- function(dataframe1, dataframe2){
  data.set <- dataframe1[dataframe1[,1]==-1,]  
  # positions of the positive samples
  pos <- dataframe1[dataframe1[,1]==1,]
  # replicate the positive samples to balance the sample  
  for (i in 1:20) data.set <- rbind(data.set,pos)  
  row.names(data.set) <- NULL
  # Return datasets as a Named List
  return(list(dataset1=data.set, dataset2=dataframe2))
}

在“执行 R 脚本”模块之间传递 R 对象Pass R objects between Execute R Script modules

可通过使用内部序列化机制在“执行 R 脚本”模块的实例之间传递 R 对象。You can pass R objects between instances of the Execute R Script module by using the internal serialization mechanism. 此示例假设需要在两个“执行 R 脚本”模块之间移动名为 A 的 R 对象。This example assumes that you want to move the R object named A between two Execute R Script modules.

  1. 将第一个“执行 R 脚本”模块添加到管道。Add the first Execute R Script module to your pipeline. 然后在“R 脚本”文本框中输入以下代码,以创建序列化对象 A 作为模块输出数据表中的列:Then enter the following code in the R Script text box to create a serialized object A as a column in the module's output data table:

    azureml_main <- function(dataframe1, dataframe2){
      print("R script run.")
      # some codes generated A
    
      serialized <- as.integer(serialize(A,NULL))  
      data.set <- data.frame(serialized,stringsAsFactors=FALSE)
    
      return(list(dataset1=data.set, dataset2=dataframe2))
    }
    

    此时会完成显式转换到整数类型,因为序列化函数输出 R Raw 格式的数据,而设计器不支持该格式。The explicit conversion to integer type is done because the serialization function outputs data in the R Raw format, which the designer doesn't support.

  2. 添加“执行 R 脚本”模块的第二个实例,并将其连接到以前模块的输出端口。Add a second instance of the Execute R Script module, and connect it to the output port of the previous module.

  3. 在“R 脚本”文本框中键入以下代码,从输入数据表中提取对象 AType the following code in the R Script text box to extract object A from the input Data Table.

    azureml_main <- function(dataframe1, dataframe2){
      print("R script run.")
      A <- unserialize(as.raw(dataframe1$serialized))  
      # Return datasets as a Named List
      return(list(dataset1=dataframe1, dataset2=dataframe2))
    }
    

预安装的 R 包Preinstalled R packages

以下预安装的 R 包当前可用:The following preinstalled R packages are currently available:

程序包Package 版本Version
askpassaskpass 1.11.1
assertthatassertthat 0.2.10.2.1
backportsbackports 1.1.41.1.4
basebase 3.5.13.5.1
base64encbase64enc 0.1-30.1-3
BHBH 1.69.0-11.69.0-1
bindrbindr 0.1.10.1.1
bindrcppbindrcpp 0.2.20.2.2
bitopsbitops 1.0-61.0-6
启动boot 1.3-221.3-22
broombroom 0.5.20.5.2
callrcallr 3.2.03.2.0
caretcaret 6.0-846.0-84
caToolscaTools 1.17.1.21.17.1.2
cellrangercellranger 1.1.01.1.0
classclass 7.3-157.3-15
clicli 1.1.01.1.0
cliprclipr 0.6.00.6.0
clustercluster 2.0.7-12.0.7-1
codetoolscodetools 0.2-160.2-16
colorspacecolorspace 1.4-11.4-1
compilercompiler 3.5.13.5.1
crayoncrayon 1.3.41.3.4
curlcurl 3.33.3
data.tabledata.table 1.12.21.12.2
datasetsdatasets 3.5.13.5.1
DBIDBI 1.0.01.0.0
dbplyrdbplyr 1.4.11.4.1
digestdigest 0.6.190.6.19
dplyrdplyr 0.7.60.7.6
e1071e1071 1.7-21.7-2
评估evaluate 0.140.14
fansifansi 0.4.00.4.0
forcatsforcats 0.3.00.3.0
foreachforeach 1.4.41.4.4
foreignforeign 0.8-710.8-71
fsfs 1.3.11.3.1
gdatagdata 2.18.02.18.0
genericsgenerics 0.0.20.0.2
ggplot2ggplot2 3.2.03.2.0
glmnetglmnet 2.0-182.0-18
glueglue 1.3.11.3.1
gowergower 0.2.10.2.1
gplotsgplots 3.0.1.13.0.1.1
graphicsgraphics 3.5.13.5.1
grDevicesgrDevices 3.5.13.5.1
gridgrid 3.5.13.5.1
gtablegtable 0.3.00.3.0
gtoolsgtools 3.8.13.8.1
havenhaven 2.1.02.1.0
highrhighr 0.80.8
hmshms 0.4.20.4.2
htmltoolshtmltools 0.3.60.3.6
httrhttr 1.4.01.4.0
ipredipred 0.9-90.9-9
iteratorsiterators 1.0.101.0.10
jsonlitejsonlite 1.61.6
KernSmoothKernSmooth 2.23-152.23-15
knitrknitr 1.231.23
labelinglabeling 0.30.3
latticelattice 0.20-380.20-38
lavalava 1.6.51.6.5
lazyevallazyeval 0.2.20.2.2
lubridatelubridate 1.7.41.7.4
magrittrmagrittr 1.51.5
markdownmarkdown 11
MASSMASS 7.3-51.47.3-51.4
矩阵Matrix 1.2-171.2-17
方法methods 3.5.13.5.1
mgcvmgcv 1.8-281.8-28
mimemime 0.70.7
ModelMetricsModelMetrics 1.2.21.2.2
modelrmodelr 0.1.40.1.4
munsellmunsell 0.5.00.5.0
nlmenlme 3.1-1403.1-140
nnetnnet 7.3-127.3-12
numDerivnumDeriv 2016.8-1.12016.8-1.1
opensslopenssl 1.41.4
parallelparallel 3.5.13.5.1
pillarpillar 1.4.11.4.1
pkgconfigpkgconfig 2.0.22.0.2
plogrplogr 0.2.00.2.0
plyrplyr 1.8.41.8.4
prettyunitsprettyunits 1.0.21.0.2
processxprocessx 3.3.13.3.1
prodlimprodlim 2018.04.182018.04.18
进度progress 1.2.21.2.2
psps 1.3.01.3.0
purrrpurrr 0.3.20.3.2
quadprogquadprog 1.5-71.5-7
quantmodquantmod 0.4-150.4-15
R6R6 2.4.02.4.0
randomForestrandomForest 4.6-144.6-14
RColorBrewerRColorBrewer 1.1-21.1-2
RcppRcpp 1.0.11.0.1
RcppRollRcppRoll 0.3.00.3.0
readrreadr 1.3.11.3.1
readxlreadxl 1.3.11.3.1
recipesrecipes 0.1.50.1.5
rematchrematch 1.0.11.0.1
reprexreprex 0.3.00.3.0
reshape2reshape2 1.4.31.4.3
reticulatereticulate 1.121.12
rlangrlang 0.4.00.4.0
rmarkdownrmarkdown 1.131.13
ROCRROCR 1.0-71.0-7
rpartrpart 4.1-154.1-15
rstudioapirstudioapi 0.10.1
rvestrvest 0.3.40.3.4
scalesscales 1.0.01.0.0
selectrselectr 0.4-10.4-1
spatialspatial 7.3-117.3-11
splinessplines 3.5.13.5.1
SQUAREMSQUAREM 2017.10-12017.10-1
statsstats 3.5.13.5.1
stats4stats4 3.5.13.5.1
stringistringi 1.4.31.4.3
stringrstringr 1.3.11.3.1
survivalsurvival 2.44-1.12.44-1.1
syssys 3.23.2
tcltktcltk 3.5.13.5.1
tibbletibble 2.1.32.1.3
tidyrtidyr 0.8.30.8.3
tidyselecttidyselect 0.2.50.2.5
tidyversetidyverse 1.2.11.2.1
timeDatetimeDate 3043.1023043.102
tinytextinytex 0.130.13
工具tools 3.5.13.5.1
tseriestseries 0.10-470.10-47
TTRTTR 0.23-40.23-4
utf8utf8 1.1.41.1.4
utilsutils 3.5.13.5.1
vctrsvctrs 0.1.00.1.0
viridisLiteviridisLite 0.3.00.3.0
whiskerwhisker 0.3-20.3-2
withrwithr 2.1.22.1.2
xfunxfun 0.80.8
xml2xml2 1.2.01.2.0
xtsxts 0.11-20.11-2
yamlyaml 2.2.02.2.0
zeallotzeallot 0.1.00.1.0
动物园zoo 1.8-61.8-6

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.