参考:Ubuntu (Linux) Data Science Virtual MachineReference: Ubuntu (Linux) Data Science Virtual Machine

有关 Ubuntu Data Science Virtual Machine 上可用工具的列表,请参见下文。See below for a list of available tools on your Ubuntu Data Science Virtual Machine.

深度学习库Deep learning libraries


Microsoft Cognitive Toolki 是一种开放源的深度学习工具包。The Microsoft Cognitive Toolkit is an open-source deep learning toolkit. Python 绑定位于根环境和 py35 Conda 环境中。Python bindings are available in the root and py35 Conda environments. 它还具有已存在于路径中的命令行工具 (CNTK)。It also has a command-line tool (CNTK) that's already in the path.

示例 Python 笔记本位于 JupyterHub。Sample Python notebooks are available in JupyterHub. 若要在命令行中运行基本示例,请在 shell 中运行以下命令:To run a basic sample at the command line, run the following commands in the shell:

cd /home/[USERNAME]/notebooks/CNTK/HelloWorld-LogisticRegression
cntk configFile=lr_bs.cntk makeMode=false command=Train

有关详细信息,请参阅 GitHub 的 CNTK 部分,以及 CNTK wikiFor more information, see the CNTK section of GitHub and the CNTK wiki.


Caffe 是美国伯克利视觉与学习中心的深度学习框架。Caffe is a deep learning framework from the Berkeley Vision and Learning Center. 其路径为 /opt/caffe。It's available in /opt/caffe. 可以在 /opt/caffe/examples 中找到示例。You can find examples in /opt/caffe/examples.


Caffe2 是 Facebook 基于 Caffe 构建的深度学习框架。Caffe2 is a deep learning framework from Facebook that is built on Caffe. 它可在 Conda 根环境中的 Python 2.7 中使用。It's available in Python 2.7 in the Conda root environment. 若要激活它,请从 shell 运行以下命令:To activate it, run the following command from the shell:

source /anaconda/bin/activate root

可在 JupyterHub 中获取一些示例笔记本。Some example notebooks are available in JupyterHub.


H2O 是一种快速的内存中分布式机器学习和预测分析平台。H2O is a fast, in-memory, distributed machine learning and predictive analytics platform. 根环境和 py35 Anaconda 环境中都安装有 Python 包。A Python package is installed in both the root and py35 Anaconda environments. 同时也会安装 R 包。An R package is also installed.

若要从命令行打开 H2O,请运行 java -jar /dsvm/tools/h2o/current/h2o.jarTo open H2O from the command line, run java -jar /dsvm/tools/h2o/current/h2o.jar. 可能需要配置各种命令行选项There are various command-line options that you might want to configure. 若要开始,可以浏览到 http://localhost:54321 以访问 Flow Web UI。You can access the Flow web UI by browsing to http://localhost:54321 to get started. 示例笔记本也位于 JupyterHub。Sample notebooks are also available in JupyterHub.


Keras 是 Python 中的高级神经网络 API。Keras is a high-level neural network API in Python. 它可以在 TensorFlow、Microsoft Cognitive Toolkit 或 Theano 的顶层运行。It can run on top of TensorFlow, Microsoft Cognitive Toolkit, or Theano. 它位于根环境和 py35 Python 环境中。It's available in the root and py35 Python environments.


MXNet 是专为提高效率和灵活性而设计的深度学习框架。MXNet is a deep learning framework designed for both efficiency and flexibility. 具有 DSVM 上包含的 R 和 Python 绑定。It has R and Python bindings included on the DSVM. 示例笔记本位于 JupyterHub 中,示例代码位于 /dsvm/samples/mxnet。Sample notebooks are included in JupyterHub, and sample code is available in /dsvm/samples/mxnet.


NVIDIA 深度学习 GPU 训练系统也称为 DIGITS,是用于简化常见深度学习任务的系统。The NVIDIA Deep Learning GPU Training System, known as DIGITS, is a system to simplify common deep learning tasks. 这些任务包括管理数据,在 GPU 系统上设计和训练神经网络,以及实时监视高级可视化效果的性能。These tasks include managing data, designing and training neural networks on GPU systems, and monitoring performance in real time with advanced visualization.

DIGITS 可用作一项服务,称为 digitsDIGITS is available as a service called digits. 启动服务,然后浏览到 http://localhost:5000 开始操作。Start the service and browse to http://localhost:5000 to get started.

DIGITS 也可在 Conda 根环境中作为 Python 模块安装。DIGITS is also installed as a Python module in the Conda root environment.


TensorFlow 是 Google 的深度学习库。TensorFlow is Google's deep learning library. 它是使用数据流图进行数值计算的开源软件库。It's an open-source software library for numerical computation using data flow graphs. TensorFlow 位于 py35 Python 环境中,一些示例笔记本位于 JupyterHub 中。TensorFlow is available in the py35 Python environment, and some sample notebooks are included in JupyterHub.


Theano 是用于高效数值计算的 Python 库。Theano is a Python library for efficient numerical computation. 它位于根环境和 py35 Python 环境中。It's available in the root and py35 Python environments.


Torch 是广泛支持机器学习算法的科学计算框架。Torch is a scientific computing framework with wide support for machine learning algorithms. 位于 /dsvm/tools/torch,th 交互式会话和 LuaRocks 程序包管理器可在命令行中使用。It's available in /dsvm/tools/torch, and the th interactive session and LuaRocks package manager are available at the command line. 示例位于 /dsvm/samples/torch。Examples are available in /dsvm/samples/torch.

PyTorch 也位于根 Anaconda 环境中。PyTorch is also available in the root Anaconda environment. 示例位于 /dsvm/samples/pytorch。Examples are in /dsvm/samples/pytorch.

Microsoft 机器学习服务器Microsoft Machine Learning Server

R 是数据分析和机器学习的最常用语言之一。R is one of the most popular languages for data analysis and machine learning. 若要使用 R 进行分析,可以利用 VM 中带有 Microsoft R Open 和数学内核库的 Microsoft Machine Learning Server。If you want to use R for your analytics, the VM has Microsoft Machine Learning Server with Microsoft R Open and Math Kernel Library. 数学内核库可优化分析算法中常用的数学运算。Math Kernel Library optimizes math operations common in analytical algorithms. Microsoft R Open 与 CRAN R 完全兼容,在 CRAN 中发布的任何 R 库都可以安装在 Microsoft R Open 上。Microsoft R Open is 100 percent compatible with CRAN R, and any of the R libraries published in CRAN can be installed on Microsoft R Open.

使用 Machine Learning Server 可将 R 模型缩放和实施为 Web 服务。Machine Learning Server gives you scaling and operationalization of R models into web services. 可以在其中一个默认编辑器(如 RStudio、vi 或 Emacs)中编辑 R 程序。You can edit your R programs in one of the default editors, like RStudio, vi, or Emacs. 预安装了 Emacs 编辑器,可根据喜好使用。If you prefer using the Emacs editor, it has been pre-installed. Emacs ESS (Emacs Speaks Statistics) 包简化了 Emacs 编辑器内的 R 文件处理。The Emacs ESS (Emacs Speaks Statistics) package simplifies working with R files within the Emacs editor.

若要打开 R 控制台,请在 shell 中输入 RTo open the R console, you enter R in the shell. 执行此命令将进入交互式环境。This command takes you to an interactive environment. 若要开发 R 程序,通常使用 Emacs 或 vi 等编辑器,并在 R 中运行脚本。使用 RStudio,便拥有一个完整的图形 IDE 来开发 R 程序。To develop your R program, you typically use an editor like Emacs or vi, and then run the scripts within R. With RStudio, you have a full graphical IDE to develop your R program.

还提供一个 R 脚本,可用于安装前 20 个 R 程序包(如果需要)。There's also an R script for you to install the Top 20 R packages if you want. 进入 R 交互式界面后可以运行此脚本。You can run this script after you're in the R interactive interface. 如前所述,可以在 shell 中输入 R 打开该界面。As mentioned earlier, you can open that interface by entering R in the shell.


使用 Python 2.7 和 3.5 环境安装 Anaconda Python。Anaconda Python is installed with Python 2.7 and 3.5 environments. 2.7 环境称为根环境,3.5 环境称为 py35 环境 。The 2.7 environment is called root, and the 3.5 environment is called py35. 此分发版包含基本 Python 以及约 300 种最常用的数学、工程和数据分析包。This distribution contains the base Python along with about 300 of the most popular math, engineering, and data analytics packages.

默认为 py35 环境。The py35 environment is the default. 若要激活根 (2.7) 环境,请使用以下命令:To activate the root (2.7) environment, use this command:

source activate root

若要再次激活 py35 环境,请使用以下命令:To activate the py35 environment again, use this command:

source activate py35

若要调用 Python 交互式会话,请在 shell 中输入 pythonTo invoke a Python interactive session, enter python in the shell.

使用 Conda 或 pip 安装其他 Python 库。Install additional Python libraries by using Conda or pip. 对于 pip,如果不想要使用默认值,请先激活正确的环境:For pip, activate the correct environment first if you don't want the default:

source activate root
pip install <package>

或者,指定到 pip 的完整路径:Or specify the full path to pip:

/anaconda/bin/pip install <package>

对于 Conda,始终应指定环境名称(py35 或根):For Conda, you should always specify the environment name (py35 or root):

conda install <package> -n py35

如果在图形界面上操作或者已设置 X11 转发,可以输入 pycharm 打开 PyCharm Python IDE。If you're on a graphical interface or have X11 forwarding set up, you can enter pycharm to open the PyCharm Python IDE. 可以使用默认文本编辑器。You can use the default text editors. 此外,可以使用 Spyder,它是与 Anaconda Python 分发版捆绑在一起的 Python IDE。In addition, you can use Spyder, a Python IDE that's bundled with Anaconda Python distributions. Spyder 需要图形桌面或 X11 转发。Spyder needs a graphical desktop or X11 forwarding. 图形桌面中提供了 Spyder 的快捷方式。The graphical desktop has a shortcut to Spyder.

Jupyter 笔记本Jupyter notebook

Anaconda 分发版还附带 Jupyter 笔记本 - 用于共享代码和分析的环境。The Anaconda distribution also comes with a Jupyter notebook, an environment to share code and analysis. 可通过 JupyterHub 访问 Jupyter notebook。The Jupyter notebook is accessed through JupyterHub. 使用本地 Linux 用户名和密码登录。You sign in by using your local Linux username and password.

已使用 Python 2、Python 3 和 R 内核预配置 Jupyter 笔记本服务器。The Jupyter notebook server has been pre-configured with Python 2, Python 3, and R kernels. 使用“Jupyter Notebook”桌面图标打开浏览器并访问 Notebook 服务器。 Use the Jupyter Notebook desktop icon to open the browser and access the notebook server. 如果通过 SSH 或 X2Go 客户端登录 VM,则还可以通过 https://localhost:8000/ 访问 Jupyter Notebook 服务器。If you're on the VM via SSH or the X2Go client, you can also access the Jupyter notebook server at https://localhost:8000/.


如果收到任何证书警告,请选择继续。Continue if you get any certificate warnings.

可以从任何主机访问 Jupyter 笔记本服务器。You can access the Jupyter notebook server from any host. 输入 https://<VM DNS 名称或 IP 地址>:8000/Enter https://<VM DNS name or IP address>:8000/.


默认情况下,配置 VM 时,防火墙中会打开端口 8000。Port 8000 is opened in the firewall by default when the VM is provisioned.

我们已经打包了两个示例笔记本(分别在 Python 和 R 中)。通过使用本地 Linux 用户名和密码向 Jupyter 笔记本进行身份验证后,可以在笔记本主页上看到示例链接。We have packaged sample notebooks--one in Python and one in R. You can see the link to the samples on the notebook home page after you authenticate to the Jupyter notebook by using your local Linux username and password. 通过选择“新建” 并选择相应的语言内核,可创建新笔记本。You can create a new notebook by selecting New, and then selecting the appropriate language kernel. 如果未看到“新建”按钮,请选择左上角的“Jupyter”图标转到 Notebook 服务器的主页。 If you don't see the New button, select the Jupyter icon on the upper left to go to the home page of the notebook server.

Apache Spark 独立版Apache Spark standalone

一个 Apache Spark 独立版实例已预装在 Linux DSVM 上,以帮助你在本地开发 Spark 应用程序,然后在大型群集上对其进行测试和部署。A standalone instance of Apache Spark is preinstalled on the Linux DSVM to help you develop Spark applications locally before you test and deploy them on large clusters.

可以通过 Jupyter 内核运行 PySpark 程序。You can run PySpark programs through the Jupyter kernel. 打开 Jupyter 时,选择“新建”按钮即可看到可用内核的列表 。When you open Jupyter, select the New button and you should see a list of available kernels. “Spark - Python”是 PySpark 内核。借助它可以使用 Python 语言生成 Spark 应用程序 。Spark - Python is the PySpark kernel that lets you build Spark applications by using the Python language. 还可以使用 Python IDE(如 PyCharm 或 Spyder)生成 Spark 程序。You can also use a Python IDE like PyCharm or Spyder to build your Spark program.

在此独立实例中,Spark 堆栈会在调用方客户端程序中运行。In this standalone instance, the Spark stack runs within the calling client program. 与在 Spark 群集上进行开发相比,使用此功能可以更快、更轻松地排查问题。This feature makes it faster and easier to troubleshoot issues, compared to developing on a Spark cluster.

Jupyter 提供一个示例 PySpark 笔记本。Jupyter provides a sample PySpark notebook. 可以在 Jupyter 主目录下的 SparkML 目录中找到该笔记本 ($HOME/notebooks/SparkML/pySpark)。You can find it in the SparkML directory under the home directory of Jupyter ($HOME/notebooks/SparkML/pySpark).

若要以 R for Spark 编程,可以使用 Microsoft Machine Learning Server、SparkR 或 sparklyr。If you're programming in R for Spark, you can use Microsoft Machine Learning Server, SparkR, or sparklyr.

在 Microsoft Machine Learning Server 的 Spark 上下文中运行之前,需要执行一次性的设置步骤来启用本地单节点 Hadoop HDFS 和 Yarn 实例。Before you run in a Spark context in Microsoft Machine Learning Server, you need to do a one-time setup step to enable a local single-node Hadoop HDFS and Yarn instance. 默认情况下,Hadoop 服务已安装但在 DSVM 上禁用。By default, Hadoop services are installed but disabled on the DSVM. 若要启用它,需要首次以 root 身份运行以下命令:To enable it, you need to run the following commands as root the first time:

echo -e 'y\n' | ssh-keygen -t rsa -P '' -f ~hadoop/.ssh/id_rsa
cat ~hadoop/.ssh/id_rsa.pub >> ~hadoop/.ssh/authorized_keys
chmod 0600 ~hadoop/.ssh/authorized_keys
chown hadoop:hadoop ~hadoop/.ssh/id_rsa
chown hadoop:hadoop ~hadoop/.ssh/id_rsa.pub
chown hadoop:hadoop ~hadoop/.ssh/authorized_keys
systemctl start hadoop-namenode hadoop-datanode hadoop-yarn

不需要 Hadoop 相关服务时,可以通过运行 systemctl stop hadoop-namenode hadoop-datanode hadoop-yarn 来停止这些服务。You can stop the Hadoop-related services when you don't need them by running systemctl stop hadoop-namenode hadoop-datanode hadoop-yarn.

/dsvm/samples/MRS 目录中提供了一个示例,演示如何在远程 Spark 上下文(即,DSVM 上的独立 Spark 实例)中开发和测试 Microsoft Machine Learning Server。The /dsvm/samples/MRS directory provides a sample that demonstrates how to develop and test Microsoft Machine Learning Server in a remote Spark context (the standalone Spark instance on the DSVM).

IDE 和编辑器IDEs and editors

可以选择多个代码编辑器,包括 vi/Vim、Emacs、PyCharm、RStudio 和 IntelliJ。You have a choice of several code editors, including vi/Vim, Emacs, PyCharm, RStudio, and IntelliJ.

PyCharm、RStudio 和 IntelliJ 是图形编辑器。PyCharm, RStudio, and IntelliJ are graphical editors. 若要使用它们,需要登录到图形桌面。To use them, you need to be signed in to a graphical desktop. 可以使用桌面和应用程序菜单中的快捷方式打开它们。You open them by using desktop and application menu shortcuts.

Vim 和 Emacs 是基于文本的编辑器。Vim and Emacs are text-based editors. Emacs 上的 ESS 附加包可以简化 Emacs 编辑器中 R 的处理。On Emacs, the ESS add-on package makes working with R easier within the Emacs editor. 可在 ESS 网站上找到更多信息。You can find more information on the ESS website.

LaTex 是通过 texlive 包连同名为 AUCTeX 的 Emacs 附加包一起安装的。LaTex is installed through the texlive package, along with an Emacs add-on package called AUCTeX. 此包简化了 Emacs 中 LaTex 文档的创作。This package simplifies authoring your LaTex documents within Emacs.


图形化 SQL 客户端Graphical SQL client

SQuirrel SQL 是一个图形化 SQL 客户端,可连接到各种数据库(如 Microsoft SQL Server 和 MySQL)和运行 SQL 查询。SQuirrel SQL, a graphical SQL client, can connect to various databases (such as Microsoft SQL Server and MySQL) and run SQL queries. 可以使用桌面图标从图形桌面会话运行 SQuirrel SQL(例如,通过 X2Go 客户端)。You can run SQuirrel SQL from a graphical desktop session (through the X2Go client, for example) by using a desktop icon. 或者,可以在 shell 中使用以下命令运行该客户端:Or you can run the client by using the following command in the shell:


首次使用前,需设置驱动程序和数据库别名。Before the first use, set up your drivers and database aliases. JDBC 驱动程序位于 /usr/share/java/jdbcdrivers 中。The JDBC drivers are located at /usr/share/java/jdbcdrivers.

有关详细信息,请参阅 SQuirrel SQLFor more information, see SQuirrel SQL.

用于访问 Microsoft SQL Server 的命令行工具Command-line tools for accessing Microsoft SQL Server

SQL Server 的 ODBC 驱动程序包还附带两个命令行工具:The ODBC driver package for SQL Server also comes with two command-line tools:

  • bcp:bcp 工具在 Microsoft SQL Server 实例与用户指定格式的数据文件之间批量复制数据。bcp: The bcp tool bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. 可以使用 bcp 工具将大量新行导入 SQL Server 表,或者将表中的数据导出到数据文件。You can use the bcp tool to import large numbers of new rows into SQL Server tables, or to export data out of tables into data files. 要将数据导入表中,必须使用为该表创建的格式文件。To import data into a table, you must use a format file created for that table. 或者,必须了解表的结构,以及对其列有效的数据类型。Or, you must understand the structure of the table and the types of data that are valid for its columns.

    有关详细信息,请参阅使用 bcp 连接For more information, see Connecting with bcp.

  • sqlcmd:可以使用 sqlcmd 工具输入 Transact-SQL 语句。sqlcmd: You can enter Transact-SQL statements by using the sqlcmd tool. 还可以在命令提示符下输入系统过程和脚本文件。You can also enter system procedures and script files at the command prompt. 此工具使用 ODBC 运行 Transact-SQL 批处理。This tool uses ODBC to run Transact-SQL batches.

    有关详细信息,请参阅使用 sqlcmd 连接For more information, see Connecting with sqlcmd.


    此工具在 Linux 和 Windows 平台之间存在差异。There are some differences in this tool between Linux and Windows platforms. 有关详细信息,请参阅文档。See the documentation for details.

数据库访问库Database access libraries

在 R 和 Python 中可以使用库来访问数据库:Libraries are available in R and Python for database access:

  • 在 R 中,可以使用 RODBC 包或 dplyr 包在数据库服务器上查询或运行 SQL 语句。In R, you can use the RODBC package or dplyr package to query or run SQL statements on the database server.
  • 在 Python 中,pyodbc 库提供使用 ODBC 作为基础层的数据库访问。In Python, the pyodbc library provides database access with ODBC as the underlying layer.

Azure 工具Azure tools

VM 上安装有以下 Azure 工具:The following Azure tools are installed on the VM:

  • Azure CLI:可以使用 Azure 中的命令行接口通过 shell 命令创建和管理 Azure 资源。Azure CLI: You can use the command-line interface in Azure to create and manage Azure resources through shell commands. 若要打开 Azure 工具,请输入 azure helpTo open the Azure tools, enter azure help. 有关详细信息,请参阅 Azure CLI 文档页For more information, see the Azure CLI documentation page.

  • Azure 存储资源管理器:Azure 存储资源管理器是一个图形工具,用于浏览在 Azure 存储帐户中存储的对象,以及将数据上传到 Azure Blob 和从中下载数据。Azure Storage Explorer: Azure Storage Explorer is a graphical tool that you can use to browse through the objects that you have stored in your Azure storage account, and to upload and download data to and from Azure blobs. 可通过桌面快捷方式图标访问存储资源管理器。You can access Storage Explorer from the desktop shortcut icon. 还可以通过输入 StorageExplorer 从 shell 提示符打开此工具。You can also open it from a shell prompt by entering StorageExplorer. 必须从 X2Go 客户端登录,或设置 X11 转发。You must be signed in from an X2Go client, or have X11 forwarding set up.

  • Azure 库:下面是一些预安装的库。Azure libraries: The following are some of the pre-installed libraries.

    • Python:Python 中的 Azure 相关库包括 azureazuremlpydocumentdbpyodbcPython: The Azure-related libraries in Python are azure, azureml, pydocumentdb, and pyodbc. 使用前三个库,可以访问 Azure 存储服务、Azure 机器学习和 Azure Cosmos DB(Azure 上的 NoSQL 数据库)。With the first three libraries, you can access Azure storage services, Azure Machine Learning, and Azure Cosmos DB (a NoSQL database on Azure). 使用第四个库 pyodbc(以及 SQL Server 的 Microsoft ODBC 驱动程序),可以通过使用 ODBC 接口从 Python 访问 SQL Server、Azure SQL 数据库和 Azure SQL 数据仓库。The fourth library, pyodbc (along with the Microsoft ODBC driver for SQL Server), enables access to SQL Server, Azure SQL Database, and Azure SQL Data Warehouse from Python by using an ODBC interface. 输入 pip 列表查看所有列出的库。Enter pip list to see all the listed libraries. 请确保在 Python 2.7 和 3.5 环境中都运行此命令。Be sure to run this command in both the Python 2.7 and 3.5 environments.
    • R:R 中的 Azure 相关库包括 AzureML 和 RODBC。R: The Azure-related libraries in R are AzureML and RODBC.
    • Java:可在 VM 上的 /dsvm/sdk/AzureSDKJava 目录中找到 Azure Java 库列表。Java: The list of Azure Java libraries can be found in the directory /dsvm/sdk/AzureSDKJava on the VM. 密钥库是 Azure 存储和用于 SQL Server 的管理 API、Azure Cosmos DB 和 JDBC 驱动程序。The key libraries are Azure storage and management APIs, Azure Cosmos DB, and JDBC drivers for SQL Server.

可以从预安装的 Firefox 浏览器访问 Azure 门户You can access the Azure portal from the pre-installed Firefox browser. 在 Azure 门户中,可以创建、管理和监视 Azure 资源。On the Azure portal, you can create, manage, and monitor Azure resources.

Azure 机器学习Azure Machine Learning

Azure 机器学习是完全托管的云服务,允许构建、部署和共享预测分析解决方案。Azure Machine Learning is a fully managed cloud service that enables you to build, deploy, and share predictive analytics solutions. 可以从 Azure 机器学习工作室(经典)中构建实验和模型。You can build your experiments and models from Azure Machine Learning Studio (classic). 可从 Data Science Virtual Machine 上的 Web 浏览器,通过访问 Microsoft Azure 机器学习来访问 Azure 机器学习工作室。You can access it from a web browser on the Data Science Virtual Machine by visiting Microsoft Azure Machine Learning.

登录到 Azure 机器学习工作室(经典)后,可以使用试验画布来生成机器学习算法的逻辑流。After you sign in to Azure Machine Learning Studio (classic), you can use an experimentation canvas to build a logical flow for the machine learning algorithms. 还可以访问在 Azure 机器学习上托管的 Jupyter 笔记本,并且可以无缝使用 Azure 机器学习工作室(经典)中的试验。You also have access to a Jupyter notebook that is hosted on Azure Machine Learning and can work seamlessly with the experiments in Azure Machine Learning Studio (classic).

通过将已构建的机器学习模型包装在 Web 服务接口中,来对它们执行操作。Operationalize the machine learning models that you have built by wrapping them in a web service interface. 实施机器学习模型使得以任何语言编写的客户端都能从这些模型中调用预测。Operationalizing machine learning models enables clients written in any language to invoke predictions from those models. 有关详细信息,请参阅机器学习文档For more information, see the Machine Learning documentation.

还可以在 VM 上的 R 或 Python 中生成模型,然后在 Azure 机器学习中将其部署到生产环境。You can also build your models in R or Python on the VM, and then deploy them in production on Azure Machine Learning. 我们已在 R (AzureML) 和 Python (azureml) 中分别安装了库以启用此功能。We have installed libraries in R (AzureML) and Python (azureml) to enable this functionality.

有关如何将 R 和 Python 中的模型部署到 Azure 机器学习的信息,请参阅 Data Science Virtual Machine 的十大功能For information on how to deploy models in R and Python into Azure Machine Learning, see Ten things you can do on the Data Science Virtual Machine.


这些说明专为 Windows 版 Data Science Virtual Machine 编写。These instructions were written for the Windows version of the Data Science Virtual Machine. 但是其中提供的有关将模型部署到 Azure 机器学习的信息也适用于 Linux VM。But the information provided there on deploying models to Azure Machine Learning is applicable to the Linux VM.

机器学习工具Machine learning tools

VM 随附一些已预编译并已在本地预装的机器学习工具和算法。The VM comes with machine learning tools and algorithms that have been pre-compiled and pre-installed locally. 其中包括:These include:

  • Vowpal Wabbit:一种快速的在线学习算法。Vowpal Wabbit: A fast online learning algorithm.

  • xgboost:提供经过优化的提升树算法的工具。xgboost: A tool that provides optimized, boosted tree algorithms.

  • Rattle:基于 R 的图形工具,可用于简单的数据浏览和建模。Rattle: An R-based graphical tool for easy data exploration and modeling.

  • Python:Anaconda Python 附带机器学习算法,这些算法含有库(如 Scikit-learn)。Python: Anaconda Python comes bundled with machine learning algorithms with libraries like Scikit-learn. 可以通过使用 pip install 命令安装其他库。You can install other libraries by using the pip install command.

  • LightGBM:快速、分布式、高性能的梯度提升框架,基于决策树算法。LightGBM: A fast, distributed, high-performance gradient boosting framework based on decision tree algorithms.

  • R:有丰富的机器学习函数库可供 R 使用。预装的库包括 lm、glm、randomForest 和 rpart。R: A rich library of machine learning functions is available for R. Pre-installed libraries include lm, glm, randomForest, and rpart. 可运行以下命令安装其他库:You can install other libraries by running this command:

      install.packages(<lib name>)

以下是一些关于列表中前三个机器学习工具的附加信息。Here is some additional information about the first three machine learning tools in the list.

Vowpal WabbitVowpal Wabbit

Vowpal Wabbit 是一种使用在线、哈希、allreduce、缩减、learning2search、主动和交互式学习等技术的机器学习系统。Vowpal Wabbit is a machine learning system that uses techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

若要在基本示例上运行该工具,请使用以下命令:To run the tool on a basic example, use the following commands:

cp -r /dsvm/tools/VowpalWabbit/demo vwdemo
cd vwdemo
vw house_dataset

该目录中存在其他更大的演示。There are other, larger demos in that directory. 有关 Vowpal Wabbit 的详细信息,请参阅 GitHub 的此部分以及 Vowpal Wabbit wikiFor more information on Vowpal Wabbit, see this section of GitHub and the Vowpal Wabbit wiki.


xgboost 库是为提升(树)算法设计和优化的库。The xgboost library is designed and optimized for boosted (tree) algorithms. 此库的目标是将计算机的计算限制推向极致,以满足提供可缩放、可移植且精确的大规模树提升的需求。The objective of this library is to push the computation limits of machines to the extremes needed to provide large-scale tree boosting that is scalable, portable, and accurate.

xgboost 作为命令行和 R 库提供。It's provided as a command line and an R library. 若要在 R 中使用此库,可以启动交互式 R 会话(在 shell 中输入 R),然后加载该库。To use this library in R, you can start an interactive R session (by entering R in the shell) and load the library.

下面是可以在 R 提示符中运行的一个简单示例:Here's a simple example that you can run in an R prompt:


data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
                eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
pred <- predict(bst, test$data)

若要运行 xgboost 命令行,下面是要在 shell 中运行的命令:To run the xgboost command line, here are the commands to run in the shell:

cp -r /dsvm/tools/xgboost/demo/binary_classification/ xgboostdemo
cd xgboostdemo
xgboost mushroom.conf

一个 .model 文件将写入到指定的目录。A .model file is written to the specified directory. 可在 GitHub 上找到有关此演示示例的信息。You can find information about this demo example on GitHub.

有关 xgboost 的详细信息,请参阅 xgboost 文档页及其 GitHub 存储库For more information about xgboost, see the xgboost documentation page and its GitHub repository.


Rattle (R Analytical Tool To Learn Easily) 使用基于 GUI 的数据浏览和建模。Rattle (the R Analytical Tool To Learn Easily) uses GUI-based data exploration and modeling. 它提供数据的统计和可视化摘要,转换可轻松建模的数据,从数据构建不受监督和受监督的模型,以图形方式呈现模型的性能,以及对新数据集进行评分。It presents statistical and visual summaries of data, transforms data that can be readily modeled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new data sets. 它还生成 R 代码,用于复制 UI 中可直接在 R 中运行或用作进一步分析的起点的操作。It also generates R code, replicating the operations in the UI that can be run directly in R or used as a starting point for further analysis.

若要运行 Rattle,需进入图形桌面登录会话。To run Rattle, you need to be in a graphical desktop sign-in session. 在终端中输入 R 打开 R 环境。On the terminal, enter R to open the R environment. 在 R 提示符中,输入以下命令:At the R prompt, enter the following commands:


此时会打开包含一组选项卡的图形界面。Now a graphical interface opens with a set of tabs. 在 Rattle 中执行以下快速入门步骤,使用示例天气数据集并生成模型。Use the following quickstart steps in Rattle to use a sample weather data set and build a model. 在某些步骤中,系统会提示自动安装并加载尚未安装在系统上的某些必需 R 包。In some of the steps, you're prompted to automatically install and load some required R packages that are not already on the system.


如果无权在系统目录(默认)中安装包,可能会在 R 控制台窗口中看到一个提示,提醒将包安装到个人库中。If you don't have access to install the package in the system directory (the default), you might see a prompt on your R console window to install packages to your personal library. 如果看到这些提示,请回复 y 。Answer y if you see these prompts.

  1. 选择“执行” 。Select Execute.
  2. 此时会显示一个对话框,询问是否要使用示例气象数据集。A dialog box appears, asking you if you want to use the example weather data set. 选择“是”以加载示例。 Select Yes to load the example.
  3. 选择“模型”选项卡。 Select the Model tab.
  4. 选择“执行”以生成决策树。 Select Execute to build a decision tree.
  5. 选择“绘制”以显示决策树。 Select Draw to display the decision tree.
  6. 选择“林”选项,然后选择“执行”以生成随机林。 Select the Forest option, and select Execute to build a random forest.
  7. 选择“评估”选项卡。 Select the Evaluate tab.
  8. 选择“风险”选项,然后选择“执行”以显示两个“风险(累积)”性能绘图。 Select the Risk option, and select Execute to display two Risk (Cumulative) performance plots.
  9. 选择“日志”选项卡以显示为上述操作生成的 R 代码。 Select the Log tab to show the generated R code for the preceding operations. (由于当前版本 Rattle 中的 bug,需在日志文本中的“导出此日志” 前插入 # 字符。)(Because of a bug in the current release of Rattle, you need to insert a # character in front of Export this log in the text of the log.)
  10. 选择“导出”按钮,将名为 weather_script.R 的 R 脚本文件保存到主文件夹。 Select the Export button to save the R script file named weather_script.R to the home folder.

可以退出 Rattle 和 R。现在,可以修改生成的 R 脚本。You can exit Rattle and R. Now you can modify the generated R script. 或者,可以按原样使用该脚本,并可随时运行它来重复 Rattle UI 中的所有操作。Or, use the script as it is, and run it anytime to repeat everything that was done within the Rattle UI. 尤其是对于 R 初学者而言,使用此方法可在简单的图形界面中快速执行分析和机器学习,同时在 R 中自动生成代码来修改项目或用于学习。Especially for beginners in R, this is a way to quickly do analysis and machine learning in a simple graphical interface, while automatically generating code in R to modify or learn.

后续步骤Next steps

还有其他问题吗?Have additional questions? 请考虑创建支持票证Consider creating a support ticket.