Data Science Virtual Machine 支持的数据平台Data platforms supported on the Data Science Virtual Machine

使用 Data Science Virtual Machine (DSVM),可生成针对各种数据平台的分析。With a Data Science Virtual Machine (DSVM), you can build your analytics against a wide range of data platforms. 除远程数据平台接口外,DSVM 还提供用于快速开发和原型制作的本地实例。In addition to interfaces to remote data platforms, the DSVM provides a local instance for rapid development and prototyping.

DSVM 支持以下数据平台工具。The following data platform tools are supported on the DSVM.

SQL Server Developer EditionSQL Server Developer Edition

类别Category Value
它是什么?What is it? 本地关系数据库实例A local relational database instance
支持的 DSVM 版本Supported DSVM editions Windows 2016:SQL Server 2017、Windows 2019:SQL Server 2019Windows 2016: SQL Server 2017, Windows 2019: SQL Server 2019
典型用途Typical uses
  • 使用小型数据集在本地进行快速开发Rapid development locally with smaller dataset
  • 运行数据库内 RRun In-database R
指向示例的链接Links to samples
  • 将 New York City 数据集加载进 SQL 数据库的小型示例:A small sample of a New York City dataset is loaded into the SQL database:
    nyctaxi
  • 可在以下位置找到显示 Microsoft Machine Learning Server 和数据库内分析的 Jupyter 示例:Jupyter sample showing Microsoft Machine Learning Server and in-database analytics can be found at:
    ~notebooks/SQL_R_Services_End_to_End_Tutorial.ipynb
DSVM 上的相关工具Related tools on the DSVM
  • SQL Server Management StudioSQL Server Management Studio
  • ODBC/JDBC 驱动程序ODBC/JDBC drivers
  • pyodbc, RODBCpyodbc, RODBC
  • Apache DrillApache Drill

备注

SQL Server Developer Edition 只能用于开发和测试。SQL Server Developer Edition can be used only for development and test purposes. 需要许可证或一个 SQL Server VM 才能在生产中运行。You need a license or one of the SQL Server VMs to run it in production.

设置Setup

数据库服务器已预先配置,与 SQL Server 相关的 Windows 服务(例如 SQL Server (MSSQLSERVER))设置为自动运行。The database server is already preconfigured and the Windows services related to SQL Server (like SQL Server (MSSQLSERVER)) are set to run automatically. 唯一的手动步骤涉及使用 Microsoft Machine Learning Server 启用数据库内分析。The only manual step involves enabling In-database analytics by using Microsoft Machine Learning Server. 要启用分析,可在 SQL Server Management Studio (SSMS) 中一次性运行以下命令。You can enable analytics by running the following command as a one-time action in SQL Server Management Studio (SSMS). 先以计算机管理员身份登录,然后运行此命令,在 SSMS 中打开一个新查询,并确保选择的是 master 数据库:Run this command after you log in as the machine administrator, open a new query in SSMS, and make sure the selected database is master:

CREATE LOGIN [%COMPUTERNAME%\SQLRUserGroup] FROM WINDOWS 
    (Replace %COMPUTERNAME% with your VM name.)
   

若要运行 SQL Server Management Studio,可在程序列表中搜索“SQL Server Management Studio”,或使用 Windows 搜索来查找并运行它。To run SQL Server Management Studio, you can search for "SQL Server Management Studio" on the program list, or use Windows Search to find and run it. 系统提示输入凭据时,请选择“Windows 身份验证”,然后使用计算机名称或 SQL Server 名称字段中的 localhostWhen prompted for credentials, select Windows Authentication and use the machine name or localhost in the SQL Server Name field.

如何使用和运行它How to use and run it

默认情况下,具有默认数据库实例的数据库服务器会自动运行。By default, the database server with the default database instance runs automatically. 可在 VM 上使用 SQL Server Management Studio 等工具在本地访问 SQL Server 数据库。You can use tools like SQL Server Management Studio on the VM to access the SQL Server database locally. 本地管理员帐户在数据库中具有管理员访问权限。Local administrator accounts have admin access on the database.

ODBC 驱动程序和 JDBC 驱动程序随附的 DSVM 还会通过使用多种语言(包括 Python 和 Machine Learning Server)编写的应用程序与 SQL Server、Azure SQL 数据库和 Azure SQL 数据仓库通信。Also, the DSVM comes with ODBC and JDBC drivers to talk to SQL Server, Azure SQL databases, and Azure SQL Data Warehouse from applications written in multiple languages, including Python and Machine Learning Server.

如何在 DSVM 上配置和安装它?How is it configured and installed on the DSVM?

SQL Server 采用标准方式安装。SQL Server is installed in the standard way. 可在 C:\Program Files\Microsoft SQL Server 中找到它。It can be found at C:\Program Files\Microsoft SQL Server. 可在 C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES 找到数据库内 Machine Learning Server 实例。The In-database Machine Learning Server instance is found at C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES. DSVM 还具有单独的独立 Machine Learning Server 实例,该实例安装在 C:\Program Files\Microsoft\R Server\R_SERVER 中。The DSVM also has a separate standalone Machine Learning Server instance, which is installed at C:\Program Files\Microsoft\R Server\R_SERVER. 这两个 Machine Learning Server 实例不共享库。These two Machine Learning Server instances don't share libraries.

Apache Spark 2.x (Standalone)Apache Spark 2.x (Standalone)

类别Category Value
它是什么?What is it? 它是流行的 Apache Spark 平台的独立(单个进程内节点)实例,是快速进行大规模数据处理和机器学习的系统A standalone (single node in-process) instance of the popular Apache Spark platform; a system for fast, large-scale data processing and machine-learning
支持的 DSVM 版本Supported DSVM editions LinuxLinux
典型用途Typical uses * 使用小型数据集在本地快速开发 Spark/PySpark 应用程序,然后在大型 Spark 群集(例如 Azure HDInsight)上进行部署* Rapid development of Spark/PySpark applications locally with a smaller dataset and later deployment on large Spark clusters such as Azure HDInsight
* 测试 Microsoft Machine Learning Server Spark 上下文* Test Microsoft Machine Learning Server Spark context
* 使用 SparkML 或 Microsoft 的开放源代码 MMLSpark 库来生成 ML 应用程序* Use SparkML or Microsoft's open-source MMLSpark library to build ML applications
指向示例的链接Links to samples Jupyter 示例:Jupyter sample:
  * ~/notebooks/SparkML/pySpark  * ~/notebooks/SparkML/pySpark
  * ~/notebooks/MMLSpark  * ~/notebooks/MMLSpark
Microsoft Machine Learning Server(Spark 上下文):/dsvm/samples/MRS/MRSSparkContextSample.RMicrosoft Machine Learning Server (Spark context): /dsvm/samples/MRS/MRSSparkContextSample.R
DSVM 上的相关工具Related tools on the DSVM PySpark、ScalaPySpark, Scala
Jupyter(Spark/PySpark 内核)Jupyter (Spark/PySpark Kernels)
Microsoft Machine Learning Server、SparkR、SparklyrMicrosoft Machine Learning Server, SparkR, Sparklyr
Apache DrillApache Drill

如何使用How to use it

可以通过运行 spark-submitpyspark 命令在命令行上提交 Spark 作业。You can submit Spark jobs on the command line by running the spark-submit or pyspark command. 还可通过使用 Spark 内核新建笔记本来创建 Jupyter 笔记本。You can also create a Jupyter notebook by creating a new notebook with the Spark kernel.

使用 DSVM 上提供的 SparkR、Sparklyr 和 Microsoft Machine Learning Server 库,可通过 R 使用 Spark。You can use Spark from R by using libraries like SparkR, Sparklyr, and Microsoft Machine Learning Server, which are available on the DSVM. 请参阅上表中的示例链接。See pointers to samples in the preceding table.

设置Setup

在 Ubuntu Linux DSVM 版本的 Microsoft Machine Learning Server 的 Spark 上下文中运行前,必须执行一次性设置步骤来启用本地单节点 Hadoop HDFS 和 Yarn 实例。Before running in a Spark context in Microsoft Machine Learning Server on Ubuntu Linux DSVM edition, you must complete a one-time setup step to enable a local single node Hadoop HDFS and Yarn instance. 默认情况下,Hadoop 服务已安装但在 DSVM 上禁用。By default, Hadoop services are installed but disabled on the DSVM. 若要启用它们,需要首次以 root 身份运行以下命令:To enable them, run the following commands as root the first time:

echo -e 'y\n' | ssh-keygen -t rsa -P '' -f ~hadoop/.ssh/id_rsa
cat ~hadoop/.ssh/id_rsa.pub >> ~hadoop/.ssh/authorized_keys
chmod 0600 ~hadoop/.ssh/authorized_keys
chown hadoop:hadoop ~hadoop/.ssh/id_rsa
chown hadoop:hadoop ~hadoop/.ssh/id_rsa.pub
chown hadoop:hadoop ~hadoop/.ssh/authorized_keys
systemctl start hadoop-namenode hadoop-datanode hadoop-yarn

不再需要 Hadoop 相关服务时,可以通过运行 systemctl stop hadoop-namenode hadoop-datanode hadoop-yarn 来停止这些服务。You can stop the Hadoop-related services when you no longer need them by running systemctl stop hadoop-namenode hadoop-datanode hadoop-yarn.

可在 /dsvm/samples/MRS 目录中找到演示如何在远程 Spark 上下文(DSVM 上的独立 Spark 实例)中开发和测试 MRS 的示例。A sample that demonstrates how to develop and test MRS in a remote Spark context (which is the standalone Spark instance on the DSVM) is provided and available in the /dsvm/samples/MRS directory.

如何在 DSVM 上配置和安装它?How is it configured and installed on the DSVM?

平台Platform 安装位置 ($SPARK_HOME)Install Location ($SPARK_HOME)
LinuxLinux /dsvm/tools/spark-X.X.X-bin-hadoopX.X/dsvm/tools/spark-X.X.X-bin-hadoopX.X

用于通过 Azure Blob 存储或 Azure Data Lake Storage 访问数据(使用 Microsoft MMLSpark 机器学习库)的库已在 $SPARK_HOME/jars 中预先安装。Libraries to access data from Azure Blob storage or Azure Data Lake Storage, using the Microsoft MMLSpark machine-learning libraries, are preinstalled in $SPARK_HOME/jars. Spark 启动时,这些 JAR 会自动加载。These JARs are automatically loaded when Spark starts up. 默认情况下,Spark 使用本地磁盘上的数据。By default, Spark uses data on the local disk.

DSVM 上的 Spark 实例若要访问存储在 Blob 存储或 Azure Data Lake Storage 中的数据,必须根据 $SPARK_HOME/conf/core-site.xml.template 中找到的模板来创建和配置 core-site.xml 文件。For the Spark instance on the DSVM to access data stored in Blob storage or Azure Data Lake Storage, you must create and configure the core-site.xml file based on the template found in $SPARK_HOME/conf/core-site.xml.template. 必须具有相应的凭据,才能访问 Blob 存储和 Azure Data Lake Storage。You must also have the appropriate credentials to access Blob storage and Azure Data Lake Storage. (请注意,这些模板文件使用占位符表示 Blob 存储和 Azure Data Lake Storage 配置。)(Note that the template files use placeholders for Blob storage and Azure Data Lake Storage configurations.)