使用 Windows Data Science Virtual Machine 的数据科学Data science with a Windows Data Science Virtual Machine

Windows Data Science Virtual Machine (DSVM) 是一个功能强大的数据科学开发环境,可在其中执行数据探索和建模任务。The Windows Data Science Virtual Machine (DSVM) is a powerful data science development environment where you can perform data exploration and modeling tasks. 该环境本身内置和捆绑了多款热门数据分析工具,便于针对本地、云或混合部署开始分析。The environment comes already built and bundled with several popular data analytics tools that make it easy to get started with your analysis for on-premises, cloud, or hybrid deployments.

DSVM 与 Azure 服务紧密协同工作。The DSVM works closely with Azure services. 它可以读取并处理已存储在 Azure、Azure Synapse(以前称为 SQL DW)、Azure Data Lake、Azure 存储或 Azure Cosmos DB 中的数据。It can read and process data that's already stored on Azure, in Azure Synapse (formerly SQL DW),Azure Data Lake, Azure Storage, or Azure Cosmos DB. 它还可以利用其他分析工具,例如 Azure 机器学习。It can also take advantage of other analytics tools, such as Azure Machine Learning.

本文介绍如何使用 DSVM 执行数据科学任务,以及如何与其他 Azure 服务交互。In this article, you'll learn how to use your DSVM to perform data science tasks and interact with other Azure services. 可以在 DSVM 上执行的操作如下所示:Here are some of the things you can do on the DSVM:

  • 通过使用 Python 2、Python 3 和 Microsoft R,在浏览器中使用 Jupyter 笔记本对数据进行试验。(Microsoft R 是适用于企业的 R 版本,进行了性能优化设计。)Use a Jupyter notebook to experiment with your data in a browser by using Python 2, Python 3, and Microsoft R. (Microsoft R is an enterprise-ready version of R designed for performance.)

  • 使用 Microsoft Machine Learning Server 和 Python 在 DSVM 上本地探索数据和开发模型。Explore data and develop models locally on the DSVM by using Microsoft Machine Learning Server and Python.

  • 使用 Azure 门户或 PowerShell 管理 Azure 资源。Administer your Azure resources by using the Azure portal or PowerShell.

  • 通过将 Azure 文件存储共享创建为可在 DSVM 上装载的驱动器,即可扩展存储空间并跨整个团队共享大型数据集/代码。Extend your storage space and share large-scale datasets/code across your whole team by creating an Azure Files share as a mountable drive on your DSVM.

  • 使用 GitHub 与团队共享代码。Share code with your team by using GitHub. 使用下列预安装的 Git 客户端访问存储库:Git Bash 和 Git GUI。Access your repository by using the pre-installed Git clients: Git Bash and Git GUI.

  • 访问 Azure 数据和分析服务,如 Azure Blob 存储、Azure Cosmos DB、Azure Synapse(以前称为 SQL DW)和 Azure SQL 数据库。Access Azure data and analytics services like Azure Blob storage, Azure Cosmos DB, Azure Synapse (formerly SQL DW), and Azure SQL Database.

  • 使用 DSVM 上预安装的 Power BI Desktop 实例生成报表和仪表板,然后将它们部署到云中。Build reports and a dashboard by using the Power BI Desktop instance that's pre-installed on the DSVM, and deploy them in the cloud.

  • 在虚拟机上安装其他工具。Install additional tools on your virtual machine.

备注

本文中列出的许多数据存储和分析服务将收取额外的使用费用。Additional usage charges apply for many of the data storage and analytics services listed in this article. 有关详细信息,请参阅 Azure 定价页。For details, see the Azure pricing page.

先决条件Prerequisites

备注

本文进行了更新,以便使用新的 Azure PowerShell Az 模块。This article has been updated to use the new Azure PowerShell Az module. 你仍然可以使用 AzureRM 模块,至少在 2020 年 12 月之前,它将继续接收 bug 修补程序。You can still use the AzureRM module, which will continue to receive bug fixes until at least December 2020. 若要详细了解新的 Az 模块和 AzureRM 兼容性,请参阅新 Azure Powershell Az 模块简介To learn more about the new Az module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. 有关 Az 模块安装说明,请参阅安装 Azure PowerShellFor Az module installation instructions, see Install Azure PowerShell.

使用 Jupyter NotebookUse Jupyter notebooks

Jupyter Notebook 提供基于浏览器的 IDE,用于数据探索和建模。The Jupyter Notebook provides a browser-based IDE for data exploration and modeling. 可以在 Jupyter 笔记本中使用 Python 2、Python 3 或 R(开源和 Microsoft R Server)。You can use Python 2, Python 3, or R (both open source and Microsoft R Server) in a Jupyter notebook.

若要启动 Jupyter Notebook,请在“开始”菜单或桌面上选择“Jupyter Notebook”图标。 To start the Jupyter Notebook, select the Jupyter Notebook icon on the Start menu or on the desktop. 在 DSVM 命令提示符处,还可以从包含现有笔记本或想在其中创建新笔记本的目录中运行 jupyter notebook 命令。In the DSVM command prompt, you can also run the command jupyter notebook from the directory where you have existing notebooks or where you want to create new notebooks.

启动 Jupyter 后,导航到 /notebooks 目录,以获取预打包到 DSVM 中的多个示例笔记本。After you start Jupyter, navigate to the /notebooks directory for example notebooks that are pre-packaged into the DSVM. 现在可以:Now you can:

  • 单击笔记本以查看代码。Select the notebook to see the code.
  • 选择 Shift+Enter 以运行每个单元格。Run each cell by selecting Shift+Enter.
  • 选择“单元” > “运行”,运行整个笔记本。 Run the entire notebook by selecting Cell > Run.
  • 依次选择 Jupyter 图标(左上角),右侧的“新建”按钮和笔记本语言(也称为内核),创建新的笔记本。Create a new notebook by selecting the Jupyter icon (upper-left corner), selecting the New button on the right, and then choosing the notebook language (also known as kernels).

备注

Jupyter 当前支持 Python 2.7、Python 3.6、R、Julia 和 PySpark 内核。Currently, Python 2.7, Python 3.6, R, Julia, and PySpark kernels in Jupyter are supported. R 内核支持以开放源代码 R 和 Microsoft R 进行编程。The R kernel supports programming in both open-source R and Microsoft R.

在笔记本中,可探索数据、生成模型、使用自己选择的库测试模型。When you're in the notebook, you can explore your data, build the model, and test the model by using your choice of libraries.

使用 Microsoft Machine Learning Server 探索数据和开发模型Explore data and develop models with Microsoft Machine Learning Server

使用 R 和 Python 之类的语言即可在 DSVM 上执行数据分析。You can use languages like R and Python to do your data analytics right on the DSVM.

对于 R,可以使用 RStudio 等 IDE(可在“开始”菜单或桌面找到)。For R, you can use an IDE like RStudio that can be found on the start menu or on the desktop. 也可使用针对 Visual Studio 的 R 工具。Or you can use R Tools for Visual Studio. Microsoft 额外提供了基于开源/CRAN-R 的库,以便支持可扩展的分析,并提供分析大型数据(大于并行区块分析所允许的内存大小)的能力。Microsoft has provided additional libraries on top of the open-source CRAN R to enable scalable analytics and the ability to analyze data larger than the memory size allowed in parallel chunked analysis.

对于 Python,可以使用已预安装针对 Visual Studio 的 Python 工具 (PTVS) 扩展的 Visual Studio Community Edition 之类的 IDE。For Python, you can use an IDE like Visual Studio Community Edition, which has the Python Tools for Visual Studio (PTVS) extension pre-installed. 默认情况下,PTVS 上仅配置了根 Conda 环境 Python 3.6。By default, only Python 3.6, the root Conda environment, is configured on PTVS. 若要启用 Anaconda Python 2.7,请按以下步骤操作:To enable Anaconda Python 2.7, take the following steps:

  1. 在 Visual Studio Community Edition 中,转到“工具” > “Python 工具” > “Python 环境”,并选择“+ 自定义”,为每个版本创建自定义环境。Create custom environments for each version by going to Tools > Python Tools > Python Environments, and then selecting + Custom in Visual Studio Community Edition.
  2. 提供描述并将环境前缀路径设置为 c:\anaconda\envs\python2(适用于 Anaconda Python 2.7)。Give a description and set the environment prefix path as c:\anaconda\envs\python2 for Anaconda Python 2.7.
  3. 选择“自动检测” > “应用”以保存环境。Select Auto Detect > Apply to save the environment.

有关如何创建 Python 环境的详细信息,请参阅 PTVS 文档See the PTVS documentation for more details on how to create Python environments.

现在即可开始创建新的 Python 项目。Now you're set up to create a new Python project. 转到“文件” > “新建” > “项目” > “Python”,并选择要生成的 Python 应用程序的类型。Go to File > New > Project > Python and select the type of Python application you're building. 可以将当前项目的 Python 环境设置为所需版本(Python 2.7 或 3.6),方法是右键单击“Python 环境”,然后选择“添加/删除 Python 环境”。You can set the Python environment for the current project to the desired version (Python 2.7 or 3.6) by right-clicking Python environments and then selecting Add/Remove Python Environments. 要详细了解如何使用 PTVS,请参阅产品文档You can find more information about working with PTVS in the product documentation.

管理 Azure 资源Manage Azure resources

DSVM 不仅允许在虚拟机上本地生成分析解决方案。The DSVM doesn't just allow you to build your analytics solution locally on the virtual machine. 它还允许访问 Azure 云平台上的服务。It also allows you to access services on the Azure cloud platform. Azure 提供多个计算、存储、数据分析和其他服务,可以从 DSVM 管理并访问这些服务。Azure provides several compute, storage, data analytics, and other services that you can administer and access from your DSVM.

可选用两种方法管理 Azure 订阅和云资源:To administer your Azure subscription and cloud resources, you have two options:

使用共享文件系统扩展存储Extend storage by using shared file systems

数据科学家可以在团队内共享大型数据集、代码或其他资源。Data scientists can share large datasets, code, or other resources within the team. DSVM 约有 45 GB 的可用空间。The DSVM has about 45 GB of space available. 要扩展存储,可以使用 Azure 文件存储,将它装载到一个或多个 DSVM 实例或通过 REST API 访问它。To extend your storage, you can use Azure Files and either mount it on one or more DSVM instances or access it via a REST API. 还可以使用 Azure 门户Azure PowerShell 添加其他专用数据磁盘。You can also use the Azure portal or use Azure PowerShell to add extra dedicated data disks.

备注

Azure 文件存储共享上的最大可用空间为 5 TB。The maximum space on the Azure Files share is 5 TB. 每个文件的大小限制为 1 TB。The size limit for each file is 1 TB.

可以在 Azure PowerShell 中使用此脚本创建 Azure 文件存储共享。You can use this script in Azure PowerShell to create an Azure Files share:

# Authenticate to Azure.
Connect-AzAccount
# Select your subscription
Get-AzSubscription –SubscriptionName "<your subscription name>" | Select-AzSubscription
# Create a new resource group.
New-AzResourceGroup -Name <dsvmdatarg>
# Create a new storage account. You can reuse existing storage account if you want.
New-AzStorageAccount -Name <mydatadisk> -ResourceGroupName <dsvmdatarg> -Location "<Azure Data Center Name For eg. China East 2>" -Type "Standard_LRS"
# Set your current working storage account
Set-AzCurrentStorageAccount –ResourceGroupName "<dsvmdatarg>" –StorageAccountName <mydatadisk>

# Create an Azure Files share
$s = New-AzStorageShare <<teamsharename>>
# Create a directory under the file share. You can give it any name
New-AzStorageDirectory -Share $s -Path <directory name>
# List the share to confirm that everything worked
Get-AzStorageFile -Share $s

现已创建 Azure 文件存储共享,可以将它装载到 Azure 中的任何虚拟机中。Now that you have created an Azure Files share, you can mount it in any virtual machine in Azure. 建议将 VM 和存储帐户置于同一 Azure 数据中心,以避免延迟和产生数据传输费用。We recommend that you put the VM in the same Azure datacenter as the storage account, to avoid latency and data transfer charges. 可使用以下 Azure PowerShell 命令在 DSVM 上装载驱动器:Here are the Azure PowerShell commands to mount the drive on the DSVM:

# Get the storage key of the storage account that has the Azure Files share from the Azure portal. Store it securely on the VM to avoid being prompted in the next command.
cmdkey /add:<<mydatadisk>>.file.core.chinacloudapi.cn /user:<<mydatadisk>> /pass:<storage key>

# Mount the Azure Files share as drive Z on the VM. You can choose another drive letter if you want.
net use z:  \\<mydatadisk>.file.core.chinacloudapi.cn\<<teamsharename>>

现在可以像访问 VM 上的任何正常驱动器一样访问此驱动器。Now you can access this drive as you would any normal drive on the VM.

在 GitHub 中共享代码Share code in GitHub

GitHub 是一个代码存储库,可在其中找到代码示例和资源,用于通过开发人员社区共享的技术构建各种工具。GitHub is a code repository where you can find code samples and sources for various tools by using technologies shared by the developer community. 它使用 Git 作为跟踪和存储代码文件版本的技术。It uses Git as the technology to track and store versions of the code files. GitHub 也是一个平台,可以在其中创建自己的存储库来存储团队的共享代码和文档、实现版本控制,还可以控制谁有权查看和贡献代码。GitHub is also a platform where you can create your own repository to store your team's shared code and documentation, implement version control, and control who has access to view and contribute code.

有关使用 Git 的详细信息,请访问 GitHub 帮助页Visit the GitHub help pages for more information on using Git. 可以将 GitHub 用作团队协作、利用社区开发的代码以及向社区回馈代码的一个途径。You can use GitHub as one of the ways to collaborate with your team, use code developed by the community, and contribute code back to the community.

DSVM 包含用于访问 GitHub 存储库的客户端工具,可通过命令行和 GUI 使用。The DSVM comes loaded with client tools on the command line and on the GUI to access the GitHub repository. 可作用于 Git 和 GitHub 的命令行工具称为 Git Bash。The command-line tool that works with Git and GitHub is called Git Bash. DSVM 上安装了 Visual Studio 且有 Git 扩展。Visual Studio is installed on the DSVM and has the Git extensions. 可以在“开始”菜单和桌面上找到这些工具的图标。You can find icons for these tools on the Start menu and on the desktop.

若要从 GitHub 存储库下载代码,请使用 git clone 命令。To download code from a GitHub repository, you use the git clone command. 例如,要将 Microsoft 发布的数据科学存储库下载到当前目录,可在 Git Bash 中运行以下命令:For example, to download the data science repository published by Microsoft into the current directory, you can run the following command in Git Bash:

git clone https://github.com/Azure/DataScienceVM.git

在 Visual Studio 中,可以执行相同的克隆操作。In Visual Studio, you can do the same clone operation. 以下屏幕截图演示了如何在 Visual Studio 中访问 Git 和 GitHub 工具:The following screenshot shows how to access Git and GitHub tools in Visual Studio:

Visual Studio 的屏幕截图,其中显示了 GitHub 连接

可以从 github.com 上提供的资源中找到有关使用 Git 操作 GitHub 存储库的更多信息。You can find more information on using Git to work with your GitHub repository from resources available on github.com. 备忘单是一个有用的参考。The cheat sheet is a useful reference.

访问 Azure 数据和分析服务Access Azure data and analytics services

Azure Blob 存储Azure Blob storage

Azure Blob 存储是适合大小数据的经济可靠云存储服务。Azure Blob storage is a reliable, economical cloud storage service for data big and small. 本部分介绍如何将数据移动到 Blob 存储以及如何访问 Azure Blob 中存储的数据。This section describes how you can move data to Blob storage and access data stored in an Azure blob.

先决条件Prerequisites

  • Azure 门户创建 Azure Blob 存储帐户。Create your Azure Blob storage account from the Azure portal.

    Azure 门户中存储帐户创建流程的屏幕截图

  • 确认已预安装命令行 AzCopy 工具:C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy\azcopy.exeConfirm that the command-line AzCopy tool is pre-installed: C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy\azcopy.exe. 包含 azcopy.exe 的目录已在 PATH 环境变量中,因此运行此工具时不用键入完整命令路径。The directory that contains azcopy.exe is already on your PATH environment variable, so you can avoid typing the full command path when running this tool. 有关 AzCopy 工具的详细信息,请参阅 AzCopy 文档For more information on the AzCopy tool, see the AzCopy documentation.

  • 启动 Azure 存储资源管理器工具。Start the Azure Storage Explorer tool. 可从存储资源管理器网页下载它。You can download it from the Storage Explorer webpage.

    Azure 存储资源管理器访问存储帐户时的屏幕截图

将数据从 VM 移动到 Azure blob:AzCopyMove data from a VM to an Azure blob: AzCopy

若要在本地文件和 Blob 存储之间移动数据,可以在命令行或 PowerShell 中使用 AzCopy:To move data between your local files and Blob storage, you can use AzCopy on the command line or in PowerShell:

    AzCopy /Source:C:\myfolder /Dest:https://<mystorageaccount>.blob.core.chinacloudapi.cn/<mycontainer> /DestKey:<storage account key> /Pattern:abc.txt

将“C:\myfolder”替换为存储着文件的路径,将“mystorageaccount”替换为 Blob 存储帐户名称,将“mycontainer”替换为容器名称,将“storage account key”替换为 Blob 存储访问密钥。 Replace C:\myfolder with the path where your file is stored, mystorageaccount with your Blob storage account name, mycontainer with the container name, and storage account key with your Blob storage access key. 可以在 Azure 门户中找到存储帐户凭据。You can find your storage account credentials in the Azure portal.

在 PowerShell 中或从命令提示符下运行 AzCopy 命令。Run the AzCopy command in PowerShell or from a command prompt. 以下是 AzCopy 命令的一些使用示例:Here is some example usage of the AzCopy command:

# Copy *.sql from a local machine to an Azure blob
"C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy\azcopy" /Source:"c:\Aaqs\Data Science Scripts" /Dest:https://[ENTER STORAGE ACCOUNT].blob.core.chinacloudapi.cn/[ENTER CONTAINER] /DestKey:[ENTER STORAGE KEY] /S /Pattern:*.sql

# Copy back all files from an Azure blob container to a local machine

"C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy\azcopy" /Dest:"c:\Aaqs\Data Science Scripts\temp" /Source:https://[ENTER STORAGE ACCOUNT].blob.core.chinacloudapi.cn/[ENTER CONTAINER] /SourceKey:[ENTER STORAGE KEY] /S

运行 AzCopy 命令以复制到 Azure blob 后,文件会显示在 Azure 存储资源管理器中。After you run the AzCopy command to copy to an Azure blob, your file will appear in Azure Storage Explorer.

存储帐户的屏幕截图,其中显示了上传的 CSV 文件

将数据从 VM 移动到 Azure blob:Azure 存储资源管理器Move data from a VM to an Azure blob: Azure Storage Explorer

还可使用 Azure 存储资源管理器上传来自 VM 中本地文件的数据:You can also upload data from the local file in your VM by using Azure Storage Explorer:

  • 要将数据上传到容器,请选择目标容器,然后选择“上传”按钮。Azure 存储资源管理器中上传按钮的屏幕截图To upload data to a container, select the target container and select the Upload button.Screenshot of the upload button in Azure Storage Explorer
  • 选择“文件”框右侧的省略号 (…),选择要从文件系统上传的一个或多个文件,然后选择“上传”开始上传文件。 “上传文件”对话框的屏幕截图Select the ellipsis (...) to the right of the Files box, select one or multiple files to upload from the file system, and select Upload to begin uploading the files.Screenshot of the Upload files dialog box

读取 Azure blob 中的数据:Python ODBCRead data from an Azure blob: Python ODBC

在 Jupyter 笔记本或 Python 程序中,可以使用 BlobService 库直接读取 blob 中的数据。You can use the BlobService library to read data directly from a blob in a Jupyter notebook or in a Python program.

首先,导入所需的包:First, import the required packages:

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
from time import time
import pyodbc
import os
from azure.storage.blob import BlobService
import tables
import time
import zipfile
import random

然后插入 Blob 存储帐户凭据,并读取 blob 中的数据:Then, plug in your Blob storage account credentials and read data from the blob:

CONTAINERNAME = 'xxx'
STORAGEACCOUNTNAME = 'xxxx'
STORAGEACCOUNTKEY = 'xxxxxxxxxxxxxxxx'
BLOBNAME = 'nyctaxidataset/nyctaxitrip/trip_data_1.csv'
localfilename = 'trip_data_1.csv'
LOCALDIRECTORY = os.getcwd()
LOCALFILE =  os.path.join(LOCALDIRECTORY, localfilename)

#download from blob
t1 = time.time()
blob_service = BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILE)
t2 = time.time()
print(("It takes %s seconds to download "+BLOBNAME) % (t2 - t1))

#unzip downloaded files if needed
#with zipfile.ZipFile(ZIPPEDLOCALFILE, "r") as z:
#    z.extractall(LOCALDIRECTORY)

df1 = pd.read_csv(LOCALFILE, header=0)
df1.columns = ['medallion','hack_license','vendor_id','rate_code','store_and_fwd_flag','pickup_datetime','dropoff_datetime','passenger_count','trip_time_in_secs','trip_distance','pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']
print 'the size of the data is: %d rows and  %d columns' % df1.shape

以数据帧的形式读取数据:The data is read as a data frame:

前 10 行数据的屏幕截图

Azure Synapse Analytics(以前称为 SQL DW)和数据库Azure Synapse Analytics (formerly SQL DW) and databases

Azure Synapse Analytics(以前称为 SQL DW)是一项弹性数据仓库即服务,具有企业级 SQL Server 体验。Azure Synapse Analytics (formerly SQL DW) is an elastic data warehouse as a service with an enterprise-class SQL Server experience.

可以按照本文中的说明预配 Azure Synapse Analytics。You can provision Azure Synapse Analytics by following the instructions in this article. 预配 SQL 数据仓库后,可按照此演练,使用 SQL 数据仓库中的数据执行数据上传、探索和建模。After you provision your SQL data warehouse, you can use this walkthrough to do data upload, exploration, and modeling by using data within the SQL data warehouse.

Azure Cosmos DBAzure Cosmos DB

Azure Cosmos DB 是云中的 NoSQL 数据库。Azure Cosmos DB is a NoSQL database in the cloud. 可用其处理 JSON 等文档及存储和查询文档。You can use it to work with documents like JSON, and to store and query the documents.

使用以下先决步骤从 DSVM 访问 Azure Cosmos DB:Use the following prerequisite steps to access Azure Cosmos DB from the DSVM:

  1. DSVM 上已安装 Azure Cosmos DB Python SDK。The Azure Cosmos DB Python SDK is already installed on the DSVM. 若要更新它,请在命令提示符下运行 pip install pydocumentdb --upgradeTo update it, run pip install pydocumentdb --upgrade from a command prompt.

  2. Azure 门户创建 Azure Cosmos DB 帐户和数据库。Create an Azure Cosmos DB account and database from the Azure portal.

  3. Microsoft 下载中心下载 Azure Cosmos DB 数据迁移工具,并提取到你选择的目录。Download the Azure Cosmos DB Data Migration Tool from the Microsoft Download Center and extract to a directory of your choice.

  4. 在迁移工具中使用以下命令参数,将存储在 公共 blob 中的 JSON 数据(Volcano 数据)导入 Azure Cosmos DB。Import JSON data (volcano data) stored in a public blob into Azure Cosmos DB with the following command parameters to the migration tool. (使用安装了 Azure Cosmos DB 数据迁移工具的目录中的 dtui.exe。)通过以下参数输入源和目标位置:(Use dtui.exe from the directory where you installed the Azure Cosmos DB Data Migration Tool.) Enter the source and target location with these parameters:

    /s:JsonFile /s.Files:https://data.humdata.org/dataset/a60ac839-920d-435a-bf7d-25855602699d/resource/7234d067-2d74-449a-9c61-22ae6d98d928/download/volcano.json /t:DocumentDBBulk /t.ConnectionString:AccountEndpoint=https://[DocDBAccountName].documents.azure.com:443/;AccountKey=[[KEY];Database=volcano /t.Collection:volcano1

导入数据后,可转到 Jupyter 并打开名为“DocumentDBSample”的笔记本。After you import the data, you can go to Jupyter and open the notebook titled DocumentDBSample. 它包含用于访问 Azure Cosmos DB 和执行某些基本查询所需的 Python 代码。It contains Python code to access Azure Cosmos DB and do some basic querying. 访问 Azure Cosmos DB 的文档页,可了解有关该服务的详细信息。You can learn more about Azure Cosmos DB by visiting the service's documentation page.

使用 Power BI 报告和仪表板Use Power BI reports and dashboards

可在 Power BI Desktop 中对上一 Azure Cosmos DB 示例中的 Volcano JSON 文件进行可视化,以便获得有关该数据的直观见解。You can visualize the Volcano JSON file from the preceding Azure Cosmos DB example in Power BI Desktop to gain visual insights into the data. Power BI 文章中提供了详细步骤。Detailed steps are available in the Power BI article. 下面是概要步骤:Here are the high-level steps:

  1. 打开 Power BI Desktop 并选择“获取数据”。Open Power BI Desktop and select Get Data. 将 URL 指定为 https://cahandson.blob.core.windows.net/samples/volcano.jsonSpecify the URL as: https://cahandson.blob.core.windows.net/samples/volcano.json.
  2. 应会看到导入的 JSON 记录显示为一个列表。You should see the JSON records imported as a list. 将此列表转换为表,使 Power BI 可以处理它。Convert the list to a table so Power BI can work with it.
  3. 选择展开(箭头)按钮来展开列。Expand the columns by selecting the expand (arrow) icon.
  4. 注意到位置是一个记录字段。Notice that the location is a Record field. 展开记录并仅选择坐标。Expand the record and select only the coordinates. 坐标是列表中的一列。Coordinate is a list column.
  5. 添加新列,将列表的坐标列转换为逗号分隔的 LatLong 列。Add a new column to convert the list coordinate column into a comma-separated LatLong column. 使用公式 Text.From([coordinates]{1})&","&Text.From([coordinates]{0}) 串接坐标列表字段中的两个元素。Concatenate the two elements in the coordinate list field by using the formula Text.From([coordinates]{1})&","&Text.From([coordinates]{0}).
  6. 将“海拔”列转换为十进制,并选择“关闭”和“应用”按钮。 Convert the Elevation column to decimal and select the Close and Apply buttons.

可以不执行以上步骤,而是粘贴以下代码。Instead of preceding steps, you can paste the following code. 它将 Power BI 中“高级编辑器”中的步骤编写为脚本,以查询语言编写数据转换。It scripts out the steps used in the Advanced Editor in Power BI to write the data transformations in a query language.

let
    Source = Json.Document(Web.Contents("https://cahandson.blob.core.chinacloudapi.cn/samples/volcano.json")),
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"Volcano Name", "Country", "Region", "Location", "Elevation", "Type", "Status", "Last Known Eruption", "id"}, {"Volcano Name", "Country", "Region", "Location", "Elevation", "Type", "Status", "Last Known Eruption", "id"}),
    #"Expanded Location" = Table.ExpandRecordColumn(#"Expanded Column1", "Location", {"coordinates"}, {"coordinates"}),
    #"Added Custom" = Table.AddColumn(#"Expanded Location", "LatLong", each Text.From([coordinates]{1})&","&Text.From([coordinates]{0})),
    #"Changed Type" = Table.TransformColumnTypes(#"Added Custom",{{"Elevation", type number}})
in
    #"Changed Type"

现在,Power BI 数据模型中有数据。You now have the data in your Power BI data model. 你的 Power BI Desktop 实例应如下所示:Your Power BI Desktop instance should appear as follows:

Power BI Desktop

可使用数据模型开始生成报告和可视化。You can start building reports and visualizations by using the data model. 可按照这篇 Power BI 文章中的步骤生成报告。You can follow the steps in this Power BI article to build a report.

动态缩放 DSVMScale the DSVM dynamically

可动态缩放 DSVM 以满足项目需求。You can scale up and down the DSVM to meet your project's needs. 如果晚上或周末不需要使用 VM ,可从 Azure 门户关闭 VM。If you don't need to use the VM in the evening or on weekends, you can shut down the VM from the Azure portal.

备注

如果使用 VM 操作系统中的关机按钮,将产生计算费用。You incur compute charges if you use just the shutdown button for the operating system on the VM.

有时可能需要处理大规模分析,并需要更多的 CPU、内存或磁盘容量。You might need to handle some large-scale analysis and need more CPU, memory, or disk capacity. 如果是这样,可以选择适当的 CPU 核心数、深度学习中基于 GPU 的实例数、内存容量和磁盘类型(包括固态磁盘)来调节 VM 大小,使其满足你的计算和成本需求。If so, you can find a choice of VM sizes in terms of CPU cores, GPU-based instances for deep learning, memory capacity, and disk types (including solid-state drives) that meet your compute and budgetary needs. VM 的完整列表及其每小时计算定价见 Azure 虚拟机定价页面。The full list of VMs, along with their hourly compute pricing, is available on the Azure Virtual Machines pricing page.

添加更多工具Add more tools

DSVM 中预构建的工具可以满足很多常规数据分析需求。Tools prebuilt into the DSVM can address many common data-analytics needs. 这能节约时间,因为无须一一安装和配置环境。This saves you time because you don't have to install and configure your environments one by one. 还能节约成本,因为仅为使用的资源付费。It also saves you money, because you pay for only resources that you use.

可以使用本文中介绍的其他 Azure 数据和分析服务改进你的分析环境。You can use other Azure data and analytics services profiled in this article to enhance your analytics environment. 在某些情况下,可能需要额外的工具,包括我们的合作伙伴提供的某些专利工具。In some cases, you might need additional tools, including some proprietary partner tools. 你拥有虚拟机上的完全管理访问权限,可安装必要的新工具。You have full administrative access on the virtual machine to install new tools that you need. 还可以安装未预安装的 Python 和 R 中的其他程序包。You can also install additional packages in Python and R that are not pre-installed. 对于 Python,可以使用 condapipFor Python, you can use either conda or pip. 对于 R,可以在 R 控制台中使用 install.packages()或使用 IDE 并选择“包” > “安装包”。For R, you can use install.packages() in the R console, or use the IDE and select Packages > Install Packages.

深度学习Deep learning

除基于框架的示例外,还可获取在 DSVM 上经过验证的一组内容全面的演练。In addition to the framework-based samples, you can get a set of comprehensive walkthroughs that have been validated on the DSVM. 这些演练可以帮助你快速开始开发图像和文本/语言理解等领域的深度学习应用程序。These walkthroughs help you jump-start your development of deep-learning applications in domains like image and text/language understanding.

  • 在不同的框架中运行神经网络:本演练展示如何在框架之间迁移代码。Running neural networks across different frameworks: This walkthrough shows how to migrate code from one framework to another. 它还展示如何比较模型和各种框架的运行时表现。It also demonstrates how to compare models and runtime performance across frameworks.

  • 生成端到端解决方案以检测图像中的产品的操作指南:图像检测是一种能够对图像中的对象进行定位和分类的技术。A how-to guide to build an end-to-end solution to detect products within images: Image detection is a technique that can locate and classify objects within images. 这项技术有望在许多现实商业领域带来巨大回报。This technology has the potential to bring huge rewards in many real-life business domains. 例如,零售商可以使用此技术确定客户已从货架上选取哪个产品。For example, retailers can use this technique to determine which product a customer has picked up from the shelf. 从而,此信息可帮助商店管理产品库存。This information in turn helps stores manage product inventory.

  • 音频深度学习:此教程展示如何使用城市声音数据集训练用于音频事件检测的深度学习模型。Deep learning for audio: This tutorial shows how to train a deep-learning model for audio event detection on the urban sounds dataset. 它还提供有关如何处理音频数据的概述。It also provides an overview of how to work with audio data.

  • 文本文档分类:本演练展示如何生成和训练两种神经网络架构:分层注意网络和长短期记忆 (LSTM) 网络。Classification of text documents: This walkthrough demonstrates how to build and train two neural network architectures: Hierarchical Attention Network and Long Short Term Memory (LSTM) network. 这些神经网络使用用于深度学习的 Keras API 对文本文档进行分类。These neural networks use the Keras API for deep learning to classify text documents.

摘要Summary

本文仅介绍了可在 Microsoft Data Science Virtual Machine 上执行的部分操作。This article described some of the things you can do on the Microsoft Data Science Virtual Machine. 你还可以执行很多其他操作,使 DSVM 成为有效的分析环境。There are many more things you can do to make the DSVM an effective analytics environment.