用于 Azure 机器学习中高级分析的方案Scenarios for advanced analytics in Azure Machine Learning

本文概述了各种示例数据源和可通过团队数据科学过程 (TDSP) 处理的目标场景。This article outlines the variety of sample data sources and target scenarios that can be handled by the Team Data Science Process (TDSP). TDSP 为团队提供了一个系统性方法,可协作构建智能应用程序。The TDSP provides a systematic approach for teams to collaborate on building intelligent applications. 此处介绍的情景说明了数据处理工作流中的可用选项,具体取决于 Azure 中的数据特征、源位置和目标存储库。The scenarios presented here illustrate options available in the data processing workflow that depend on the data characteristics, source locations, and target repositories in Azure.

“决策树”用于选择适合最后部分中出现的数据和对象的示例情景。The decision tree for selecting the sample scenarios that are appropriate for your data and objective is presented in the last section.

以下各部分都提供了一个示例情景。Each of the following sections presents a sample scenario. 每个情景,都列出了可能数据科学或高级分析流和支持 Azure 资源。For each scenario, a possible data science or advanced analytics flow and supporting Azure resources are listed.

备注

对于所有以下情况,需要:For all of the following scenarios, you need to:

场景 #1:本地文件中的小型到中型表格数据集Scenario #1: Small to medium tabular dataset in a local files

小型到中型本地文件

其他 Azure 资源:无Additional Azure resources: None

  1. 登录到 Azure 机器学习工作室Sign in to the Azure Machine Learning Studio.
  2. 上传数据集。Upload a dataset.
  3. 开始使用上传的数据集生成 Azure 机器学习实验流。Build an Azure Machine Learning experiment flow starting with uploaded dataset(s).

场景 #2:需要处理的本地文件的小型到中型数据集Scenario #2: Small to medium dataset of local files that require processing

需要处理的小型到中型本地文件

其他 Azure 资源:Azure 虚拟机(IPython Notebook 服务器)Additional Azure resources: Azure Virtual Machine (IPython Notebook server)

  1. 创建运行 IPython Notebook 的 Azure 虚拟机。Create an Azure Virtual Machine running IPython Notebook.
  2. 将数据上传到 Azure 存储容器。Upload data to an Azure storage container.
  3. 访问 Azure 存储容器中的数据,预处理并清理 IPython Notebook 中的数据。Pre-process and clean data in IPython Notebook, accessing data from Azure storage container.
  4. 将数据转换为已清除的表格格式。Transform data to a cleaned tabular form.
  5. 将已转换的数据保存在 Azure blob 中。Save transformed data in Azure blobs.
  6. 登录到 Azure 机器学习工作室Sign in to the Azure Machine Learning Studio.
  7. 使用导入数据模块从 Azure blob 中读取数据。Read the data from Azure blobs using the Import Data module.
  8. 开始使用引入的数据集生成 Azure 机器学习实验流。Build an Azure Machine Learning experiment flow starting with ingested dataset(s).

场景 #3:本地文件针对 Azure Blob 的大型数据集Scenario #3: Large dataset of local files, targeting Azure Blobs

大型本地文件

其他 Azure 资源:Azure 虚拟机(IPython Notebook 服务器)Additional Azure resources: Azure Virtual Machine (IPython Notebook server)

  1. 创建运行 IPython Notebook 的 Azure 虚拟机。Create an Azure Virtual Machine running IPython Notebook.
  2. 将数据上传到 Azure 存储容器。Upload data to an Azure storage container.
  3. 访问 Azure blob 中的数据,预处理并清理 IPython Notebook 中的数据。Pre-process and clean data in IPython Notebook, accessing data from Azure blobs.
  4. 将数据转换为已清除的表格格式(如果需要)。Transform data to a cleaned tabular form, if needed.
  5. 浏览数据,并根据需要创建功能。Explore data, and create features as needed.
  6. 提取小到中型数据示例。Extract a small-to-medium data sample.
  7. 将采样的数据保存在 Azure blob 中。Save the sampled data in Azure blobs.
  8. 登录到 Azure 机器学习工作室Sign in to the Azure Machine Learning Studio.
  9. 使用导入数据模块从 Azure blob 中读取数据。Read the data from Azure blobs using the Import Data module.
  10. 开始使用引入的数据集生成 Azure 机器学习实验流。Build Azure Machine Learning experiment flow starting with ingested dataset(s).

场景 #4:本地文件中针对 Azure 虚拟机中 SQL Server 的小型到中型数据集Scenario #4: Small to medium dataset of local files, targeting SQL Server in an Azure Virtual Machine

Azure 中 SQL DB 的小型到中型本地文件

其他 Azure 资源:Azure 虚拟机(SQL Server/IPython Notebook 服务器)Additional Azure resources: Azure Virtual Machine (SQL Server / IPython Notebook server)

  1. 创建运行 SQL Server + IPython Notebook 的 Azure 虚拟机。Create an Azure Virtual Machine running SQL Server + IPython Notebook.

  2. 将数据上传到 Azure 存储容器。Upload data to an Azure storage container.

  3. 使用 IPython Notebook 预处理并清理 Azure 存储容器中的数据。Pre-process and clean data in Azure storage container using IPython Notebook.

  4. 将数据转换为已清除的表格格式(如果需要)。Transform data to a cleaned tabular form, if needed.

  5. 将数据保存到本地 VM 文件(IPython Notebook 正在 VM 上运行,本地驱动器是指 VM 驱动器)。Save data to VM-local files (IPython Notebook is running on VM, local drives refer to VM drives).

  6. 将数据加载到在 Azure VM 上运行的 SQL Server 数据库。Load data to SQL Server database running on an Azure VM.

    选项 #1:使用 SQL Server Management Studio。Option #1: Using SQL Server Management Studio.

    • 登录到 SQL Server VMLogin to SQL Server VM
    • 运行 SQL Server Management Studio。Run SQL Server Management Studio.
    • 创建数据库和目标表。Create database and target tables.
    • 使用某个大容量导入方法从本地 VM 文件加载数据。Use one of the bulk import methods to load the data from VM-local files.

    选项 #2:使用 IPython Notebook - 不建议将其用于中型和较大型数据集Option #2: Using IPython Notebook – not advisable for medium and larger datasets

    • 使用 ODBC 连接字符串访问 VM 上的 SQL Server。Use ODBC connection string to access SQL Server on VM.
    • 创建数据库和目标表。Create database and target tables.
    • 使用某个大容量导入方法从本地 VM 文件加载数据。Use one of the bulk import methods to load the data from VM-local files.
  7. 浏览数据,根据需要创建功能。Explore data, create features as needed. 这些功能不需要在数据库表中具体化。The features do not need to be materialized in the database tables. 仅请注意创建它们所需的查询。Only note the necessary query to create them.

  8. 如果需要和/或想要,请确定数据示例大小。Decide on a data sample size, if needed and/or desired.

  9. 登录到 Azure 机器学习工作室Sign in to the Azure Machine Learning Studio.

  10. 使用导入数据模块直接从 SQL Server 中读取数据。Read the data directly from the SQL Server using the Import Data module. 直接在导入数据查询中粘贴必要的查询,该查询可提取字段、创建特征并根据需要对数据进行采样。Paste the necessary query that extracts fields, creates features, and samples data if needed directly in the Import Data query.

  11. 开始使用引入的数据集生成 Azure 机器学习实验流。Build Azure Machine Learning experiment flow starting with ingested dataset(s).

场景 #5:本地文件中的大型数据集、Azure VM 中的目标 SQL ServerScenario #5: Large dataset in local files, target SQL Server in Azure VM

Azure 中 SQL DB 的大型本地文件

其他 Azure 资源:Azure 虚拟机(SQL Server/IPython Notebook 服务器)Additional Azure resources: Azure Virtual Machine (SQL Server / IPython Notebook server)

  1. 创建运行 SQL Server 和 IPython Notebook 服务器的 Azure 虚拟机。Create an Azure Virtual Machine running SQL Server and IPython Notebook server.

  2. 将数据上传到 Azure 存储容器。Upload data to an Azure storage container.

  3. (可选)预处理并清理数据。(Optional) Pre-process and clean data.

    a.a. 访问 Azure blob 中的数据,预处理并清理 IPython Notebook 中的数据。Pre-process and clean data in IPython Notebook, accessing data from Azure blobs.

    b.b. 将数据转换为已清除的表格格式(如果需要)。Transform data to a cleaned tabular form, if needed.

    c.c. 将数据保存到本地 VM 文件(IPython Notebook 正在 VM 上运行,本地驱动器是指 VM 驱动器)。Save data to VM-local files (IPython Notebook is running on VM, local drives refer to VM drives).

  4. 将数据加载到在 Azure VM 上运行的 SQL Server 数据库。Load data to SQL Server database running on an Azure VM.

    a.a. 登录到 SQL Server VM。Login to SQL Server VM.

    b.b. 如果数据尚未保存,请从 Azure 存储容器将数据文件下载到本地 VM 文件夹。If data not saved already, download data files from Azure storage container to local-VM folder.

    c.c. 运行 SQL Server Management Studio。Run SQL Server Management Studio.

    d.d. 创建数据库和目标表。Create database and target tables.

    e.e. 使用某个大容量导入方法加载数据。Use one of the bulk import methods to load the data.

    f.f. 如果需要表联接,可创建索引以加快联接。If table joins are required, create indexes to expedite joins.

    备注

    为了加快较大数据大小的加载速度,建议创建分区表,并大容量导入并行数据。For faster loading of large data sizes, it is recommended that you create partitioned tables and bulk import the data in parallel. 有关详细信息,请参阅将并行数据导入到 SQL 分区表For more information, see Parallel Data Import to SQL Partitioned Tables.

  5. 浏览数据,根据需要创建功能。Explore data, create features as needed. 这些功能不需要在数据库表中具体化。The features do not need to be materialized in the database tables. 仅请注意创建它们所需的查询。Only note the necessary query to create them.

  6. 如果需要和/或想要,请确定数据示例大小。Decide on a data sample size, if needed and/or desired.

  7. 登录到 Azure 机器学习工作室Sign in to the Azure Machine Learning Studio.

  8. 使用导入数据模块直接从 SQL Server 中读取数据。Read the data directly from the SQL Server using the Import Data module. 直接在导入数据查询中粘贴必要的查询,该查询可提取字段、创建特征并根据需要对数据进行采样。Paste the necessary query that extracts fields, creates features, and samples data if needed directly in the Import Data query.

  9. 开始使用上传的数据集对 Azure 机器学习实验流进行采样Simple Azure Machine Learning experiment flow starting with uploaded dataset

场景 #6:本地 SQL Server 数据库中针对 Azure 虚拟机中 SQL Server 的大型数据集Scenario #6: Large dataset in a SQL Server database on premises, targeting SQL Server in an Azure Virtual Machine

Azure 中 SQL DB 的本地大型 SQL DB

其他 Azure 资源:Azure 虚拟机(SQL Server/IPython Notebook 服务器)Additional Azure resources: Azure Virtual Machine (SQL Server / IPython Notebook server)

  1. 创建运行 SQL Server 和 IPython Notebook 服务器的 Azure 虚拟机。Create an Azure Virtual Machine running SQL Server and IPython Notebook server.

  2. 使用一种数据导出方法将数据从 SQL Server 导出到转储文件。Use one of the data export methods to export the data from SQL Server to dump files.

    备注

    如果决定移动本地数据库中的所有数据,另一个方法(更快)是将完整数据库移到 Azure 中的 SQL Server 实例。If you decide to move all data from the on premises database, an alternate (faster) method to move the full database to the SQL Server instance in Azure. 跳过导出数据、创建数据库和将数据加载/导入到目标数据库的步骤,使用另一个方法。Skip the steps to export data, create database, and load/import data to the target database and follow the alternate method.

  3. 将转储文件上传到 Azure 存储容器。Upload dump files to Azure storage container.

  4. 将数据加载到在 Azure 虚拟机上运行的 SQL Server 数据库。Load the data to a SQL Server database running on an Azure Virtual Machine.

    a.a. 登录到 SQL Server VM。Login to the SQL Server VM.

    b.b. 从 Azure 存储容器将数据文件下载到本地 VM 文件夹。Download data files from an Azure storage container to the local-VM folder.

    c.c. 运行 SQL Server Management Studio。Run SQL Server Management Studio.

    d.d. 创建数据库和目标表。Create database and target tables.

    e.e. 使用某个大容量导入方法加载数据。Use one of the bulk import methods to load the data.

    f.f. 如果需要表联接,可创建索引以加快联接。If table joins are required, create indexes to expedite joins.

    备注

    为了加快较大数据大小的加载速度,可创建分区表,并大容量导入并行数据。For faster loading of large data sizes, create partitioned tables and to bulk import the data in parallel. 有关详细信息,请参阅将并行数据导入到 SQL 分区表For more information, see Parallel Data Import to SQL Partitioned Tables.

  5. 浏览数据,根据需要创建功能。Explore data, create features as needed. 这些功能不需要在数据库表中具体化。The features do not need to be materialized in the database tables. 仅请注意创建它们所需的查询。Only note the necessary query to create them.

  6. 如果需要和/或想要,请确定数据示例大小。Decide on a data sample size, if needed and/or desired.

  7. 登录到 Azure 机器学习工作室Sign in to the Azure Machine Learning Studio.

  8. 使用导入数据模块直接从 SQL Server 中读取数据。Read the data directly from the SQL Server using the Import Data module. 直接在导入数据查询中粘贴必要的查询,该查询可提取字段、创建特征并根据需要对数据进行采样。Paste the necessary query that extracts fields, creates features, and samples data if needed directly in the Import Data query.

  9. 开始使用上传的数据集对 Azure 机器学习实验流进行采样。Simple Azure Machine Learning experiment flow starting with uploaded dataset.

另一个方法是将完整的数据库从本地 SQL Server 复制到 Azure SQL 数据库Alternate method to copy a full database from an on-premises SQL Server to Azure SQL Database

分离本地 DB,并附加到 Azure 中的 SQL DB

其他 Azure 资源:Azure 虚拟机(SQL Server/IPython Notebook 服务器)Additional Azure resources: Azure Virtual Machine (SQL Server / IPython Notebook server)

要复制 SQL Server VM 中的完整 SQL Server 数据库,假定数据库可暂时处于脱机状态,应将数据库从一个位置/服务器复制到另一个。To replicate the entire SQL Server database in your SQL Server VM, you should copy a database from one location/server to another, assuming that the database can be taken temporarily offline. 可使用 SQL Server Management Studio 对象资源管理器,或使用等效的 Transact-SQL 命令。You may use SQL Server Management Studio Object Explorer, or using the equivalent Transact-SQL commands.

  1. 在源位置分离数据库。Detach the database at the source location. 有关详细信息,请参阅分离数据库For more information, see Detach a database.
  2. 在 Windows Explorer 或“Windows 命令提示”窗口中,将分离的一个或多个数据库文件和日志文件复制到 Azure 中 SQL Server VM 上的目标位置。In Windows Explorer or Windows Command Prompt window, copy the detached database file or files and log file or files to the target location on the SQL Server VM in Azure.
  3. 将复制的文件附加到目标 SQL Server 实例。Attach the copied files to the target SQL Server instance. 有关详细信息,请参阅 Attach a DatabaseFor more information, see Attach a Database.

通过分离和附加来移动数据库 (Transact-SQL)Move a Database Using Detach and Attach (Transact-SQL)

场景 #7:本地文件中针对 Azure HDInsight Hadoop 群集中 Hive 数据库的大数据Scenario #7: Big data in local files, target Hive database in Azure HDInsight Hadoop clusters

本地目标 Hive 中的大数据

其他 Azure 资源:Azure HDInsight Hadoop 群集和 Azure 虚拟机(IPython Notebook 服务器)Additional Azure resources: Azure HDInsight Hadoop Cluster and Azure Virtual Machine (IPython Notebook server)

  1. 创建运行 IPython Notebook 的 Azure 虚拟机服务器。Create an Azure Virtual Machine running IPython Notebook server.

  2. 创建 Azure HDInsight Hadoop 群集。Create an Azure HDInsight Hadoop cluster.

  3. (可选)预处理并清理数据。(Optional) Pre-process and clean data.

    a.a. 预处理并清理 IPython Notebook 中的数据,从 Azure 访问数据Pre-process and clean data in IPython Notebook, accessing data from Azure

    blobs.
    

    b.b. 如果需要,可将数据转换为已清除的表格格式。Transform data to cleaned, tabular form, if needed.

    c.c. 将数据保存到本地 VM 文件(IPython Notebook 正在 VM 上运行,本地驱动器是指 VM 驱动器)。Save data to VM-local files (IPython Notebook is running on VM, local drives refer to VM drives).

  4. 将数据上传到步骤 2 中选择的 Hadoop 群集的默认容器。Upload data to the default container of the Hadoop cluster selected in the step 2.

  5. 将数据加载到 Azure HDInsight Hadoop 群集中的 Hive 数据库。Load data to Hive database in Azure HDInsight Hadoop cluster.

    a.a. 登录到 Hadoop 群集的头节点Log in to the head node of the Hadoop cluster

    b.b. 打开 Hadoop 命令行。Open the Hadoop Command Line.

    c.c. 按照 Hadoop 命令行中的命令 cd %hive_home%\bin 输入 Hive 根目录。Enter the Hive root directory by command cd %hive_home%\bin in Hadoop Command Line.

    d.d. 运行 Hive 查询以创建数据库和表,并将数据从 Blob 存储加载到 Hive 表。Run the Hive queries to create database and tables, and load data from blob storage to Hive tables.

    备注

    如果数据太大,用户可以使用分区创建 Hive 表。If the data is big, users can create the Hive table with partitions. 然后,用户可以使用头节点上 Hadoop 命令行中的 for 循环,将数据加载到由分区进行分区的 Hive 表中。Then, users can use a for loop in the Hadoop Command Line on the head node to load data into the Hive table partitioned by partition.

  6. 浏览数据,并根据需要在 Hadoop 命令行中创建功能。Explore data and create features as needed in Hadoop Command Line. 这些功能不需要在数据库表中具体化。The features do not need to be materialized in the database tables. 仅请注意创建它们所需的查询。Only note the necessary query to create them.

    a.a. 登录到 Hadoop 群集的头节点Log in to the head node of the Hadoop cluster

    b.b. 打开 Hadoop 命令行。Open the Hadoop Command Line.

    c.c. 按照 Hadoop 命令行中的命令 cd %hive_home%\bin 输入 Hive 根目录。Enter the Hive root directory by command cd %hive_home%\bin in Hadoop Command Line.

    d.d. 在 Hadoop 群集的头节点上的 Hadoop 命令行中运行 Hive 查询,可浏览数据并根据需要创建功能。Run the Hive queries in Hadoop Command Line on the head node of the Hadoop cluster to explore the data and create features as needed.

  7. 如果需要和/或想要,对数据进行采样以满足 Azure 机器学习工作室。If needed and/or desired, sample the data to fit in Azure Machine Learning Studio.

  8. 登录到 Azure 机器学习工作室Sign in to the Azure Machine Learning Studio.

  9. 使用导入数据模块直接从 Hive Queries 中读取数据。Read the data directly from the Hive Queries using the Import Data module. 直接在导入数据查询中粘贴必要的查询,该查询可提取字段、创建特征并根据需要对数据进行采样。Paste the necessary query that extracts fields, creates features, and samples data if needed directly in the Import Data query.

  10. 开始使用上传的数据集对 Azure 机器学习实验流进行采样。Simple Azure Machine Learning experiment flow starting with uploaded dataset.

用于方案选择的决策树Decision tree for scenario selection


下面的图表总结了上述的方案以及让你采用每个详细方案的高级分析流程和技术选择。The following diagram summarizes the scenarios described above and the Advanced Analytics Process and Technology choices made that take you to each of the itemized scenarios. 数据处理、浏览、特征工程和采样可能会出现在一个或多个方法/环境中(在源、中间和/或目标环境中),并且可能会根据需要以迭代方式进行。Data processing, exploration, feature engineering, and sampling may take place in one or more method/environment -- at the source, intermediate, and/or target environments – and may proceed iteratively as needed. 此图只用于列举某些可能流,并不提供详尽的枚举。The diagram only serves as an illustration of some of possible flows and does not provide an exhaustive enumeration.

示例 DS 流程演练方案

操作示例中的高级分析Advanced Analytics in action Examples

关于利用使用公共数据集的高级分析流程和技术的端到端 Azure 机器学习演练,请参阅:For end-to-end Azure Machine Learning walkthroughs that employ the Advanced Analytics Process and Technology using public datasets, see: