如何规定高级分析数据处理的方案和计划How to identify scenarios and plan for advanced analytics data processing

若要创建可以对数据集执行高级分析处理的环境,需要什么资源?What resources are required for you to create an environment that can perform advanced analytics processing on a dataset? 本文为一系列提出的问题提供建议,可以帮助确定与用户方案相关的任务和资源。This article suggests a series of questions to ask that can help identify tasks and resources relevant your scenario.

若要了解预测分析的大概步骤的顺序,请参阅什么是 Team Data Science Process (TDSP)To learn about the order of high-level steps for predictive analytics, see What is the Team Data Science Process (TDSP). 每个步骤都需要与用户特定方案相关的任务特定资源。Each step requires specific resources for the tasks relevant to your particular scenario.

请回答以下方面的关键问题来标识你的方案:Answer key questions in the following areas to identify your scenario:

  • 数据物流data logistics
  • 数据特征data characteristics
  • 数据集质量dataset quality
  • 首选的工具和语言preferred tools and languages

物流问题:数据位置和移动Logistic questions: data locations and movement

物流问题涉及以下各项:The logistic questions cover the following items:

  • 数据源位置data source location
  • Azure 中的目标位置target destination in Azure
  • 移动数据的要求,包括日程安排、数量和涉及的资源requirements for moving the data, including the schedule, amount, and resources involved

在分析过程中,可能需要多次移动数据。You may need to move the data several times during the analytics process. 一种常见方案是先将本地数据移到 Azure 上某种形式的存储中,再移到机器学习工作室。A common scenario is to move local data into some form of storage on Azure and then into Machine Learning Studio.

你的数据源是什么?What is your data source?

你的数据位于本地还是在云中?Is your data local or in the cloud? 可能的位置包括:Possible locations include:

  • 公开可用的 HTTP 地址a publicly available HTTP address
  • 本地或网络文件位置a local or network file location
  • SQL Server 数据库a SQL Server database
  • Azure 存储容器an Azure storage container

Azure 目标是什么?What is the Azure destination?

若要进行处理或建模,数据需要位于何处?Where does your data need to be for processing or modeling?

  • Azure Blob 存储Azure Blob Storage
  • SQL Azure 数据库SQL Azure databases
  • Azure VM 上的 SQL ServerSQL Server on Azure VM
  • HDInsight(Azure 上的 Hadoop)或 Hive 表HDInsight (Hadoop on Azure) or Hive tables
  • Azure 机器学习Azure Machine Learning
  • 可装载的 Azure 虚拟硬盘Mountable Azure virtual hard disks

你打算如何移动数据?How are you going to move the data?

有关用于将数据引入或载入不同存储和处理环境的过程和资源,请参阅:For procedures and resources to ingest or load data into a variety of different storage and processing environments, see:

是否需要按定期计划移动数据或在迁移期间修改数据?Does the data need to be moved on a regular schedule or modified during migration?

当需要持续迁移数据时,请考虑使用 Azure 数据工厂 (ADF)。Consider using Azure Data Factory (ADF) when data needs to be continually migrated. 在下列场景下,ADF 可能比较有用:ADF can be helpful for:

  • 同时涉及本地和云资源的混合场景a hybrid scenario that involves both on-premises and cloud resources
  • 在迁移过程中业务逻辑会处理、修改或更改数据的场景a scenario where the data is transacted, modified, or changed by business logic in the course of being migrated

有关详细信息,请参阅使用 Azure 数据工厂将数据从 SQL Server 数据库移到 SQL AzureFor more information, see Move data from a SQL Server database to SQL Azure with Azure Data Factory.

要将多少数据移动到 Azure?How much of the data is to be moved to Azure?

大型数据集可能会超过某些环境的存储容量。Large datasets may exceed the storage capacity of certain environments. 有关示例,请参阅下一部分中针对机器学习工作室(经典)大小限制的讨论。For an example, see the discussion of size limits for Machine Learning Studio (classic) in the next section. 在这种情况下,在分析过程中可以使用一个数据样本。In such cases, you might use a sample of the data during the analysis. 有关如何在不同 Azure 环境中向下采样数据集的详细信息,请参阅 Team Data Science Process 中的示例数据For details of how to down-sample a dataset in various Azure environments, see Sample data in the Team Data Science Process.

数据特征问题:类型、格式和大小Data characteristics questions: type, format, and size

这些问题是规划存储和处理环境的关键。These questions are key to planning your storage and processing environments. 它们可以帮助你为数据类型选择合适的方案,并了解任何限制。They will help you choose the appropriate scenario for your data type and understand any restrictions.

数据类型有哪些?What are the data types?

  • 数值Numerical
  • 分类Categorical
  • 字符串Strings
  • 二进制Binary

数据格式是如何设置的?How is your data formatted?

  • 以逗号分隔 (CSV) 或以制表符分隔 (TSV) 的平面文件Comma-separated (CSV) or tab-separated (TSV) flat files
  • 压缩或未压缩Compressed or uncompressed
  • Azure BlobAzure blobs
  • Hadoop Hive 表Hadoop Hive tables
  • SQL Server 表SQL Server tables

数据规模有多大?How large is your data?

  • 小型:小于 2 GBSmall: Less than 2 GB
  • 中:大于 2 GB 且小于 10 GBMedium: Greater than 2 GB and less than 10 GB
  • 大型:大于 10 GBLarge: Greater than 10 GB

以 Azure 机器学习工作室(经典)环境为例:Take the Azure Machine Learning Studio (classic) environment for example:

数据质量问题:浏览和预处理Data quality questions: exploration and pre-processing

你对你的数据了解多少?What do you know about your data?

了解你的数据的基本特征:Understand the basic characteristics about your data:

  • 它呈现出什么模式或趋势What patterns or trends it exhibits
  • 它有哪些离群值What outliers it has
  • 缺少多少值How many values are missing

此步骤非常重要,可以帮助你:This step is important to help you:

  • 确定需要执行什么程度的预处理Determine how much pre-processing is needed
  • 构建提出最合适的功能或分析类型建议的假设Formulate hypotheses that suggest the most appropriate features or type of analysis
  • 构建用于收集附加数据的计划Formulate plans for additional data collection

用于检查数据的有用技术包括描述性统计计算和可视化效果绘图。Useful techniques for data inspection include descriptive statistics calculation and visualization plots. 有关如何在不同 Azure 环境中浏览数据集的详细信息,请参阅在 Team Data Science Process 中浏览数据For details of how to explore a dataset in various Azure environments, see Explore data in the Team Data Science Process.

数据是否需要进行预处理或清理?Does the data require preprocessing or cleaning?

你可能需要预处理并清理数据,然后才能有效地将数据集用于机器学习。You might need to preprocess and clean your data before you can use the dataset effectively for machine learning. 原始数据通常包含干扰项并且不可靠。Raw data is often noisy and unreliable. 它可能缺少值。It might be missing values. 使用此类数据进行建模会产生误导性结果。Using such data for modeling can produce misleading results. 有关说明,请参阅准备任务数据以增强机器学习For a description, see Tasks to prepare data for enhanced machine learning.

工具和语言问题Tools and languages questions

有许多语言、开发环境和工具可供选择。There are many options for languages, development environments, and tools. 请注意你的需求和偏好。Be aware of your needs and preferences.

希望使用什么语言进行分析?What languages do you prefer to use for analysis?

  • RR
  • PythonPython
  • SQLSQL

应使用什么工具进行数据分析?What tools should you use for data analysis?

确定高级分析方案Identify your advanced analytics scenario

解答了上一部分中的问题后,便可确定最适合情况的方案。After you have answered the questions in the previous section, you are ready to determine which scenario best fits your case. 用于 Azure 机器学习中高级分析的方案中概述了示例方案。The sample scenarios are outlined in Scenarios for advanced analytics in Azure Machine Learning.

后续步骤Next steps