用于准备数据以进行增强型机器学习的任务Tasks to prepare data for enhanced machine learning

预处理和清理数据是重要的任务,通常必须先执行此任务才能有效地使用数据集进行机器学习。Pre-processing and cleaning data are important tasks that typically must be conducted before dataset can be used effectively for machine learning. 原始数据通常具有干扰性且不可靠,还可能缺少值。Raw data is often noisy and unreliable, and may be missing values. 使用此类数据进行建模会产生误导性结果。Using such data for modeling can produce misleading results. 这些任务是 Team Data Science Process (TDSP) 的一部分,通常对用于发现和计划所需预处理的数据集进行初步探索。These tasks are part of the Team Data Science Process (TDSP) and typically follow an initial exploration of a dataset used to discover and plan the pre-processing required. 有关 TDSP 过程的详细说明,请参阅 Team Data Science Process 中概述的步骤。For more detailed instructions on the TDSP process, see the steps outlined in the Team Data Science Process.

根据数据存储位置及其格式,可在各种环境(如 SQL、Hive、或 Azure 机器学习工作室(经典))中,使用各种工具和语言(如 R 或 Python)执行预处理和清理任务(如数据浏览任务)。Pre-processing and cleaning tasks, like the data exploration task, can be carried out in a wide variety of environments, such as SQL or Hive or Azure Machine Learning Studio (classic), and with various tools and languages, such as R or Python, depending where your data is stored and how it is formatted. 由于 TDSP 本质上是迭代的,所以这些任务可以在进程工作流中的各个步骤发生。Since TDSP is iterative in nature, these tasks can take place at various steps in the workflow of the process.

本文介绍各种数据处理概念,以及可在将数据引入到 Azure 机器学习工作室(经典)之前或之后执行的任务。This article introduces various data processing concepts and tasks that can be undertaken either before or after ingesting data into Azure Machine Learning Studio (classic).

为什么要预处理并清理数据?Why pre-process and clean data?

实际数据从各种源和进程收集,可能包含违规行为或破坏数,进而影响数据集的之类。Real world data is gathered from various sources and processes and it may contain irregularities or corrupt data compromising the quality of the dataset. 通常引起的数据质量问题包括:The typical data quality issues that arise are:

  • 不完整:数据缺少属性或包含缺失值。Incomplete: Data lacks attributes or containing missing values.
  • 干扰:数据包含错误记录或离群值。Noisy: Data contains erroneous records or outliers.
  • 不一致:数据包含冲突的记录或差异。Inconsistent: Data contains conflicting records or discrepancies.

优质数据是优质预测模型的先决条件。Quality data is a prerequisite for quality predictive models. 为了避免“无用输入、无用输出”和提高数据质量并进而提高模型性能,必须执行数据运行状况筛查,提前发现数据问题,并确定相应的数据处理和清理步骤。To avoid "garbage in, garbage out" and improve data quality and therefore model performance, it is imperative to conduct a data health screen to spot data issues early and decide on the corresponding data processing and cleaning steps.

通常执行哪些数据运行状况筛查?What are some typical data health screens that are employed?

可通过检查以下内容来检查数据的总体质量:We can check the general quality of data by checking:

  • 记录的数量。The number of records.
  • 特性(或特征)的数量。The number of attributes (or features).
  • 数据类型特性(称名、等级或连续)。The attribute data types (nominal, ordinal, or continuous).
  • 缺失值的数量。The number of missing values.
  • 数据的格式正确性Well-formedness of the data.
    • 如果数据格式为 TSV 或 CSV,请检查列分隔符和行分隔符是否始终正确地分隔列和行。If the data is in TSV or CSV, check that the column separators and line separators always correctly separate columns and lines.
    • 如果数据格式为 HTML 或 XML,请检查数据是否根据各自的标准保持正确格式。If the data is in HTML or XML format, check whether the data is well formed based on their respective standards.
    • 若要从半结构化或非结构化数据提取结构化信息,可能还需要进行分析。Parsing may also be necessary in order to extract structured information from semi-structured or unstructured data.
  • 不一致的数据记录Inconsistent data records. 查看允许的值范围。Check the range of values are allowed. 例如,如果数据包含学生 GPA,请检查 GPA 是否在指定范围内(例如 0〜4)。e.g. If the data contains student GPA, check if the GPA is in the designated range, say 0~4.

查找数据问题时,需要执行处理步骤,通常涉及清理缺失值、数据规范化、离散化、文本处理,以删除和/或替换可能影响数据对齐的嵌入字符、公共字段中的混合数据类型及其他内容。When you find issues with data, processing steps are necessary which often involves cleaning missing values, data normalization, discretization, text processing to remove and/or replace embedded characters which may affect data alignment, mixed data types in common fields, and others.

Azure 机器学习使用格式正确的表格数据Azure Machine Learning consumes well-formed tabular data. 如果数据已经是表格形式,则可在机器学习中直接使用 Azure 机器学习工作室(经典)执行数据预处理。If the data is already in tabular form, data pre-processing can be performed directly with Azure Machine Learning Studio (classic) in the Machine Learning. 如果数据不是表格形式(如 XML),可能需要进行分析以将数据转换为表格形式。If data is not in tabular form, say it is in XML, parsing may be required in order to convert the data to tabular form.

数据预处理包含有哪些主要任务?What are some of the major tasks in data pre-processing?

  • 数据清理:填写缺失值,检测并删除干扰数据和离群值。Data cleaning: Fill in or missing values, detect and remove noisy data and outliers.
  • 数据转换:规范化数据以降低维数和干扰。Data transformation: Normalize data to reduce dimensions and noise.
  • 数据减少:对数据记录或属性进行采样,便于数据处理。Data reduction: Sample data records or attributes for easier data handling.
  • 数据离散化:将连续属性转换为分类属性,便于使用某些机器学习方法。Data discretization: Convert continuous attributes to categorical attributes for ease of use with certain machine learning methods.
  • 文字清理:删除可能导致数据未对齐的嵌入字符,例如,制表符分隔的数据文件中的嵌入制表符和可能破坏记录的嵌入式新行等。Text cleaning: remove embedded characters which may cause data misalignment, for e.g., embedded tabs in a tab-separated data file, embedded new lines which may break records, etc.

以下部分详细介绍一些数据处理步骤。The sections below detail some of these data processing steps.

如何处理缺失值?How to deal with missing values?

若要处理缺失值,最好先确定缺少值的原因以更好地解决问题。To deal with missing values, it is best to first identify the reason for the missing values to better handle the problem. 通常的缺失值处理方法包括:Typical missing value handling methods are:

  • 删除:删除具有缺失值的记录Deletion: Remove records with missing values
  • 虚拟替换:使用虚拟值替换缺失值:例如,用 unknown 替换分类值,或用 0 替换数值 。Dummy substitution: Replace missing values with a dummy value: e.g, unknown for categorical or 0 for numerical values.
  • 平均值替换:如果缺失的数据是数字,则使用平均值替换缺失值。Mean substitution: If the missing data is numerical, replace the missing values with the mean.
  • 常用项替换:如果缺失的是分类数据,则使用最常用的项替换缺失值Frequent substitution: If the missing data is categorical, replace the missing values with the most frequent item
  • 回归替换:使用回归方法,将缺失值替换为回归值。Regression substitution: Use a regression method to replace missing values with regressed values.

如何规范化数据?How to normalize data?

数据规范化将数值重新调整到指定范围。Data normalization re-scales numerical values to a specified range. 常用的数据规范化方法包括:Popular data normalization methods include:

  • 最小 - 最大值规范化:将数据线性转换到某一范围(例如 0 和 1 之间),这会将最小值缩放为 0,最大值缩放为 1。Min-Max Normalization: Linearly transform the data to a range, say between 0 and 1, where the min value is scaled to 0 and max value to 1.
  • Z 分数规范化:基于平均值和标准偏差缩放数据:将数据和平均值之间的差除以标准偏差。Z-score Normalization: Scale data based on mean and standard deviation: divide the difference between the data and the mean by the standard deviation.
  • 小数缩放:通过移动属性值的小数点缩放数据。Decimal scaling: Scale the data by moving the decimal point of the attribute value.

如何离散化数据?How to discretize data?

可通过将连续值转换为标称属性或间隔值来离散化数据。Data can be discretized by converting continuous values to nominal attributes or intervals. 执行此操作的方法包括:Some ways of doing this are:

  • 等宽分箱:将属性的所有可能值范围划分为 N 个大小相同的组,并分配具有箱编号的箱中的值。Equal-Width Binning: Divide the range of all possible values of an attribute into N groups of the same size, and assign the values that fall in a bin with the bin number.
  • 等高分箱:将属性的所有可能值范围划分为 N 个组,每个组包含相同数量的实例,然后分配具有箱编号的箱中的值。Equal-Height Binning: Divide the range of all possible values of an attribute into N groups, each containing the same number of instances, then assign the values that fall in a bin with the bin number.

如何减少数据?How to reduce data?

可通过多种方法缩减数据大小,以便简化数据处理。There are various methods to reduce data size for easier data handling. 根据数据大小和域,可使用以下方法:Depending on data size and the domain, the following methods can be applied:

  • 记录采样:对数据记录进行采样,并且仅从数据中选择代表性子集。Record Sampling: Sample the data records and only choose the representative subset from the data.
  • 属性采样:仅从数据中选择最重要属性的子集。Attribute Sampling: Select only a subset of the most important attributes from the data.
  • 聚合:将数据分组并存储每个组的编号。Aggregation: Divide the data into groups and store the numbers for each group. 例如,可将过去 20 年餐饮连锁的每日收入聚合为每月收入,从而缩减数据大小。For example, the daily revenue numbers of a restaurant chain over the past 20 years can be aggregated to monthly revenue to reduce the size of the data.

如何清理文本数据?How to clean text data?

表格数据中的文本字段可能包含影响列对齐和/或记录边界的字符。Text fields in tabular data may include characters which affect columns alignment and/or record boundaries. 例如,制表符分隔的文件中的嵌入式制表符会导致列不对齐,而嵌入式新行字符会破坏记录行。For e.g., embedded tabs in a tab-separated file cause column misalignment, and embedded new line characters break record lines. 写入或读取文本时,若处理文本编码的方式不当会导致信息丢失、意外引入不可读字符(例如,null),还可能影响文本分析。Improper text encoding handling while writing/reading text leads to information loss, inadvertent introduction of unreadable characters, e.g., nulls, and may also affect text parsing. 若要清理文本字段以确保正确对齐和/或从非结构化或半结构化文本数据中提取结构化数据,可能需要进行仔细的分析和编辑。Careful parsing and editing may be required in order to clean text fields for proper alignment and/or to extract structured data from unstructured or semi-structured text data.

数据浏览支持提前预览数据。Data exploration offers an early view into the data. 执行该步骤时会发现多个数据问题,可应用相应的方法解决这些问题。A number of data issues can be uncovered during this step and corresponding methods can be applied to address those issues. 提问非常重要,例如问题的源是什么以及问题是如何引入的。It is important to ask questions such as what is the source of the issue and how the issue may have been introduced. 这还有助于确定需采取哪些数据处理步骤来解决这些问题。This also helps you decide on the data processing steps that need to be taken to resolve them. 从数据中获得的见解也可用于确定数据处理操作的优先级。The kind of insights one intends to derive from the data can also be used to prioritize the data processing effort.

参考References

《数据挖掘: 概念和技术》,第三版,Morgan Kaufmann 出版社,2011,Jiawei Han、Micheline Kamber 和 Jian PeiData Mining: Concepts and Techniques, Third Edition, Morgan Kaufmann, 2011, Jiawei Han, Micheline Kamber, and Jian Pei