“拆分数据”模块Split Data module

本文介绍 Azure 机器学习设计器中的一个模块。This article describes a module in Azure Machine Learning designer.

使用“拆分数据”模块可将一个数据集分割成两个不同的集。Use the Split Data module to divide a dataset into two distinct sets.

如果需要将数据划分为训练集和测试集,此模块会很有用。This module is useful when you need to separate data into training and testing sets. 你还可以自定义数据的分割方式。You can also customize the way that data is divided. 某些选项支持数据随机化。Some options support randomization of data. 其他选项是针对特定数据类型或模型类型定制的。Others are tailored for a certain data type or model type.

配置模块Configure the module

提示

在选择拆分模式之前,请阅读所有选项以确定所需的拆分类型。Before you choose the splitting mode, read all options to determine the type of split you need. 如果更改拆分模式,可能会重置所有其他选项。If you change the splitting mode, all other options might be reset.

  1. 在设计器中将“拆分数据”模块添加到管道。Add the Split Data module to your pipeline in the designer. 你可在“示例和拆分”类别中的“数据转换”下找到此模块 。You can find this module under Data Transformation , in the Sample and Split category.

  2. 拆分模式 :根据现有的数据类型以及所需的数据分割方式,选择以下模式之一。Splitting mode : Choose one of the following modes, depending on the type of data you have and how you want to divide it. 每个拆分模式都有不同的选项。Each splitting mode has different options.

    • 拆分行 :如果只需将数据拆分为两个部分,请使用该选项。Split Rows : Use this option if you just want to divide the data into two parts. 可以指定每个拆分批中的数据百分比。You can specify the percentage of data to put in each split. 默认情况下,数据将对半分割。By default, the data is divided 50/50.

      还可随机化每组中的行选定内容,并使用分层采样。You can also randomize the selection of rows in each group, and use stratified sampling. 在分层采样中,必须选择希望在两个结果数据集中平等分配值的单个数据列。In stratified sampling, you must select a single column of data for which you want values to be apportioned equally among the two result datasets.

    • 正则表达式拆分 :如果你希望通过测试单个值列来分割数据集,请选择此选项。Regular Expression Split : Choose this option when you want to divide your dataset by testing a single column for a value.

      例如,如果你正在分析情绪,可以在文本字段中检查是否存在特定的产品名称。For example, if you're analyzing sentiment, you can check for the presence of a particular product name in a text field. 然后,可将数据集分割成包含目标产品名称的行,以及不包含目标产品名称的行。You can then divide the dataset into rows with the target product name and rows without the target product name.

    • 相对表达式拆分 :如果希望对数字列应用条件,请使用该选项。Relative Expression Split : Use this option whenever you want to apply a condition to a number column. 该数字可以是日期/时间字段、包含年龄或金额的列,甚至可以是百分比。The number can be a date/time field, a column that contains age or dollar amounts, or even a percentage. 例如,你可能希望根据项的成本分割数据集,按年龄范围将人员分组,或者按日历日期划分数据。For example, you might want to divide your dataset based on the cost of the items, group people by age ranges, or separate data by a calendar date.

拆分行Split rows

  1. 在设计器中将拆分数据模块添加到管道,并连接想要拆分的数据集。Add the Split Data module to your pipeline in the designer, and connect the dataset that you want to split.

  2. 对于“拆分模式”,请选择“拆分行”。 For Splitting mode , select Split Rows .

  3. 第一个输出数据集中的行部分 :使用此选项确定有多少行进入第一个(左侧)输出。Fraction of rows in the first output dataset : Use this option to determine how many rows will go into the first (left side) output. 所有其他行将进入第二个(右侧)输出。All other rows will go into the second (right side) output.

    此比率表示发送到第一个输出数据集的行的百分比,因此必须输入一个介于 0 和 1 之间的小数。The ratio represents the percentage of rows sent to the first output dataset, so you must enter a decimal number between 0 and 1.

    例如,如果输入 0.75 作为值,则会按 75/25 的比率拆分数据集。For example, if you enter 0.75 as the value, the dataset will be split 75/25. 在此拆分中,75% 的行将发送到第一个输出数据集。In this split, 75 percent of the rows will be sent to the first output dataset. 剩余 25% 的行将发送到第二个输出数据集。The remaining 25 percent will be sent to the second output dataset.

  4. 如果要将数据选定内容随机化为两个组,请选择“随机化拆分”选项。Select the Randomized split option if you want to randomize selection of data into the two groups. 这是创建训练和测试数据集时的首选项。This is the preferred option when you're creating training and test datasets.

  5. 随机种子 :输入一个非负整数值以启动要使用的实例的伪随机序列。Random Seed : Enter a non-negative integer value to start the pseudorandom sequence of instances to be used. 此默认种子用于生成随机数字的所有模块。This default seed is used in all modules that generate random numbers.

    指定种子会使结果可再现。Specifying a seed makes the results reproducible. 如果需要重复执行拆分操作的结果,则应指定随机数生成器的种子。If you need to repeat the results of a split operation, you should specify a seed for the random number generator. 否则,随机种子默认设置为 0,这意味着将从系统时钟获取初始种子值。Otherwise the random seed is set by default to 0 , which means the initial seed value is obtained from the system clock. 因此,每次执行拆分时,数据的分布都会略有不同。As a result, the distribution of data might be slightly different each time you perform a split.

  6. 分层拆分 :如果将此选项设置为 True,则可确保两个输出数据集包含“阶层列”或“分层键列”中值的代表性示例 。Stratified split : Set this option to True to ensure that the two output datasets contain a representative sample of the values in the strata column or stratification key column .

    使用分层采样,将对数据进行划分,以便每个输出数据集大约获取每个目标值的相同百分比。With stratified sampling, the data is divided such that each output dataset gets roughly the same percentage of each target value. 例如,你可能希望确保训练集和测试集在结果方面大致平衡,或者考虑其他一些列(如性别)。For example, you might want to ensure that your training and testing sets are roughly balanced with regard to the outcome or to some other column (such as gender).

  7. 提交管道。Submit the pipeline.

选择正则表达式Select a regular expression

  1. 拆分数据模块添加到管道,并将其作为输入连接到要拆分的数据集。Add the Split Data module to your pipeline, and connect it as input to the dataset that you want to split.

  2. 对于“拆分模式”,请选择“正则表达式拆分” 。For Splitting mode , select Regular expression split .

  3. 在“正则表达式”框中,输入有效的正则表达式。In the Regular expression box, enter a valid regular expression.

    正则表达式应遵循正则表达式的 Python 语法。The regular expression should follow Python syntax for regular expressions.

  4. 提交管道。Submit the pipeline.

    根据提供的正则表达式,数据集将分割成两组行:包含与表达式匹配的值的行,以及所有剩余行。Based on the regular expression that you provide, the dataset is divided into two sets of rows: rows with values that match the expression and all remaining rows.

以下示例演示如何使用“正则表达式”选项来分割数据集。The following examples demonstrate how to divide a dataset by using the Regular expression option.

单个完整单词Single whole word

此示例会将在列 Text 中包含文本 Gryphon 的所有行放入第一个数据集。This example puts into the first dataset all rows that contain the text Gryphon in the column Text. 它将其他行放入“拆分数据”的第二个输出。It puts other rows into the second output of Split Data .

    \"Text" Gryphon  

SubstringSubstring

此示例在数据集的第二列中的任意位置查找指定的字符串。This example looks for the specified string in any position within the second column of the dataset. 此处的位置由索引值 1 表示。The position is denoted here by the index value of 1. 匹配区分大小写。The match is case-sensitive.

(\1) ^[a-f]

第一个结果数据集包含索引列以以下字符之一开头的所有行:abcdefThe first result dataset contains all rows where the index column begins with one of these characters: a, b, c, d, e, f. 所有其他行将定向到第二个输出。All other rows are directed to the second output.

选择相对表达式Select a relative expression

  1. 拆分数据模块添加到管道,并将其作为输入连接到要拆分的数据集。Add the Split Data module to your pipeline, and connect it as input to the dataset that you want to split.

  2. 对于“拆分模式”,请选择“相对表达式” 。For Splitting mode , select Relative Expression .

  3. 在“关系表达式”框中,输入对单个列执行比较运算的表达式。In the Relational expression box, enter an expression that performs a comparison operation on a single column.

    对于“数字列”:For Numeric column :

    • 该列包含任何数字数据类型的数字,包括日期和时间数据类型。The column contains numbers of any numeric data type, including date and time data types.
    • 表达式最多可以引用一个列名称。The expression can reference a maximum of one column name.
    • 对于 AND 操作,请使用 & 字符。Use the ampersand character, &, for the AND operation. 对于 OR 操作,请使用 | 竖线字符。Use the pipe character, |, for the OR operation.
    • 支持以下运算符:<><=>===!=The following operators are supported: <, >, <=, >=, ==, !=.
    • 不能使用 () 对运算进行分组。You can't group operations by using ( and ).

    对于“字符串列”:For String column :

    • 支持以下运算符:==!=The following operators are supported: ==, !=.
  4. 提交管道。Submit the pipeline.

    表达式将数据集分成两组行:值满足条件的行和所有剩余行。The expression divides the dataset into two sets of rows: rows with values that meet the condition, and all remaining rows.

以下示例演示如何使用“拆分数据”模块中的“相对表达式”选项来分割数据集。 The following examples demonstrate how to divide a dataset by using the Relative Expression option in the Split Data module.

日历年Calendar year

一个常见的场景是按年分割数据集。A common scenario is to divide a dataset by years. 以下表达式选择 Year 列中的值大于 2010 的所有行。The following expression selects all rows where the values in the column Year are greater than 2010.

\"Year" > 2010

日期表达式必须考虑到数据列中包含的所有日期部分。The date expression must account for all date parts that are included in the data column. 数据列中的日期格式必须一致。The format of dates in the data column must be consistent.

例如,在使用格式 mmddyyyy 的日期列中,表达式应类似于:For example, in a date column that uses the format mmddyyyy, the expression should be something like this:

\"Date" > 1/1/2010

列索引Column index

以下表达式演示如何使用列索引,在数据集的第一列中选择包含小于或等于 30 但不等于 20 的值的所有行。The following expression demonstrates how you can use the column index to select all rows in the first column of the dataset that contain values less than or equal to 30, but not equal to 20.

(\0)<=30 & !=20

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.