“分区和采样”模块Partition and Sample module

本文介绍 Azure 机器学习设计器中的一个模块。This article describes a module in Azure Machine Learning designer.

使用“分区和采样”模块可对数据集进行采样或由数据集创建分区。Use the Partition and Sample module to perform sampling on a dataset or to create partitions from your dataset.

采样是机器学习中的一个重要工具,因为通过采样可以减小数据集的大小,同时使值的比率保持相同。Sampling is an important tool in machine learning because it lets you reduce the size of a dataset while maintaining the same ratio of values. 本模块支持在机器学习中非常重要的几个相关任务:This module supports several related tasks that are important in machine learning:

  • 将数据划分为多个大小相同的小节。Dividing your data into multiple subsections of the same size.

    可以将分区用于交叉验证,或用于将事例分配给随机组。You might use the partitions for cross-validation, or to assign cases to random groups.

  • 将数据分成组,然后处理特定组中的数据。Separating data into groups and then working with data from a specific group.

    将事例随机分配给不同的组后,可能需要修改只与一个组关联的特征。After you randomly assign cases to different groups, you might need to modify the features that are associated with only one group.

  • 采样。Sampling.

    可以提取一定百分比的数据,应用随机采样,也可以选择一列用于均衡数据集并对其值进行分层采样。You can extract a percentage of the data, apply random sampling, or choose a column to use for balancing the dataset and perform stratified sampling on its values.

  • 创建更小的数据集用于测试。Creating a smaller dataset for testing.

    如果存在大量数据,可以在设置管道时使用前 n 行,然后在构建模型时转为使用完整数据集。If you have a lot of data, you might want to use only the first n rows while setting up the pipeline, and then switch to using the full dataset when you build your model. 也可以使用采样创建更小的数据集,以便在开发环境中使用。You can also use sampling to create a smaller dataset for use in development.

配置模块Configure the module

本模块支持下述用于将数据划分到多个分区或进行采样的方法。This module supports the following methods for dividing your data into partitions or for sampling. 请先选择方法,然后设置方法所需的其他选项。Choose the method first, and then set additional options that the method requires.

  • Head
  • 采样Sampling
  • 分配到折叠Assign to folds
  • 选取折叠Pick fold

获取数据集中的前 N 行Get TOP N rows from a dataset

使用此模式只能获取前 n 行。Use this mode to get only the first n rows. 如果要使用少量行测试管道,而不需要以任何方式均衡数据或对其进行采样,则此选项很有用。This option is useful if you want to test a pipeline on a small number of rows, and you don't need the data to be balanced or sampled in any way.

  1. 在界面中将“分区和采样”模块添加到管道,并连接数据集。Add the Partition and Sample module to your pipeline in the interface, and connect the dataset.

  2. 分区模式或采样模式 :将此选项设置为“Head”。Partition or sample mode : Set this option to Head .

  3. 要选择的行数 :输入要返回的行数。Number of rows to select : Enter the number of rows to return.

    行数必须为非负整数。The number of rows must be a non-negative integer. 如果所选行数大于数据集中的行数,则返回整个数据集。If the number of selected rows is larger than the number of rows in the dataset, the entire dataset is returned.

  4. 提交管道。Submit the pipeline.

模块输出只包含指定行数的单个数据集。The module outputs a single dataset that contains only the specified number of rows. 始终从数据集的顶部开始读行。The rows are always read from the top of the dataset.

创建数据样本Create a sample of data

此选项支持简单随机采样或分层随机采样。This option supports simple random sampling or stratified random sampling. 如果要创建用于测试的小型代表性样本数据集,则可以使用它。It's useful if you want to create a smaller representative sample dataset for testing.

  1. 将“分区和采样”模块添加到管道,并连接数据集。Add the Partition and Sample module to your pipeline, and connect the dataset.

  2. 分区模式或采样模式 :将此选项设置为“采样”。Partition or sample mode : Set this option to Sampling .

  3. 采样率 :请输入介于 0 和 1 之间的值。Rate of sampling : Enter a value between 0 and 1. 此值指定输出数据集应包含的行数占源数据集中行数的百分比。this value specifies the percentage of rows from the source dataset that should be included in the output dataset.

    例如,如果只需要占原始数据集一半的数据,则输入 0.5,表示采样率应为 50%。For example, if you want only half of the original dataset, enter 0.5 to indicate that the sampling rate should be 50 percent.

    根据指定的比率,随机排列输入数据集中的行,并将其有选择地放入输出数据集。The rows of the input dataset are shuffled and selectively placed in the output dataset, according to the specified ratio.

  4. 用于采样的随机种子 :(可选)输入一个要用作种子值的整数。Random seed for sampling : Optionally, enter an integer to use as a seed value.

    如果希望每次都按相同的方式划分多个行,则此选项非常重要。This option is important if you want the rows to be divided the same way every time. 默认值为 0 ,表示根据系统时钟生成起始种子。The default value is 0 , meaning that a starting seed is generated based on the system clock. 此值可能会导致每次运行管道的结果略有不同。This value can lead to slightly different results each time you run the pipeline.

  5. 用于采样的分层拆分 :如果数据集中的行在采样之前按某些键列平均划分,请选择此选项。Stratified split for sampling : Select this option if it's important that the rows in the dataset are divided evenly by some key column before sampling.

    对于用于采样的分层键列,请选择一个单独的分层列,以供划分数据集时使用。For Stratification key column for sampling , select a single strata column to use when dividing the dataset. 然后,将数据集中的行划分如下:The rows in the dataset are then divided as follows:

    1. 所有输入行均按指定分层列中的值分组(分层)。All input rows are grouped (stratified) by the values in the specified strata column.

    2. 对每个组中的行进行随机排列。Rows are shuffled within each group.

    3. 将每个组有选择地添加到输出数据集中以满足指定比例。Each group is selectively added to the output dataset to meet the specified ratio.

  6. 提交管道。Submit the pipeline.

    使用此选项时,模块输出包含代表性采样数据的单个数据集。With this option, the module outputs a single dataset that contains a representative sampling of the data. 不会输出数据集的未采样部分。The remaining, unsampled portion of the dataset is not output.

将数据拆分到多个分区Split data into partitions

要将数据集划分为数据子集时,请使用此选项。Use this option when you want to divide the dataset into subsets of the data. 要创建自定义数量的折叠以用于交叉验证,或将行拆分为多个组,此选项也很有用。This option is also useful when you want to create a custom number of folds for cross-validation, or to split rows into several groups.

  1. 将“分区和采样”模块添加到管道,并连接数据集。Add the Partition and Sample module to your pipeline, and connect the dataset.

  2. 对于分“区或采样模式”,选择“分配到折叠” 。For Partition or sample mode , select Assign to Folds .

  3. 在分区中使用替换 :如果希望将采样行放回到行池中以允许重用,请选择此选项。Use replacement in the partitioning : Select this option if you want the sampled row to be put back into the pool of rows for potential reuse. 因此,可以将同一行分配到多个折叠。As a result, the same row might be assigned to several folds.

    如果不使用替换(默认选项),则不会将采样行放回到行池中以允许重用。If you don't use replacement (the default option), the sampled row is not put back into the pool of rows for potential reuse. 因此,只能将每行分配到一个折叠。As a result, each row can be assigned to only one fold.

  4. 随机拆分 :如果要将行随机分配到折叠,请选择此选项。Randomized split : Select this option if you want rows to be randomly assigned to folds.

    如果不选择此选项,则通过轮询机制方法将行分配到折叠。If you don't select this option, rows are assigned to folds through the round-robin method.

  5. 随机种子 :(可选)输入一个要用作种子值的整数。Random seed : Optionally, enter an integer to use as the seed value. 如果希望每次都按相同的方式划分多个行,则此选项非常重要。This option is important if you want the rows to be divided the same way every time. 否则,默认值 0 表示将使用随机起始种子。Otherwise, the default value of 0 means that a random starting seed will be used.

  6. 指定分区方法 :使用以下选项,指示如何将数据分配到每个分区:Specify the partitioner method : Indicate how you want data to be apportioned to each partition, by using these options:

    • 平均分区 :使用此选项可使每个分区中的行数相等。Partition evenly : Use this option to place an equal number of rows in each partition. 若要指定输出分区的数目,请在“指定要平均拆分的折叠数”框中输入整数。To specify the number of output partitions, enter a whole number in the Specify number of folds to split evenly into box.

    • 使用自定义比例进行分区 :使用此选项可以用逗号分隔的列表指定每个分区的大小。Partition with customized proportions : Use this option to specify the size of each partition as a comma-separated list.

      例如,假设要创建三个分区。For example, assume that you want to create three partitions. 第一个分区将包含 50% 的数据。The first partition will contain 50 percent of the data. 其余两个分区每个包含 25% 的数据。The remaining two partitions will each contain 25 percent of the data. 在“以逗号分隔的比例列表”框中,输入以下数字: .5、.25、.25In the List of proportions separated by comma box, enter these numbers: .5, .25, .25 .

      所有分区大小的总和必须正好是 1。The sum of all partition sizes must add up to exactly 1.

      如果输入的数字加起来小于 1,则将创建一个额外的分区来容纳剩余的行。If you enter numbers that add up to less than 1 , an extra partition is created to hold the remaining rows. 例如,如果输入的值为 .2.3 ,则会创建第三个分区,用于容纳所有行的其余 50%。For example, if you enter the values .2 and .3 , a third partition is created to hold the remaining 50 percent of all rows.

      如果输入的数字加起来大于 1,则在运行管道时会引发错误。If you enter numbers that add up to more than 1 , an error is raised when you run the pipeline.

  7. 分层拆分 :如果希望在拆分时对行进行分层,请选择此选项,然后选择“分层列”。Stratified split : Select this option if you want the rows to be stratified when split, and then choose the strata column .

  8. 提交管道。Submit the pipeline.

    使用此选项,模块会输出多个数据集。With this option, the module outputs multiple datasets. 数据集根据指定的规则分区。The datasets are partitioned according to the rules that you specified.

使用预定义分区中的数据Use data from a predefined partition

如果已将数据集划分为多个分区,而现在想要依次加载每个分区以进行进一步分析或处理,请使用此选项。Use this option when you have divided a dataset into multiple partitions and now want to load each partition in turn for further analysis or processing.

  1. 将“分区和采样”模块添加到管道。Add the Partition and Sample module to the pipeline.

  2. 将模块连接到“分区和采样”的旧实例的输出。Connect the module to the output of a previous instance of Partition and Sample . 此实例必须使用“分配到折叠”选项才能生成一定数量的分区。That instance must have used the Assign to Folds option to generate some number of partitions.

  3. 分区模式或采样模式 :选择“选取折叠”。Partition or sample mode : Select Pick Fold .

  4. 指定要采样的折叠 :通过输入分区索引来选择要使用的分区。Specify which fold to be sampled from : Select a partition to use by entering its index. 分区索引从 1 开始。Partition indices are 1-based. 例如,如果将数据集划分为三个部分,则分区的索引分别为 1、2 和 3。For example, if you divided the dataset into three parts, the partitions would have the indices 1, 2, and 3.

    如果输入无效的索引值,则会引发设计时错误:“Error 0018:数据集包含无效数据。”If you enter an invalid index value, a design-time error is raised: "Error 0018: Dataset contains invalid data."

    除了将数据集按折叠分组,还可以将数据集分成两个组:目标折叠和其他所有内容。In addition to grouping the dataset by folds, you can separate the dataset into two groups: a target fold, and everything else. 为此,请输入单个折叠的索引,然后选择选项“选取所选折叠的补集”,以获取除指定折叠中数据以外的所有内容。To do this, enter the index of a single fold, and then select the option Pick complement of the selected fold to get everything but the data in the specified fold.

  5. 如果使用多个分区,则必须添加“分区和采样”模块的更多实例才能处理每个分区。If you're working with multiple partitions, you must add more instances of the Partition and Sample module to handle each partition.

    例如,第二行中的“分区和采样”模块设置为“分配给折叠”,第三行中的模块设置为“选取折叠”。 For example, the Partition and Sample module in the second row is set to Assign to Folds , and the module in the third row is set to Pick Fold .

    分区和采样

  6. 提交管道。Submit the pipeline.

    使用此选项,则模块输出单个数据集,它只包含分配到该折叠的行。With this option, the module outputs a single dataset that contains only the rows assigned to that fold.

备注

不能直接查看折叠指定情况。You can't view the fold designations directly. 此信息仅存在于元数据中。They're present only in the metadata.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.