训练 Vowpal Wabbit 模型Train Vowpal Wabbit Model

本文介绍如何在 Azure 机器学习设计器中使用“训练 Vowpal Wabbit 模型”模块,以通过 Vowpal Wabbit 创建一个机器学习模型。This article describes how to use the Train Vowpal Wabbit Model module in Azure Machine Learning designer, to create a machine learning model by using Vowpal Wabbit.

若将 Vowpal Wabbit 用于机器学习,请按照 Vowpal Wabbit 要求将输入格式化,并按照必需的格式准备数据。To use Vowpal Wabbit for machine learning, format your input according to Vowpal Wabbit requirements, and prepare the data in the required format. 使用此模块可指定 Vowpal Wabbit 命令行参数。Use this module to specify Vowpal Wabbit command-line arguments.

在管道运行时,Vowpal Wabbit 的实例将连同指定数据一起加载到实验运行时。When the pipeline is run, an instance of Vowpal Wabbit is loaded into the experiment run-time, together with the specified data. 训练完成后,模型会被序列化回工作区。When training is complete, the model is serialized back to the workspace. 你可以立即使用该模型对数据进行评分。You can use the model immediately to score data.

若要根据新数据对现有模型进行增量训练,将保存的模型连接到“训练 Vowpal Wabbit 模型”的“预训练 Vowpal Wabbit 模型”输入端口,然后将新数据添加到另一个输入端口 。To incrementally train an existing model on new data, connect a saved model to the Pre-trained Vowpal Wabbit model input port of Train Vowpal Wabbit Model , and add the new data to the other input port.

什么是 Vowpal Wabbit?What is Vowpal Wabbit?

Vowpal Wabbit (VW) 是一个快速的并行机器学习框架,由 Yahoo! 开发以用于分布式计算Vowpal Wabbit (VW) is a fast, parallel machine learning framework that was developed for distributed computing by Yahoo! 研究。Research. 后来移植到了 Windows,并由 John Langford (Microsoft Research) 改写,以便在并行体系结构中进行科学计算。Later it was ported to Windows and adapted by John Langford (Microsoft Research) for scientific computing in parallel architectures.

对机器学习很重要的 Vowpal Wabbit 功能包括持续学习(在线学习)、维数约简和交互式学习。Features of Vowpal Wabbit that are important for machine learning include continuous learning (online learning), dimensionality reduction, and interactive learning. 如果你无法将数据拟合到内存中,Vowpal Wabbit 也能够解决你的问题。Vowpal Wabbit is also a solution for problems when you cannot fit the model data into memory.

Vowpal Wabbit 的主要用户是数据科学家,他们以前曾将该框架用于机器学习任务,例如分类、回归、主题建模或矩阵因子分解。The primary users of Vowpal Wabbit are data scientists who have previously used the framework for machine learning tasks such as classification, regression, topic modeling or matrix factorization. 用于 Vowpal Wabbit 的 Azure 包装器具有与其本地版本非常相似的性能特征,因此你可以使用 Vowpal Wabbit 的强大功能和本机性能,并轻松将已训练的模型作为可操作的服务发布。The Azure wrapper for Vowpal Wabbit has very similar performance characteristics to the on-premises version, so you can use the powerful features and native performance of Vowpal Wabbit, and easily publish the trained model as an operationalized service.

特征哈希模块也包括由 Vowpal Wabbit 提供的功能,使你可以使用哈希算法将文本数据集转换为二进制特征。The Feature Hashing module also includes functionality provided by Vowpal Wabbit, that lets you transform text datasets into binary features using a hashing algorithm.

如何配置 Vowpal Wabbit 模型How to configure Vowpal Wabbit Model

本节介绍如何训练新模型,以及如何将新数据添加到现有模型中。This section describes how to train a new model, and how to add new data to an existing model.

与设计器中的其他模块不同,该模块既指定模块参数,又训练模型。Unlike other modules in designer, this module both specifies the module parameters, and trains the model. 如果你有一个现有模型,则可将其添加为可选输入以对该模型进行增量训练。If you have an existing model, you can add it as an optional input, to incrementally train the model.

对输入数据进行准备Prepare the input data

若要使用此模块训练模型,输入数据集必须由采用以下两种支持格式之一的单一文本列组成:SVMLight 或 VW .To train a model using this module, the input dataset must consist of a single text column in one of the two supported formats: SVMLight or VW . 这并不表示 Vowpal Wabbit 只分析文本数据,只是特征和值必须按必需的文本文件格式进行准备。This doesn't mean that Vowpal Wabbit analyzes only text data, only that the features and values must be prepared in the required text file format.

可以从两种类型的数据集中读取数据:文件数据集或表格数据集。The data can be read from two kinds of datasets, file dataset or tabular dataset. 这两种数据集必须采用 SVMLight 或 VW 格式。Both of these datasets must either in SVMLight or VW format. Vowpal Wabbit 数据格式的优点它不需要以分列格式,处理稀疏数据时节省空间。The Vowpal Wabbit data format has the advantage that it does not require a columnar format, which saves space when dealing with sparse data. 有关此格式的详细信息,请参阅 Vowpal Wabbit Wiki 页面For more information about this format, see the Vowpal Wabbit wiki page.

创建和训练 Vowpal Wabbit 模型Create and train a Vowpal Wabbit model

  1. 将“训练 Vowpal Wabbit 模型”模块添加至试验。Add the Train Vowpal Wabbit Model module to your experiment.

  2. 添加训练数据集并将其连接到训练数据。Add the training dataset and connect it to Training data . 如果训练数据集是一个包含训练数据文件的目录,请使用“训练数据文件的名称”指定训练数据文件名。If training dataset is a directory, which contains the training data file, specify the training data file name with Name of the training data file . 如果训练数据集是单个文件,请将“训练数据文件的名称”保留为空。If training dataset is a single file, leave Name of the training data file to be empty.

  3. 在“VW 参数”文本框中,键入 Vowpal Wabbit 可执行文件的命令行参数。In the VW arguments text box, type the command-line arguments for the Vowpal Wabbit executable.

    例如,可以添加 –l 来指定学习速率,或者添加 -b 来指示哈希位数 。For example, you might add –l to specify the learning rate, or -b to indicate the number of hashing bits.

    有关详细信息,请参阅 Vowpal Wabbit 参数一节。For more information, see the Vowpal Wabbit parameters section.

  4. 训练数据文件的名称:键入包含输入数据的文件的名称。Name of the training data file : Type the name of the file that contains the input data. 此参数仅在训练数据集是目录时使用。This argument is only used when the training dataset is a directory.

  5. 指定文件类型:指示训练数据使用的格式。Specify file type : Indicate which format your training data uses. Vowpal Wabbit 支持以下两种输入文件格式:Vowpal Wabbit supports these two input file formats:

    • VW 表示 Vowpal Wabbit 使用的内部格式。VW represents the internal format used by Vowpal Wabbit . 有关详细信息,请参阅 Vowpal Wabbit Wiki 页面See the Vowpal Wabbit wiki page for details.
    • SVMLight 是其他一些机器学习工具使用的格式。SVMLight is a format used by some other machine learning tools.
  6. 输出可读模型文件:如果希望模块将可读模型保存到运行记录中,请选择该选项。Output readable model file : select the option if you want the module to save the readable model to the run records. 此参数对应于 VW 命令行中的 --readable_model 参数。This argument corresponds to the --readable_model parameter in the VW command line.

  7. 输出反向哈希文件:如果希望模块将反向哈希函数保存到运行记录中的一个文件中,请选择该选项。Output inverted hash file : select the option if you want the module to save the inverted hashing function to one file in the run records. 此参数对应于 VW 命令行中的 --invert_hash 参数。This argument corresponds to the --invert_hash parameter in the VW command line.

  8. 提交管道。Submit the pipeline.

重新训练现有的 Vowpal Wabbit 模型Retrain an existing Vowpal Wabbit model

Vowpal Wabbit 通过向现有模型中添加新数据来支持增量训练。Vowpal Wabbit supports incremental training by adding new data to an existing model. 可通过两种方法获取现有的模型以进行重新训练:There are two ways to get an existing model for retraining:

  • 使用同一管道中另一个“训练 Vowpal Wabbit 模型”模块的输出。Use the output of another Train Vowpal Wabbit Model module in the same pipeline.

  • 在设计器左侧导航窗格的“数据集”类别中找到一个已保存模型,并将其拖动到管道中。Locate a saved model in the Datasets category of designer’s left navigation pane, and drag it in to your pipeline.

  1. 将“训练 Vowpal Wabbit 模型”模块添加到管道。Add the Train Vowpal Wabbit Model module to your pipeline.

  2. 将以前已训练的模型连接到模块的“预训练 Vowpal Wabbit 模型”输入端口。Connect the previously trained model to the Pre-trained Vowpal Wabbit Model input port of the module.

  3. 将新的训练数据连接到模块的“训练数据”输入端口。Connect the new training data to the Training data input port of the module.

  4. 在“训练 Vowpal Wabbit 模型”的参数窗格中,指定新的训练数据的格式,如果输入的数据集是一个目录,也要指定训练数据文件名。In the parameters pane of Train Vowpal Wabbit Model , specify the format of the new training data, and also the training data file name if the input dataset is a directory.

  5. 如果需要在运行记录中保存对应的文件,请选择“输出可读模型文件”和“输出反向哈希文件”选项。Select the Output readable model file and Output inverted hash file options if the corresponding files need to be saved in the run records.

  6. 提交管道。Submit the pipeline.

  7. 选择该模块并选择右窗格中“输出 + 日志”选项卡下的“注册数据集”,以在 Azure 机器学习工作区中保留更新后的模型 。Select the module and select Register dataset under Outputs+logs tab in the right pane, to preserve the updated model in your Azure Machine Learning workspace. 如果你不指定新名称,则更新后的模型将覆盖现有的已保存模型。If you don't specify a new name, the updated model overwrites the existing saved model.

技术说明Technical notes

本部分包含实现详情、使用技巧和常见问题解答。This section contains implementation details, tips, and answers to frequently asked questions.

Vowpal Wabbit 的优势Advantages of Vowpal Wabbit

Vowpal Wabbit 通过非线性特征(如 n 元语法)提供极快的学习。Vowpal Wabbit provides extremely fast learning over non-linear features like n-grams.

Vowpal Wabbit 使用 在线学习 技术,如随机梯度下降 (SGD),以一次一条记录的形式拟合模型。Vowpal Wabbit uses online learning techniques such as stochastic gradient descent (SGD) to fit a model one record at a time. 因此,它将循环访问原始数据的速度非常快,并可以比大多数其他模型更快地开发的最佳预测因子。Thus it iterates very quickly over raw data and can develop a good predictor faster than most other models. 这种方法也可避免让所有训练数据读入内存。This approach also avoids having to read all training data into memory.

Vowpal Wabbit 将所有数据都转换为哈希值,而不仅仅是文本数据,但其他类别的变量。Vowpal Wabbit converts all data to hashes, not just text data but other categorical variables. 使用哈希可以更有效地查找回归权重,这对于有效的随机梯度下降而言至关重要。Using hashes makes lookup of regression weights more efficient, which is critical for effective stochastic gradient descent.

支持的和不支持的参数Supported and unsupported parameters

本节介绍 Azure 机器学习设计器中对 Vowpal Wabbit 命令行参数的支持。This section describes support for Vowpal Wabbit command line parameters in Azure Machine Learning designer.

通常,所有参数(除有限的一组参数外)均受支持。Generally, all but a limited set of arguments are supported. 有关参数的完整列表,请参阅 Vowpal Wabbit Wiki 页面For a complete list of arguments, use the Vowpal Wabbit wiki page.

不支持以下参数:The following parameters are not supported:

  • https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-line-arguments 中指定的输入/输出选项The input/output options specified in https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-line-arguments

    模块已自动配置这些属性。These properties are already configured automatically by the module.

  • 此外,不允许任何生成多个输出或接受多个输入的选项。Additionally, any option that generates multiple outputs or takes multiple inputs is disallowed. 这些选项包括 --cbt--lda--wapThese include --cbt , --lda , and --wap .

  • 仅支持监督式学习算法。Only supervised learning algorithms are supported. 因此,这些选项不受支持:–active--rank--search 等 。Therefore, these options are not supported: –active , --rank, --search etc.


由于该服务的目标是支持经验丰富的 Vowpal Wabbit 用户,因此必须使用 Vowpal Wabbit 本机文本格式(而不是其他模块使用的数据集格式)提前准备输入数据。Because the goal of the service is to support experienced users of Vowpal Wabbit, input data must be prepared ahead of time using the Vowpal Wabbit native text format, rather than the dataset format used by other modules.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.