“隐性 Dirichlet 分配”模块Latent Dirichlet Allocation module

本文介绍了如何使用 Azure 机器学习设计器(预览版)中的“隐性 Dirichlet 分配”模块将未通过其他方式分类的文本分组到各个类别中。This article describes how to use the Latent Dirichlet Allocation module in Azure Machine Learning designer (preview), to group otherwise unclassified text into categories.

隐性 Dirichlet 分配 (LDA) 经常在自然语言处理中用来查找相似的文本。Latent Dirichlet Allocation (LDA) is often used in natural language processing to find texts that are similar. 另一个常见术语是“主题建模”。Another common term is topic modeling.

此模块接受一个文本列并生成以下输出:This module takes a column of text and generates these outputs:

  • 源文本,以及每个类别的分数The source text, together with a score for each category

  • 一个特征矩阵,其中包含每个类别的已提取术语和系数A feature matrix that contains extracted terms and coefficients for each category

  • 一个转换,你可以将其保存并重新应用于用作输入的新文本A transformation, which you can save and reapply to new text used as input

此模块使用 scikit-learn 库。This module uses the scikit-learn library. 有关 scikit-learn 的详细信息,请参阅  GitHub 存储库,其中包括有关算法的教程和说明。For more information about scikit-learn, see the GitHub repository, which includes tutorials and an explanation of the algorithm.

有关隐性 Dirichlet 分配的更多信息More about Latent Dirichlet Allocation

LDA 通常不是一种分类方法。LDA is generally not a method for classification. 但它使用生成法,因此你不需要先提供已知的类标签,然后再推断模式。But it uses a generative approach, so you don't need to provide known class labels and then infer the patterns. 相反,该算法将生成用于标识主题组的概率模型。Instead, the algorithm generates a probabilistic model that's used to identify groups of topics. 你可以使用概率模型对作为输入提供给模型的现有训练案例或新案例进行分类。You can use the probabilistic model to classify either existing training cases or new cases that you provide to the model as input.

你可能更喜欢使用生成模型,因为它避免了对文本与类别之间的关系进行强假设。You might prefer a generative model because it avoids making strong assumptions about the relationship between the text and categories. 它仅使用字词的分布来对主题进行数学建模。It uses only the distribution of words to mathematically model topics.

以下论文讨论了相关理论,可以 PDF 格式下载:隐性 Dirichlet 分配:Blei、Ng 和 JordanThe theory is discussed in this paper, available as a PDF download: Latent Dirichlet Allocation: Blei, Ng, and Jordan.

此模块中的实现基于适用于 LDA 的 scikit-learn 库The implementation in this module is based on the scikit-learn library for LDA.

有关详细信息,请参阅技术说明部分。For more information, see the Technical notes section.

如何配置隐性 Dirichlet 分配How to configure Latent Dirichlet Allocation

此模块需要一个包含文本列(原始的或预处理的)的数据集。This module requires a dataset that contains a column of text, either raw or preprocessed.

  1. 将“隐性 Dirichlet 分配”模块添加到你的管道。Add the Latent Dirichlet Allocation module to your pipeline.

  2. 请提供包含一个或多个文本列的数据集作为模块的输入。As input for the module, provide a dataset that contains one or more text columns.

  3. 对于“目标列”,请选择包含要分析的文本的一个或多个列。For Target columns, choose one or more columns that contain text to analyze.

    你可以选择多个列,但它们必须是字符串数据类型。You can choose multiple columns, but they must be of the string data type.

    因为 LDA 基于文本创建一个大型的特征矩阵,所以你通常只分析单个文本列。Because LDA creates a large feature matrix from the text, you'll typically analyze a single text column.

  4. 对于“要建模的主题数”,请输入一个介于 1 和 1000 之间的整数,用于指示要从输入文本派生多少类别或主题。For Number of topics to model, enter an integer between 1 and 1000 that indicates how many categories or topics you want to derive from the input text.

    默认情况下,将创建 5 个主题。By default, 5 topics are created.

  5. 对于“N 元语法”,请指定哈希处理期间生成的 N 元语法的最大长度。For N-grams, specify the maximum length of N-grams generated during hashing.

    默认值为 2,表示同时生成二元语法和一元语法。The default is 2, meaning that both bigrams and unigrams are generated.

  6. 选择“规范化”选项以将输出值转换为概率。Select the Normalize option to convert output values to probabilities.

    输出和特征数据集中的值将转换如下,而不是将转换后的值表示为整数:Rather than representing the transformed values as integers, values in the output and feature dataset will be transformed as follows:

    • 数据集中的值将表示为一个概率,即 P(topic|document)Values in the dataset will be represented as a probability where P(topic|document).

    • 特征主题矩阵中的值将表示为一个概率,即 P(word|topic)Values in the feature topic matrix will be represented as a probability where P(word|topic).

    备注

    在 Azure 机器学习设计器(预览版)中,从版本 0.19 开始,scikit-learn 库不再支持非规范化 doc_topic_distr 输出。In Azure Machine Learning designer (preview), the scikit-learn library no longer supports unnormalized doc_topic_distr output from version 0.19. 在此模块中,“规范化”参数只能应用于“特征主题矩阵”输出。In this module, the Normalize parameter can only be applied to feature Topic matrix output. 转换后的数据集输出始终是规范化的。Transformed dataset output is always normalized.

  7. 如果希望设置以下高级参数,请选择“显示所有选项”选项,然后将其设置为“是”。Select the option Show all options, and then set it to TRUE if you want to set the following advanced parameters.

    这些参数特定于 LDA 的 scikit-learn 实现。These parameters are specific to the scikit-learn implementation of LDA. 对于 scikit-learn 中的 LDA,有一些不错的教程,还有官方 scikit-learn 文档There are some good tutorials about LDA in scikit-learn, as well as the official scikit-learn document.

    • Rho 参数Rho parameter. 为主题分布稀疏度提供一个先验概率。Provide a prior probability for the sparsity of topic distributions. 此参数对应于 sklearn 的 topic_word_prior 参数。This parameter corresponds to sklearn's topic_word_prior parameter. 如果希望字词的分布是平坦的,请使用值 1;也就是说,假定所有字词都是等概率的。Use the value 1 if you expect that the distribution of words is flat; that is, all words are assumed equiprobable. 如果你认为大多数字词都以稀疏方式显示,则可以将其设置为较小的值。If you think most words appear sparsely, you might set it to a lower value.

    • Alpha 参数Alpha parameter. 为每文档主题权重的稀疏度指定一个先验概率。Specify a prior probability for the sparsity of per-document topic weights. 此参数对应于 sklearn 的 doc_topic_prior 参数。This parameter corresponds to sklearn's doc_topic_prior parameter.

    • 估计的文档数Estimated number of documents. 输入一个数字,用于表示你对将要处理的文档(行)数的最佳估计。Enter a number that represents your best estimate of the number of documents (rows) that will be processed. 此参数允许模块分配足够大小的哈希表。This parameter lets the module allocate a hash table of sufficient size. 它对应于 scikit-learn 中的 total_samples 参数。It corresponds to the total_samples parameter in scikit-learn.

    • 批大小Size of the batch. 输入一个数字,用于指示要在发送到 LDA 模型的每批文本中包含多少行。Enter a number that indicates how many rows to include in each batch of text sent to the LDA model. 此参数对应于 scikit-learn 中的 batch_size 参数。This parameter corresponds to the batch_size parameter in scikit-learn.

    • 学习更新计划中使用的迭代的初始值Initial value of iteration used in learning update schedule. 指定在在线学习中用于对早期迭代的学习速率进行减权的起始值。Specify the starting value that downweights the learning rate for early iterations in online learning. 此参数对应于 scikit-learn 中的 learning_offset 参数。This parameter corresponds to the learning_offset parameter in scikit-learn.

    • 在更新期间应用于迭代的幂Power applied to the iteration during updates. 指示在联机更新期间为了控制学习速率而应用于迭代计数的幂级。Indicate the level of power applied to the iteration count in order to control learning rate during online updates. 此参数对应于 scikit-learn 中的 learning_decay 参数。This parameter corresponds to the learning_decay parameter in scikit-learn.

    • 对数据执行操作的轮次Number of passes over the data. 指定对数据循环执行算法的最大次数。Specify the maximum number of times the algorithm will cycle over the data. 此参数对应于 scikit-learn 中的 max_iter 参数。This parameter corresponds to the max_iter parameter in scikit-learn.

  8. 如果希望在对文本分类之前在初始轮次中创建 N 元语法列表,请选择“生成 N 元语法的字典”或“在 LDA 之前生成 N 元语法的字典”选项。Select the option Build dictionary of ngrams or Build dictionary of ngrams prior to LDA, if you want to create the n-gram list in an initial pass before classifying text.

    如果事先创建了初始字典,则可在以后查看该模型时使用该字典。If you create the initial dictionary beforehand, you can later use the dictionary when reviewing the model. 如果能够将结果映射到文本而不是数字索引,则通常更容易进行解释。Being able to map results to text rather than numerical indices is generally easier for interpretation. 但是,保存字典会花费较长的时间,并会使用额外的存储。However, saving the dictionary will take longer and use additional storage.

  9. 对于“N 元语法字典的最大大小”,请输入可以在 N 元语法字典中创建的总行数。For Maximum size of ngram dictionary, enter the total number of rows that can be created in the n-gram dictionary.

    此选项对于控制字典的大小很有用。This option is useful for controlling the size of the dictionary. 但如果输入中的 N 元语法数超过此大小,则可能会发生冲突。But if the number of ngrams in the input exceeds this size, collisions may occur.

  10. 提交管道。Submit the pipeline. LDA 模块使用贝氏定理来确定哪些主题可能与单个字词相关联。The LDA module uses Bayes theorem to determine what topics might be associated with individual words. 字词并非排他性地与任何主题或组关联。Words are not exclusively associated with any topics or groups. 相反,每个 N 元语法都有一定的习得概率与任何发现的类相关联。Instead, each n-gram has a learned probability of being associated with any of the discovered classes.

结果Results

此模块有两个输出:The module has two outputs:

  • 转换后的数据集:此输出包含输入文本、指定数量的已发现类别,以及每个类别的每个文本示例的分数。Transformed dataset: This output contains the input text, a specified number of discovered categories, and the scores for each text example for each category.

  • 特征主题矩阵:最左侧的列包含提取的文本特征。Feature topic matrix: The leftmost column contains the extracted text feature. 每个类别的列都包含该类别中该特征的分数。A column for each category contains the score for that feature in that category.

LDA 转换LDA transformation

此模块还输出将 LDA 应用于数据集的 LDA 转换。This module also outputs the LDA transformation that applies LDA to the dataset.

你可以保存此转换,并将其重用于其他数据集。You can save this transformation and reuse it for other datasets. 如果你已经基于一个大型语料库进行过训练,并且想要重用系数或类别,则此技术可能很有用。This technique might be useful if you've trained on a large corpus and want to reuse the coefficients or categories.

若要重用此转换,请在“隐性 Dirichlet 分配”模块的右面板中选择“注册数据集”图标,以使此模块位于模块列表中的“数据集”类别下。To reuse this transformation, select the Register dataset icon in the right panel of the Latent Dirichlet Allocation module to keep the module under the Datasets category in the module list. 然后,你可以将此模块连接到应用转换模块来重用此转换。Then you can connect this module to the Apply Transformation module to reuse this transformation.

优化 LDA 模型或结果Refining an LDA model or results

通常,你无法创建可满足所有需求的单个 LDA 模型。Typically, you can't create a single LDA model that will meet all needs. 即使是为一项任务设计的模型,也可能需要进行多次迭代来提高准确性。Even a model designed for one task might require many iterations to improve accuracy. 建议你尝试所有这些方法来改进模型:We recommend that you try all these methods to improve your model:

  • 更改模型参数Changing the model parameters
  • 使用可视化效果来了解结果Using visualization to understand the results
  • 获取行业专家的反馈,以确定生成的主题是否有用Getting the feedback of subject matter experts to determine whether the generated topics are useful

还可以使用定性度量值来评估结果。Qualitative measures can also be useful for assessing the results. 若要评估主题建模结果,请考虑:To evaluate topic modeling results, consider:

  • 准确度。Accuracy. 类似的项真的类似吗?Are similar items really similar?
  • 多样性。Diversity. 当业务问题需要区分类似项时,模型是否可以进行区分?Can the model discriminate between similar items when required for the business problem?
  • 可伸缩性。Scalability. 它是适用于广泛的文本类别还是仅适用于狭小的目标域?Does it work on a wide range of text categories or only on a narrow target domain?

通常情况下,可以使用自然语言处理对文本进行清理、总结、简化或分类,从而基于 LDA 提高模型的准确度。You can often improve the accuracy of models based on LDA by using natural language processing to clean, summarize and simplify, or categorize text. 例如,以下技术(在 Azure 机器学习中都受支持)可以提高分类准确度:For example, the following techniques, all supported in Azure Machine Learning, can improve classification accuracy:

  • 停止词删除Stop word removal

  • 大小写规范化Case normalization

  • 词元化或词干Lemmatization or stemming

  • 命名实体识别Named entity recognition

有关详细信息,请参阅预处理文本For more information, see Preprocess Text.

在设计器中,还可以使用 R 或 Python 库进行文本处理:执行 R 脚本执行 Python 脚本In the designer, you can also use R or Python libraries for text processing: Execute R Script, Execute Python Script.

技术说明Technical notes

本部分包含实现详情、使用技巧和常见问题解答。This section contains implementation details, tips, and answers to frequently asked questions.

实现详细信息Implementation details

默认情况下,转换后的数据集和特征-主题矩阵的输出分布将以概率的形式规范化:By default, the distributions of outputs for a transformed dataset and feature-topic matrix are normalized as probabilities:

  • 对于给定的文档,转换后的数据集可以主题条件概率的形式规范化。The transformed dataset is normalized as the conditional probability of topics given a document. 在这种情况下,每行的总和等于 1。In this case, the sum of each row equals 1.

  • 对于给定的主题,特征-主题矩阵可以字词条件概率的形式规范化。The feature-topic matrix is normalized as the conditional probability of words given a topic. 在这种情况下,每列的总和等于 1。In this case, the sum of each column equals 1.

提示

有时,模块可能会返回空主题。Occasionally the module might return an empty topic. 最常见的原因是算法的伪随机初始化。Most often, the cause is pseudo-random initialization of the algorithm. 如果发生这种情况,可以尝试更改相关参数。If this happens, you can try changing related parameters. 例如,更改 N 元语法字典的最大大小或用于特征哈希处理的位数。For example, change the maximum size of the N-gram dictionary or the number of bits to use for feature hashing.

LDA 和主题建模LDA and topic modeling

隐性 Dirichlet 分配通常用于基于内容的主题建模,这基本上意味着从未分类的文本中学习类别。Latent Dirichlet Allocation is often used for content-based topic modeling, which basically means learning categories from unclassified text. 在基于内容的主题建模中,主题是基于字词的分布。In content-based topic modeling, a topic is a distribution over words.

例如,假设你提供了一个包含许多产品的客户评论语料库。For example, assume that you've provided a corpus of customer reviews that includes many products. 随着时间推移,客户提交的评论文本会包含许多词语,其中一些词语在多个主题中使用。The text of reviews that have been submitted by customers over time contains many terms, some of which are used in multiple topics.

LDA 流程标识的主题可能表示对单个产品的评论,也可能表示一组产品的评论。A topic that the LDA process identifies might represent reviews for an individual product, or it might represent a group of product reviews. 对 LDA 来说,主题本身只是一组字词随时间推移的概率分布。To LDA, the topic itself is just a probability distribution over time for a set of words.

词语很少专用于任一产品。Terms are rarely exclusive to any one product. 它们也用于其他产品,还可能是适用于所有产品的通用词语(“很好”、“糟糕”)。They can refer to other products, or be general terms that apply to everything ("great", "awful"). 其他词语可能是干扰词。Other terms might be noise words. 然而,LDA 方法不会试图捕获系统中的所有字词,也不试图理解字词之间的关系,只需确定字词同时出现的概率。However, the LDA method doesn't try to capture all words in the universe or to understand how words are related, aside from probabilities of co-occurrence. 它只能对目标域中使用的字词进行分组。It can only group words that are used in the target domain.

在计算术语索引之后,基于距离的相似度度量值将比较各个文本行,以确定两个文本片段是否相似。After the term indexes are computed, a distance-based similarity measure compares individual rows of text to determine whether two pieces of text are similar. 例如,你可能会发现产品有多个强相关的名称。For example, you might find that the product has multiple names that are strongly correlated. 或者,你可能会发现强负面词语经常与某个特定产品相关。Or, you might find that strongly negative terms are usually associated with a particular product. 你可以使用相似度度量值来识别相关词语以及创建建议。You can use the similarity measure both to identify related terms and to create recommendations.

模块参数Module parameters

名称Name 类型Type 范围Range 可选Optional 默认Default 说明Description
目标列Target column(s) 列选择Column Selection 必须Required StringFeatureStringFeature 目标列名称或索引。Target column name or index.
要建模的主题数Number of topics to model IntegerInteger [1;1000][1;1000] 必须Required 55 根据 N 个主题对文档分布进行建模。Model the document distribution against N topics.
N 元语法N-grams IntegerInteger [1;10][1;10] 必须Required 22 在哈希处理期间生成的 N 元语法的顺序。Order of N-grams generated during hashing.
规范化Normalize 布尔Boolean 是或否True or False 必须Required true 将输出规范化为概率。Normalize output to probabilities. 转换后的数据集将是 P(topic|document),特征主题矩阵将是 P(word|topic)。The transformed dataset will be P(topic|document) and the feature topic matrix will be P(word|topic).
显示所有选项Show all options 布尔Boolean 是或否True or False 必须Required FalseFalse 显示特定于 scikit-learn 联机 LDA 的其他参数。Presents additional parameters specific to scikit-learn online LDA.
Rho 参数Rho parameter FloatFloat [0.00001;1.0][0.00001;1.0] 当选中“显示所有选项”复选框时应用Applies when the Show all options check box is selected 0.010.01 主题字词先验分布。Topic word prior distribution.
Alpha 参数Alpha parameter FloatFloat [0.00001;1.0][0.00001;1.0] 当选中“显示所有选项”复选框时应用Applies when the Show all options check box is selected 0.010.01 文档主题先验分布。Document topic prior distribution.
估计的文档数Estimated number of documents IntegerInteger [1;int.MaxValue][1;int.MaxValue] 当选中“显示所有选项”复选框时应用Applies when the Show all options check box is selected 10001000 估计的文档数。Estimated number of documents. 对应于 total_samples 参数。Corresponds to the total_samples parameter.
批大小Size of the batch IntegerInteger [1;1024][1;1024] 当选中“显示所有选项”复选框时应用Applies when the Show all options check box is selected 3232 批大小。Size of the batch.
学习速率更新计划中使用的迭代的初始值Initial value of iteration used in learning rate update schedule IntegerInteger [0;int.MaxValue][0;int.MaxValue] 当选中“显示所有选项”复选框时应用Applies when the Show all options check box is selected 00 用于对早期迭代的学习速率进行减权的初始值。Initial value that downweights learning rate for early iterations. 对应于 learning_offset 参数。Corresponds to the learning_offset parameter.
在更新期间应用于迭代的幂Power applied to the iteration during updates FloatFloat [0.0;1.0][0.0;1.0] 当选中“显示所有选项”复选框时应用Applies when the Show all options check box is selected 0.50.5 为了控制学习速率而应用于迭代计数的幂。Power applied to the iteration count in order to control learning rate. 对应于 learning_decay 参数。Corresponds to the learning_decay parameter.
训练迭代次数Number of training iterations IntegerInteger [1;1024][1;1024] 当选中“显示所有选项”复选框时应用Applies when the Show all options check box is selected 2525 训练迭代次数。Number of training iterations.
生成 N 元语法的字典Build dictionary of ngrams 布尔Boolean 是或否True or False 未选中“显示所有选项”复选框时应用Applies when the Show all options check box is not selected TrueTrue 在计算 LDA 之前生成 N 元语法的字典。Builds a dictionary of ngrams prior to computing LDA. 适用于模型检查和解释。Useful for model inspection and interpretation.
N 元语法字典的最大大小Maximum size of ngram dictionary IntegerInteger [1;int.MaxValue][1;int.MaxValue] 当“生成 N 元语法的字典”选项为“是”时应用Applies when the option Build dictionary of ngrams is True 2000020000 N 元语法字典的最大大小。Maximum size of the ngrams dictionary. 如果输入中的标记数超过此大小,则可能会发生冲突。If the number of tokens in the input exceeds this size, collisions might occur.
要用于特征哈希处理的位数。Number of bits to use for feature hashing. IntegerInteger [1;31][1;31] 当未选中“显示所有选项”复选框并且“生成 N 元语法的字典”为“否”时应用Applies when the Show all options check box is not selected and Build dictionary of ngrams is False 1212 要用于特征哈希处理的位数。Number of bits to use for feature hashing.
在 LDA 之前生成 N 元语法的字典Build dictionary of ngrams prior to LDA 布尔Boolean 是或否True or False 当选中“显示所有选项”复选框时应用Applies when the Show all options check box is selected TrueTrue 在 LDA 之前生成 N 元语法的字典。Builds a dictionary of ngrams prior to LDA. 适用于模型检查和解释。Useful for model inspection and interpretation.
字典中 N 元语法的最大数目Maximum number of ngrams in dictionary IntegerInteger [1;int.MaxValue][1;int.MaxValue] 当选中了“显示所有选项”复选框并且“生成 N 元语法的字典”选项为“是”时应用Applies when the Show all options check box is selected and the option Build dictionary of ngrams is True 2000020000 字典的最大大小。Maximum size of the dictionary. 如果输入中的标记数超过此大小,则可能会发生冲突。If the number of tokens in the input exceeds this size, collisions might occur.
哈希位数Number of hash bits IntegerInteger [1;31][1;31] 当选中了“显示所有选项”复选框并且“生成 N 元语法的字典”选项为“否”时应用Applies when the Show all options check box is selected and the option Build dictionary of ngrams is False 1212 要在特征哈希处理期间使用的位数。Number of bits to use during feature hashing.

后续步骤Next steps

请参阅 Azure 机器学习的可用模块集See the set of modules available to Azure Machine Learning.

如需特定于模块的错误列表,请参阅设计器的异常和错误代码For a list of errors specific to the modules, see Exceptions and error codes for the designer.