通过实体提取数据Extract data with entities

实体可以在预测运行时从用户言语中提取数据。An entity extracts data from a user utterance at prediction runtime. 一个可选的辅助用途是通过用作特征的实体促进意向或其他实体的预测。 An optional , secondary purpose is to boost the prediction of the intent or other entities by using the entity as a feature.

有多种类型的实体:There are several types of entities:

  • 机器学习实体 - 这是主要实体。machine-learning entity - this is the primary entity. 在使用其他实体之前,应使用此实体类型设计架构。You should design your schema with this entity type before using other entities.
  • 用作必要特征的非机器学习实体 - 由预生成实体用于精确文本匹配、模式匹配或检测Non-machine-learning used as a required feature - for exact text matches, pattern matches, or detection by prebuilt entities
  • Pattern.any - 从模式中提取自由格式的文本,例如书籍标题Pattern.any - to extract free-form text such as book titles from a Pattern

机器学习实体提供最广泛的数据提取选项。machine-learning entities provide the widest range of data extraction choices. 非机器学习实体通过文本匹配发挥作用,充当机器学习实体或意图的必要特征Non-machine-learning entities work by text matching and are used as a required feature for a machine-learning entity or intent.

实体表示数据Entities represent data

实体是要从言语中提取的数据,例如姓名、日期、产品名称或任何有意义的单词组。Entities are data you want to pull from the utterance, such as names, dates, product names, or any significant group of words. 话语可包括多个实体,也可不包含任何实体。An utterance can include many entities or none at all. 客户端应用程序可能需要数据来执行其任务。 A client application may need the data to perform its task.

对于模型中的每个意向,需要在所有训练言语中一致性地标记实体。Entities need to be labeled consistently across all training utterances for each intent in a model.

你可以定义自己的实体,也可以使用预生成实体来节省处理 datetimeV2序号电子邮件电话号码等为常见概念的时间。You can define your own entities or use prebuilt entities to save time for common concepts such as datetimeV2, ordinal, email, and phone number.

话语Utterance 实体Entity 数据Data
购买 3 张到纽约的机票Buy 3 tickets to New York 预生成的数字Prebuilt number
纽约New York

意向是必需的,而实体是可选的。While intents are required, entities are optional. 无需为应用中的每个概念创建实体,只需为符合以下条件的概念创建实体:客户端应用程序需要其相关数据或实体充当另一实体或意向的提示或信号。You do not need to create entities for every concept in your app, but only for those where the client application needs the data or the entity acts as a hint or signal to another entity or intent.

以后随着应用程序的开发以及确定新的数据需求,可以在 LUIS 模型中添加相应的实体。As your application develops and a new need for data is identified, you can add appropriate entities to your LUIS model later.

实体表示数据提取Entity represents data extraction

实体表示言语中的数据概念。 The entity represents a data concept inside the utterance . 意向对整个言语分类。 An intent classifies the entire utterance .

请考虑以下四个言语:Consider the following four utterances:

话语Utterance 预测的意向Intent predicted 提取的实体Entities extracted 说明Explanation
帮助Help helphelp - 没有要提取的内容。Nothing to extract.
Send somethingSend something sendSomethingsendSomething - 没有要提取的内容。Nothing to extract. 在此上下文中,模型没有可用于提取 something 的必要特征,且未指定接收方。The model does not have a required feature to extract something in this context, and there is no recipient stated.
Send Bob a presentSend Bob a present sendSomethingsendSomething Bob, presentBob, present 模型通过添加预生成实体 personName 的必要特征来提取 BobThe model extracts Bob by adding a required feature of prebuilt entity personName. 已使用机器学习实体提取 presentA machine-learning entity has been used to extract present.
Send Bob a box of chocolatesSend Bob a box of chocolates sendSomethingsendSomething Bob, box of chocolatesBob, box of chocolates 机器学习实体已提取两个重要数据片段:Bobbox of chocolatesThe two important pieces of data, Bob and the box of chocolates, have been extracted by machine-learning entities.

标记所有意向中的实体Label entities in all intents

实体提取数据而不考虑预测的意向。Entities extract data regardless of the predicted intent. 请确保标记所有意向中的所有示例言语。 Make sure you label all example utterances in all intents. None 意向缺少实体标记会导致混淆,即使其他意向有多得多的训练言语,也是如此。The None intent missing entity labeling causes confusion even if there were far more training utterances for the other intents.

设计分解实体Design entities for decomposition

借助机器学习实体,可以设计用于分解的应用架构,将一个大的概念分解成子实体。machine-learning entities allow you to design your app schema for decomposition, breaking a large concept into subentities.

分解设计可使 LUIS 向客户端应用程序返回深度的实体解析。Designing for decomposition allows LUIS to return a deep degree of entity resolution to your client application. 这样,客户端应用程序便可以专注于业务规则,让 LUIS 来处理数据解析。This allows your client application to focus on business rules and leave data resolution to LUIS.

机器学习实体根据通过示例言语习得的上下文触发。A machine-learning entity triggers based on the context learned through example utterances.

机器学习实体是顶级提取程序。machine-learning entities are the top-level extractors. 子实体是机器学习实体的子实体。Subentities are child entities of machine-learning entities.

有效的机器学习实体Effective machine learned entities

若要有效构建机器学习实体,必须符合以下条件:To build the machine learned entities effectively:

  • 标记应在意向之间应保持一致。Your labeling should be consistent across the intents. 这甚至包括你在 None 意向中提供的包含此实体的言语。This includes even utterances you provide in the None intent that include this entity. 否则,模型将无法有效地确定序列。Otherwise the model will not be able to determine the sequences effectively.
  • 如果有包含子实体的机器学习实体,请确保实体和子实体的不同顺序和变体显示在标记的言语中。If you have a machine learned entity with subentities, make sure that the different orders and variants of the entity and subentities are presented in the labeled utterances. 标记的示例言语应包括所有有效的形式,并包括显示的、缺失的以及在言语中重新排序的实体。Labeled example utterances should include all valid forms, and include entities that appear and are absent and also reordered within the utterance.
  • 应避免将实体过度拟合到极固定的集。You should avoid overfitting the entities to a very fixed set. 当模型未充分通用化时,会发生过度拟合 ,这是机器学习模型中的常见问题。Overfitting happens when the model doesn't generalize well, and is a common problem in machine learning models. 这意味着应用并非很适合处理新数据。This implies the app would not work on new data adequately. 因此,你应该使标记的示例言语多样化,从而使应用能够在你提供的有限示例之外通用化。In turn, you should vary the labeled example utterances so the app is able to generalize beyond the limited examples you provide. 你应该让不同的子实体有足够的差异性,使模型更多地考虑相关概念,而不是只考虑显示的示例。You should vary the different subentities with enough change for the model to think more of the concept instead of just the examples shown.

有效的预生成实体Effective prebuilt entities

若要生成可提取常见数据(例如由预生成实体提供的数据)的有效实体,建议采用以下流程。To build effective entities that extract common data, such as those provided by the prebuilt entities, we recommend the following process.

通过将自己的数据作为特征引入到实体中来改进数据的提取。Improve the extraction of data by bringing your own data to an entity as a feature. 这样,你数据中的所有附加标签都会了解人名在应用程序中存在的上下文。That way all the additional labels from your data will learn the context of where person names exist in your application.

实体类型Types of entities

父代的子实体应是机器学习实体。A subentity to a parent should be a machine-learning entity. 子实体可以使用一个非机器学习实体作为特征The subentity can use a non-machine-learning entity as a feature.

请根据数据的提取方式以及提取后的数据表示方式,来选择实体。Choose the entity based on how the data should be extracted and how it should be represented after it is extracted.

实体类型Entity type 目的Purpose
机器学习Machine-learned 从标记的示例中提取嵌套的、复杂的数据。Extract nested, complex data learned from labeled examples.
列表List 使用 精确文本匹配 提取的项列表及其同义词。List of items and their synonyms extracted with exact text match .
Pattern.anyPattern.any 由于属于自由格式而难以确定末尾部分的实体。Entity where finding the end of entity is difficult to determine because the entity is free-form. 仅在模式中可用。Only available in patterns.
预生成Prebuilt 已经过训练,可以提取特定类型的数据,例如 URL 或电子邮件。Already trained to extract specific kind of data such as URL or email. 其中一些预生成实体是在开源识别器 - 文本项目中定义的。Some of these prebuilt entities are defined in the open-source Recognizers-Text project. 如果你的特定区域性或实体当前不受支持,请通过为项目做贡献来获得支持。If your specific culture or entity isn't currently supported, contribute to the project.
正则表达式Regular Expression 使用正则表达式进行 精确文本匹配Uses regular expression for exact text match .

提取与解析Extraction versus resolution

当数据显示在言语中时,实体会提取数据。Entities extract data as the data appears in the utterance. 实体不更改或解析数据。Entities do not change or resolve the data. 实体不会对文本是否为实体的有效值提供任何解析。The entity won't provide any resolution if the text is a valid value for the entity or not.

可以通过多种方法将解析引入到提取中,但应注意,这会限制应用防范变异和错误的能力。There are ways to bring resolution into the extraction, but you should be aware that this limits the ability of the app to be immune against variations and mistakes.

可以将列表实体和正则表达式(文本匹配)实体用作子实体的必要特征,让子实体充当所提取内容的筛选器。List entities and regular expression (text-matching) entities can be used as required features to a subentity and that acts as a filter to the extraction. 应谨慎使用此功能,不要妨碍应用的预测功能。You should use this carefully as not to hinder the ability of the app to predict.

言语可以包含两个或更多个实体,其中的数据含义基于言语内部的上下文。An utterance may contain two or more occurrences of an entity where the meaning of the data is based on context within the utterance. 例如,一个用于预订航班的言语包含两个地理位置:出发地和目的地。An example is an utterance for booking a flight that has two geographical locations, origin and destination.

Book a flight from Seattle to Cairo

这两个位置的提取方式需使客户端应用程序可以知道每个位置的类型,以便完成购票过程。The two locations need to be extracted in a way that the client-application knows the type of each location in order to complete the ticket purchase.

若要提取源和目标,请创建两个子实体作为订票机器学习实体的一部分。To extract the origin and destination, create two subentities as part of the ticket order machine-learning entity. 对于每个子实体,请创建一个使用 geographyV2 的必要特征。For each of the subentities, create a required feature that uses geographyV2.

使用必要特征来约束实体Using required features to constrain entities

详细了解必要特征Learn more about required features

Pattern.any 实体Pattern.any entity

Pattern.any 仅在模式中可用。A Pattern.any is only available in a Pattern.

超过应用的实体限制Exceeding app limits for entities

如果需要提高限制,请联系支持人员。If you need more than the limit, contact support. 为此,请收集有关系统的详细信息,转到 。To do so, gather detailed information about your system, go to the LUIS website, and then select Support . 如果所持 Azure 订阅包含支持服务,请与 Azure 技术支持联系。If your Azure subscription includes support services, contact Azure technical support.

实体预测状态和错误Entity prediction status and errors

当实体的实体预测不同于你为示例言语选择的实体时,LUIS 门户会显示此状态。The LUIS portal shows when the entity has a different entity prediction than the entity you selected for an example utterance. 这种不同的评分是根据当前已训练的模型给出的。This different score is based on the current trained model.

当实体的实体预测不同于你为示例言语选择的实体时,LUIS 门户会显示此状态。

错误的文本将在示例言语中突出显示,示例言语行的右侧有一个显示为红色三角形的错误指示符。The erroring text is highlighted within the example utterance, and the example utterance line has an error indicator to the right, shown as a red triangle.

可以根据此信息使用下面的一个或多个方法来解决实体错误:Use this information to resolve entity errors using one or more of the following:

  • 突出显示的文本进行了错误的标记。The highlighted text is mislabeled. 若要进行修复,请完成查看、更正和重新训练的操作。To fix, review, correct, and retrain.
  • 为实体创建一个有助于确定实体概念的特征Create a feature for the entity to help identify the entity's concept
  • 添加更多示例言语并使用实体进行标记Add more example utterances and label with the entity
  • 对于在预测终结点上收到的任何言语,查看有效的学习建议,以便确定实体的概念。Review active learning suggestions for any utterances received at the prediction endpoint that can help identify the entity's concept.

后续步骤Next steps

了解关于优良话语的概念。Learn concepts about good utterances.

请参阅添加实体,详细了解如何将实体添加到 LUIS 应用。See Add entities to learn more about how to add entities to your LUIS app.

请参阅教程:在语言理解 (LUIS) 中使用机器学习实体从用户言语中提取结构化数据,学习如何使用机器学习实体从言语中提取结构化数据。See Tutorial: Extract structured data from user utterance with machine-learning entities in Language Understanding (LUIS) to learn how to extract structured data from an utterance using the machine-learning entity.