机器学习特征Machine-learning features

在机器学习中,特征是系统观察到的以及习得的数据的判别特征或属性。 **  In machine learning, a feature is a distinguishing trait or attribute of data that your system observes and learns through.

机器学习特征为 LUIS 提供了重要提示,指示在何处查找可判别概念的事项。Machine-learning features give LUIS important cues for where to look for things that distinguish a concept. 它们是 LUIS 可以使用的提示,但并不是硬规则。They're hints that LUIS can use, but they aren't hard rules. LUIS 将这些提示与标签结合使用来查找数据。LUIS uses these hints in conjunction with the labels to find the data.

特征可以描述为一个函数,如 f(x) = y。A feature can be described as a function, like f(x) = y. 在示例言语中,特征会告诉你在何处寻找判别特征。In the example utterance, the feature tells you where to look for the distinguishing trait. 使用此信息来帮助创建架构。Use this information to help create your schema.

特征类型Types of features

特征是架构设计的必要组成部分。Features are a necessary part of your schema design. LUIS 支持同时使用短语列表和模型作为特征:LUIS supports both phrase lists and models as features:

  • 短语列表特征Phrase list feature
  • 以模型(意向或实体)作为特征Model (intent or entity) as a feature

查找示例言语中的特征Find features in your example utterances

由于 LUIS 是基于语言的应用程序,因此特征是基于文本的。Because LUIS is a language-based application, the features are text-based. 选择指示要判别的特征的文本。Choose text that indicates the trait you want to distinguish. 对于 LUIS,最小单位是标记。For LUIS, the smallest unit is the token . 对于英语,标记是一组连续的字母和数字,不包含空格或标点。For the English language, a token is a contiguous span of letters and numbers that has no spaces or punctuation.

由于空格和标点不是标记,因此请专注于可用作特征的文本线索。Because spaces and punctuation aren't tokens, focus on the text clues that you can use as features. 请记住单词的变体也包括在内,例如:Remember to include variations of words, such as:

  • 复数形式plural forms
  • 动词时态verb tenses
  • 缩写abbreviations
  • 拼写和拼写错误spellings and misspellings

确定文本是否因可判别特征而必须:Determine if the text, because it distinguishes a trait, has to:

  • 匹配精确的单词或短语:考虑将正则表达式实体或列表实体作为特征添加到实体或意向。Match an exact word or phrase: Consider adding a regular expression entity or a list entity as a feature to the entity or intent.
  • 匹配熟知的概念,例如日期、时间或人名:使用预生成的实体作为实体或意向的特征。Match a well-known concept like dates, times, or people's names: Use a prebuilt entity as a feature to the entity or intent.
  • 随时间推移不断通过新示例学习:使用部分概念示例的短语列表作为实体或意向的特征。Learn new examples over time: Use a phrase list of some examples of the concept as a feature to the entity or intent.

为概念创建短语列表Create a phrase list for a concept

短语列表是描述概念的单词或短语列表。A phrase list is a list of words or phrases that describes a concept. 短语列表作为不区分大小写的匹配项应用于标记级别。A phrase list is applied as a case-insensitive match at the token level.

添加短语列表时,可以将特征设置为全局When adding a phrase list, you can set the feature as global . 全局特征适用于整个应用。A global feature applies to the entire app.

何时使用短语列表When to use a phrase list

如果需要 LUIS 应用来归纳和识别概念的新项,请使用短语列表。Use a phrase list when you need your LUIS app to generalize and identify new items for the concept. 短语列表类似于域特定词汇。Phrase lists are like domain-specific vocabulary. 它们提高了意向和实体的理解质量。They enhance the quality of understanding for intents and entities.

如何使用短语列表How to use a phrase list

使用短语列表时,LUIS 会考虑上下文并归纳内容,以识别类似的但不完全匹配文本的项。With a phrase list, LUIS considers context and generalizes to identify items that are similar to, but aren't an exact text match. 请遵循以下步骤使用短语列表:Follow these steps to use a phrase list:

  1. 开始使用机器学习实体:Start with a machine-learning entity:
    1. 添加示例言语。Add example utterances.
    2. 使用机器学习实体进行标记。Label with a machine-learning entity.
  2. 添加短语列表:Add a phrase list:
    1. 添加具有相似含义的单词。Add words with similar meaning. 不要添加每个可能的单词或短语。Don't add every possible word or phrase. 应该每次添加几个单词或短语。Instead, add a few words or phrases at a time. 然后重新训练并发布。Then retrain and publish.
    2. 查看并添加建议的单词。Review and add suggested words.

短语列表的典型方案A typical scenario for a phrase list

短语列表的典型方案是强化与特定想法相关的单词。A typical scenario for a phrase list is to boost words related to a specific idea.

医学术语是很好的示例,这些词可能需要短语列表来提高其显著性。Medical terms are a good example of words that might need a phrase list to boost their significance. 这些术语可以具有特定的物理、化学、治疗或抽象含义。These terms can have specific physical, chemical, therapeutic, or abstract meanings. 如果没有短语列表,LUIS 将不知道这些术语对你的主题领域很重要。LUIS won't know the terms are important to your subject domain without a phrase list.

若要提取医学术语:To extract the medical terms:

  1. 创建示例言语并在这些言语中标记医学术语。Create example utterances and label medical terms within those utterances.
  2. 通过主题领域中的术语示例创建短语列表。Create a phrase list with examples of the terms within the subject domain. 此短语列表应包含你标记的实际术语以及描述相同概念的其他术语。This phrase list should include the actual term you labeled and other terms that describe the same concept.
  3. 将短语列表添加到用于提取短语列表中使用的概念的实体或子实体中。Add the phrase list to the entity or subentity that extracts the concept used in the phrase list. 最常见的方案是机器学习实体的组件(子级)。The most common scenario is a component (child) of a machine-learning entity. 如果短语列表应该应用于所有意向或实体,请将短语列表标记为全局短语列表。If the phrase list should be applied across all intents or entities, mark the phrase list as a global-phrase list. enabledForAllModels 标志控制 API 中的此模型范围。The enabledForAllModels flag controls this model scope in the API.

短语列表的标记匹配Token matches for a phrase list

短语列表始终应用于标记级别。A phrase list always applies at the token level. 下表显示具有 Ann 一词的短语列表如何以同样的顺序应用于相同字符的变体。The following table shows how a phrase list that has the word Ann applies to variations of the same characters in that order.

Ann 的标记变体Token variation of Ann 找到标记时的短语列表匹配Phrase list match when the token is found
ANNANN
aNNaNN
是 - 标记是 AnnYes - token is Ann
Ann'sAnn's 是 - 标记是 AnnYes - token is Ann
AnneAnne 否 - 标记是 AnneNo - token is Anne

作为特征的模型可帮助另一模型A model as a feature helps another model

你可将模型(意向或实体)作为特征添加到另一个模型(意向或实体)。You can add a model (intent or entity) as a feature to another model (intent or entity). 通过将现有意向或实体添加为特征,你将添加一个已有标记示例的定义明确的概念。By adding an existing intent or entity as a feature, you're adding a well-defined concept that has labeled examples.

添加模型作为特征时,可以将特征设置为:When adding a model as a feature, you can set the feature as:

  • 必需Required . 若要从预测终结点返回模型,必须找到所需特征。A required feature has to be found in order for the model to be returned from the prediction endpoint.
  • 全局Global . 全局特征适用于整个应用。A global feature applies to the entire app.

何时将实体用作意向特征When to use an entity as a feature to an intent

如果该实体的检测对于意向非常重要,请将实体作为特征添加到意向。Add an entity as a feature to an intent when the detection of that entity is significant for the intent.

例如,如果意向是预订航班(如 BookFlight),实体是机票信息(例如座位数量、出发地和目的地),则查找机票信息实体应会将重要权重添加到 BookFlight 意向的预测 。For example, if the intent is for booking a flight, like BookFlight , and the entity is ticket information (such as the number of seats, origin, and destination), then finding the ticket-information entity should add significant weight to the prediction of the BookFlight intent.

何时将实体用作另一个实体的特征When to use an entity as a feature to another entity

如果实体 (A) 的检测对于另一个实体 (B) 的预测非常重要,则应将实体 (A) 作为特征添加到实体 (B)。An entity (A) should be added as a feature to another entity (B) when the detection of that entity (A) is significant for the prediction of entity (B).

例如,如果“邮寄地址”实体包含在“街道地址”子实体中,则查找“街道地址”子实体会将较大权重添加到“邮寄地址”实体的预测。For example, if a shipping-address entity is contained in a street-address subentity, then finding the street-address subentity adds significant weight to the prediction for the shipping address entity.

  • 邮寄地址(机器学习实体):Shipping address (machine-learning entity):

    • 街道编号(子实体)Street number (subentity)
    • 街道地址(子实体)Street address (subentity)
    • 城市(子实体)City (subentity)
    • 州或省/自治区/直辖市(子实体)State or Province (subentity)
    • 国家/地区(子实体)Country/Region (subentity)
    • 邮政编码(子实体)Postal code (subentity)

带特征的嵌套子实体Nested subentities with features

机器学习子实体指示向父实体呈现了一个概念。A machine-learning subentity indicates a concept is present to the parent entity. 父实体可以是另一个子实体或顶层实体。The parent can be another subentity or the top entity. 子实体的值充当其父实体的特征。The value of the subentity acts as a feature to its parent.

子实体可以同时包含短语列表和模型(另一实体)作为特征。A subentity can have both a phrase list and a model (another entity) as a feature.

当子实体包含短语列表时,它会增加概念的词汇,但不会将任何信息添加到预测的 JSON 响应。When the subentity has a phrase list, it boosts the vocabulary of the concept but won't add any information to the JSON response of the prediction.

当子实体具有另一实体的特征时,JSON 响应将包含该另一实体的提取数据。When the subentity has a feature of another entity, the JSON response includes the extracted data of that other entity.

所需功能Required features

若要将模型从预测终结点返回,必须找到所需特征。A required feature has to be found in order for the model to be returned from the prediction endpoint. 如果你知道传入数据必须与特征匹配,请使用所需特征。Use a required feature when you know your incoming data must match the feature.

如果言语文本与所需特征不匹配,则不会提取该文本。If the utterance text doesn't match the required feature, it won't be extracted.

所需特征使用非机器学习实体:A required feature uses a non-machine-learning entity:

  • 正则表达式实体Regular-expression entity
  • 列表实体List entity
  • 预生成实体Prebuilt entity

如果你确信将在数据中发现模型,请将特征设置为“所需”。If you're confident that your model will be found in the data, set the feature as required. 如果未发现模型,则所需特征不会返回任何内容。A required feature doesn't return anything if it isn't found.

仍以邮寄地址为例:Continuing with the example of the shipping address:

邮寄地址(机器学习实体)Shipping address (machine learned entity)

  • 街道编号(子实体)Street number (subentity)
  • 街道地址(子实体)Street address (subentity)
  • 街道名称(子实体)Street name (subentity)
  • 城市(子实体)City (subentity)
  • 州或省/自治区/直辖市(子实体)State or Province (subentity)
  • 国家/地区(子实体)Country/Region (subentity)
  • 邮政编码(子实体)Postal code (subentity)

使用预生成实体的所需特征Required feature using prebuilt entities

城市、省/市/自治区和国家/地区通常是一组封闭列表,这意味着它们不会随着时间的推移而变化。The city, state, and country/region are generally a closed set of lists, meaning they don't change much over time. 这些实体可能具有相关的建议特征,可以将这些特征标记为“所需”。These entities could have the relevant recommended features and those features could be marked as required. 这意味着,如果找不到具有所需特征的实体,则不会返回整个邮寄地址。That means the entire shipping address isn't returned if the entities that have required features aren't found.

如果言语中有城市、省/市/自治区或国家/地区信息,但它们的所在位置不在 LUIS 的预期范畴内,或是一些不在该范畴内的俚语,该怎么办?What if the city, state, or country/region are in the utterance, but they're in a location or are slang that LUIS doesn't expect? 如果要提供一些后期处理来帮助解析该实体,由于该实体在 LUIS 中的置信度评分较低,请勿将该特征标记为“所需”。If you want to provide some post processing to help resolve the entity, due to a low-confidence score from LUIS, don't mark the feature as required.

邮寄地址的所需特征的另一示例是,将街道编号设置为所需的预生成编号。Another example of a required feature for the shipping address is to make the street number a required, prebuilt number. 这允许用户输入“1 Microsoft Way”或“One Microsoft Way”。This allows a user to enter "1 Microsoft Way" or "One Microsoft Way". 对于街道编号子实体,两者都解析为数字“1”。Both resolve to the numeral "1" for the street-number subentity.

使用列表实体的所需特征Required feature using list entities

列表实体用作规范名称及其同义词的列表。A list entity is used as a list of canonical names along with their synonyms. 作为所需特征,如果言语既不包括规范名称也不包括同义词,则不会将该实体作为预测终结点的一部分返回。As a required feature, if the utterance doesn't include either the canonical name or a synonym, then the entity isn't returned as part of the prediction endpoint.

假设你的公司仅向部分国家/地区发货。Suppose that your company only ships to a limited set of countries/regions. 你可以创建一个列表实体,其中包含多种供客户参考国家/地区的方式。You can create a list entity that includes several ways for your customer to reference the country/region. 如果 LUIS 在言语的文本中找不到完全匹配项,则不会在预测中返回(具有列表实体的所需特征)实体。If LUIS doesn't find an exact match within the text of the utterance, then the entity (that has the required feature of the list entity) isn't returned in the prediction.

规范名称Canonical name 同义词Synonyms
美国United States 美国U.S.
U.S.AU.S.A
USUS
美国USA
00

客户端应用程序(如聊天机器人)可以询问后续问题来提供帮助。A client application, such as a chat bot, can ask a follow-up question to help. 这可帮助客户了解国家/地区选择是有限制的和必需的。This helps the customer understand that the country/region selection is limited and required .

使用正则表达式实体的所需特征Required feature using regular expression entities

充当所需特征的正则表达式实体可提供丰富的文本匹配功能。A regular expression entity that's used as a required feature provides rich text-matching capabilities.

在邮寄地址示例中,可以创建一个正则表达式来捕获国家/地区邮政编码的语法规则。In the shipping address example, you can create a regular expression that captures syntax rules of the country/region postal codes.

全局特征Global features

尽管最常见的用法是将特征应用到特定模型,但也可以将特征配置为全局特征,将其应用到整个应用程序。While the most common use is to apply a feature to a specific model, you can configure the feature as a global feature to apply it to your entire application.

全局特征最常见的用途是向应用添加其他词汇。The most common use for a global feature is to add an additional vocabulary to the app. 例如,如果你的客户使用的是主要语言,但希望能够在同一言语中使用其他语言,则你可添加一项特征来包含辅助语言的单词。For example, if your customers use a primary language, but expect to be able to use another language within the same utterance, you can add a feature that includes words from the secondary language.

由于用户希望跨任何意向或实体使用辅助语言,因此请将辅助语言中的单词添加到短语列表。Because the user expects to use the secondary language across any intent or entity, add words from the secondary language to the phrase list. 将短语列表配置为全局特征。Configure the phrase list as a global feature.

组合特征以增加效益Combine features for added benefit

可使用多个特征来描述特征或概念。You can use more than one feature to describe a trait or concept. 常见的配对是使用:A common pairing is to use:

示例:旅游应用的订票实体特征Example: ticket-booking entity features for a travel app

作为基本示例,请考虑这样一个应用:该应用通过一个“航班预订”意向和一个订票实体来预订航班 。As a basic example, consider an app for booking a flight with a flight-reservation intent and a ticket-booking entity . 订票实体将捕获信息以在预订系统中预订机票。The ticket-booking entity captures the information to book a airplane ticket in a reservation system.

用于订票的机器学习实体提供两个子实体来捕获源和目标。The machine-learning entity for ticket-book has two subentities to capture origin and destination. 需要将这些特征添加到每个子实体,而不是顶级实体。The features need to be added to each subentity, not the top level entity.

订票实体架构

该订票实体是一个机器学习实体,其子实体包括“源”和“目标”。The ticket-booking entity is a machine-learning entity, with subentities including Origin and Destination . 这些子实体都表示地理位置。These subentities both indicate a geographical location. 为了帮助提取位置并在“源”和“目标”之间进行区分,每个子实体都应具有特征。To help extract the locations, and distinguish between Origin and Destination , each subentity should have features.

类型Type “源”子实体Origin subentity “目标”子实体Destination subentity
作为特征的模型Model as a feature geographyV2 预生成实体geographyV2 prebuilt entity geographyV2 预生成实体geographyV2 prebuilt entity
短语列表Phrase list “源”字词:start atbegin fromleaveOrigin words : start at, begin from, leave “目标”字词:toarriveland atgogoingstayheadingDestination words : to, arrive, land at, go, going, stay, heading
短语列表Phrase list 机场代码 - 对源和目标而言均相同的列表Airport codes - same list for both origin and destination 机场代码 - 对源和目标而言均相同的列表Airport codes - same list for both origin and destination
短语列表Phrase list 机场名称 - 对源和目标而言均相同的列表Airport names - same list for both origin and destination 机场代码 - 对源和目标而言均相同的列表Airport codes - same list for both origin and destination

如果预计用户使用机场代码和机场名称,则 LUIS 应包含使用这两种类型的短语的短语列表。If you anticipate that people use airport codes and airport names, than LUIS should have phrase lists which uses both types of phrases. 机场代码可能在输入到聊天机器人的文本中更常见,而机场名称可能在语音对话(如启用语音的聊天机器人)中更常见。Airport codes may be more common with text entered in a chatbot while airport names may be more common with spoken conversation such as a speech-enabled chatbot.

只会为模型(而不会为短语列表)返回特征的匹配详细信息,因为在预测 JSON 中仅返回模型。The matching details of the features are returned only for models, not for phrase lists because only models are returned in prediction JSON.

意向中的订票标记Ticket-booking labeling in the intent

创建机器学习实体后,需要将示例言语添加到意向,并标记父实体和所有子实体。After you create the machine-learning entity, you need to add example utterances to an intent, and label the parent entity and all subentities.

对于订票示例,使用 TicketBooking 实体及其任何子实体在文本中标记意向中的示例言语。For the ticket booking example, Label the example utterances in the intent with the TicketBooking entity and any subentities in the text.

标签示例言语

示例:比萨饼订购应用Example: pizza ordering app

对于第二个示例,请考虑一个适用于比萨饼餐馆的应用,该餐馆接收比萨饼订单,包括顾客所订购的比萨饼类型的详细信息。For a second example, consider an app for a pizza restaurant, which receives pizza orders including the details of the type of pizza someone is ordering. 如果可能,应提取比萨饼的全部详细信息,以便完成订单处理。Each detail of the pizza should be extracted, if possible, in order to complete the order processing.

本示例中的机器学习实体更复杂,其中包含嵌套的子实体、短语列表、预生成实体和自定义实体。The machine-learning entity in this example is more complex with nested subentities, phrase lists, prebuilt entities, and custom entities.

比萨饼订单实体架构

此示例使用子实体级别的特征,并使用子实体的子级级别的特征。This example uses features at the subentity level and child of subentity level. 哪一级别获取哪种类型的短语列表或作为特征的模型,是实体设计的重要部分。Which level gets what kind of phrase list or model as a feature is an important part of your entity design.

尽管子实体可以将许多短语列表作为特征以便帮助检测实体,但每个子实体都只有一个作为特征的模型。While subentities can have many phrase lists as features that help detect the entity, each subentity has only one model as a feature. 在此比萨饼应用中,这些模型主要是列表。In this pizza app, those models are primarily lists.

带有已标记的示例言语的比萨饼订单意向

上面显示了已正确标记的示例言语,采用这种显示方式是为了说明实体如何嵌套。The correctly labeled example utterances display in a way to show how the entities are nested.

最佳实践Best practices

了解最佳实践Learn best practices.

后续步骤Next steps

  • 在预测运行时扩展应用模型。Extend your app models at prediction runtime.
  • 请参阅添加特征,详细了解如何将特征添加到 LUIS 应用。See Add features to learn more about how to add features to your LUIS app.