你的应用的数据收集Data collection for your app

在应用开发过程中,语言理解 (LUIS) 应用需要使用数据。A Language Understanding (LUIS) app needs data as part of app development.

LUIS 中使用的数据Data used in LUIS

LUIS 使用文本作为数据来训练用于分类意向和提取实体的 LUIS 应用。LUIS uses text as data to train and test your LUIS app for classification for intents and for extraction of entities. 你需要一个足够大的数据集,以便有足够的数据为训练和测试创建单独的数据集,且其多样性和分布符合下述要求。You need a large enough data set that you have sufficient data to create separate data sets for both training and test that have the diversity and distribution called out specifically below. 这些集合中的数据不应重叠。The data in each of these sets should not overlap.

示例言语的训练数据选择Training data selection for example utterances

根据以下条件为你的训练集选择言语:Select utterances for your training set based on the following criteria:

  • 真实数据最佳Real data is best :

    • 客户端应用程序中的真实数据 :选择客户端应用程序中的真实数据作为言语。Real data from client application : Select utterances that are real data from your client application. 如果客户今天发送一个带有查询的 Web 表单,而你正在构建一个机器人,则你可以从使用该 Web 表单数据开始。If the customer sends a web form with their inquiry today, and you’re building a bot, you can start by using the web form data.
    • 来自大众的数据 :如果你没有任何现有数据,请考虑使用来自大众的言语。Crowd-sourced data : If you don’t have any existing data, consider crowd sourcing utterances. 尝试为你的方案使用来自实际用户群的大众来源言语,以获得应用程序将看到的真实数据的最佳近似值。Try to crowd-source utterances from your actual user population for your scenario to get the best approximation of the real data your application will see. 来自大众的人类言语比计算机生成的言语要好。Crowd-sourced human utterances are better than computer-generated utterances. 当你构建一个基于特定模式生成的合成言语的数据集时,它将缺少人们在说话时的很多自然变化,并且不能在生产中很好地归纳。When you build a data set of synthetic utterances generated on specific patterns, it will lack much of the natural variation you’ll see with people creating the utterances and won’t end up generalizing well in production.
  • 数据多样性Data diversity :

    • 措辞(字词选择)和语法 。Region diversity : Make sure the data for each intent is as diverse as possible including phrasing (word choice), and grammar . 如果你在讲授有关休假日的人力资源政策,请确保你的言语可以代表你所服务的所有地区所使用的词语。If you are teaching an intent about HR policies about vacation days, make sure you have utterances that represent the terms that are used for all regions you’re serving. 例如,欧洲人可能会询问 taking a holiday,而美国人可能会询问 taking vacation daysFor example, in Europe people might ask about taking a holiday and in the US people might ask about taking vacation days.
    • 语言多样性 :如果你的使用各种母语的用户正在使用第二种语言进行沟通,请确保提供可以代表非母语者的言语。Language diversity : If you have users with various native languages that are communicating in a second language, make sure to have utterances that represent non-native speakers.
    • 行业多样性 :请考虑你的数据输入路径。Input diversity : Consider your data input path. 如果你从单个人、部门或输入设备(麦克风)收集数据,则可能会缺少多样性,而多样性对于你的应用了解所有输入路径非常重要。If you are collecting data from one person, department or input device (microphone) you are likely missing diversity that will be important for your app to learn about all input paths.
    • 标点多样性 :要考虑到人们在文本应用程序中使用各种级别的标点符号,并确保你使用标点符号的方式具有多样性。Punctuation diversity : Consider that people use varying levels of punctuation in text applications and make sure you have a diversity of how punctuation is used. 如果你使用的是来自语音的数据,则它不会有任何标点符号,所以你的数据也不应该有标点符号。If you're using data that comes from speech, it won't have any punctuation, so your data shouldn't either.
  • 数据分布 :确保在各个意向中分布的数据能够代表你的客户端应用程序接收到的数据的分布。Data distribution : Make sure the data spread across intents represents the same spread of data your client application receives. 如果你的 LUIS 应用对请假的言语 (50%) 进行分类,但它也会看到询问休假天数的言语 (20%)、批准休假的言语 (20%) 以及一些超出范围和闲聊的言语 (10%),那么你的数据集包含的每种言语应该遵循各样本百分比。If your LUIS app will classify utterances that are requests to schedule a leave (50%), but it will also see utterances about inquiring about leave days left (20%), approving leaves (20%) and some out of scope and chit chat (10%) then your data set should have the sample percentages of each type of utterance.

  • 使用所有数据形式 :如果你的 LUIS 应用将以多种形式获取数据,请确保在你的训练言语中包含这些形式。Use all data forms : If your LUIS app will take data in multiple forms, make sure to include those forms in your training utterances. 例如,如果你的客户端应用程序同时接受语音和键入的文本输入,则你需要有通过语音转文本生成的言语以及键入的言语。For example, if your client application takes both speech and typed text input, you need to have speech-to-text generated utterances as well as typed utterances. 你将看到人们说话方式和打字方式的不同,以及语音识别错误和打字错误的不同。You will see different variations in how people speak from how they type as well as different errors in speech recognition and typos. 所有这些变化都应该在你的训练数据中表现出来。All of this variation should be represented in your training data.

  • 正面示例和负面示例 :要教授某个 LUIS 应用,它必须了解意向是什么(正面的)以及意向不是什么(负面的)。Positive and negative examples : To teach a LUIS app, it must learn about what the intent is (positive) and what it is not (negative). 在 LUIS 中,对于单个意向,言语只能是正面的。In LUIS, utterances can only be positive for a single intent. 将言语添加到意向时,LUIS 会自动地使同一个示例言语成为所有其他意向的负面示例。When an utterance is added to an intent, LUIS automatically makes that same example utterance a negative example for all the other intents.

  • 应用程序范围之外的数据 :如果你的应用程序将看到超出已定义意向的言语,请确保提供这些言语。Data outside of application scope : If your application will see utterances that fall outside of your defined intents, make sure to provide those. 未分配给特定已定义意向的示例将被标记为“无” 意向。The examples that aren’t assigned to a particular defined intent will be labeled with the None intent. 请为“无” 意向提供切实可行的示例,以正确预测超出已定义意向范围的言语,这一点很重要。It’s important to have realistic examples for the None intent to properly predict utterances that are outside the scope of the defined intents.

    例如,如果你要创建侧重于休假时间的 HR 机器人并且你有三个意向:For example, if you are creating an HR bot focused on leave time and you have three intents:

    • 安排或编辑休假schedule or edit a leave
    • 查询可用休假天数inquire about available leave days
    • 批准/不批准休假approve/disapprove leave

    你需要确保你的言语既涵盖这两个意向,也涵盖应用程序应该处理的在该范围之外的潜在言语,如下所示:You want to make sure you have utterances that cover both of those intents, but also that cover potential utterances outside that scope that the application should serve like these:

    • What are my medical benefits?
    • Who is my HR rep?
    • tell me a joke
  • 罕见示例 :你的应用需要有罕见示例以及常见示例。Rare examples : Your app will need to have rare examples as well as common examples. 如果你的应用从未见过罕见示例,它将无法在生产环境中识别它们。If your app has never seen rare examples, it won’t be able to identify them in production. 如果你使用的是真实数据,你将能够更准确地预测你的 LUIS 应用在生产环境中的工作方式。If you’re using real data, you will be able to more accurately predict how your LUIS app will work in production.

注重质量而非数量Quality instead of quantity

在添加更多数据之前,请考虑现有数据的质量。Consider the quality of your existing data before you add more data. 使用 LUIS 时,你使用的是机器教学。With LUIS, you’re using Machine Teaching. 你的标签和你定义的机器学习功能的组合就是你的 LUIS 应用所使用的内容。The combination of your labels and the machine learning features you define is what your LUIS app uses. 它不是仅依赖于标签的数量来做出最好的预测。It doesn’t simply rely on the quantity of labels to make the best prediction. 示例的多样性以及它们对你的 LUIS 应用在生产环境中将看到的内容的表示形式才是最重要的部分。The diversity of examples and their representation of what your LUIS app will see in production is the most important part.

预处理数据Preprocessing data

以下预处理步骤有助于构建更好的 LUIS 应用:The following preprocessing steps will help build a better LUIS app:

  • 删除重复项 :重复的言语没有危害,但它们也没有帮助,因此删除它们将节省标记时间。Remove duplicates : Duplicate utterances won't hurt, but they don't help either, so removing them will save labeling time.
  • 应用相同的客户端应用预处理 :如果调用 LUIS 预测终结点的客户端应用程序在运行时会先应用数据处理,然后再将文本发送到 LUIS,则你应当基于以相同方式处理的数据训练 LUIS 应用。Apply same client-app preprocess : If your client application, which calls the LUIS prediction endpoint, applies data processing at runtime before sending the text to LUIS, you should train the LUIS app on data that is processed in the same way.
  • 不要应用客户端应用不使用的新清理流程 :如果你的客户端应用直接接受语音生成的文本,而不进行任何清理(例如语法或标点符号方面的清理),则你的言语需要反映相同的内容,包括任何缺失的标点符号以及需要考虑到的任何其他误识。Don't apply new cleanup processes that the client app doesn't use : If your client app accepts speech-generated text directly without any cleanup such as grammar or punctuation, your utterances need to reflect the same including any missing punctuation and any other misrecognition you’ll need to account for.
  • 不要清理数据 :不要删除可能会从乱码语音识别、意外按键或错误键入/错误拼写的文本中得到的格式错误的输入。Don't clean up data : Don’t get rid of malformed input that you might get from garbled speech recognition, accidental keypresses, or mistyped/misspelled text. 如果你的应用会看到这样的输入,则基于这些输入对其进行训练和测试非常重要。If your app will see inputs like these, it’s important for it to be trained and tested on them. 如果你不希望应用理解某个意向,请添加格式错误的输入 意向。Add a malformed input intent if you wouldn’t expect your app to understand it. 请标记此数据以帮助 LUIS 应用在运行时预测正确的响应。Label this data to help your LUIS app predict the correct response at runtime. 你的客户端应用程序可以针对意思不明确的言语(例如 Please try again)选择合适的响应。Your client application can choose an appropriate response to unintelligible utterances such as Please try again.

标记数据Labeling data

  • 将文本作为正确文本进行标记 :示例言语应当具有所标记实体的所有形式。Label text as if it was correct : The example utterances should have all forms of an entity labeled. 这包括拼写错误、键入错误和翻译错误的文本。This includes text that is misspelled, mistyped, and mistranslated.

LUIS 应用投入生产后的数据评审Data review after LUIS app is in production

在将应用部署到生产后,可评审终结点言语来监视真实的言语流量。Review endpoint utterances to monitor real utterance traffic once you have deployed an app to production. 这样你就可以用真实的数据更新你的训练言语,改进你的应用。This allows you to update your training utterances with real data, which will improve your app. 任何使用来自大众的数据或非真实场景数据构建的应用都需要根据其真实使用情况进行改进。Any app built with crowd-sourced or non-real scenario data will need to be improved based on its real use.

测试用于批测试的数据选择Test data selection for batch testing

上面列出的用于训练言语的所有原则都适用于你应当用于测试集的言语。All of the principles listed above for training utterances apply to utterances you should use for your test set. 确保在各个意向和实体之间的分布尽可能地反映真实分布。Ensure the distribution across intents and entities mirror the real distribution as closely as possible.

不要在测试集中重复使用训练集中的言语。Don't reuse utterances from your training set in your test set. 这会不当地使结果产生偏差,并且不会正确反映 LUIS 应用在生产环境中的性能。This improperly biases your results and won’t give you the right indication of how your LUIS app will perform in production.

发布你的应用的第一个版本后,你应当使用来自真实流量的言语更新你的测试集,以确保你的测试集反映你的生产分布,并且你可以随着时间推移监视实际性能。Once the first version of your app is published, you should update your test set with utterances from real traffic to ensure your test set reflects your production distribution and you can monitor realistic performance over time.

后续步骤Next steps