文本拆分认知技能Text split cognitive skill

文本拆分技能将文本分解为文本区块。The Text Split skill breaks text into chunks of text. 你可以指定是要将文件分解为句子还是特定长度的页面。You can specify whether you want to break the text into sentences or into pages of a particular length. 当其他技能下游有最大文本长度要求时,此技能尤其有用。This skill is especially useful if there are maximum text length requirements in other skills downstream.

备注

此技能未绑定到认知服务 API,你使用它无需付费。This skill is not bound to a Cognitive Services API and you are not charged for using it. 但是,你仍然应该附加认知服务资源,以覆盖免费资源选项,该选项限制你每天进行少量的每日扩充。You should still attach a Cognitive Services resource, however, to override the Free resource option that limits you to a small number of daily enrichments per day.

@odata.type

Microsoft.Skills.Text.SplitSkillMicrosoft.Skills.Text.SplitSkill

技能参数Skill Parameters

参数区分大小写。Parameters are case-sensitive.

参数名称Parameter name 说明Description
textSplitMode “pages”或“sentences”Either "pages" or "sentences"
maximumPageLength 如果将 textSplitMode 设置为“pages”,它指的是由 String.Length 测量的最大页面长度。If textSplitMode is set to "pages", this refers to the maximum page length as measured by String.Length. 最小值为 300。The minimum value is 300. 如果 textSplitMode 设置为“pages”,则该算法将尝试将文本拆分为大小最多为“maximumPageLength”的区块。If the textSplitMode is set to "pages", the algorithm will try to split the text into chunks that are at most "maximumPageLength" in size. 在这种情况下,该算法将尽力断开句子边界上的句子,因此区块的大小可能略小于“maximumPageLength”。In this case, the algorithm will do its best to break the sentence on a sentence boundary, so the size of the chunk may be slightly less than "maximumPageLength".
defaultLanguageCode (可选)以下语言代码之一:da, de, en, es, fi, fr, it, ko, pt(optional) One of the following language codes: da, de, en, es, fi, fr, it, ko, pt. 默认为英语 (en)。Default is English (en). 注意事项:Few things to consider:
  • 如果你传递的是 languagecode-countrycode 格式,只会使用格式的 languagecode 部分。If you pass a languagecode-countrycode format, only the languagecode part of the format is used.
  • 如果语言不在上述列表中,拆分技能会在字符边界分解文本。If the language is not in the previous list, the split skill breaks the text at character boundaries.
  • 提供语言代码有助于避免将非空格的语言(例如,中文、日语和韩语)的单词一分为二。Providing a language code is useful to avoid cutting a word in half for non-whitespace languages such as Chinese, Japanese, and Korean.
  • 如果你不知道语言(例如,需要将输入的文本拆分为 LanguageDetectionSkill),则默认的英语 (en) 应该已足够。If you do not know the language (i.e. you need to split the text for input into the LanguageDetectionSkill), the default of English (en) should be sufficient.

技能输入Skill Inputs

参数名称Parameter name 说明Description
text 要拆分为子字符串的文本。The text to split into substring.
languageCode (可选)文档的语言代码。(Optional) Language code for the document. 如果你不知道语言(例如,需要将输入的文本拆分为 LanguageDetectionSkill),则可以放心地删除此输入。If you do not know the language (i.e. you need to split the text for input into the LanguageDetectionSkill), it is safe to remove this input.

技能输出Skill Outputs

参数名称Parameter name 说明Description
textItems 提取的子字符串数组。An array of substrings that were extracted.

示例定义Sample definition

{
    "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
    "textSplitMode" : "pages", 
    "maximumPageLength": 1000,
    "defaultLanguageCode": "en",
    "inputs": [
        {
            "name": "text",
            "source": "/document/content"
        },
        {
            "name": "languageCode",
            "source": "/document/language"
        }
    ],
    "outputs": [
        {
            "name": "textItems",
            "targetName": "mypages"
        }
    ]
}

示例输入Sample Input

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "text": "This is a the loan application for Joe Romero, a Microsoft employee who was born in Chile and who then moved to Australia…",
                "languageCode": "en"
            }
        },
        {
            "recordId": "2",
            "data": {
                "text": "This is the second document, which will be broken into several pages...",
                "languageCode": "en"
            }
        }
    ]
}

示例输出Sample Output

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "textItems": [
                    "This is the loan…",
                    "On the second page we…"
                ]
            }
        },
        {
            "recordId": "2",
            "data": {
                "textItems": [
                    "This is the second document...",
                    "On the second page of the second doc…"
                ]
            }
        }
    ]
}

错误案例Error cases

如果不支持某种语言,会生成一个警告,并在字符边界拆分文本。If a language is not supported, a warning is generated and the text is split at character boundaries.

另请参阅See also