PII 检测认知技能PII Detection cognitive skill

重要

此技能目前为公共预览版。This skill is currently in public preview. 提供的预览版功能不附带服务级别协议,我们不建议将其用于生产工作负荷。目前未提供门户或 .NET SDK 支持。Preview functionality is provided without a service level agreement, and is not recommended for production workloads.There is currently no portal or .NET SDK support.

PII 检测技能可以从输入文本中提取个人身份信息,并可让你通过多种方式在该文本中屏蔽此类信息。The PII Detection skill extracts personally identifiable information from an input text and gives you the option to mask it from that text in various ways. 此技能使用认知服务中的文本分析提供的机器学习模型。This skill uses the machine learning models provided by Text Analytics in Cognitive Services.

备注

通过增大处理频率、添加更多文档或添加更多 AI 算法来扩大范围时,需要附加可计费的认知服务资源As you expand scope by increasing the frequency of processing, adding more documents, or adding more AI algorithms, you will need to attach a billable Cognitive Services resource. 调用认知服务中的 API 以及在 Azure 认知搜索中的文档破解阶段提取图像时,会产生费用。Charges accrue when calling APIs in Cognitive Services, and for image extraction as part of the document-cracking stage in Azure Cognitive Search. 提取文档中的文本不会产生费用。There are no charges for text extraction from documents.

当你行使内置技能时,我们会按现有的认知服务预付费价格收费。Execution of built-in skills is charged at the existing Cognitive Services pay-in-advance price. 图像提取定价如 Azure 认知搜索定价页所述。Image extraction pricing is described on the Azure Cognitive Search pricing page.

@odata.type

Microsoft.Skills.Text.PIIDetectionSkillMicrosoft.Skills.Text.PIIDetectionSkill

数据限制Data limits

记录的最大大小应为 50,000 个字符,通过 String.Length 进行测量。The maximum size of a record should be 50,000 characters as measured by String.Length. 如果在将数据发送到技能之前需要将其拆分,请考虑使用文本拆分技能If you need to break up your data before sending it to the skill, consider using the Text Split skill.

技能参数Skill parameters

这些参数区分大小写并且都是可选的。Parameters are case-sensitive and all are optional.

参数名称Parameter name 说明Description
defaultLanguageCodedefaultLanguageCode 输入文本的语言代码。Language code of the input text. 目前仅支持 enFor now, only en is supported.
minimumPrecisionminimumPrecision 一个介于 0.0 和 1.0 之间的值。A value between 0.0 and 1.0. 如果置信度分数(在 piiEntities 输出中)低于所设置的 minimumPrecision 值,则不会返回或屏蔽该实体。If the confidence score (in the piiEntities output) is lower than the set minimumPrecision value, the entity is not returned or masked. 默认值为 0.0。The default is 0.0.
maskingModemaskingMode 一个提供多种方法来屏蔽输入文本中检测到的 PII 的参数。A parameter that provides various ways to mask the detected PII in the input text. 可以使用以下选项:The following options are supported:
  • none(默认值):这意味着不会执行任何屏蔽,并且不会返回 maskedText 输出。none (default): This means that no masking will be performed and the maskedText output will not be returned.
  • redact:此选项将从输入文本中删除检测到的实体,并且不会将其替换为任何内容。redact: This option will remove the detected entities from the input text and not replace them with anything. 请注意,在这种情况下,piiEntities 输出中的偏移量将与原始文本(而非已屏蔽文本)相关。Note that in this case, the offset in the piiEntities output will be in relation to the original text, and not the masked text.
  • replace:此选项将用 maskingCharacter 参数中给定的字符替换检测到的实体。replace: This option will replace the detected entities with the character given in the maskingCharacter parameter. 将重复该字符,直至达到检测到的实体的长度,以便偏移量与输入文本和输出 maskedText 都正确对应。The character will be repeated to the length of the detected entity so that the offsets will correctly correspond to both the input text as well as the output maskedText.
maskingCharactermaskingCharacter maskingMode 参数设置为 replace 时,将用来屏蔽文本的字符。The character that will be used to masked the text if the maskingMode parameter is set to replace. 支持以下选项:*(默认值)、#XThe following options are supported: * (default), #, X. 如果 maskingMode 未设置为 replace,则此参数只能为 nullThis parameter can only be null if maskingMode is not set to replace.

技能输入Skill inputs

输入名称Input name 说明Description
languageCodelanguageCode 可选。Optional. 默认值为 enDefault is en.
texttext 要分析的文本。The text to analyze.

技能输出Skill outputs

输出名称Output name 说明Description
piiEntitiespiiEntities 复杂类型的数组,包含以下字段:An array of complex types that contains the following fields:
  • text(提取的实际 PII)text (The actual PII as extracted)
  • typetype
  • subTypesubType
  • score(值越高意味着它越有可能是一个真实的实体)score (Higher value means it's more likely to be a real entity)
  • offset(输入文本中)offset (into the input text)
  • lengthlength

可在此处找到可能的类型和子类型。Possible types and subTypes can be found here.
maskedTextmaskedText 如果 maskingMode 设置为 none 以外的值,则此输出将是对由所选 maskingMode 描述的输入文本执行屏蔽后的字符串结果。If maskingMode is set to a value other than none, this output will be the string result of the masking performed on the input text as described by the selected maskingMode. 如果 maskingMode 设置为 none,则不会提供此输出。If maskingMode is set to none, this output will not be present.

示例定义Sample definition

  {
    "@odata.type": "#Microsoft.Skills.Text.PIIDetectionSkill",
    "defaultLanguageCode": "en",
    "minimumPrecision": 0.5,
    "maskingMode": "replace",
    "maskingCharacter": "*",
    "inputs": [
      {
        "name": "text",
        "source": "/document/content"
      }
    ],
    "outputs": [
      {
        "name": "piiEntities"
      },
      {
        "name": "maskedText"
      }
    ]
  }

示例输入Sample input

{
    "values": [
      {
        "recordId": "1",
        "data":
           {
             "text": "Microsoft employee with ssn 859-98-0987 is using our awesome API's."
           }
      }
    ]
}

示例输出Sample output

{
  "values": [
    {
      "recordId": "1",
      "data" : 
      {
        "piiEntities":[ 
           { 
              "text":"859-98-0987",
              "type":"U.S. Social Security Number (SSN)",
              "subtype":"",
              "offset":28,
              "length":11,
              "score":0.65
           }
        ],
        "maskedText": "Microsoft employee with ssn *********** is using our awesome API's."
      }
    }
  ]
}

错误和警告案例Error and warning cases

如果文档的语言代码不受支持,则会返回警告,并且不提取任何实体。If the language code for the document is unsupported, a warning is returned and no entities are extracted. 如果你的文本为空,则不会生成警告。If your text is empty, a warning will be produced. 如果文本大于 50,000 个字符,只会分析前 50,000 个字符,并会发出警告。If your text is larger than 50,000 characters, only the first 50,000 characters will be analyzed and a warning will be issued.

如果技能返回警告,则输出 maskedText 可能为空。If the skill returns a warning, the output maskedText may be empty. 这意味着,如果你希望输出存在,以便输入到以后的技能中,它将无法按预期工作。This means that if you expect that output to exist for input into later skills, it will not work as intended. 编写技能组定义时,请牢记这一点。Keep this in mind when writing your skillset definition.

另请参阅See also