如何检测和编修对话中的个人身份信息 (PII)

项目
2024/12/20

对话 PII 功能可以评估对话，以在多个预定义类别的内容中提取敏感信息 (PII) 并对其进行编修。此 API 对转录文本（称为“记录”）和聊天引用的文本进行操作。对于记录，API 还可为包含 PII 信息的音频片段提供音频计时信息，从而支持对这些音频片段进行编修。

确定如何处理数据（可选）

指定 PII 检测模型

默认情况下，此功能将对输入使用最新的可用 AI 模型。你还可以将 API 请求配置为使用特定模型版本。

语言支持

有关详细信息，请参阅 PII 语言支持页。目前，对话 PII GA 模型仅支持英语。预览模型和 API 支持与其他语言服务相同的列表语言。

区域支持

对话式 PII API 支持语言服务支持的所有 Azure 区域。

提交数据

备注

有关使用 Language Studio 设置要提交的对话文本格式的信息，请参阅 Language Studio 一文。

可以将输入作为对话项列表提交到 API。在收到请求时执行分析。因为 API 是异步的，所以在发送 API 请求和接收结果之间可能存在延迟。有关每分钟和每秒可以发送的请求大小和数量信息，请参阅下面的数据限制。

使用异步功能时，API 结果在引入请求后的 24 小时内可用，并在响应中指示。在此时间段后，结果将被清除，并且不再可用于检索。

将数据提交到聊天 PII 时，可以针对每个请求发送一个对话（聊天或语音）。

此 API 尝试检测给定对话输入的所有已定义实体类别。如果要指定将检测并返回哪些实体，请使用可选的 piiCategories 参数指定相应的实体类别。

对于语音记录，检测到的实体会在提供的 redactionSource 参数值中返回。目前，redactionSource 支持的值为 text、lexical、itn 和 maskedItn（它们分别映射到语音转文本 REST API 的 display\displayText、lexical、itn 和 maskedItn 格式）。此外，对于语音记录输入，此 API 还会提供音频计时信息，以增强音频编修功能。若要使用 audioRedaction 功能，请使用值为 true 的可选 includeAudioRedaction 标志。系统根据词法输入格式执行音频编修。

备注

对话 PII 现在支持将 40,000 个字符作为文档大小。

获取 PII 结果

从 PII 检测获得结果时，可以将结果流式传输到应用程序或将输出保存到本地系统上的文件中。 API 响应包括识别的实体，包括其类别和子类别，以及置信度分数。还将返回 PII 实体经过编辑的文本字符串。

在 Azure 门户中，转到资源概述页
在左侧菜单中，选择“密钥和终结点”。需要其中一个密钥和终结点才能对 API 请求进行身份验证。
下载并安装所选语言的客户端库包：

语言包版本

.NET 1.0.0

Python 1.0.0
有关客户端和返回对象的详细信息，请参阅以下参考文档：
- C#
- Python

语言	包版本
.NET	1.0.0
Python	1.0.0

修订策略（仅限版本 2024-11-15-preview）

在版本 2024-11-15-preview 中，可以定义 redactionPolicy 参数，以反映在响应中编辑文档时要使用的修订策略。策略字段支持 3 种策略类型：

noMask
characterMask（默认值）
entityMask

noMask 策略允许用户在没有 redactedText 字段的情况下返回响应。

characterMask 策略允许使用字符屏蔽 redactedText，从而保留原始文本的长度和偏移量。这是现有行为。

还有一个名为 redactionCharacter 的可选字段，可以在其中输入在修订中使用的字符（如果正在使用 characterMask 策略）

通过 entityMask 策略，可以使用检测到的实体类型屏蔽检测到的 PII 实体文本

如果要更改修订策略，请使用以下示例。

curl -i -X POST https://your-language-endpoint-here/language/analyze-conversations/jobs?api-version=2024-05-01 \
-H "Content-Type: application/json" \
-H "Ocp-Apim-Subscription-Key: your-key-here" \
-d \
'
{ 
    "displayName": "Analyze conversations from xxx", 
    "analysisInput": { 
        "conversations": [ 
            { 
                "id": "23611680-c4eb-4705-adef-4aa1c17507b5", 
                "language": "en", 
                "modality": "text", 
                "conversationItems": [ 
                    { 
                        "participantId": "agent_1", 
                        "id": "1", 
                        "text": "Good morning." 
                    }, 
                    { 
                        "participantId": "agent_1", 
                        "id": "2", 
                        "text": "Can I have your name?" 
                    }, 
                    { 
                        "participantId": "customer_1", 
                        "id": "3", 
                        "text": "Sure that is John Doe." 
                    } 
                ] 
            } 
        ] 
    }, 
    "tasks": [ 
        { 
            "taskName": "analyze 1", 
            "kind": "ConversationalPIITask", 
            "parameters": { 
                "modelVersion": "2023-04-15-preview", 
                “redactionCharacter” 
                "redactionPolicy": { 
                    "policyKind": "characterMask", 
                    //characterMask|entityMask|noMask 
                    "redactionCharacter": "*" 
                } 
            } 
        } 
    ] 
} 
`

使用语音转文本功能提交口述文本

如果有使用语音服务的语音转文本功能转录的对话，请使用以下示例：

curl -i -X POST https://your-language-endpoint-here/language/analyze-conversations/jobs?api-version=2024-05-01 \
-H "Content-Type: application/json" \
-H "Ocp-Apim-Subscription-Key: your-key-here" \
-d \
' 
{
    "displayName": "Analyze conversations from xxx",
    "analysisInput": {
        "conversations": [
            {
                "id": "23611680-c4eb-4705-adef-4aa1c17507b5",
                "language": "en",
                "modality": "transcript",
                "conversationItems": [
                    {
                        "participantId": "agent_1",
                        "id": "8074caf7-97e8-4492-ace3-d284821adacd",
                        "text": "Good morning.",
                        "lexical": "good morning",
                        "itn": "good morning",
                        "maskedItn": "good morning",
                        "audioTimings": [
                            {
                                "word": "good",
                                "offset": 11700000,
                                "duration": 2100000
                            },
                            {
                                "word": "morning",
                                "offset": 13900000,
                                "duration": 3100000
                            }
                        ]
                    },
                    {
                        "participantId": "agent_1",
                        "id": "0d67d52b-693f-4e34-9881-754a14eec887",
                        "text": "Can I have your name?",
                        "lexical": "can i have your name",
                        "itn": "can i have your name",
                        "maskedItn": "can i have your name",
                        "audioTimings": [
                            {
                                "word": "can",
                                "offset": 44200000,
                                "duration": 2200000
                            },
                            {
                                "word": "i",
                                "offset": 46500000,
                                "duration": 800000
                            },
                            {
                                "word": "have",
                                "offset": 47400000,
                                "duration": 1500000
                            },
                            {
                                "word": "your",
                                "offset": 49000000,
                                "duration": 1500000
                            },
                            {
                                "word": "name",
                                "offset": 50600000,
                                "duration": 2100000
                            }
                        ]
                    },
                    {
                        "participantId": "customer_1",
                        "id": "08684a7a-5433-4658-a3f1-c6114fcfed51",
                        "text": "Sure that is John Doe.",
                        "lexical": "sure that is john doe",
                        "itn": "sure that is john doe",
                        "maskedItn": "sure that is john doe",
                        "audioTimings": [
                            {
                                "word": "sure",
                                "offset": 5400000,
                                "duration": 6300000
                            },
                            {
                                "word": "that",
                                "offset": 13600000,
                                "duration": 2300000
                            },
                            {
                                "word": "is",
                                "offset": 16000000,
                                "duration": 1300000
                            },
                            {
                                "word": "john",
                                "offset": 17400000,
                                "duration": 2500000
                            },
                            {
                                "word": "doe",
                                "offset": 20000000,
                                "duration": 2700000
                            }
                        ]
                    }
                ]
            }
        ]
    },
    "tasks": [
        {
            "taskName": "analyze 1",
            "kind": "ConversationalPIITask",
            "parameters": {
                "modelVersion": "2023-04-15-preview",
                "redactionSource": "text",
                "includeAudioRedaction": true,
                "piiCategories": [
                    "all"
                ]
            }
        }
    ]
}
`

提交文本聊天

如果有源自文本的对话，请使用以下示例。例如，通过基于文本的聊天客户端进行的对话。

curl -i -X POST https://your-language-endpoint-here/language/analyze-conversations/jobs?api-version=2024-05-01 \
-H "Content-Type: application/json" \
-H "Ocp-Apim-Subscription-Key: your-key-here" \
-d \
' 
{
    "displayName": "Analyze conversations from xxx",
    "analysisInput": {
        "conversations": [
            {
                "id": "23611680-c4eb-4705-adef-4aa1c17507b5",
                "language": "en",
                "modality": "text",
                "conversationItems": [
                    {
                        "participantId": "agent_1",
                        "id": "8074caf7-97e8-4492-ace3-d284821adacd",
                        "text": "Good morning."
                    },
                    {
                        "participantId": "agent_1",
                        "id": "0d67d52b-693f-4e34-9881-754a14eec887",
                        "text": "Can I have your name?"
                    },
                    {
                        "participantId": "customer_1",
                        "id": "08684a7a-5433-4658-a3f1-c6114fcfed51",
                        "text": "Sure that is John Doe."
                    }
                ]
            }
        ]
    },
    "tasks": [
        {
            "taskName": "analyze 1",
            "kind": "ConversationalPIITask",
            "parameters": {
                "modelVersion": "2023-04-15-preview"
            }
        }
    ]
}
`

获取结果

从响应头获取 operation-location。该值类似于以下 URL：

https://your-language-endpoint/language/analyze-conversations/jobs/12345678-1234-1234-1234-12345678

要获取请求的结果，请使用以下 cURL 命令。请务必将 my-job-id 替换为从之前的 operation-location 响应头中收到的数值 ID 值：

curl -X GET    https://your-language-endpoint/language/analyze-conversations/jobs/my-job-id \
-H "Content-Type: application/json" \
-H "Ocp-Apim-Subscription-Key: your-key-here"

服务和数据限制

有关每分钟和每秒可以发送的请求大小和数量信息，请参阅服务限制一文。

通过