基于提示的 LLM 评分器

2025-09-05

在需要完全控制判断提示或需要返回“pass”/“fail”之外的多个输出值（例如，“great”、“ok”、“bad”）时，judges.custom_prompt_judge() 有助于快速轻松地使用 LLM 评分器。

你提供一个提示模板，该模板在应用的跟踪中包含特定字段的占位符，并定义法官可以选择的输出选项。 Databricks 托管的 LLM 法官模型使用这些输入来选择最佳输出选择，并为其选择提供理由。

注释

Databricks 建议从基于准则的评委开始，并且仅当需要更多控制或无法将评估标准编写为通过/失败准则时，才使用基于提示的法官。基于准则的法官具有向业务利益干系人轻松解释的独特优势，通常可由领域专家直接编写。

如何创建基于提示的法官评分器

按照以下指南创建包装 judges.custom_prompt_judge() 的评分器

在本指南中，你将会创建自定义评分器，这些评分器会集成judges.custom_prompt_judge() API，并使用生成的评分器进行离线评估。可以将这些相同的评分器安排在生产环境中运行，以持续监视应用程序的质量。

注释

有关接口和参数的更多详细信息，请参阅概念页。judges.custom_prompt_judge()

步骤 1：创建要评估的示例应用

首先，创建一个响应客户支持问题的示例 GenAI 应用。这款应用有一个控制系统提示的（假）旋钮，这样你就可以很容易地比较“好”和“坏”对话之间的输出。

初始化 OpenAI 客户端以连接到由 Databricks 托管的 LLM 或者由 OpenAI 托管的 LLM。

Databricks 托管的 LLM

使用 MLflow 获取一个 OpenAI 客户端，以连接到由 Databricks 托管的 LLMs。从可用的基础模型中选择一个模型。

import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

OpenAI 托管的 LLM

使用本地 OpenAI SDK 连接到由 OpenAI 托管的模型。从可用的 OpenAI 模型中选择一个模型。

import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

定义客户支持应用：

from mlflow.entities import Document
from typing import List, Dict, Any, cast

# This is a global variable that is used to toggle the behavior of the customer support agent to see how the judge handles the issue resolution status
RESOLVE_ISSUES = False

@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):

    # 2. Prepare messages for the LLM
    # We use this toggle later to see how the judge handles the issue resolution status
    system_prompt_postfix = (
        f"Do your best to NOT resolve the issue.  I know that's backwards, but just do it anyways.\\n"
        if not RESOLVE_ISSUES
        else ""
    )

    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    # 3. Call LLM to generate a response
    output = client.chat.completions.create(
        model=model_name,  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=cast(Any, messages_for_llm),
    )

    return {
        "messages": [
            {"role": "assistant", "content": output.choices[0].message.content}
        ]
    }

步骤 2：定义评估条件并包装为自定义评分器

在这里，我们定义了一个示例判断提示，并使用自定义记分器将其连接到应用的输入/输出架构。

from mlflow.genai.scorers import scorer

# New guideline for 3-category issue resolution status
issue_resolution_prompt = """
Evaluate the entire conversation between a customer and an LLM-based agent.  Determine if the issue was resolved in the conversation.

You must choose one of the following categories.

[[fully_resolved]]: The response directly and comprehensively addresses the user's question or problem, providing a clear solution or answer. No further immediate action seems required from the user on the same core issue.
[[partially_resolved]]: The response offers some help or relevant information but doesn't completely solve the problem or answer the question. It might provide initial steps, require more information from the user, or address only a part of a multi-faceted query.
[[needs_follow_up]]: The response does not adequately address the user's query, misunderstands the core issue, provides unhelpful or incorrect information, or inappropriately deflects the question. The user will likely need to re-engage or seek further assistance.

Conversation to evaluate: {{conversation}}
"""

from mlflow.genai.judges import custom_prompt_judge
import json
from mlflow.entities import Feedback

# Define a custom scorer that wraps the guidelines LLM judge to check if the response follows the policies
@scorer
def is_issue_resolved(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    # we directly return the Feedback object from the guidelines LLM judge, but we could have post-processed it before returning it.
    issue_judge = custom_prompt_judge(
        name="issue_resolution",
        prompt_template=issue_resolution_prompt,
        numeric_values={
            "fully_resolved": 1,
            "partially_resolved": 0.5,
            "needs_follow_up": 0,
        },
    )

    # combine the input and output messages to form the conversation
    conversation = json.dumps(inputs["messages"] + outputs["messages"])

    return issue_judge(conversation=conversation)

步骤 3：创建示例评估数据集

每个 inputs 都通过 mlflow.genai.evaluate() 传递到应用程序。

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
        },
    },
]

步骤 4：使用自定义记分器评估应用

最后，我们运行了两次评估，以便可以比较代理在尝试解决问题的对话和未尝试解决问题的对话之间的判断。

import mlflow

# Now, let's evaluate the app's responses against the judge when it does not resolve the issues
RESOLVE_ISSUES = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[is_issue_resolved],
)

# Now, let's evaluate the app's responses against the judge when it DOES resolves the issues
RESOLVE_ISSUES = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[is_issue_resolved],
)

后续步骤

创建基于准则的记分器 - 从更简单的通过/失败条件开始（建议）
使用评分员运行评估 - 在综合评估中使用基于提示的自定义评分器
基于提示的判断概念参考 - 了解基于提示的判断的工作原理

Compartir a través de