使法官与人类协调一致

法官对齐教 LLM 法官通过系统反馈来匹配人类评估标准。 此过程将泛型评估师转变为了解你独特的质量标准的领域特定专家,与基线评委相比,将与人工评估的协议提高30%至50%。

相同的对齐工作流适用于内置评判器(例如RelevanceToQuerySafetyCorrectness)以及使用make_judge()创建的自定义评判器。 使用与内置评估器的对齐来使其通用标准适配你的领域,或使用与自定义评估器的对齐来优化专门的评估逻辑。

判断对齐遵循三步工作流:

  1. 生成初始评估:使用内置或自定义判断来评估跟踪并建立基线。
  2. 收集人工反馈:领域专家会审核并纠正裁判评估结果。
  3. 对齐和部署:调用法官 align() 的方法以创建一个与人工反馈更一致的新法官。

系统支持包 mlflow.genai.judges.optimizers中提供的优化器。

要求

  • MLflow 3.4.0 或更高版本用于使用评判对齐功能

    %pip install --upgrade "mlflow[databricks]>=3.4.0" databricks_openai dspy
    dbutils.library.restartPython()
    
  • 要对齐的法官。 这可以是内置评判器(例如,RelevanceToQueryCorrectness),也可以是使用make_judge()创建的自定义评判器

  • 人工反馈评估名称必须与法官 name 的属性完全匹配。 对于内置评判器,这是默认的 snake_case 名称(例如,RelevanceToQuery 的名称为 relevance_to_query),除非你在实例化该类时传递 name= 来覆盖此默认名称。 对于自定义判定器,它就是你传递给 make_judge()name(例如 product_quality)。

  • 对于会话级(多轮)评判器(例如 ConversationCompleteness),不支持进行对齐。

步骤 1:设置法官并生成跟踪

设置初始评测器,并生成带评估结果的追踪记录。 使用至少 10 条跟踪即可实现较合理的对齐,但使用 50 到 100 条跟踪效果更好。

内置评判器

直接实例化内置评估器。 内置评判器会暴露一个 name 属性(默认值是一个 snake_case 字符串,例如 relevance_to_query),你将在步骤 2 中记录人工反馈时使用该属性。

from mlflow.genai.scorers import RelevanceToQuery
import mlflow

# Create or set an MLflow experiment for alignment.
# Use a workspace path such as /Shared/<name> or /Users/<your-email>/<name>.
experiment = mlflow.set_experiment("/Shared/relevance-alignment")
experiment_id = experiment.experiment_id

# Use a built-in judge
initial_judge = RelevanceToQuery()

自定义法官

使用 make_judge().. 创建自定义法官。 参数 name 与在步骤 2 中记录人工反馈时将使用的名称相同。

from mlflow.genai.judges import make_judge
import mlflow

# Create or set an MLflow experiment for alignment.
# Use a workspace path such as /Shared/<name> or /Users/<your-email>/<name>.
experiment = mlflow.set_experiment("/Shared/product-quality-alignment")
experiment_id = experiment.experiment_id

# Create initial judge with template-based evaluation
initial_judge = make_judge(
    name="product_quality",
    instructions=(
        "Evaluate if the product description in {{ outputs }} "
        "is accurate and helpful for the query in {{ inputs }}. "
        "Rate as: excellent, good, fair, or poor"
    ),
    model="databricks:/databricks-gpt-oss-120b",
)

定义应用程序逻辑。 以下示例使用 Databricks 托管的基础模型从查询生成产品说明。 将此代码替换为你自己的应用程序代码:

import mlflow
from databricks_openai import DatabricksOpenAI

# Enable automatic tracing of OpenAI calls
mlflow.openai.autolog()

# Create an OpenAI client connected to Databricks-hosted LLMs
client = DatabricksOpenAI()
model_name = "databricks-claude-sonnet-4"

def generate_product_description(query: str) -> str:
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {
                "role": "system",
                "content": "You write concise, accurate product descriptions.",
            },
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content

生成追踪并运行评测器。 使用评判器的 name 属性(例如,上述内置评判器的 relevance_to_query,或上述自定义评判器的 product_quality)作为反馈 name

# Generate traces for alignment (minimum 10, recommended 50+)
for i in range(50):
    query = f"Tell me about product {i}"
    description = generate_product_description(query)

    # Retrieve the ID of the most recent finished trace
    trace_id = mlflow.get_last_active_trace_id()
    trace = mlflow.get_trace(trace_id)

    # Generate judge assessment
    judge_result = initial_judge(trace=trace)

    # Log judge feedback to the trace using the judge's name
    mlflow.log_feedback(
        trace_id=trace_id,
        name=initial_judge.name,
        value=judge_result.value,
        rationale=judge_result.rationale,
    )

步骤 2:收集人工反馈

收集人工反馈,以训练评审掌握你的质量标准。 从以下方法中进行选择:

Databricks 用户界面评审

在以下情况下收集人工反馈:

  • 你需要域专家来评审输出。
  • 你想要迭代地优化反馈条件。
  • 你正在使用较小的数据集(< 100 个示例)。

使用 MLflow UI 手动查看并提供反馈:

  1. 转到 Databricks 工作区中的 MLflow 实验。
  2. 单击“ 跟踪 ”选项卡以查看跟踪。
  3. 查看每条追踪记录及其评估结果。
  4. 使用 UI 的反馈界面添加人工反馈。
  5. 确保反馈名称与你的评判器的 name 属性完全一致(例如,内置 RelevanceToQuery 实例使用 relevance_to_query,或者上述自定义评判器使用 product_quality)。

编程反馈

在以下情况下使用编程反馈:

  • 你有预先存在的地真相标签。
  • 你正在使用大型数据集(100 多个示例)。
  • 你需要可重现的反馈集合。

如果有现有的地实标签,请以编程方式记录它们:

from mlflow.entities import AssessmentSource, AssessmentSourceType

# Your ground truth data
ground_truth_data = [
    {"trace_id": "<trace_id_1>", "label": "excellent", "rationale": "Comprehensive and accurate description"},
    {"trace_id": "<trace_id_2>", "label": "poor", "rationale": "Missing key product features"},
    {"trace_id": "<trace_id_3>", "label": "good", "rationale": "Accurate but could be more detailed"},
    # ... more ground truth labels
]

# Log human feedback for each trace
for item in ground_truth_data:
    mlflow.log_feedback(
        trace_id=item["trace_id"],
        name=initial_judge.name,  # Must match judge name (built-in or custom)
        value=item["label"],
        rationale=item.get("rationale", ""),
        source=AssessmentSource(
            source_type=AssessmentSourceType.HUMAN,
            source_id="ground_truth_dataset"
        ),
    )

反馈收集的最佳做法

  • 不同的审阅者:包括多个领域专家,以捕获各种观点
  • 平衡示例:至少包括 30% 的负面示例(差评/中评)。
  • 明确的理由:提供分级的详细说明
  • 代表性示例:涵盖边缘事例和常见方案

步骤 3:对齐并注册法官

在获得足够的人工反馈后,对评判模型进行对齐。 内置和自定义评判器都使用同一种 align() 方法。

在不指定优化器的情况下调用 align() 时,会自动使用 MemAlign 优化器:

# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
    experiment_ids=[experiment_id],
    max_results=100,
    return_type="list"
)

if len(traces_for_alignment) >= 10:
    # Align the judge based on human feedback using the default optimizer
    aligned_judge = initial_judge.align(traces_for_alignment)

    # Register the aligned judge for production use.
    # Use a new name to distinguish it from the original judge.
    aligned_judge.register(
        experiment_id=experiment_id,
        name=f"{initial_judge.name}_aligned",
        tags={"alignment_date": "2025-10-23", "num_traces": str(len(traces_for_alignment))}
    )

    print(f"Successfully aligned judge using {len(traces_for_alignment)} traces")
else:
    print(f"Insufficient traces for alignment. Found {len(traces_for_alignment)}, need at least 10")

显式优化器

from mlflow.genai.judges.optimizers import MemAlignOptimizer

# Retrieve traces with both judge and human assessments
traces_for_alignment = mlflow.search_traces(
    experiment_ids=[experiment_id], max_results=15, return_type="list"
)

# Align the judge using human corrections (minimum 10 traces recommended)
if len(traces_for_alignment) >= 10:
    # Explicitly specify optimizer with custom model configuration
    optimizer = MemAlignOptimizer(model="databricks:/databricks-gpt-oss-120b")
    aligned_judge = initial_judge.align(traces_for_alignment, optimizer)

    # Register the aligned judge
    aligned_judge.register(experiment_id=experiment_id)
    print("Judge aligned successfully with human feedback")
else:
    print(f"Need at least 10 traces for alignment, have {len(traces_for_alignment)}")

启用详细日志记录

若要监视对齐过程,请为优化器启用调试日志记录:

import logging

# Enable detailed logging
logging.getLogger("mlflow.genai.judges.optimizers.memalign").setLevel(logging.DEBUG)

# Run alignment with verbose output
aligned_judge = initial_judge.align(traces_for_alignment)

验证对齐

验证对齐是否改进了判断:


def test_alignment_improvement(
    original_judge, aligned_judge, test_traces: list
) -> dict:
    """Compare judge performance before and after alignment."""

    original_correct = 0
    aligned_correct = 0

    for trace in test_traces:
        # Get human ground truth from trace assessments
        feedbacks = trace.search_assessments(type="feedback")
        human_feedback = next(
            (f for f in feedbacks if f.source.source_type == "HUMAN"), None
        )

        if not human_feedback:
            continue

        # Get judge evaluations
        # Judges can evaluate entire traces instead of individual inputs/outputs
        original_eval = original_judge(trace=trace)
        aligned_eval = aligned_judge(trace=trace)

        # Check agreement with human
        if original_eval.value == human_feedback.value:
            original_correct += 1
        if aligned_eval.value == human_feedback.value:
            aligned_correct += 1

    total = len(test_traces)
    return {
        "original_accuracy": original_correct / total,
        "aligned_accuracy": aligned_correct / total,
        "improvement": (aligned_correct - original_correct) / total,
    }

创建自定义对齐优化器

请扩展基类以实现专用对齐策略。

from mlflow.genai.judges.base import AlignmentOptimizer, Judge
from mlflow.entities.trace import Trace

class MyCustomOptimizer(AlignmentOptimizer):
    """Custom optimizer implementation for judge alignment."""

    def __init__(self, model: str = None, **kwargs):
        """Initialize your optimizer with custom parameters."""
        self.model = model
        # Add any custom initialization logic

    def align(self, judge: Judge, traces: list[Trace]) -> Judge:
        """
        Implement your alignment algorithm.

        Args:
            judge: The judge to be optimized
            traces: List of traces containing human feedback

        Returns:
            A new Judge instance with improved alignment
        """
        # Your custom alignment logic here
        # 1. Extract feedback from traces
        # 2. Analyze disagreements between judge and human
        # 3. Generate improved instructions
        # 4. Return new judge with better alignment

        # Example: Return judge with modified instructions
        from mlflow.genai.judges import make_judge

        improved_instructions = self._optimize_instructions(judge.instructions, traces)

        return make_judge(
            name=judge.name,
            instructions=improved_instructions,
            model=judge.model,
        )

    def _optimize_instructions(self, instructions: str, traces: list[Trace]) -> str:
        """Your custom optimization logic."""
        # Implement your optimization strategy
        pass

# Create your custom optimizer
custom_optimizer = MyCustomOptimizer(model="your-model")

# Use it for alignment
aligned_judge = initial_judge.align(traces_with_feedback, custom_optimizer)

局限性

  • 判定对齐不支持代理驱动或期望驱动的评估。

后续步骤