评估和比较提示版本

重要

此功能在 Beta 版中。

本指南介绍如何系统地评估不同的提示版本,以确定代理和 GenAI 应用程序最有效的版本。 你将了解如何创建提示版本、生成具有预期事实的评估数据集,以及使用 MLflow 的评估框架来比较性能。

本页上的所有代码都包含在 示例笔记本中。

先决条件

本指南需要:

  • MLflow 3.1.0 或更高版本。
  • OpenAI API 访问或 Databricks 模型服务。
  • 具有CREATE FUNCTIONEXECUTEMANAGE特权的 Unity Catalog 架构。

您将了解到的内容

  • 创建并注册多个提示版本进行比较。
  • 生成具有预期事实的评估数据集,以测试提示有效性。
  • 使用 MLflow 的内置评分器来评估事实准确性。
  • 针对特定指标创建自定义基于提示的评估工具。
  • 比较各版本的结果,以选择性能最佳的提示。

步骤 1:配置环境

注释

需要在目录和模式上同时拥有 CREATE FUNCTIONEXECUTEMANAGE 特权才能创建提示和评估数据集。

首先,设置 Unity 目录架构并安装所需的包:

# Install required packages
%pip install --upgrade "mlflow[databricks]>=3.1.0" openai
dbutils.library.restartPython()

# Configure your Unity Catalog schema
import mlflow
import pandas as pd
from openai import OpenAI
import uuid

CATALOG = "main"        # Replace with your catalog name
SCHEMA = "default"      # Replace with your schema name

# Create unique names for the prompt and dataset
SUFFIX = uuid.uuid4().hex[:8]  # Short unique suffix
PROMPT_NAME = f"{CATALOG}.{SCHEMA}.summary_prompt_{SUFFIX}"
EVAL_DATASET_NAME = f"{CATALOG}.{SCHEMA}.summary_eval_{SUFFIX}"

print(f"Prompt name: {PROMPT_NAME}")
print(f"Evaluation dataset: {EVAL_DATASET_NAME}")

# Set up OpenAI client
client = OpenAI()

步骤 2:创建提示版本

注册不同提示版本以表示任务的不同方法。

# Version 1: Basic prompt
prompt_v1 = mlflow.genai.register_prompt(
    name=PROMPT_NAME,
    template="Summarize this text: {{content}}",
    commit_message="v1: Basic summarization prompt"
)

print(f"Created prompt version {prompt_v1.version}")

# Version 2: Improved with comprehensive guidelines
prompt_v2 = mlflow.genai.register_prompt(
    name=PROMPT_NAME,
    template="""You are an expert summarizer. Create a summary of the following content in *exactly* 2 sentences (no more, no less - be very careful about the number of sentences).

Guidelines:
- Include ALL core facts and key findings
- Use clear, concise language
- Maintain factual accuracy
- Cover all main points mentioned
- Write for a general audience
- Use exactly 2 sentences

Content: {{content}}

Summary:""",
    commit_message="v2: Added comprehensive fact coverage with 2-sentence requirement"
)

print(f"Created prompt version {prompt_v2.version}")

步骤 3:创建评估数据集

生成一个数据集,其中包含应以良好摘要显示的预期事实:

# Create evaluation dataset
eval_dataset = mlflow.genai.datasets.create_dataset(
    uc_table_name=EVAL_DATASET_NAME
)

# Add summarization examples with expected facts
evaluation_examples = [
    {
        "inputs": {
            "content": """Remote work has fundamentally changed how teams collaborate and communicate. Companies have adopted new digital tools for video conferencing, project management, and file sharing. While productivity has remained stable or increased in many cases, challenges include maintaining company culture, ensuring work-life balance, and managing distributed teams across time zones. The shift has also accelerated digital transformation initiatives and changed hiring practices, with many companies now recruiting talent globally rather than locally."""
        },
        "expectations": {
            "expected_facts": [
                "remote work changed collaboration",
                "digital tools adoption",
                "productivity remained stable",
                "challenges with company culture",
                "work-life balance issues",
                "global talent recruitment"
            ]
        }
    },
    {
        "inputs": {
            "content": """Electric vehicles are gaining mainstream adoption as battery technology improves and charging infrastructure expands. Major automakers have committed to electrification with new models launching regularly. Government incentives and environmental regulations are driving consumer interest. However, challenges remain including higher upfront costs, limited charging stations in rural areas, and concerns about battery life and replacement costs. The market is expected to grow significantly over the next decade."""
        },
        "expectations": {
            "expected_facts": [
                "electric vehicles gaining adoption",
                "battery technology improving",
                "charging infrastructure expanding",
                "government incentives",
                "higher upfront costs",
                "limited rural charging",
                "market growth expected"
            ]
        }
    },
    {
        "inputs": {
            "content": """Artificial intelligence is transforming healthcare through diagnostic imaging, drug discovery, and personalized treatment plans. Machine learning algorithms can now detect diseases earlier and more accurately than traditional methods. AI-powered robots assist in surgery and patient care. However, concerns exist about data privacy, algorithm bias, and the need for regulatory oversight. Healthcare providers must balance innovation with patient safety and ethical considerations."""
        },
        "expectations": {
            "expected_facts": [
                "AI transforming healthcare",
                "diagnostic imaging improvements",
                "drug discovery acceleration",
                "personalized treatment",
                "earlier disease detection",
                "data privacy concerns",
                "algorithm bias issues",
                "regulatory oversight needed"
            ]
        }
    }
]

eval_dataset.merge_records(evaluation_examples)
print(f"Added {len(evaluation_examples)} summarization examples to evaluation dataset")

步骤 4:创建评估函数和自定义指标

定义使用提示版本的函数并创建自定义评估指标:

def create_summary_function(prompt_name: str, version: int):
    """Create a summarization function for a specific prompt version."""

    @mlflow.trace
    def summarize_content(content: str) -> dict:
        # Load the prompt version
        prompt = mlflow.genai.load_prompt(
            name_or_uri=f"prompts:/{prompt_name}/{version}"
        )

        # Format and call the LLM
        formatted_prompt = prompt.format(content=content)

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": formatted_prompt}],
            temperature=0.1
        )

        return {"summary": response.choices[0].message.content}

    return summarize_content

基于提示的自定义评估器

创建自定义法官以评估特定条件:

from mlflow.genai.judges import custom_prompt_judge
from mlflow.genai.scorers import scorer

# Create a custom prompt judge
sentence_count_judge = custom_prompt_judge(
    name="sentence_count_compliance",
    prompt_template="""Evaluate if this summary follows the 2-sentence requirement:

Summary: {{summary}}

Count the sentences carefully and choose the appropriate rating:

[[correct]]: Exactly 2 sentences - follows instructions correctly
[[incorrect]]: Not exactly 2 sentences - does not follow instructions""",
    numeric_values={
        "correct": 1.0,
        "incorrect": 0.0
    }
)

# Wrap the judge in a scorer
@scorer
def sentence_compliance_scorer(inputs, outputs, trace) -> bool:
    """Custom scorer that evaluates sentence count compliance."""
    result = sentence_count_judge(summary=outputs.get("summary", ""))
    return result.value == 1.0  # Convert to boolean

执行对比评估

使用内置评分器和自定义记分器评估每个提示版本:

from mlflow.genai.scorers import Correctness

# Define scorers
scorers = [
    Correctness(),  # Checks expected facts
    sentence_compliance_scorer,  # Custom sentence count metric
]

# Evaluate each version
results = {}

for version in [1, 2]:
    print(f"\nEvaluating version {version}...")

    with mlflow.start_run(run_name=f"summary_v{version}_eval"):
        mlflow.log_param("prompt_version", version)

        # Run evaluation
        eval_results = mlflow.genai.evaluate(
            predict_fn=create_summary_function(PROMPT_NAME, version),
            data=eval_dataset,
            scorers=scorers,
        )

        results[f"v{version}"] = eval_results
        print(f"  Correctness score: {eval_results.metrics.get('correctness/mean', 0):.2f}")
        print(f"  Sentence compliance: {eval_results.metrics.get('sentence_compliance_scorer/mean', 0):.2f}")

步骤 6:比较结果并选择最佳版本

分析结果以确定表现最佳的提示词。

# Compare versions across all metrics
print("=== Version Comparison ===")
for version, result in results.items():
    correctness_score = result.metrics.get('correctness/mean', 0)
    compliance_score = result.metrics.get('sentence_compliance_scorer/mean', 0)
    print(f"{version}:")
    print(f"  Correctness: {correctness_score:.2f}")
    print(f"  Sentence compliance: {compliance_score:.2f}")
    print()

# Calculate composite scores
print("=== Composite Scores ===")
composite_scores = {}
for version, result in results.items():
    correctness = result.metrics.get('correctness/mean', 0)
    compliance = result.metrics.get('sentence_compliance_scorer/mean', 0)
    # Weight correctness more heavily (70%) than compliance (30%)
    composite = 0.7 * correctness + 0.3 * compliance
    composite_scores[version] = composite
    print(f"{version}: {composite:.2f}")

# Find best version
best_version = max(composite_scores.items(), key=lambda x: x[1])
print(f"\nBest performing version: {best_version[0]} (score: {best_version[1]:.2f})")

# Show why this version is best
best_results = results[best_version[0]]
print(f"\nWhy {best_version[0]} is best:")
print(f"- Captures {best_results.metrics.get('correctness/mean', 0):.0%} of expected facts")
print(f"- Follows sentence requirements {best_results.metrics.get('sentence_compliance_scorer/mean', 0):.0%} of the time")

通过评估确定性能最佳的提示版本后,即可部署它。 若要了解如何使用别名进行生产部署,请参阅 在部署的应用中使用提示

最佳做法

  1. 简单开始:从基本提示开始,并根据评估结果迭代改进。

  2. 使用一致的数据集:针对同一数据评估所有版本,以便进行公平比较。

  3. 跟踪所有内容:日志提示版本、评估结果和部署决策。

  4. 测试边缘用例:在评估数据集中包含具有挑战性的示例。

  5. 监控生产:在部署后继续评估提示,以识别和防止性能下降。

  6. 文档更改:使用有意义的提交消息来跟踪更改的原因。

示例笔记本

有关完整的工作示例,请参阅以下笔记本。

评估 GenAI 应用快速入门笔记本

获取笔记本

后续步骤