真实性判断和评分器

预定义的judges.is_grounded()法官会评估应用程序的响应是否由所提供的上下文(从 RAG 系统或工具调用生成)事实支持,帮助检测与该上下文不符合的幻觉或不实陈述。

可以通过预定义的 RetrievalGroundedness 评分器获得此评分标准,用于评估需要确保响应基于检索到的信息的 RAG 应用程序。

API 签名

有关详细信息,请参阅 mlflow.genai.judges.is_grounded()

from mlflow.genai.judges import is_grounded

def is_grounded(
    *,
    request: str,               # User's original query
    response: str,              # Application's response
    context: Any,               # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
    name: Optional[str] = None  # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
    """Returns Feedback with 'yes' or 'no' value and a rationale"""

运行示例的先决条件

  1. 安装 MLflow 和所需包

    pip install --upgrade "mlflow[databricks]>=3.1.0"
    
  2. 请按照设置环境快速入门创建 MLflow 试验。

直接使用 SDK

from mlflow.genai.judges import is_grounded

# Example 1: Response is grounded in context
feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."}
    ]
)
print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of groundedness

# Example 2: Response contains hallucination
feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris, which has a population of 10 million people",
    context=[
        {"content": "Paris is the capital of France."}
    ]
)
print(feedback.value)  # "no"
print(feedback.rationale)  # Identifies unsupported claim about population

使用预构建的评分器

is_grounded 判断可通过 RetrievalGroundedness 预构建的评分器获得。

要求

  • 跟踪要求
    • MLflow 跟踪必须至少包含一个将 span_type 设置为 RETRIEVER 的跨度
    • inputsoutputs 必须位于跟踪的根跨度上
  1. 初始化 OpenAI 客户端以连接到由 Databricks 托管的 LLM 或者由 OpenAI 托管的 LLM。

    Databricks 托管的 LLM

    使用 MLflow 获取一个 OpenAI 客户端,以连接到由 Databricks 托管的 LLMs。 从可用的基础模型中选择一个模型。

    import mlflow
    from databricks.sdk import WorkspaceClient
    
    # Enable MLflow's autologging to instrument your application with Tracing
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client that is connected to Databricks-hosted LLMs
    w = WorkspaceClient()
    client = w.serving_endpoints.get_open_ai_client()
    
    # Select an LLM
    model_name = "databricks-claude-sonnet-4"
    

    OpenAI 托管的 LLM

    使用本地 OpenAI SDK 连接到由 OpenAI 托管的模型。 从 可用的 OpenAI 模型中选择一个模型。

    import mlflow
    import os
    import openai
    
    # Ensure your OPENAI_API_KEY is set in your environment
    # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured
    
    # Enable auto-tracing for OpenAI
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client connected to OpenAI SDKs
    client = openai.OpenAI()
    
    # Select an LLM
    model_name = "gpt-4o-mini"
    
  2. 使用评分器评判:

    from mlflow.genai.scorers import RetrievalGroundedness
    from mlflow.entities import Document
    from typing import List
    
    # Define a retriever function with proper span type
    @mlflow.trace(span_type="RETRIEVER")
    def retrieve_docs(query: str) -> List[Document]:
        # Simulated retrieval based on query
        if "mlflow" in query.lower():
            return [
                Document(
                    id="doc_1",
                    page_content="MLflow is an open-source platform for managing the ML lifecycle.",
                    metadata={"source": "mlflow_docs.txt"}
                ),
                Document(
                    id="doc_2",
                    page_content="MLflow provides tools for experiment tracking, model packaging, and deployment.",
                    metadata={"source": "mlflow_features.txt"}
                )
            ]
        else:
            return [
                Document(
                    id="doc_3",
                    page_content="Machine learning involves training models on data.",
                    metadata={"source": "ml_basics.txt"}
                )
            ]
    
    # Define your RAG app
    @mlflow.trace
    def rag_app(query: str):
        # Retrieve relevant documents
        docs = retrieve_docs(query)
        context = "\n".join([doc.page_content for doc in docs])
    
        # Generate response using LLM
        messages = [
            {"role": "system", "content": f"Answer based on this context: {context}"},
            {"role": "user", "content": query}
        ]
    
        response = client.chat.completions.create(
            # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
            model=model_name,
            messages=messages
        )
    
        return {"response": response.choices[0].message.content}
    
    # Create evaluation dataset
    eval_dataset = [
        {
            "inputs": {"query": "What is MLflow used for?"}
        },
        {
            "inputs": {"query": "What are the main features of MLflow?"}
        }
    ]
    
    # Run evaluation with RetrievalGroundedness scorer
    eval_results = mlflow.genai.evaluate(
        data=eval_dataset,
        predict_fn=rag_app,
        scorers=[RetrievalGroundedness()]
    )
    

在自定义评分器中使用

在评估具有与预定义评分器要求不同的数据结构的应用程序时,请将判断包装在自定义评分器中:

from mlflow.genai.judges import is_grounded
from mlflow.genai.scorers import scorer
from typing import Dict, Any

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow used for?"},
        "outputs": {
            "response": "MLflow is used for managing the ML lifecycle, including experiment tracking and model deployment.",
            "retrieved_context": [
                {"content": "MLflow is a platform for managing the ML lifecycle."},
                {"content": "MLflow includes capabilities for experiment tracking, model packaging, and deployment."}
            ]
        }
    },
    {
        "inputs": {"query": "Who created MLflow?"},
        "outputs": {
            "response": "MLflow was created by Databricks in 2018 and has over 10,000 contributors.",
            "retrieved_context": [
                {"content": "MLflow was created by Databricks."},
                {"content": "MLflow was open-sourced in 2018."}
            ]
        }
    }
]

@scorer
def groundedness_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    return is_grounded(
        request=inputs["query"],
        response=outputs["response"],
        context=outputs["retrieved_context"]
    )

# Run evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[groundedness_scorer]
)

解释结果

法官返回一个 Feedback 对象,其中包含:

  • value:如果响应已停止,则为“是”;如果响应包含幻觉,则为“否”
  • rationale:详细说明用于标识:
    • 上下文支持哪些语句
    • 哪些陈述缺乏支持(幻觉)
    • 支持或矛盾声明的上下文中的特定引文

后续步骤