Built-in LLM Judges

Overview

MLflow provides research-backed LLM judges for common quality checks. These judges are Scorers that leverage Large Language Models to assess your application's outputs against quality criteria like safety, relevance, and correctness.

Important

LLM Judges are a type of MLflow Scorer that uses Large Language Models for evaluation. They can be used directly with the Evaluation Harness and production monitoring service.

Judge Arguments Requires ground truth What it evaluates?
RelevanceToQuery inputs, outputs No Is the response directly relevant to the user's request?
RetrievalRelevance inputs, outputs No Is the retrieved context directly relevant to the user's request?
Safety inputs, outputs No Is the content free from harmful, offensive, or toxic material?
RetrievalGroundedness inputs, outputs No Is the response grounded in the information provided in the context (e.g., the app is not hallucinating)?
Guidelines inputs, outputs No Does the response meet specified natural language criteria?
ExpectationsGuidelines inputs, outputs, expectations No (but needs guidelines in expectations) Does the response meet per-example natural language criteria?
Correctness inputs, outputs, expectations Yes Is the response correct as compared to the provided ground truth?
RetrievalSufficiency inputs, outputs, expectations Yes Does the context provide all necessary information to generate a response that includes the ground truth facts?

Prerequisites for running the examples

  1. Install MLflow and required packages

    pip install --upgrade "mlflow[databricks]>=3.1.0"
    
  2. Create an MLflow experiment by following the setup your environment quickstart.

How to use prebuilt judges

1. Directly via the SDK

You can use judges directly in your evaluation workflow. Below is an example using the RetrievalGroundedness judge:

from mlflow.genai.scorers import RetrievalGroundedness

groundedness_judge = RetrievalGroundedness()

feedback = groundedness_judge(
    inputs={"request": "What is the capital/major city of France?"},
    outputs={"response": "Paris", "context": "Paris is the capital/major city of France."}
)

feedback = groundedness_judge(
    inputs={"request": "What is the capital/major city of France?"},
    outputs={"response": "Paris", "context": "Paris is known for its Eiffel Tower."}
)

2. Usage with mlflow.evaluate()

You can use judges directly with MLflow's evaluation framework.

eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris is the magnificent capital city of France, a stunning metropolis known worldwide for its iconic Eiffel Tower, rich cultural heritage, beautiful architecture, world-class museums like the Louvre, and its status as one of Europe's most important political and economic centers. As the capital city, Paris serves as the seat of France's government and is home to numerous important national institutions."
        },
        "expectations": {
            "expected_facts": ["Paris is the capital of France."],
        },
    },
]

from mlflow.genai.scorers import Correctness

eval_results = mlflow.genai.evaluate(data=eval_dataset, scorers=[Correctness])

Next Steps