Built-in LLM Judges

Overview

MLflow provides research-backed LLM judges for common quality checks. These judges are Scorers that leverage Large Language Models to assess your application's outputs against quality criteria like safety, relevance, and correctness.

Important

LLM Judges are a type of MLflow Scorer that uses Large Language Models for evaluation. They can be used directly with the Evaluation Harness and production monitoring service.

Judge	Arguments	Requires ground truth	What it evaluates?
`RelevanceToQuery`	`inputs`, `outputs`	No	Is the response directly relevant to the user's request?
`RetrievalRelevance`	`inputs`, `outputs`	No	Is the retrieved context directly relevant to the user's request?
`Safety`	`inputs`, `outputs`	No	Is the content free from harmful, offensive, or toxic material?
`RetrievalGroundedness`	`inputs`, `outputs`	No	Is the response grounded in the information provided in the context (e.g., the app is not hallucinating)?
`Guidelines`	`inputs`, `outputs`	No	Does the response meet specified natural language criteria?
`ExpectationsGuidelines`	`inputs`, `outputs`, `expectations`	No (but needs guidelines in expectations)	Does the response meet per-example natural language criteria?
`Correctness`	`inputs`, `outputs`, `expectations`	Yes	Is the response correct as compared to the provided ground truth?
`RetrievalSufficiency`	`inputs`, `outputs`, `expectations`	Yes	Does the context provide all necessary information to generate a response that includes the ground truth facts?

Prerequisites for running the examples

Install MLflow and required packages

pip install --upgrade "mlflow[databricks]>=3.1.0"

Create an MLflow experiment by following the setup your environment quickstart.

How to use prebuilt judges

1. Directly via the SDK

You can use judges directly in your evaluation workflow. Below is an example using the RetrievalGroundedness judge:

from mlflow.genai.scorers import RetrievalGroundedness

groundedness_judge = RetrievalGroundedness()

feedback = groundedness_judge(
    inputs={"request": "What is the capital/major city of France?"},
    outputs={"response": "Paris", "context": "Paris is the capital/major city of France."}
)

feedback = groundedness_judge(
    inputs={"request": "What is the capital/major city of France?"},
    outputs={"response": "Paris", "context": "Paris is known for its Eiffel Tower."}
)

2. Usage with mlflow.evaluate()

You can use judges directly with MLflow's evaluation framework.

eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {
            "response": "Paris is the magnificent capital city of France, a stunning metropolis known worldwide for its iconic Eiffel Tower, rich cultural heritage, beautiful architecture, world-class museums like the Louvre, and its status as one of Europe's most important political and economic centers. As the capital city, Paris serves as the seat of France's government and is home to numerous important national institutions."
        },
        "expectations": {
            "expected_facts": ["Paris is the capital of France."],
        },
    },
]

from mlflow.genai.scorers import Correctness

eval_results = mlflow.genai.evaluate(data=eval_dataset, scorers=[Correctness])

Next Steps

Use built-in LLM judges in evaluation - Get started with built-in LLM judges
Create custom LLM judges - Build judges tailored to your specific needs
Run evaluations - Apply judges to systematically assess your application

Last updated on 2025-10-30