How to create guidelines-based LLM scorers

2025-06-11

Overview

[scorers.Guidelines()] and [scorers.ExpectationGuidelines()] are scorers that wrap the judges.meets_guidelines() Databricks-provided LLM judge SDK. It is is designed to let you quickly and easily customize evaluation by defining natural language criteria that are framed as pass/fail conditions. It is ideal for checking compliance with rules, style guides, or information inclusion/exclusion.

Guidelines have the distinct advantage of being easy to explain to business stakeholders ("we are evaluating if the app delivers upon this set of rules") and, as such, can often be directly written by ___domain experts.

You can use the guidelines LLM judge model in 2 ways:

If your guidelines only consider the app's inputs and outputs and your app's trace only has simple inputs (e.g., only the user query) and outputs (e.g., only app response), use the prebuilt guidelines scorer.
If your guidelines consider additional data (for example, the retrieved documents or tool calls) or the trace has complex inputs/outputs that contain fields you want to exclude from evaluation (for example, a user_id, etc), create a custom scorer that wraps the judges.meets_guidelines() API

Note

For more details on how the prebuilt guidelines scorer parses your trace, visit the guidelines prebuilt scorer concept page.

1. Use the prebuilt guidelines scorer

In this guide, you will add custom evaluation criteria to the prebuilt scorer and run an offline evaluation with the resulting scorers. These same scorers can be scheduled to run in production to continously monitor your application's quality.

Step 1: Create the sample app to evaluate

First, lets create a sample GenAI app that responds to customer support questions. The app has a few (fake) knobs that control the system prompt so we can easily compare the guideline judge's outputs between "good" and "bad" responses.

import os
import mlflow
from openai import OpenAI
from typing import List, Dict, Any

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
BE_RUDE_AND_VERBOSE = False

@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):

    # 1. Prepare messages for the LLM
    system_prompt_postfix = (
        "Be super rude and very verbose in your responses."
        if BE_RUDE_AND_VERBOSE
        else ""
    )
    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    # 2. Call LLM to generate a response
    return client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )

result = customer_support_agent(
    messages=[
        {"role": "user", "content": "How much does a microwave cost?"},
    ]
)
print(result)

Step 2: Define your evaluation criteria

Normally, you'd work with your business stakeholders to define the guidelines. Here, we define a few sample guidelines. When writing guidelines, you refer to the app's inputs as the request and the app's outputs as the response. Refer to the how inputs and outputs are parsed by the predefined guidelines scorer section to understand what data is passed to the LLM judge.

tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
structure = "The response must use clear, concise language and structures responses logically.  It must avoids jargon or explains technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
relevance = "The response must be relevant to the user's request.  Only consider the relevance and nothing else. If the request is not clear, the response must ask for more information."

Note

Guidelines can be as long or as short as you wish. Conceptually, you can think of a guideline as a "mini prompt" that defines the passing criteria. They can optionally include markdown formatting (such as a bulleted list).

Step 3: Create a sample evaluation dataset

Each inputs will be passed to our app by mlflow.genai.evaluate(...).

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ]
        },
    },
]
print(eval_dataset)

Step 4: Evaluate your app using the custom scorers

Finally, we run evaluation twice so you can compare the guideline scorer's judgements between the rude/verbose (first screenshot) and polite/non verbose (second screenshot) app versions.

from mlflow.genai.scorers import Guidelines
import mlflow

# First, let's evaluate the app's responses against the guidelines when it is not rude and verbose
BE_RUDE_AND_VERBOSE = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="structure", guidelines=structure),
        Guidelines(name="banned_topics", guidelines=banned_topics),
        Guidelines(name="relevance", guidelines=relevance),
    ],
)


# Next, let's evaluate the app's responses against the guidelines when it IS rude and verbose
BE_RUDE_AND_VERBOSE = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="structure", guidelines=structure),
        Guidelines(name="banned_topics", guidelines=banned_topics),
        Guidelines(name="relevance", guidelines=relevance),
    ],
)

Evaluation rude and verbose

Evaluation polite and non verbose

2. Create a custom scorer that wraps the guidelines judge

In this guide, you will add create custom scorers that wrap the judges.meets_guidelines() API and run an offline evaluation with the resulting scorers. These same scorers can be scheduled to run in production to continously monitor your application's quality.

Step 1: Create the sample app to evaluate

import os
import mlflow
from openai import OpenAI
from typing import List, Dict

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
FOLLOW_POLICIES = False

# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
BE_RUDE_AND_VERBOSE = False

@mlflow.trace
def customer_support_agent(user_messages: List[Dict[str, str]], user_id: str):

    # 1. Fake policies to follow.
    @mlflow.trace
    def get_policies_for_user(user_id: str):
        if user_id == 1:
            return [
                "All returns must be processed within 30 days of purchase, with a valid receipt.",
            ]
        else:
            return [
                "All returns must be processed within 90 days of purchase, with a valid receipt.",
            ]

    policies_to_follow = get_policies_for_user(user_id)

    # 2. Prepare messages for the LLM
    # We will use this toggle later to see how the scorers handle rude and verbose responses
    system_prompt_postfix = (
        f"Follow the following policies: {policies_to_follow}.  Do not refer to the specific policies in your response.\n"
        if FOLLOW_POLICIES
        else ""
    )

    system_prompt_postfix = (
        f"{system_prompt_postfix}Be super rude and very verbose in your responses.\n"
        if BE_RUDE_AND_VERBOSE
        else system_prompt_postfix
    )
    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *user_messages,
    ]

    # 3. Call LLM to generate a response
    output = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )

    return {
        "message": output.choices[0].message.content,
        "policies_followed": policies_to_follow,
    }

result = customer_support_agent(
    user_messages=[
        {"role": "user", "content": "How much does a microwave cost?"},
    ],
    user_id=1
)
print(result)

Step 2: Define your evalation criteria and wrap as custom scorers

Normally, you'd work with your business stakeholders to define the guidelines. Here, we define a few sample guidelines and use custom scorers to wire them up to our app's input / output schema.

from mlflow.genai.scorers import scorer
from mlflow.genai.judges import meets_guidelines
import json
from typing import Dict, Any


tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
structure = "The response must use clear, concise language and structures responses logically.  It must avoids jargon or explains technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
relevance = "The response must be relevant to the user's request.  Only consider the relevance and nothing else. If the request is not clear, the response must ask for more information."
# Note in this guideline how we refer to `provided_policies` - we will make the meets_guidelines LLM judge aware of this data.
follows_policies_guideline = "If the provided_policies is relevant to the request and response, the response must adhere to the provided_policies."

# Define a custom scorer that wraps the guidelines LLM judge to check if the response follows the policies
@scorer
def follows_policies(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    # we directly return the Feedback object from the guidelines LLM judge, but we could have post-processed it before returning it.
    return meets_guidelines(
        name="follows_policies",
        guidelines=follows_policies_guideline,
        context={
            # Here we make meets_guidelines aware of
            "provided_policies": outputs["policies_followed"],
            "response": outputs["message"],
            "request": json.dumps(inputs["user_messages"]),
        },
    )


# Define a custom scorer that wraps the guidelines LLM judge to pass the custom keys from the inputs/outputs to the guidelines LLM judge
@scorer
def check_guidelines(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    feedbacks = []

    request = json.dumps(inputs["user_messages"])
    response = outputs["message"]

    feedbacks.append(
        meets_guidelines(
            name="tone",
            guidelines=tone,
            # Note: While we used request and response as keys, we could have used any key as long as our guideline referred to that key by name (e.g., if we had used output instead of response, we would have changed our guideline to be "The output must be polite")
            context={"response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="structure",
            guidelines=structure,
            context={"response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="banned_topics",
            guidelines=banned_topics,
            context={"request": request, "response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="relevance",
            guidelines=relevance,
            context={"request": request, "response": response},
        )
    )

    # A scorer can return a list of Feedback objects OR a single Feedback object.
    return feedbacks

Note

Step 3: Create a sample evaluation dataset

Each inputs will be passed to our app by mlflow.genai.evaluate(...).

eval_dataset = [
    {
        "inputs": {
            # Note that these keys match the **kwargs of our application.
            "user_messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
            "user_id": 3,
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
            "user_id": 1,  # the bot should say no if the policies are followed for this user
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
            "user_id": 2,  # the bot should say yes if the policies are followed for this user
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
            "user_id": 3,
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
            "user_id": 1,
        },
    },
]

print(eval_dataset)

Step 4: Evaluate your app using the guidelines

Finally, we run evaluation twice so you can compare the guideline scorer's judgements between the rude/verbose (first screenshot) and polite/non verbose (second screenshot) app versions.

import mlflow

# Now, let's evaluate the app's responses against the guidelines when it is NOT rude and verbose and DOES follow policies
BE_RUDE_AND_VERBOSE = False
FOLLOW_POLICIES = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[follows_policies, check_guidelines],
)


# Now, let's evaluate the app's responses against the guidelines when it IS rude and verbose and does NOT follow policies
BE_RUDE_AND_VERBOSE = True
FOLLOW_POLICIES = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[follows_policies, check_guidelines],
)

Evaluation rude and verbose

Evaluation polite and non verbose

Next Steps

Create prompt-based scorers - Build more complex judges with custom prompts and multiple output choices
Run evaluations with your scorers - Use your custom guidelines scorers in comprehensive evaluations
Guidelines concept reference - Understand how the guidelines judge works under the hood

Share via

How to create guidelines-based LLM scorers

Overview

1. Use the prebuilt guidelines scorer

Step 1: Create the sample app to evaluate

Step 2: Define your evaluation criteria

Step 3: Create a sample evaluation dataset

Step 4: Evaluate your app using the custom scorers

2. Create a custom scorer that wraps the guidelines judge

Step 1: Create the sample app to evaluate

Step 2: Define your evalation criteria and wrap as custom scorers

Step 3: Create a sample evaluation dataset

Step 4: Evaluate your app using the guidelines

Next Steps

Feedback

Additional resources