Vibes are not a Metric: A Guide to LLM Evals in Python

Written by Karan Gathani
2020-10-30
A man with glasses sits at a desk writing on a paper while a robot stands beside him, looking down at his work.

LLMs (Large Language Models) have transformed how we write code—but “it looks right” isn’t a testing strategy. When your fraud detection model confidently approves a $50K suspicious claim, you need more than vibes.

A tale of 2 eras

The Problem: Rules vs. Prompts

Consider insurdetect, a fraud detection app. The old way meant writing explicit rules:

def evaluate_claim(claim, today_date):
    amount, policy_start, claim_date = parse_claim(claim)

    severity, threshold = (("high", 50_000)
                           if amount >= 50_000
                           else (("medium", 25_000) if amount > 25_000 else (None, None)))
    if severity:
        op = ">=" if severity == "high" else ">"
        flags.append({
            "reason": f"Claim amount ${amount:,.2f} {op} ${threshold:,}",
            "severity": severity,
        })

    if policy_start and claim_date and 0 <= (delta := (claim_date - policy_start).days) <= 30:
        flags.append({
            "reason": f"Claim date {claim_date.isoformat()} is within {delta} day(s) after policy start {policy_start.isoformat()}",
            "severity": "medium",
        })

Now? A prompt does the heavy lifting:

You are a senior insurance fraud investigator. Follow these guardrails:
FRAUD RISK RULES:
1. Claims over $25,000 are medium risk
2. Claims $50,000+ are high risk
3. Claims filed within 30 days of policy start are medium risk

The catch: Models change frequently — sometimes improving, sometimes regressing. How can you be sure it still works?

Spoiler Alert You need Evals!

Integrating LLM with a Shiny app

Let’s first build a fraud detection app using Shiny for Python and chatlas. This app is an AI-powered fraud detection app that analyzes insurance claims for risk indicators and automatically recommends approval or manual review. It supports both single-claim analysis and batch processing via CSV upload with results sorted by risk level.

View Shiny App Code
app.py
import pandas as pd
from shiny_app.chat_utils import create_chat
from great_tables import GT
from shiny import reactive
from shiny.express import input, render, ui


SAMPLE_INPUT = """Claim C-2001 for $28,000 filed on Feb 1, 2025 (policy P-001 started Dec 15, 2024).
Auto claim: Rear-end collision; police report filed; staged accident suspected."""


ui.page_opts(title="insurdetect, a fraud detection app", fillable=True)

with ui.layout_sidebar():
    with ui.sidebar(width=400):
        ui.h4("Analyze Claims")

        ui.input_radio_buttons(
            "input_mode",
            "Input Method:",
            {"single": "Single Claim", "batch": "Batch Upload (CSV)"},
            selected="single"
        )

        @render.ui
        def input_section():
            if input.input_mode() == "single":
                return ui.TagList(
                    ui.input_text_area(
                        "claims_input",
                        "Claim Description:",
                        value=SAMPLE_INPUT,
                        rows=8,
                        width="100%",
                    ),
                    ui.input_action_button("analyze", "Analyze Claim", class_="btn-primary w-100")
                )
            else:
                return ui.TagList(
                    ui.input_file(
                        "claims_file",
                        "Upload CSV file:",
                        accept=[".csv"],
                        placeholder="Select a CSV file with 'input' column"
                    ),
                    ui.input_action_button("analyze_batch", "Analyze All Claims", class_="btn-primary w-100 mt-2")
                )

    with ui.card():
        ui.card_header("Fraud Analysis Results")

        @render.ui
        @reactive.event(input.analyze)
        async def analysis_result():
            claim_text = input.claims_input().strip()
            if not claim_text:
                return ui.p(
                    "Please enter a claim description.", class_="text-warning p-3"
                )

            chat = create_chat()

            response = await chat.chat_async(
                f"Analyze this insurance claim:\n{claim_text}", stream=False
            )
            result_text = await response.get_content()

            upper_text = result_text.upper()
            risk_level, severity_class = next(
                (
                    (level, cls)
                    for level, cls in [
                        ("HIGH", "bg-danger"),
                        ("MEDIUM", "bg-warning text-dark"),
                        ("LOW", "bg-info"),
                        ("NONE", "bg-success"),
                    ]
                    if level in upper_text
                ),
                ("UNKNOWN", "bg-secondary"),
            )

            payment_decision, payment_class = next(
                (
                    (dec, cls)
                    for dec, cls in [
                        ("PENDING", "bg-warning text-dark"),
                        ("AUTO-APPROVE", "bg-success"),
                    ]
                    if dec in upper_text
                ),
                ("UNKNOWN", "bg-secondary"),
            )

            df = pd.DataFrame(
                {
                    "Field": ["Risk Level", "Decision", "Analysis"],
                    "Value": [
                        f'<span class="badge {severity_class}">{risk_level}</span>',
                        f'<span class="badge {payment_class}">{payment_decision}</span>',
                        result_text,
                    ],
                }
            )

            gt_table = (
                GT(df)
                .fmt_markdown(columns="Value")
                .tab_options(
                    table_font_size="14px",
                    heading_background_color="#f8f9fa",
                    column_labels_background_color="#e9ecef",
                )
            )

            return ui.HTML(gt_table.as_raw_html())

        @render.ui
        @reactive.event(input.analyze_batch)
        async def batch_analysis_result():
            file_info = input.claims_file()
            if not file_info:
                return ui.p("Please upload a CSV file.", class_="text-warning p-3")

            try:
                # Read the CSV file
                csv_content = file_info[0]["datapath"]
                df_claims = pd.read_csv(csv_content)

                if "input" not in df_claims.columns:
                    return ui.p(
                        "CSV file must contain an 'input' column with claim descriptions.",
                        class_="text-danger p-3"
                    )

                results = []

                for i, (_, row) in enumerate(df_claims.iterrows()):
                    claim_text = str(row["input"]).strip()
                    if not claim_text or claim_text == "nan":
                        continue

                    chat = create_chat()
                    response = await chat.chat_async(
                        f"Analyze this insurance claim:\n{claim_text}", stream=False
                    )
                    result_text = await response.get_content()

                    upper_text = result_text.upper()

                    risk_level = next(
                        (level for level in ["HIGH", "MEDIUM", "LOW", "NONE"] if level in upper_text),
                        "UNKNOWN"
                    )

                    payment_decision = next(
                        (dec for dec in ["PENDING", "AUTO-APPROVE"] if dec in upper_text),
                        "UNKNOWN"
                    )

                    claim_id = f"Claim {i + 1}"
                    if "Claim C-" in claim_text:
                        claim_id = claim_text.split()[0] + " " + claim_text.split()[1]

                    results.append({
                        "Claim ID": claim_id,
                        "Risk Level": risk_level,
                        "Decision": payment_decision,
                        "Claim Description": claim_text[:100] + "..." if len(claim_text) > 100 else claim_text,
                        "Full Analysis": result_text
                    })

                risk_order = {"HIGH": 0, "MEDIUM": 1, "LOW": 2, "NONE": 3, "UNKNOWN": 4}
                results_df = pd.DataFrame(results)
                results_df["_sort_order"] = results_df["Risk Level"].map(risk_order)
                results_df = results_df.sort_values("_sort_order").drop("_sort_order", axis=1)

                display_df = results_df.copy()

                # Apply badges to Risk Level
                display_df["Risk Level"] = display_df["Risk Level"].apply(
                    lambda x: {
                        "HIGH": f'<span class="badge bg-danger">{x}</span>',
                        "MEDIUM": f'<span class="badge bg-warning text-dark">{x}</span>',
                        "LOW": f'<span class="badge bg-info">{x}</span>',
                        "NONE": f'<span class="badge bg-success">{x}</span>',
                        "UNKNOWN": f'<span class="badge bg-secondary">{x}</span>',
                    }.get(x, x)
                )

                display_df["Decision"] = display_df["Decision"].apply(
                    lambda x: {
                        "PENDING": f'<span class="badge bg-warning text-dark">{x}</span>',
                        "AUTO-APPROVE": f'<span class="badge bg-success">{x}</span>',
                        "UNKNOWN": f'<span class="badge bg-secondary">{x}</span>',
                    }.get(x, x)
                )

                risk_counts = results_df["Risk Level"].value_counts().to_dict()
                summary_html = f"""
                <div class="alert alert-info mb-3">
                    <h5>Analysis Summary</h5>
                    <p><strong>Total Claims Analyzed:</strong> {len(results_df)}</p>
                    <p><strong>Risk Distribution:</strong></p>
                    <ul>
                        <li>HIGH: {risk_counts.get('HIGH', 0)}</li>
                        <li>MEDIUM: {risk_counts.get('MEDIUM', 0)}</li>
                        <li>LOW: {risk_counts.get('LOW', 0)}</li>
                        <li>NONE: {risk_counts.get('NONE', 0)}</li>
                    </ul>
                </div>
                """

                # Create the table
                table_df = display_df[["Claim ID", "Risk Level", "Decision", "Claim Description"]].copy()

                gt_table = (
                    GT(table_df)
                    .fmt_markdown(columns=["Risk Level", "Decision"])
                    .tab_options(
                        table_font_size="13px",
                        heading_background_color="#f8f9fa",
                        column_labels_background_color="#e9ecef",
                        table_width="100%"
                    )
                )

                return ui.HTML(summary_html + gt_table.as_raw_html())

            except Exception as e:
                return ui.p(
                    f"Error processing file: {str(e)}",
                    class_="text-danger p-3"
                )
View Chat Utility Code
chat_utils.py
from typing import Optional
from datetime import datetime, date

from chatlas import ChatOpenAI


DEFAULT_SYSTEM_PROMPT = """You are a senior insurance fraud investigator. Review each insurance claim and assess the fraud risk. Be concise and natural.

IMPORTANT: This system is designed ONLY for insurance claim fraud analysis. If the input:
- Is not an insurance claim (e.g., general questions, unrelated topics)
- Is incomplete or missing key details (claim amount, dates, description)
- Is a test message or greeting

Respond politely: "I'm sorry, but I can only analyze insurance claims. Please provide a claim with: Claim ID, claim amount, policy start date, claim date, and claim description. For example: 'Claim C-2001 for $28,000 filed on Feb 1, 2025 (policy started Dec 15, 2024). Auto claim: Rear-end collision.'"

FRAUD RISK RULES:
1. Claims over $25,000 are medium risk
2. Claims $50,000+ are high risk
3. Claims filed within 30 days of policy start are medium risk
4. Suspicious keywords ("staged", "duplicate", "exaggerated", "repeat", "police report pending") = high risk
5. Future-dated claims = high risk
6. Vague language ("repair estimate high", "few witnesses", "no witnesses") = high risk
7. If multiple medium flags exist (2+), escalate to high risk

INPUT: You'll receive a claim with these details:
- Claim ID, policy start date, claim date, amount, type, and description

OUTPUT: Provide your assessment in natural language:
- Briefly describe any red flags you found (or say "No red flags")
- State the overall risk level: LOW, MEDIUM, HIGH, or NONE
- State the payment decision: AUTO-APPROVE or PENDING
- Keep it to 2-3 sentences max

EXAMPLE INPUT:
Claim C-2001 for $28,000 filed on Feb 1, 2025 (policy started Dec 15, 2024).
Auto claim: Rear-end collision; police report filed; staged accident suspected.

EXAMPLE OUTPUT:
Red flags: High claim amount ($28K) and suspicious term "staged" in description.
Risk: HIGH.
Decision: PENDING review."""


def check_if_future_date(claim_date: str, today_date: str | None = None) -> bool:
    """Check if claim_date is in the future. If today_date is not provided,
    compute it server-side to avoid model-supplied wrong values.

    This tool is safe to register with the chat client and call from the model
    if needed.
    """
    try:
        claim_dt = datetime.fromisoformat(claim_date).date()
    except Exception:
        claim_dt = datetime.strptime(claim_date[:10], "%Y-%m-%d").date()

    if today_date:
        try:
            today_dt = datetime.fromisoformat(today_date).date()
        except Exception:
            today_dt = date.today()
    else:
        today_dt = date.today()

    return claim_dt > today_dt


def create_chat(
    system_prompt: Optional[str] = None,
    model: str = "gpt-5-nano-2025-08-07",
) -> ChatOpenAI:
    """Create and return a configured chat client.

    Args:
        system_prompt: If omitted, uses DEFAULT_SYSTEM_PROMPT.
        model: Model identifier string.

    Returns:
        A configured `ChatOpenAI` instance with the system prompt set
    """
    if system_prompt is None:
        system_prompt = DEFAULT_SYSTEM_PROMPT

    chat = ChatOpenAI(
        model=model,
        system_prompt=system_prompt,
    )

    chat.register_tool(check_if_future_date)

    return chat

We keep the LLM code in its own module chat_utils.py separate from the Shiny app. This isolates the UI, business logic, and LLM logic, making each component clearer and more focused, and it also simplifies evaluation and testing of the LLM functionality.

The app works great—but how do we know the LLM is actually following our rules? That’s where Evals come in.

The Core Idea: Trust Through Testing

Instead of guessing if your LLM is working, use Evals with Inspect AI. These are systematic tests that act as quality control checks to prove your LLM integration is reliable and to monitor it over time.

Validating System Reliability

flowchart TD
    %% Styling
    classDef pass fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
    classDef fail fill:#ffebee,stroke:#c62828,stroke-width:2px;

    %% Nodes
    App["<b>LLM App</b><br/>(System under test)"]
    Eval["<b>Run Evals</b><br/>(Compare Outputs vs. Rules)"]
    Decision{"<b>Pass?</b>"}
    Deploy["<b>Deploy</b>"]:::pass
    Fix["<b>Refine</b>"]:::fail

    %% Flow
    App --> Eval
    Eval --> Decision
    Decision -->|Yes| Deploy
    Decision -->|No| Fix

Benchmarking Model Candidates

flowchart TD
    %% Styling
    classDef baseline fill:#f9f9f9,stroke:#333,stroke-width:2px;
    classDef candidate fill:#e1f5fe,stroke:#0277bd,stroke-width:2px;
    classDef decision fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,stroke-dasharray: 5 5;
    classDef pass fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
    classDef fail fill:#ffebee,stroke:#c62828,stroke-width:2px;

    %% Nodes
    Base("<b>The Baseline</b><br/>GPT-5.1<br/>98% Accuracy"):::baseline
    Cand("<b>The Goal</b><br/>GPT-5 nano<br/>(Cheaper)"):::candidate
    Rule{"<b>The Acceptance Rule</b><br/>Must be ≥ 95%<br/>accurate"}:::decision
    Pass("<b>PASS</b><br/>Result: 96%<br/>(Safe)"):::pass
    Fail("<b>FAIL</b><br/>Result: 90%<br/>(Not safe)"):::fail

    %% Connections
    Base -->|Sets Standard| Rule
    Cand -->|Run Eval| Rule
    Rule -->|If it scores 96%| Pass
    Rule -->|If it scores 90%| Fail

Step 1: Isolate Your LLM Logic

First, extract the LLM interaction into a standalone module that can be imported by both your app and your Evals. See the chat utils code example above.

Note

Inspect AI can work with any LLM frameworks, but in this example, we have used chatlas.

Step 2: Create a Test Dataset

Build a CSV with two columns: input and target.

Understanding the columns:

  • input: The claim description in natural language that your LLM will analyze (just like what a user would type into your app)
  • target: The expected correct answer describing the risk assessment, red flags, and payment decision

Example:

Column Content
input Claim C-2001 for $28,000 filed on Feb 1, 2025 (policy P-001 started Dec 15, 2024). Auto claim: Rear-end collision; police report filed; staged accident suspected.
target The claim should be flagged as HIGH risk due to the high amount and suspicious ‘staged’ language. Payment decision: PENDING. This requires manual review due to multiple red flags.

Creating your own dataset:

  1. Write realistic claim scenarios in the input column (include claim ID, amount, dates, and description)
  2. Write the expected fraud analysis in the target column (describe red flags, risk level, and decision)
  3. Use natural language for both columns.
  4. Start with 5-10 test cases covering different risk scenarios (low, medium, high)

Learn More About Dataset Collection

For best practices on building evaluation datasets, see the chatlas documentation on collecting datasets.

View Full Dataset
claims_list.csv
input,target
"Claim C-2001 for $28,000 filed on Feb 1, 2025 (policy P-001 started Dec 15, 2024). Auto claim: Rear-end collision; police report filed; staged accident suspected.","The claim should be flagged as HIGH risk due to the high amount and suspicious 'staged' language. Payment decision: PENDING. This requires manual review due to multiple red flags."
"Claim C-2002 for $12,500 filed on Jan 15, 2025 (policy P-002 started Jan 10, 2025). Auto claim: Minor fender bender with comprehensive coverage.","This is a straightforward claim with no fraud indicators. Risk level is LOW or NONE. Payment: AUTO-APPROVE without further review."
"Claim C-2003 for $75,000 filed on Feb 5, 2025 (policy P-003 started Feb 3, 2025). Property damage: House fire with limited documentation and few witnesses reported.","High risk claim: exceeds $50K threshold, filed within 30 days of policy start, and contains vague language ('few witnesses'). Risk: HIGH. Decision: PENDING review."

Step 3: Define Your Eval Task

Connect your dataset, LLM, and scoring method:

@task
def insurance_fraud_detection():
    return Task(
        dataset=csv_dataset("claims_list.csv"),
        solver=chat.to_solver(),
        scorer=model_graded_qa(partial_credit=True),
        name="insurance_fraud_detection_february_claims",
        model="openai/gpt-5-nano-2025-08-07",
    )

The scorer compares model output to target and returns a numeric grade. Using model_graded_qa(partial_credit=True) lets a second LLM evaluate responses against natural-language descriptions rather than exact patterns, rewarding correct reasoning even when details are missing.

Write your target column as guidance: describe the red flags, risk levels, and payment decisions you expect. The grader evaluates against this intent, not exact wording.

View Full Eval Script
detect_fraud_eval.py
from inspect_ai import Task, task
from inspect_ai.dataset import csv_dataset
from inspect_ai.scorer import model_graded_qa
from shiny_app.chat_utils import create_chat

chat = create_chat()


@task
def insurance_fraud_detection():
    """Evaluate insurance claims for fraud detection."""
    return Task(
        dataset=csv_dataset("claims_list.csv"),
        solver=chat.to_solver(),
        scorer=model_graded_qa(partial_credit=True),
        name="insurance_fraud_detection_february_claims",
        model="openai/gpt-5-nano-2025-08-07",
    )

Key components:

  • dataset: Your CSV with input/target pairs
  • solver: The model generating the response to the input prompt
  • scorer: Use another LLM to grade responses (allows partial credit)

Learn More About Tasks

For a deeper dive into creating eval tasks, see the chatlas documentation on creating tasks.

Step 4: Run evals and view results

With our eval script saved to a file, we can use the Inspect CLI to run the eval and view the results:

inspect eval detect_fraud_eval.py
inspect view bundle --output-dir eval_report

Eval HTML report

This report can be deployed to Posit Connect or any static hosting platform, giving your team visibility into model performance over time.

Key takeaways

  • Isolate LLM logic so it can be tested and updated independently.
  • Start with a small, realistic dataset of input + target pairs that cover expected behaviors and edge cases.
  • Define clear eval tasks and scoring rules, and automate runs so regressions are detected early.
  • Publish reports and integrate with CI so evals act as quality gates for model changes.

Next-level considerations: observability and monitoring are crucial in production – especially for AI applications where insight into how users are actually using the app can help you build a better product. Stay tuned for future posts on this topic, but in the meantime, you can learn more about these topics on the chatlas website.