Why Your RAG Metrics Are Lying to You

Measuring What Actually Matters

Feb 10, 2026

Introduction

If you're building with RAG, you've likely done the responsible thing: assembled a golden dataset, run your retrieval evaluation, and checked the numbers before shipping. And the numbers probably looked fine — maybe even great. MRR near 1.0, solid recall, respectable nDCG. By every standard measure, your retriever is doing its job.

So why are your users still getting confidently wrong answers?

The issue isn't that you skipped evaluation — it's that the metrics you used were designed for a different world. nDCG, MRR, and Recall all emerged from decades of web search research, built on a foundational assumption: the end consumer is a human who scans results, picks what looks right, and ignores the rest. That assumption breaks completely when the consumer is an LLM that reads everything you give it and tries to synthesize it all into a single answer.

This creates a dangerous blind spot. Documents that a human would glance at and dismiss — a training log about squats when you asked about deadlifts, a YC talk about pivoting when you asked about B2B sales — become active liabilities in an LLM's context window. They're not just unhelpful; they're distractors that inject plausible-sounding but wrong information into the model's reasoning. And traditional metrics score them the same as any other irrelevant result: zero. Neutral. Harmless.

They're anything but harmless.

This post lays out the problem in detail, then introduces UDCG (Utility and Distraction-aware Cumulative Gain) — a metric built specifically for evaluating retrieval in RAG systems. The core idea is straightforward: distractors don't have zero utility, they have negative utility. By extending DCG to allow negative scores, we can finally quantify the hidden harm that traditional metrics miss, compare ranking strategies on what actually matters for LLM performance, and make informed decisions about how many documents to retrieve.

The Problem: Perfect Metrics, Broken RAG

If you've built a RAG system, you've probably evaluated it properly before shipping. You created a golden dataset with labeled queries and relevant documents, ran your retrieval evaluation, and got results like this:

MRR:       1.0000  ✓ Perfect!

Recall@5: 0.73    ✓ Finding 73% of relevant docs

nDCG@5:   0.77    ✓ Good ranking quality

These numbers tell a compelling story: your retriever consistently places relevant documents at the top, finds most of what's needed, and ranks results well. Your evaluation passed. You ship to production.

But then something puzzling happens. Despite passing your retrieval evaluation, users start complaining:

“The AI keeps giving confident but wrong answers.”

“It mixes up information from different topics.”

“Sometimes it gives me details I never asked for.”

Here's the uncomfortable truth: your evaluation was measuring the wrong thing.

Traditional retrieval metrics (MRR, nDCG, Recall) answer the question: “Did we find the relevant documents?”

But for RAG, the real question is: “Will this context help or hurt the LLM's answer?”

These are not the same question. A retrieval result can find all the relevant documents (high recall) while simultaneously including documents that will confuse the LLM (distractors). Traditional metrics don't penalize distractors — they treat them the same as any other non-relevant document.

The gap isn't in whether you evaluate — it's in WHAT you evaluate.

The Hidden Danger: Distractors

Let's make this concrete with an example. Imagine you're building a personal knowledge assistant, and a user asks:

Query: “What was my 500 lb deadlift PR?”

Your retriever returns 5 documents:

Retrieved Results

[1] ✓

strength_training_log.md

“2025-05-06: Deadlift 500 lbs (RPE 10) - GOAL HIT!”

[2] ⚠

squat_training_log.md DISTRACTOR

“2025-05-02: Squat 405 lbs (RPE 9.5) - New PR!”

[3] ⚠

deadlift_form_check.md DISTRACTOR

“Working on form at 315 lbs, need to fix hip hinge”

[4] ⚠

mock_meet_log.md DISTRACTOR

“Mock meet: Deadlift opened at 455 lbs”

[5] ✗

nutrition_log.md

“Protein intake: 187g average”

From a traditional IR perspective, this looks great — MRR = 1.0, the correct answer is at position 1. The system found what the user needed.

But here's what actually happens when this context reaches the LLM. The model reads ALL 5 documents and sees: “500 lbs” (correct), “405 lbs” (squat), “315 lbs” (form practice), “455 lbs” (mock meet opener). The LLM might now confidently answer with any of the wrong numbers.

These aren't random irrelevant documents — they're worse. They're plausible distractors: documents that look related, contain similar terminology, but provide wrong or confusing information.

Why Traditional Metrics Fail

To understand why traditional metrics fail for RAG, we need to understand what they were designed for and the assumptions they make.

Traditional IR metrics like nDCG, MRR, and MAP were developed in the 1990s–2000s for evaluating web search engines. They were designed to answer one question: “How well does this ranked list serve a human user?”

These metrics are built on three key assumptions:

Assumption 1: Users examine results sequentially and stop early. When you Google something, you don't read all 10 results on the first page. You scan from top to bottom, click on something promising, and stop. This is why MRR focuses on the position of the FIRST relevant result.

Assumption 2: Irrelevant results are simply ignored. If result #4 is off-topic, a human just skips it. It doesn't harm them. This is why nDCG assigns zero utility to irrelevant results — they're neutral, not harmful.

Assumption 3: Value decays predictably by position. A relevant result at position 1 is more valuable than at position 5, because humans are less likely to scroll down. The logarithmic discount in DCG models this “patience decay.”

Why These Assumptions Break for LLMs

Now consider how an LLM agent uses retrieved documents. Every single assumption breaks:

Assumption 1 breaks: LLMs don't stop early. An LLM doesn't “click” on one result and stop. It receives ALL retrieved documents concatenated into its context window. It processes everything you give it.

Assumption 2 breaks: Irrelevant results are NOT ignored. This is the critical failure. When a human sees an off-topic result, they recognize it and skip it. When an LLM sees plausible-but-wrong information in its context, it often incorporates it into its reasoning. LLMs are trained to synthesize information from their context — they're designed to find connections and use all available information.

Assumption 3 partially breaks: Position matters less. While some LLMs show slight primacy/recency biases, they generally attend to all context. A distractor at position 5 can be just as harmful as one at position 2.

The Core Problem: Distractors Have Negative Value

Here's the key insight that traditional metrics miss entirely:

Document TypeValue for HumansValue for LLMs
Relevant resultPositive (they click and find what they need)Positive (correct info in context)
Off-topic irrelevantZero (they ignore it)~Zero (clearly off-topic, LLM might ignore)
Plausible distractorZero (they ignore it too)NEGATIVE (gets incorporated, causes errors)

Traditional metrics treat that last category as zero. But for RAG, a plausible distractor is worse than having no result at all. It actively injects false premises into the LLM's reasoning.

Human User

Query: “500 lb deadlift PR?”
See 5 results
Scan titles, click #1
Ignore rest — Correct!

LLM Agent

Query: “500 lb deadlift PR?”
See 5 results
Read ALL documents
Gets confused by conflicts
Confident wrong answer

This is why the article “Mutually Assured Distraction” makes such a powerful point:

“You cannot reason your way out of bad context.”

The research shows something counterintuitive: Chain-of-thought prompting actually DEGRADES performance when distractors are present. The more the model “thinks,” the more it incorporates the misleading information. Even worse, in agent loops where the LLM makes multiple decisions: 90% per-step accuracy → 53% accuracy after just 6 steps. The errors compound.

The Evaluation Gap

Most teams already do the right thing — they evaluate with golden datasets before shipping. The problem isn't lack of evaluation, it's that standard retrieval metrics have a blind spot.

What Traditional Metrics MeasureWhat RAG Actually Needs
“Did we find relevant docs?”“Will this context help the LLM?”
Relevant docs = good, others = neutralRelevant = good, distractors = harmful
More recall = betterMore recall might = more distractors
Position matters for clicksPosition matters less (LLM reads all)

We need metrics designed for RAG, not for human searchers.

The Solution: UDCG (Utility and Distraction-aware DCG)

If traditional metrics don't capture the harm of distractors, we need a new metric that does. Enter UDCG: Utility and Distraction-aware Cumulative Gain.

The key insight is simple but powerful: distractors don't have zero utility — they have NEGATIVE utility.

Utility Scoring

Traditional DCG assigns utility scores where irrelevant documents get zero. Both “completely off-topic” and “plausible but wrong” get the same score.

UDCG introduces negative utilities:

Document TypeUDCG UtilityWhy?
Relevant+1.0Helps the LLM answer correctly
Partially relevant+0.5Provides some useful context
Irrelevant (off-topic)0.0LLM can usually ignore it
Hard negative-0.5Similar enough to confuse
Plausible distractor-1.0Actively causes wrong answers

Traditional nDCG

Relevant

+1.0

Irrelevant

0.0

Distractor

0.0

Same as irrelevant!

UDCG (Distraction-Aware)

Relevant

+1.0

Irrelevant

0.0

Distractor

-0.5

PENALIZED!

This simple change has profound implications. A retrieval system that returns 3 relevant docs and 2 distractors now scores worse than one that returns 3 relevant docs and 2 off-topic docs.

The Math: How UDCG Works

The formula is identical to standard DCG — the only difference is that utilities can be negative:

DCG@k = Σ (utility_i / log ²(i + 1))   for i = 1 to k

Each document's utility is divided by a logarithmic discount based on its position. UDCG uses the exact same formula, but because utilities can now be negative, distractors subtract from the cumulative score. A distractor at rank 1 hurts more than one at rank 5 due to the position discount.

Worked example — 5 documents:

Position 1: Relevant    → utility = +1.0

Position 2: Partial     → utility = +0.5

Position 3: Distractor  → utility = -0.5 ← This hurts!

Position 4: Relevant    → utility = +1.0

Position 5: Irrelevant  → utility = 0.0

UDCG@5 calculation:

= 1.0/log²(2) + 0.5/log²(3) + (-0.5)/log²(4) + 1.0/log²(5) + 0.0/log²(6)

= 1.0/1.0 + 0.5/1.58 + (-0.5)/2.0 + 1.0/2.32 + 0.0/2.58

= 1.0 + 0.316 + (-0.25) + 0.431 + 0.0

= 1.497

Without the distractor (position 3 becomes 0.0):

= 1.0 + 0.316 + 0.0 + 0.431 + 0.0 = 1.747

Distractor cost: 1.747 - 1.497 = 0.25 utility points (~14% of total score)

Architecture: The Evaluation Pipeline

Now that we understand the concept, let's see how to build an evaluation system around it. The pipeline has four main steps:

RAG Quality Evaluation Pipeline

Queries + Labels

Step 1: Retrieval

For each test query, run your retrieval system and collect the top-k results with their relevance scores.

Step 2: Utility Scoring

For each retrieved document, assign a utility score: relevant (+1.0), known distractor (-1.0), high-scoring but not relevant (-0.5), top 3 but not relevant (-0.5), otherwise neutral (0.0).

Step 3: Metric Calculation

Calculate both traditional metrics (MRR, nDCG) and UDCG metrics. The gap between nDCG and UDCG reveals how much hidden distractor harm your traditional metrics are missing.

Step 4: Dynamic-K Analysis

Find the optimal number of documents to retrieve. More documents ≠ better for RAG. Calculate: score = cumulative_utility - distractor_harm at each k.

Distractor Detection: How It Works

One of the trickiest parts of UDCG evaluation is identifying which documents are distractors. Here's the decision flow:

Retrieved Doc (id, score, rank)
In relevant_ids?

YES

utility = +1.0

type: relevant

NO → Check distractor conditions:

Known distractor in blacklist?

utility = -1.0

Score > 70% of max score?

utility = -0.5

Rank ≤ 3?

utility = -0.5

Otherwise

utility = 0.0 (neutral)

Why these heuristics?

  1. Known distractors (-1.0): If you've manually labeled documents as distractors from previous error analysis, they get the maximum penalty.
  2. High relative score (-0.5): If a document scores within 70% of the top result but isn't relevant, it's suspicious. The retriever thinks it's good, but it's not — classic hard negative.
  3. Top 3 position (-0.5): Documents in the top 3 positions are almost always included in LLM context. A non-relevant document here is dangerous regardless of its score.
  4. Everything else (0.0): Low-ranked, low-scoring irrelevant documents are neutral. They're easy for the LLM to ignore because they're clearly off-topic.

These thresholds are configurable. What counts as a “distractor” and how severely to penalize it depends on your domain:

Use CaseDistractor ToleranceRecommended Settings
Medical/Legal RAGVery lowStrict thresholds (50%), harsh penalties (-1.0)
Customer supportMediumDefault thresholds work well
Creative writingHighRelaxed thresholds (90%), light penalties (-0.2)
Code generationLowStrict on code-related distractors

Real Results: Before vs After

Let's see how this works in practice. We evaluated a RAG system with 20 queries and 100 documents.

Before (with a bug in distractor detection)

Our initial implementation had a bug — it wasn't properly detecting distractors because we used an absolute score threshold (0.5) when our ranking profile returned scores around 0.03.

Misleading Results

MRR:          1.0000

Precision@5:  0.3300

Recall@5:     0.7283

nDCG@5:       0.7728 ✓ “Looks great!”

Distractor Rate: 0.00% ← BUG! Should be much higher

UDCG@5:       1.3373 ← Artificially inflated

Assessment: “SAFE” — WRONG!

After (with fixed distractor detection)

After switching to relative score thresholds, the true picture emerged:

Accurate Results

MRR:          1.0000 (unchanged)

Precision@5:  0.3300 (unchanged)

Recall@5:     0.7283 (unchanged)

nDCG@5:       0.2515 ← Now properly penalized!

Distractor Rate: 67.00% ← 2/3 of results are distractors!

UDCG@5:       0.5317 ← Real utility is MUCH lower

Distractor Harm: 1.6750

Optimal k:     2.8 ← Should use k=3, not k=5!

Assessment: “RISKY” — Now we know the truth!

What changed? The traditional metrics (MRR, Precision, Recall) stayed the same. But UDCG revealed that 67% of our top-5 results were distractors — documents that would confuse the LLM.

Example of detected distractors

Query: “What were the key takeaways from YC B2B sales?”

[1] ✓

yc_b2b_sales_workshop_notes.md

ICP definition, outbound strategy, sales process

[2] ✓

yc_b2b_sales_followup.md

Follow-up action items from the workshop

[3] ⚠

yc_talk_pivoting_effectively.md DISTRACTOR

Also YC content, also about startups — but about pivoting, NOT B2B sales

[4] ⚠

yc_office_hours_user_acquisition.md DISTRACTOR

YC content, growth-related — but user acquisition ≠ B2B sales

[5] ⚠

yc_w25_welcome_email.md DISTRACTOR

YC content, same batch — administrative email, not educational

These distractors are insidious because they're semantically similar to the query. They all relate to YC, startups, and business growth. A traditional retriever sees them as “close matches.” But for RAG, they're poison.

Comparing Ranking Profiles

One practical application of UDCG is comparing different ranking strategies. We tested multiple Vespa ranking profiles:

ProfileMRRRecall@5Dist. RateHarmSafety
rrf-hybrid1.0072.8%0.0%0.00SAFE
match-only1.0079.5%0.0%0.00SAFE
gbdt-only1.0091.5%49.0%1.23RISKY
simple-hybrid1.0083.0%62.0%1.55DANGEROUS

The counterintuitive insight: Look at simple-hybrid: it has the best recall at 83% — by traditional metrics, it's a great choice. But it also has the highest distractor rate at 62%. For every 5 documents retrieved, about 3 are distractors. For RAG, this is the WORST choice, not the best.

Meanwhile, rrf-hybrid has lower recall (72.8%) but zero distractors. For RAG applications, this is far superior.

For traditional search:

High Recall = Good ✓

For RAG:

High Recall + High Distractors = DANGEROUS ✗

Lower Recall + Zero Distractors = SAFE ✓

Applying This at Runtime

UDCG is an offline evaluation metric — you run it on test queries to understand your retrieval quality. But the insights directly inform runtime decisions:

From Evaluation to Production

Offline (Evaluation)

Run UDCG evaluation on test queries
Find: optimal k = 3 (not 5 or 10!)
Find: score drop pattern at 70%
Collect hard negatives for model improvement

Runtime (Production)

User query arrives
Retrieve k=3 docs (not more!)
Filter: keep only docs > 70% of top score
Build clean context for LLM
ApproachWhen to UseHow
Set optimal kAlwaysUse evaluation to find best k, configure retriever
Score thresholdHigh distractor rateFilter out docs below X% of top score
RerankingCritical applicationsAdd cross-encoder to filter distractors

The general pattern: use the optimal k discovered during evaluation, apply a relative score threshold to filter candidates, and cap the final result set. This treats UDCG evaluation as a calibration step — you run it offline to discover the right parameters, then apply those parameters at query time.

Key Takeaways

1. Traditional metrics lie for RAG

nDCG, MRR, and Recall were designed for human users who can ignore bad results. LLMs read EVERYTHING and get confused.

2. Distractors have negative utility

Plausible-but-wrong documents don't just fail to help — they actively cause confident wrong answers. Score them as negative.

3. Use UDCG for RAG evaluation

Same formula as DCG, but utilities can be negative. This reveals hidden problems that traditional metrics completely miss.

4. More results ≠ better

Use dynamic-k analysis to find the sweet spot. Often k=3 beats k=10 for RAG because you avoid accumulating distractor harm.

5. High recall can be dangerous

A ranking profile with 90% recall but 50% distractor rate is WORSE than one with 70% recall and 0% distractor rate. Choose your ranking strategy with RAG in mind.

References