Introduction
If you're building with RAG, you've likely done the responsible thing: assembled a golden dataset, run your retrieval evaluation, and checked the numbers before shipping. And the numbers probably looked fine — maybe even great. MRR near 1.0, solid recall, respectable nDCG. By every standard measure, your retriever is doing its job.
So why are your users still getting confidently wrong answers?
The issue isn't that you skipped evaluation — it's that the metrics you used were designed for a different world. nDCG, MRR, and Recall all emerged from decades of web search research, built on a foundational assumption: the end consumer is a human who scans results, picks what looks right, and ignores the rest. That assumption breaks completely when the consumer is an LLM that reads everything you give it and tries to synthesize it all into a single answer.
This creates a dangerous blind spot. Documents that a human would glance at and dismiss — a training log about squats when you asked about deadlifts, a YC talk about pivoting when you asked about B2B sales — become active liabilities in an LLM's context window. They're not just unhelpful; they're distractors that inject plausible-sounding but wrong information into the model's reasoning. And traditional metrics score them the same as any other irrelevant result: zero. Neutral. Harmless.
They're anything but harmless.
This post lays out the problem in detail, then introduces UDCG (Utility and Distraction-aware Cumulative Gain) — a metric built specifically for evaluating retrieval in RAG systems. The core idea is straightforward: distractors don't have zero utility, they have negative utility. By extending DCG to allow negative scores, we can finally quantify the hidden harm that traditional metrics miss, compare ranking strategies on what actually matters for LLM performance, and make informed decisions about how many documents to retrieve.
The Problem: Perfect Metrics, Broken RAG
If you've built a RAG system, you've probably evaluated it properly before shipping. You created a golden dataset with labeled queries and relevant documents, ran your retrieval evaluation, and got results like this:
MRR: 1.0000 ✓ Perfect!
Recall@5: 0.73 ✓ Finding 73% of relevant docs
nDCG@5: 0.77 ✓ Good ranking quality
These numbers tell a compelling story: your retriever consistently places relevant documents at the top, finds most of what's needed, and ranks results well. Your evaluation passed. You ship to production.
But then something puzzling happens. Despite passing your retrieval evaluation, users start complaining:
“The AI keeps giving confident but wrong answers.”
“It mixes up information from different topics.”
“Sometimes it gives me details I never asked for.”
Here's the uncomfortable truth: your evaluation was measuring the wrong thing.
Traditional retrieval metrics (MRR, nDCG, Recall) answer the question: “Did we find the relevant documents?”
But for RAG, the real question is: “Will this context help or hurt the LLM's answer?”
These are not the same question. A retrieval result can find all the relevant documents (high recall) while simultaneously including documents that will confuse the LLM (distractors). Traditional metrics don't penalize distractors — they treat them the same as any other non-relevant document.
The gap isn't in whether you evaluate — it's in WHAT you evaluate.
The Hidden Danger: Distractors
Let's make this concrete with an example. Imagine you're building a personal knowledge assistant, and a user asks:
Query: “What was my 500 lb deadlift PR?”
Your retriever returns 5 documents:
Retrieved Results
strength_training_log.md
“2025-05-06: Deadlift 500 lbs (RPE 10) - GOAL HIT!”
squat_training_log.md DISTRACTOR
“2025-05-02: Squat 405 lbs (RPE 9.5) - New PR!”
deadlift_form_check.md DISTRACTOR
“Working on form at 315 lbs, need to fix hip hinge”
mock_meet_log.md DISTRACTOR
“Mock meet: Deadlift opened at 455 lbs”
nutrition_log.md
“Protein intake: 187g average”
From a traditional IR perspective, this looks great — MRR = 1.0, the correct answer is at position 1. The system found what the user needed.
But here's what actually happens when this context reaches the LLM. The model reads ALL 5 documents and sees: “500 lbs” (correct), “405 lbs” (squat), “315 lbs” (form practice), “455 lbs” (mock meet opener). The LLM might now confidently answer with any of the wrong numbers.
These aren't random irrelevant documents — they're worse. They're plausible distractors: documents that look related, contain similar terminology, but provide wrong or confusing information.
Why Traditional Metrics Fail
To understand why traditional metrics fail for RAG, we need to understand what they were designed for and the assumptions they make.
The History: Metrics Built for Human Search
Traditional IR metrics like nDCG, MRR, and MAP were developed in the 1990s–2000s for evaluating web search engines. They were designed to answer one question: “How well does this ranked list serve a human user?”
These metrics are built on three key assumptions:
Assumption 1: Users examine results sequentially and stop early. When you Google something, you don't read all 10 results on the first page. You scan from top to bottom, click on something promising, and stop. This is why MRR focuses on the position of the FIRST relevant result.
Assumption 2: Irrelevant results are simply ignored. If result #4 is off-topic, a human just skips it. It doesn't harm them. This is why nDCG assigns zero utility to irrelevant results — they're neutral, not harmful.
Assumption 3: Value decays predictably by position. A relevant result at position 1 is more valuable than at position 5, because humans are less likely to scroll down. The logarithmic discount in DCG models this “patience decay.”
Why These Assumptions Break for LLMs
Now consider how an LLM agent uses retrieved documents. Every single assumption breaks:
Assumption 1 breaks: LLMs don't stop early. An LLM doesn't “click” on one result and stop. It receives ALL retrieved documents concatenated into its context window. It processes everything you give it.
Assumption 2 breaks: Irrelevant results are NOT ignored. This is the critical failure. When a human sees an off-topic result, they recognize it and skip it. When an LLM sees plausible-but-wrong information in its context, it often incorporates it into its reasoning. LLMs are trained to synthesize information from their context — they're designed to find connections and use all available information.
Assumption 3 partially breaks: Position matters less. While some LLMs show slight primacy/recency biases, they generally attend to all context. A distractor at position 5 can be just as harmful as one at position 2.
The Core Problem: Distractors Have Negative Value
Here's the key insight that traditional metrics miss entirely:
| Document Type | Value for Humans | Value for LLMs |
|---|---|---|
| Relevant result | Positive (they click and find what they need) | Positive (correct info in context) |
| Off-topic irrelevant | Zero (they ignore it) | ~Zero (clearly off-topic, LLM might ignore) |
| Plausible distractor | Zero (they ignore it too) | NEGATIVE (gets incorporated, causes errors) |
Traditional metrics treat that last category as zero. But for RAG, a plausible distractor is worse than having no result at all. It actively injects false premises into the LLM's reasoning.
Human User
LLM Agent
This is why the article “Mutually Assured Distraction” makes such a powerful point:
“You cannot reason your way out of bad context.”
The research shows something counterintuitive: Chain-of-thought prompting actually DEGRADES performance when distractors are present. The more the model “thinks,” the more it incorporates the misleading information. Even worse, in agent loops where the LLM makes multiple decisions: 90% per-step accuracy → 53% accuracy after just 6 steps. The errors compound.
The Evaluation Gap
Most teams already do the right thing — they evaluate with golden datasets before shipping. The problem isn't lack of evaluation, it's that standard retrieval metrics have a blind spot.
| What Traditional Metrics Measure | What RAG Actually Needs |
|---|---|
| “Did we find relevant docs?” | “Will this context help the LLM?” |
| Relevant docs = good, others = neutral | Relevant = good, distractors = harmful |
| More recall = better | More recall might = more distractors |
| Position matters for clicks | Position matters less (LLM reads all) |
We need metrics designed for RAG, not for human searchers.
The Solution: UDCG (Utility and Distraction-aware DCG)
If traditional metrics don't capture the harm of distractors, we need a new metric that does. Enter UDCG: Utility and Distraction-aware Cumulative Gain.
The key insight is simple but powerful: distractors don't have zero utility — they have NEGATIVE utility.
Utility Scoring
Traditional DCG assigns utility scores where irrelevant documents get zero. Both “completely off-topic” and “plausible but wrong” get the same score.
UDCG introduces negative utilities:
| Document Type | UDCG Utility | Why? |
|---|---|---|
| Relevant | +1.0 | Helps the LLM answer correctly |
| Partially relevant | +0.5 | Provides some useful context |
| Irrelevant (off-topic) | 0.0 | LLM can usually ignore it |
| Hard negative | -0.5 | Similar enough to confuse |
| Plausible distractor | -1.0 | Actively causes wrong answers |
Traditional nDCG
Relevant
Irrelevant
Distractor
Same as irrelevant!
UDCG (Distraction-Aware)
Relevant
Irrelevant
Distractor
PENALIZED!
This simple change has profound implications. A retrieval system that returns 3 relevant docs and 2 distractors now scores worse than one that returns 3 relevant docs and 2 off-topic docs.
The Math: How UDCG Works
The formula is identical to standard DCG — the only difference is that utilities can be negative:
DCG@k = Σ (utility_i / log ²(i + 1)) for i = 1 to k
Each document's utility is divided by a logarithmic discount based on its position. UDCG uses the exact same formula, but because utilities can now be negative, distractors subtract from the cumulative score. A distractor at rank 1 hurts more than one at rank 5 due to the position discount.
Worked example — 5 documents:
Position 1: Relevant → utility = +1.0
Position 2: Partial → utility = +0.5
Position 3: Distractor → utility = -0.5 ← This hurts!
Position 4: Relevant → utility = +1.0
Position 5: Irrelevant → utility = 0.0
UDCG@5 calculation:
= 1.0/log²(2) + 0.5/log²(3) + (-0.5)/log²(4) + 1.0/log²(5) + 0.0/log²(6)
= 1.0/1.0 + 0.5/1.58 + (-0.5)/2.0 + 1.0/2.32 + 0.0/2.58
= 1.0 + 0.316 + (-0.25) + 0.431 + 0.0
= 1.497
Without the distractor (position 3 becomes 0.0):
= 1.0 + 0.316 + 0.0 + 0.431 + 0.0 = 1.747
Distractor cost: 1.747 - 1.497 = 0.25 utility points (~14% of total score)
Architecture: The Evaluation Pipeline
Now that we understand the concept, let's see how to build an evaluation system around it. The pipeline has four main steps:
RAG Quality Evaluation Pipeline
Step 1: Retrieval
For each test query, run your retrieval system and collect the top-k results with their relevance scores.
Step 2: Utility Scoring
For each retrieved document, assign a utility score: relevant (+1.0), known distractor (-1.0), high-scoring but not relevant (-0.5), top 3 but not relevant (-0.5), otherwise neutral (0.0).
Step 3: Metric Calculation
Calculate both traditional metrics (MRR, nDCG) and UDCG metrics. The gap between nDCG and UDCG reveals how much hidden distractor harm your traditional metrics are missing.
Step 4: Dynamic-K Analysis
Find the optimal number of documents to retrieve. More documents ≠ better for RAG. Calculate: score = cumulative_utility - distractor_harm at each k.
Distractor Detection: How It Works
One of the trickiest parts of UDCG evaluation is identifying which documents are distractors. Here's the decision flow:
YES
utility = +1.0
type: relevant
NO → Check distractor conditions:
Known distractor in blacklist?
utility = -1.0
Score > 70% of max score?
utility = -0.5
Rank ≤ 3?
utility = -0.5
Otherwise
utility = 0.0 (neutral)
Why these heuristics?
- Known distractors (-1.0): If you've manually labeled documents as distractors from previous error analysis, they get the maximum penalty.
- High relative score (-0.5): If a document scores within 70% of the top result but isn't relevant, it's suspicious. The retriever thinks it's good, but it's not — classic hard negative.
- Top 3 position (-0.5): Documents in the top 3 positions are almost always included in LLM context. A non-relevant document here is dangerous regardless of its score.
- Everything else (0.0): Low-ranked, low-scoring irrelevant documents are neutral. They're easy for the LLM to ignore because they're clearly off-topic.
These thresholds are configurable. What counts as a “distractor” and how severely to penalize it depends on your domain:
| Use Case | Distractor Tolerance | Recommended Settings |
|---|---|---|
| Medical/Legal RAG | Very low | Strict thresholds (50%), harsh penalties (-1.0) |
| Customer support | Medium | Default thresholds work well |
| Creative writing | High | Relaxed thresholds (90%), light penalties (-0.2) |
| Code generation | Low | Strict on code-related distractors |
Real Results: Before vs After
Let's see how this works in practice. We evaluated a RAG system with 20 queries and 100 documents.
Before (with a bug in distractor detection)
Our initial implementation had a bug — it wasn't properly detecting distractors because we used an absolute score threshold (0.5) when our ranking profile returned scores around 0.03.
Misleading Results
MRR: 1.0000 ✓
Precision@5: 0.3300
Recall@5: 0.7283
nDCG@5: 0.7728 ✓ “Looks great!”
Distractor Rate: 0.00% ← BUG! Should be much higher
UDCG@5: 1.3373 ← Artificially inflated
Assessment: “SAFE” — WRONG!
After (with fixed distractor detection)
After switching to relative score thresholds, the true picture emerged:
Accurate Results
MRR: 1.0000 ✓ (unchanged)
Precision@5: 0.3300 (unchanged)
Recall@5: 0.7283 (unchanged)
nDCG@5: 0.2515 ← Now properly penalized!
Distractor Rate: 67.00% ← 2/3 of results are distractors!
UDCG@5: 0.5317 ← Real utility is MUCH lower
Distractor Harm: 1.6750
Optimal k: 2.8 ← Should use k=3, not k=5!
Assessment: “RISKY” — Now we know the truth!
What changed? The traditional metrics (MRR, Precision, Recall) stayed the same. But UDCG revealed that 67% of our top-5 results were distractors — documents that would confuse the LLM.
Example of detected distractors
Query: “What were the key takeaways from YC B2B sales?”
yc_b2b_sales_workshop_notes.md
ICP definition, outbound strategy, sales process
yc_b2b_sales_followup.md
Follow-up action items from the workshop
yc_talk_pivoting_effectively.md DISTRACTOR
Also YC content, also about startups — but about pivoting, NOT B2B sales
yc_office_hours_user_acquisition.md DISTRACTOR
YC content, growth-related — but user acquisition ≠ B2B sales
yc_w25_welcome_email.md DISTRACTOR
YC content, same batch — administrative email, not educational
These distractors are insidious because they're semantically similar to the query. They all relate to YC, startups, and business growth. A traditional retriever sees them as “close matches.” But for RAG, they're poison.
Comparing Ranking Profiles
One practical application of UDCG is comparing different ranking strategies. We tested multiple Vespa ranking profiles:
| Profile | MRR | Recall@5 | Dist. Rate | Harm | Safety |
|---|---|---|---|---|---|
rrf-hybrid | 1.00 | 72.8% | 0.0% | 0.00 | SAFE |
match-only | 1.00 | 79.5% | 0.0% | 0.00 | SAFE |
gbdt-only | 1.00 | 91.5% | 49.0% | 1.23 | RISKY |
simple-hybrid | 1.00 | 83.0% | 62.0% | 1.55 | DANGEROUS |
The counterintuitive insight: Look at simple-hybrid: it has the best recall at 83% — by traditional metrics, it's a great choice. But it also has the highest distractor rate at 62%. For every 5 documents retrieved, about 3 are distractors. For RAG, this is the WORST choice, not the best.
Meanwhile, rrf-hybrid has lower recall (72.8%) but zero distractors. For RAG applications, this is far superior.
For traditional search:
High Recall = Good ✓
For RAG:
High Recall + High Distractors = DANGEROUS ✗
Lower Recall + Zero Distractors = SAFE ✓
Applying This at Runtime
UDCG is an offline evaluation metric — you run it on test queries to understand your retrieval quality. But the insights directly inform runtime decisions:
From Evaluation to Production
Offline (Evaluation)
Runtime (Production)
| Approach | When to Use | How |
|---|---|---|
| Set optimal k | Always | Use evaluation to find best k, configure retriever |
| Score threshold | High distractor rate | Filter out docs below X% of top score |
| Reranking | Critical applications | Add cross-encoder to filter distractors |
The general pattern: use the optimal k discovered during evaluation, apply a relative score threshold to filter candidates, and cap the final result set. This treats UDCG evaluation as a calibration step — you run it offline to discover the right parameters, then apply those parameters at query time.
Key Takeaways
1. Traditional metrics lie for RAG
nDCG, MRR, and Recall were designed for human users who can ignore bad results. LLMs read EVERYTHING and get confused.
2. Distractors have negative utility
Plausible-but-wrong documents don't just fail to help — they actively cause confident wrong answers. Score them as negative.
3. Use UDCG for RAG evaluation
Same formula as DCG, but utilities can be negative. This reveals hidden problems that traditional metrics completely miss.
4. More results ≠ better
Use dynamic-k analysis to find the sweet spot. Often k=3 beats k=10 for RAG because you avoid accumulating distractor harm.
5. High recall can be dangerous
A ranking profile with 90% recall but 50% distractor rate is WORSE than one with 70% recall and 0% distractor rate. Choose your ranking strategy with RAG in mind.