Part 2 of a series. Part 1 covered why AI agents need a Context Graph, not a vector store. This part covers how you actually build one — from zero documents to a self-populating graph that gets smarter with every event.
The Reasoning Evaporation Problem
Part 1 described a Context Graph as a system that remembers decisions, tracks outcomes, and learns from experience. If you read that and thought “great, but how does the decision get into the graph in the first place?” — this post is the answer.
The harder question is not how to query a context graph. It is how to populate one without turning it into a second job for the people it is meant to help.
Consider what happens right now in a typical engineering organisation on a typical day:
A developer opens a conversation with an AI coding assistant. They spend forty minutes working through a tricky authentication refactor — exploring three approaches, ruling out two for good reasons, settling on one, implementing it, fixing an edge case, and committing the result. The commit message says “refactor auth token handling.” That is all that survives. The forty minutes of reasoning — the approaches considered, the tradeoffs made, the edge case found — evaporates the moment the session closes.
Multiply that by every developer, every day, across every tool they use: coding assistants, Slack threads, design discussions, incident post-mortems, sprint planning calls. An enormous volume of reasoning happens continuously, and almost none of it persists anywhere searchable.
What gets preserved today
What you need six months later
“Why is it like this?”
“Who decided this and what were the alternatives?”
“What went wrong last time we tried a different approach?”
None of these are answered by the output alone.
This is the reasoning evaporation problem. It is not a storage problem — disk is cheap. It is a capture friction problem. Asking developers to write structured decision records after every session is asking them to do the work twice. Nobody does it consistently. So the reasoning evaporates.
The implication is that a context graph which requires manual population will never reach the density needed to be useful. The capture has to be automatic. The developer has to be able to do exactly what they already do, and the reasoning has to find its own way into the graph.
Three Paths Into the Graph
In production, a context graph is populated through three distinct mechanisms. They complement each other — each covers gaps the others leave.
Population Mechanisms
PATH 1
Backfill
Historical data replayed from existing systems
Jira: last 2 years
GitHub: merged PRs
PagerDuty: incidents
Runs once at setup
Gives immediate depth
PATH 2
Live Webhooks
Real-time events from operational systems as they happen
PagerDuty alerts
Zendesk tickets
GitHub PR merges
Slack threads
Runs continuously
Keeps graph current
PATH 3
Capture Bridge
LLM sessions + interactions auto-captured on trigger
git commit
meeting end
notebook save
PR merge
Runs on trigger
Captures why
Path 1 — Backfill solves the cold start problem. An empty graph cannot give useful answers. By replaying historical data from existing systems before going live, you give the precedent search something to work with from day one. A two-year Jira backfill might yield six hundred decision traces. That is enough for the system to start returning genuinely useful results.
Path 2 — Live webhooks keep the graph current. Operational systems (ticketing, monitoring, project management, CRM) already emit events when things happen. Registering webhooks against these systems means every new incident, ticket, PR, and decision flows into the graph automatically, with no human intervention.
Path 3 — Capture bridge solves the reasoning evaporation problem. This is the path that captures what the other two cannot: the why behind decisions made with AI assistance. It hooks into the tools where reasoning happens — coding sessions, meetings, notebooks — and auto-captures the conversation before it disappears.
The three paths together mean the graph populates on three timescales: historical depth from day one, continuous present-tense updates from live systems, and persistent reasoning from every assisted interaction.
The Cold Start Problem
A context graph starts empty. An empty graph returns no results. No results means no value. No value means no adoption. This is the cold start problem, and it is the most common reason these systems fail before they start.
Day 0: Empty graph state
Vespa is running. Schema is deployed.
entity 0 documents
context_fragment 0 documents
decision_trace 0 documents
edge 0 documents
Search returns nothing.
Precedent search returns nothing.
The graph is structurally ready but useless.
The solution is to treat the cold start as a data engineering problem, not a product problem. Before exposing the graph to users, backfill it with historical data from the systems they already use.
The backfill script follows the same pattern for every source system:
Backfill pattern (per historical record)
Extract text
Title, description, comments, resolution notes
Resolve entities
Who is mentioned? What service? Which team? Map to canonical entity IDs (create if new)
Write context fragment
source_system, source_id, content, entity_ids
Write decision trace
For resolved/closed records: summary, reasoning, outcome, tags
Write edges
trace --informed_by→ fragment, trace --involves→ entity
After a Jira backfill covering two years, a typical engineering organisation might have:
That is enough. Precedent search now returns real answers to real questions. “Has this kind of auth bug been seen before?” returns actual tickets with actual resolutions, not an empty list.
The critical insight about backfill is that it converts existing organisational knowledge — already written, already resolved, sitting unused in Jira — into a searchable precedent graph. Nothing new needs to be created. The knowledge already exists. It just needs to be structured and indexed.
The Generic Ingestion Pattern
Once you look at several source systems — Jira, PagerDuty, GitHub, Slack, Zoom, Jupyter notebooks, coding assistant sessions — a pattern emerges. Every ingestion path, regardless of source, does the same three things:
Generic Ingestion Pipeline
PARSER
Knows the source format.
Converts it to a flat list of {role, text} turns.
~20–30 lines of code per source.
EXTRACTOR
Always Claude Haiku. Prompt tuned per domain.
Input: turns + metadata hints (commit msg, ticket title, ...)
Output: decision_summary, reasoning, decision_type, tags
WRITER
Always the same two API calls.
POST /capture/fragment → stores raw content
POST /record/decision → stores extracted reasoning
Never changes. Domain-agnostic.
The parser is the only thing that changes between source systems. Every format looks different on the wire:
Source formats (all produce the same output)
Claude Code session (JSONL):
Zoom transcript (VTT):
Slack thread (JSON):
Jupyter notebook (.ipynb):
But every parser produces the same output: a flat list of turns. Once you have turns, everything downstream — Haiku extraction, fragment write, decision write, edge creation — is identical regardless of where the data came from.
The prompt tweak is smaller than it sounds. The extractor prompt has one domain-specific section: the vocabulary.
Domain vocabulary (the only thing that changes in the extractor prompt)
# Engineering session
“decision_type: code_change | bug_fix | feature_add | refactor | config_change”
“workflow: development”
# Support ticket
“decision_type: escalation | workaround | refund | resolution | workaround”
“workflow: support”
# Architecture review meeting
“decision_type: adopt | reject | defer | spike | revisit”
“workflow: architecture”
# Data science notebook
“decision_type: hypothesis_confirmed | hypothesis_rejected | investigation | model_selection”
“workflow: data_science”
Everything else in the prompt — the instruction to extract summary, reasoning, tags, confidence — stays the same. Haiku already knows what those concepts mean. You are just telling it which vocabulary to use for your specific domain.
This means adding a new source system to an existing context graph is not a large engineering project. It is a parser (the format-specific piece, ~20–30 lines) and a prompt section (the vocabulary, ~5 lines). The graph, the API, the search, the ranking — none of that changes.
Entity Resolution: The Piece That Holds Everything Together
Here is a problem that sounds minor and turns out to be central: the same real-world thing has a different name in every system.
The payment service, as known by each system
If each of these creates a separate entity in the graph, you have five disconnected islands. A query about “the payment service” returns fragments from Jira, but not the incidents from PagerDuty, not the PRs from GitHub, not the Slack discussions. The graph has the data. It just cannot connect it.
This is the entity identity problem. It is one of the most underappreciated problems in enterprise AI, and it is the primary reason cross-source queries fail in practice.
Entity resolution is the process of mapping raw mentions to canonical entity IDs:
Entity resolution in practice
The resolution uses Reciprocal Rank Fusion over two signals: BM25 keyword similarity and E5 semantic embedding similarity. The score threshold (default 0.3) determines whether a mention is resolved to an existing entity or creates a new one.
Resolution flow for incoming mention: “Payment Service”
BM25(“Payment Service” vs all entity names)
+ E5 embedding similarity
Fused via RRF
Best match: “payment-service” → score 0.71
0.71 > 0.3 threshold → reuse “ent-payment-svc”
Without resolution
5 separate entity nodes
5 disconnected subgraphs
cross-source query fails
With resolution
1 entity node
1 connected subgraph
cross-source query works
The downstream effect is significant. With resolution working correctly, a query for “what happened with the payment service last quarter?” returns Jira tickets, PagerDuty incidents, GitHub PRs, and Claude Code sessions — all as results from a single search, all linked through a single canonical entity ID. Without it, the same query returns only whichever source system happened to use the exact keyword you searched for.
Correctness is also important here. Resolution edges are first-class graph objects with their own metadata: confidence score, match method, and source system. When a human corrects a wrong resolution, the old edge is expired and a new one is written. The full audit trail of every resolution decision — right and wrong — is preserved in the graph.
What Actually Happens When You Ask a Question
Part 1 described hybrid search as “Reciprocal Rank Fusion combining multiple signals.” That description is accurate but abstract. Here is what actually happens, step by step, when you ask “what is the status of the auth implementation?”
Step 1: The query becomes a vector
Before anything reaches the search index, the query text is embedded:
“what is the status of the auth implementation?”
q = [0.021, -0.147, 0.309, 0.088, -0.211, ...]
384 floats representing the meaning of the query in embedding space
This vector represents the meaning of the query in embedding space. It will be used to find documents that are semantically similar, regardless of whether they share exact keywords.
Step 2: Two searches run in parallel
A single YQL query triggers two independent retrieval paths simultaneously:
BM25 keyword search
Tokenises query:
[“auth”, “implementation”, “status”]
Scores every document that contains these terms using term frequency + inverse document frequency.
“auth” is rare → high IDF
“status” is common → lower IDF
Returns: keyword matches
HNSW vector search
Takes q from Step 1.
Walks the HNSW graph by cosine distance.
Finds 20 nearest neighbours to q in embedding space.
Finds documents about “token refresh”, “identity layer”, “OAuth2” — even without those words in the query.
Returns: semantic matches
This is why two signals matter. BM25 catches exact matches — if a trace explicitly mentions “auth implementation,” BM25 finds it. HNSW catches semantic matches — if a trace discusses “token validation” or “OAuth2 flow,” the vector similarity finds it even though those words are not in the query. Each signal finds things the other misses.
Step 3: First-phase ranking
All ~60 candidates are scored cheaply. No model calls — just arithmetic over stored attributes:
score = bm25(decision_summary) + bm25(reasoning) + closeness(field, decision_embedding)
trace-g7h8 "Implemented OAuth2 token refresh"
trace-j1k2 "Payment 401 spike — rolled back"
Top 10 pass through. The other 50 are dropped.
Step 4: Second-phase ranking with quality signals
This is where the context graph diverges fundamentally from a vector store. The top 10 survivors are re-scored using outcome history:
final_score = first_phase_score × confidence × outcome_multiplier
outcome_multiplier:
trace-g7h8 “Implemented OAuth2 token refresh”
5.02 × 0.89 × 1.5 = 6.70 ← RANKS #1
trace-j1k2 “Payment 401 spike — rolled back”
2.79 × 0.72 × 0.12 = 0.24 ← RANKS LAST
The failed attempt had reasonable keyword overlap with the query. Without quality scoring, it would rank competitively. With quality scoring, it is pushed to the bottom where it belongs — while still being visible in the results as a cautionary example.
Step 5: Fields projected, embeddings dropped
Vespa returns the top 5 hits as JSON. The embedding vectors — used for retrieval, never needed by the caller — are stripped before the response leaves Vespa. Only human-readable fields travel over the wire.
{
“trace_id”: “trace-g7h8”,
“decision_summary”: “Implemented OAuth2 token refresh in payment service”,
“reasoning”: “Following PROJ-892 decision, pre-emptive refresh chosen to avoid 401 storms on token expiry...”,
“outcome_status”: “successful”,
“tags”: [“payment-service”, “oauth2”, “commit:3f9a1c2”],
“relevance”: 6.70
}
The Missing Synthesis Layer
Here is where most context graph implementations stop — and where the user experience falls short.
The search returns accurate, well-ranked results. But it returns them as a JSON array of decision traces. The user asked a question in plain English. They received a data structure.
They now have to read three reasoning fields and mentally assemble an answer. That is not a great experience, and it is not necessary. The synthesis step — turning ranked results back into a direct answer — is one additional model call:
Without synthesis
User question
→ Vespa search
→ JSON results
→ User reads 3 records
→ User figures it out
With synthesis
User question
→ Vespa search
→ top-k traces
→ Haiku prompt
→ Direct answer in plain English with source IDs
The synthesis prompt is straightforward:
The user asked: “what is the status of the auth implementation?”
Here are the relevant decisions from the context graph, ranked by relevance and outcome quality:
1. [decision_summary, reasoning, outcome_status, decided_at]
2. [decision_summary, reasoning, outcome_status, decided_at]
3. [decision_summary, reasoning, outcome_status, decided_at]
Answer the user's question directly and concisely.
Cite which traces support your answer.
If outcomes conflict, say so explicitly.
The output the user actually reads:
The auth implementation is complete and working.
The core decision (PROJ-892) consolidated three parallel auth paths into a single OAuth2 flow. This was implemented using pre-emptive token refresh — a deliberate choice made after a failed rollout midway through migration, where old tokens still in rotation caused a 401 spike that required a rollback.
The final implementation has been stable since the fix. Outcome: successful. Not overridden.
Sources: trace-g7h8, trace-d4e5, trace-j1k2
The context graph did the hard work: retrieving the right traces, ranking them correctly, surfacing the outcome history. The synthesis layer is a thin pass over that result that costs one fast model call and converts data into an answer.
The reason synthesis belongs in the architecture is that it closes the loop on the user experience. A system that requires users to read and interpret raw JSON has friction at every query. A system that answers questions in plain English, with source citations, gets used continuously. Continuous use means continuous capture. Continuous capture means the graph keeps growing. The synthesis layer is what makes the whole thing worth doing.
The Accumulation Effect
There is a property of context graphs that is easy to state but takes time to appreciate: the graph at month six is categorically different from the graph at day one — not just larger, but qualitatively more useful in a way that has no equivalent in traditional RAG.
Traditional RAG quality over time
Context Graph quality over time
The quality signals only exist through real usage over real time. You cannot manufacture them. A 0.3x penalty on a failed decision only applies if someone tracked the failure as an outcome. A 1.5x boost on a successful decision only applies if someone confirmed the success. You cannot pre-populate these — they accumulate naturally as decisions are made and as humans correct and confirm results.
The precedent chains compound in a specific way. When decision C cites decisions A and B as precedent, and later decision D cites C, decision D implicitly inherits the lessons of A and B — even if the engineer making decision D has never heard of A or B. The knowledge propagates through the citation graph without anyone explicitly transferring it.
Month 1:
A
no chains yet
Month 3:
A ──preceded_by──▶ B ──preceded_by──▶ C
"we learned from A and B"
Month 6:
A ──preceded_by──▶ B ──preceded_by──▶ C ──preceded_by──▶ D
D inherits lessons of A, B, C without D knowing about them
There is a second accumulation effect that is less obvious: entity density. At month one, entity resolution has seen a limited vocabulary of names from a limited number of source systems. By month six, it has processed thousands of mentions across Jira, PagerDuty, GitHub, Slack, and coding sessions. The entity model has seen every variation of every service name, every team alias, every customer shorthand. Resolution confidence improves simply because the system has seen the patterns before.
This means the graph at month six is not just quantitatively larger. It is qualitatively smarter at connecting the right things, surfacing the right precedent, and suppressing the wrong answers — all without any model retraining, any explicit rule writing, or any human curation. The improvement is a structural property of accumulation.
What This Looks Like End to End
Pull it all together into a single example. A developer joins a team that has been running a context graph for six months.
On their first day, they open a Claude Code session and ask: “We need to add rate limiting to the payment API. What should I know before I start?”
Top results:
#1 trace-r4s5 "Implemented rate limiting on payment API"
source: claude_code · outcome: successful · 6 months ago
Token bucket chosen over sliding window after testing. Redis-backed. 1000 req/min per customer tier.
#2 trace-f2g3 "Rate limiter caused false positives in batch jobs"
source: pagerduty · outcome: failed · 5 months ago
Batch job exceeded per-customer limit. Fixed with separate limit bucket for service accounts.
#3 trace-h6i7 "Extended rate limiter to cover payment webhooks"
source: jira · outcome: successful · 3 months ago
Webhook endpoints exempt from customer rate limits, separate service-level cap at 10k/min applied.
“Rate limiting on the payment API is already implemented using a Redis-backed token bucket (1000 req/min per customer tier). One known issue: batch jobs hit the per-customer limit and caused alerts — fixed by giving service accounts a separate bucket. Webhook endpoints are exempt from customer limits with their own cap.
Before adding more rate limiting, check whether your use case involves batch processing or webhooks — both have special handling already in place.
Sources: trace-r4s5, trace-f2g3, trace-h6i7”
The developer just received six months of institutional knowledge about this exact topic — including the failure mode that was discovered and fixed — in seconds, without asking anyone, without reading through ticket histories, without knowing those decisions existed.
That answer came from three source systems (Claude Code, PagerDuty, Jira), was connected through a single canonical entity (ent-payment-api), was ranked by actual outcomes, and was synthesised into plain English by one fast model call.
None of that required any special instrumentation after the initial setup. The Jira webhook was running. The PagerDuty webhook was running. The capture bridge was running. The graph accumulated those decisions automatically. The developer's question just surfaced what was already there.
Conclusion
Part 1 made the case that AI agents need a different kind of memory than traditional RAG provides — one that remembers decisions, tracks outcomes, and learns from experience.
This post has been about the less glamorous half of that story: how the memory actually gets populated, how the data holds together across source systems, and what happens at each step between a natural language question and a useful answer.
A few things are worth restating plainly.
The capture problem is harder than the query problem. A well-designed context graph query is straightforward to implement. Getting the data into the graph without adding friction for the people who generate it is where most implementations stall. The capture bridge pattern — hooking into existing tools and triggers rather than asking people to write structured records — is the answer to this. If capture is frictionless, the graph fills itself.
Entity resolution is not optional. Every source system names things differently. Without resolution, you have a collection of per-system silos with a shared API. With resolution, you have a unified graph where a question about a service returns everything that ever mentioned it, regardless of how it was spelled. The difference in search quality is enormous.
Quality signals only exist through real usage. The outcome multipliers that push successful decisions up and failed decisions down do not come from anywhere except real humans tracking real results over real time. You cannot bootstrap them. You cannot simulate them. They accumulate. This is why a context graph at month six is not just quantitatively larger than at day one — it is genuinely smarter in a way that requires the passage of time.
The synthesis layer is what makes it usable. Search that returns ranked JSON is a developer tool. Search that returns a plain English answer with citations is a product. The additional Haiku call that converts results into prose is not an implementation detail — it is the difference between a system that gets queried once and abandoned and one that becomes part of how people work.
The deepest property of this architecture is one that has no equivalent in traditional RAG: the system's value compounds with use. Every decision recorded is precedent for the next similar situation. Every outcome tracked is a quality signal that improves future ranking. Every entity resolved is a connection that makes cross-source queries more complete. None of this requires any explicit human curation after the initial setup. It accumulates as a natural consequence of the system being used.
That is the difference between a system that stores information and a system that accumulates institutional wisdom.