Building Institutional Memory in Practice: How a Context Graph Gets Populated and Used

Part 2 of a series. Part 1 covered why AI agents need a Context Graph, not a vector store. This part covers how you actually build one — from zero documents to a self-populating graph that gets smarter with every event.

The Reasoning Evaporation Problem

Part 1 described a Context Graph as a system that remembers decisions, tracks outcomes, and learns from experience. If you read that and thought “great, but how does the decision get into the graph in the first place?” — this post is the answer.

The harder question is not how to query a context graph. It is how to populate one without turning it into a second job for the people it is meant to help.

Consider what happens right now in a typical engineering organisation on a typical day:

A developer opens a conversation with an AI coding assistant. They spend forty minutes working through a tricky authentication refactor — exploring three approaches, ruling out two for good reasons, settling on one, implementing it, fixing an edge case, and committing the result. The commit message says “refactor auth token handling.” That is all that survives. The forty minutes of reasoning — the approaches considered, the tradeoffs made, the edge case found — evaporates the moment the session closes.

Multiply that by every developer, every day, across every tool they use: coding assistants, Slack threads, design discussions, incident post-mortems, sprint planning calls. An enormous volume of reasoning happens continuously, and almost none of it persists anywhere searchable.

What gets preserved today

✓The output → code, ticket, doc, decision

✗The reasoning → gone

What you need six months later

“Why is it like this?”

“Who decided this and what were the alternatives?”

“What went wrong last time we tried a different approach?”

None of these are answered by the output alone.

This is the reasoning evaporation problem. It is not a storage problem — disk is cheap. It is a capture friction problem. Asking developers to write structured decision records after every session is asking them to do the work twice. Nobody does it consistently. So the reasoning evaporates.

The implication is that a context graph which requires manual population will never reach the density needed to be useful. The capture has to be automatic. The developer has to be able to do exactly what they already do, and the reasoning has to find its own way into the graph.

Three Paths Into the Graph

In production, a context graph is populated through three distinct mechanisms. They complement each other — each covers gaps the others leave.

Population Mechanisms

PATH 1

Backfill

Historical data replayed from existing systems

Jira: last 2 years

GitHub: merged PRs

PagerDuty: incidents

Runs once at setup

Gives immediate depth

PATH 2

Live Webhooks

Real-time events from operational systems as they happen

PagerDuty alerts

Zendesk tickets

GitHub PR merges

Slack threads

Runs continuously

Keeps graph current

PATH 3

Capture Bridge

LLM sessions + interactions auto-captured on trigger

git commit

meeting end

notebook save

PR merge

Runs on trigger

Captures why

Path 1 — Backfill solves the cold start problem. An empty graph cannot give useful answers. By replaying historical data from existing systems before going live, you give the precedent search something to work with from day one. A two-year Jira backfill might yield six hundred decision traces. That is enough for the system to start returning genuinely useful results.

Path 2 — Live webhooks keep the graph current. Operational systems (ticketing, monitoring, project management, CRM) already emit events when things happen. Registering webhooks against these systems means every new incident, ticket, PR, and decision flows into the graph automatically, with no human intervention.

Path 3 — Capture bridge solves the reasoning evaporation problem. This is the path that captures what the other two cannot: the why behind decisions made with AI assistance. It hooks into the tools where reasoning happens — coding sessions, meetings, notebooks — and auto-captures the conversation before it disappears.

The three paths together mean the graph populates on three timescales: historical depth from day one, continuous present-tense updates from live systems, and persistent reasoning from every assisted interaction.

The Cold Start Problem

A context graph starts empty. An empty graph returns no results. No results means no value. No value means no adoption. This is the cold start problem, and it is the most common reason these systems fail before they start.

Day 0: Empty graph state

Vespa is running. Schema is deployed.

entity 0 documents

context_fragment 0 documents

decision_trace 0 documents

edge 0 documents

Search returns nothing.

Precedent search returns nothing.

The graph is structurally ready but useless.

The solution is to treat the cold start as a data engineering problem, not a product problem. Before exposing the graph to users, backfill it with historical data from the systems they already use.

The backfill script follows the same pattern for every source system:

Backfill pattern (per historical record)

Extract text

Title, description, comments, resolution notes

Resolve entities

Who is mentioned? What service? Which team? Map to canonical entity IDs (create if new)

Write context fragment

source_system, source_id, content, entity_ids

Write decision trace

For resolved/closed records: summary, reasoning, outcome, tags

Write edges

trace --informed_by→ fragment, trace --involves→ entity

After a Jira backfill covering two years, a typical engineering organisation might have:

entity~120(people, services, teams)

context_fragment~847(one per ticket)

decision_trace~612(resolved tickets only)

edge~2400(involves + informed_by links)

That is enough. Precedent search now returns real answers to real questions. “Has this kind of auth bug been seen before?” returns actual tickets with actual resolutions, not an empty list.

The critical insight about backfill is that it converts existing organisational knowledge — already written, already resolved, sitting unused in Jira — into a searchable precedent graph. Nothing new needs to be created. The knowledge already exists. It just needs to be structured and indexed.

The Generic Ingestion Pattern

Once you look at several source systems — Jira, PagerDuty, GitHub, Slack, Zoom, Jupyter notebooks, coding assistant sessions — a pattern emerges. Every ingestion path, regardless of source, does the same three things:

Generic Ingestion Pipeline

Any source system

↓ raw event / file / payload

PARSER

Knows the source format.

Converts it to a flat list of {role, text} turns.

~20–30 lines of code per source.

↓ [{role: “user”, text: “...”}, {role: “assistant”, text: “...”}]

EXTRACTOR

Always Claude Haiku. Prompt tuned per domain.

Input: turns + metadata hints (commit msg, ticket title, ...)

Output: decision_summary, reasoning, decision_type, tags

↓ structured metadata

WRITER

Always the same two API calls.

POST /capture/fragment → stores raw content

POST /record/decision → stores extracted reasoning

Never changes. Domain-agnostic.

The parser is the only thing that changes between source systems. Every format looks different on the wire:

Source formats (all produce the same output)

Claude Code session (JSONL):

{"type": "assistant", "message": {"content": [{"type": "text", "text": "..."}]}}

Zoom transcript (VTT):

00:14:22.000 --> 00:14:28.000 Alice: We should go with the pre-emptive refresh approach.

Slack thread (JSON):

{"user": "U123", "text": "agreed, but we need to handle the edge case where..."}

Jupyter notebook (.ipynb):

{"cell_type": "markdown", "source": ["## Decision\nUsing XGBoost over LightGBM because..."]}

But every parser produces the same output: a flat list of turns. Once you have turns, everything downstream — Haiku extraction, fragment write, decision write, edge creation — is identical regardless of where the data came from.

The prompt tweak is smaller than it sounds. The extractor prompt has one domain-specific section: the vocabulary.

Domain vocabulary (the only thing that changes in the extractor prompt)

# Engineering session

“decision_type: code_change | bug_fix | feature_add | refactor | config_change”

“workflow: development”

# Support ticket

“decision_type: escalation | workaround | refund | resolution | workaround”

“workflow: support”

# Architecture review meeting

“decision_type: adopt | reject | defer | spike | revisit”

“workflow: architecture”

# Data science notebook

“decision_type: hypothesis_confirmed | hypothesis_rejected | investigation | model_selection”

“workflow: data_science”

Everything else in the prompt — the instruction to extract summary, reasoning, tags, confidence — stays the same. Haiku already knows what those concepts mean. You are just telling it which vocabulary to use for your specific domain.

This means adding a new source system to an existing context graph is not a large engineering project. It is a parser (the format-specific piece, ~20–30 lines) and a prompt section (the vocabulary, ~5 lines). The graph, the API, the search, the ranking — none of that changes.

Entity Resolution: The Piece That Holds Everything Together

Here is a problem that sounds minor and turns out to be central: the same real-world thing has a different name in every system.

The payment service, as known by each system

Jira:"payment-service"

PagerDuty:"Payment Service"

GitHub:"payments-service"

Slack:"payments" or "pay-svc"

Datadog:"payment_service_prod"

If each of these creates a separate entity in the graph, you have five disconnected islands. A query about “the payment service” returns fragments from Jira, but not the incidents from PagerDuty, not the PRs from GitHub, not the Slack discussions. The graph has the data. It just cannot connect it.

This is the entity identity problem. It is one of the most underappreciated problems in enterprise AI, and it is the primary reason cross-source queries fail in practice.

Entity resolution is the process of mapping raw mentions to canonical entity IDs:

Entity resolution in practice

Jira: "payment-service"

——→

ent-payment-svc

PagerDuty: "Payment Service"

——→

ent-payment-svc

GitHub: "payments-service"

——→

ent-payment-svc

Slack: "pay-svc"

——→

ent-payment-svc

Datadog: "payment_service_prod"

——→

ent-payment-svc

(score: 0.71)

New mention: "billing engine"

——→

NEW entity

(score: 0.18)

The resolution uses Reciprocal Rank Fusion over two signals: BM25 keyword similarity and E5 semantic embedding similarity. The score threshold (default 0.3) determines whether a mention is resolved to an existing entity or creates a new one.

Resolution flow for incoming mention: “Payment Service”

↓ Vespa entity-resolve rank profile

BM25(“Payment Service” vs all entity names)

+ E5 embedding similarity

Fused via RRF

Best match: “payment-service” → score 0.71

0.71 > 0.3 threshold → reuse “ent-payment-svc”

↓

Without resolution

5 separate entity nodes

5 disconnected subgraphs

cross-source query fails

With resolution

1 entity node

1 connected subgraph

cross-source query works

The downstream effect is significant. With resolution working correctly, a query for “what happened with the payment service last quarter?” returns Jira tickets, PagerDuty incidents, GitHub PRs, and Claude Code sessions — all as results from a single search, all linked through a single canonical entity ID. Without it, the same query returns only whichever source system happened to use the exact keyword you searched for.

Correctness is also important here. Resolution edges are first-class graph objects with their own metadata: confidence score, match method, and source system. When a human corrects a wrong resolution, the old edge is expired and a new one is written. The full audit trail of every resolution decision — right and wrong — is preserved in the graph.

What Actually Happens When You Ask a Question

Part 1 described hybrid search as “Reciprocal Rank Fusion combining multiple signals.” That description is accurate but abstract. Here is what actually happens, step by step, when you ask “what is the status of the auth implementation?”

Step 1: The query becomes a vector

Before anything reaches the search index, the query text is embedded:

“what is the status of the auth implementation?”

↓ E5-small-v2 transformer (384-dimensional dense encoder)

q = [0.021, -0.147, 0.309, 0.088, -0.211, ...]

384 floats representing the meaning of the query in embedding space

This vector represents the meaning of the query in embedding space. It will be used to find documents that are semantically similar, regardless of whether they share exact keywords.

Step 2: Two searches run in parallel

A single YQL query triggers two independent retrieval paths simultaneously:

“what is the status of the auth implementation?”

↓

BM25 keyword search

Tokenises query:

[“auth”, “implementation”, “status”]

Scores every document that contains these terms using term frequency + inverse document frequency.

“auth” is rare → high IDF

“status” is common → lower IDF

Returns: keyword matches

↓

HNSW vector search

Takes q from Step 1.

Walks the HNSW graph by cosine distance.

Finds 20 nearest neighbours to q in embedding space.

Finds documents about “token refresh”, “identity layer”, “OAuth2” — even without those words in the query.

Returns: semantic matches

↓ merge + deduplicate (~60 candidate documents)

This is why two signals matter. BM25 catches exact matches — if a trace explicitly mentions “auth implementation,” BM25 finds it. HNSW catches semantic matches — if a trace discusses “token validation” or “OAuth2 flow,” the vector similarity finds it even though those words are not in the query. Each signal finds things the other misses.

Step 3: First-phase ranking

All ~60 candidates are scored cheaply. No model calls — just arithmetic over stored attributes:

score = bm25(decision_summary) + bm25(reasoning) + closeness(field, decision_embedding)

trace-g7h8 "Implemented OAuth2 token refresh"

bm25(summary):2.31← "auth", "implementation" match

bm25(reasoning):1.87← strong keyword overlap

closeness(emb):0.84← vector is semantically close

first-phase:5.02

trace-j1k2 "Payment 401 spike — rolled back"

bm25(summary):1.10← "auth" matches

bm25(reasoning):0.93

closeness(emb):0.76

first-phase:2.79

Top 10 pass through. The other 50 are dropped.

Step 4: Second-phase ranking with quality signals

This is where the context graph diverges fundamentally from a vector store. The top 10 survivors are re-scored using outcome history:

final_score = first_phase_score × confidence × outcome_multiplier

outcome_multiplier:

"successful"× 1.5(this approach worked)

"failed" + overridden× 0.12(this approach failed and was corrected)

"pending"× 1.0(no signal yet)

trace-g7h8 “Implemented OAuth2 token refresh”

5.02 × 0.89 × 1.5 = 6.70 ← RANKS #1

trace-j1k2 “Payment 401 spike — rolled back”

2.79 × 0.72 × 0.12 = 0.24 ← RANKS LAST

The failed attempt had reasonable keyword overlap with the query. Without quality scoring, it would rank competitively. With quality scoring, it is pushed to the bottom where it belongs — while still being visible in the results as a cautionary example.

Step 5: Fields projected, embeddings dropped

Vespa returns the top 5 hits as JSON. The embedding vectors — used for retrieval, never needed by the caller — are stripped before the response leaves Vespa. Only human-readable fields travel over the wire.

{

“trace_id”: “trace-g7h8”,

“decision_summary”: “Implemented OAuth2 token refresh in payment service”,

“reasoning”: “Following PROJ-892 decision, pre-emptive refresh chosen to avoid 401 storms on token expiry...”,

“outcome_status”: “successful”,

“tags”: [“payment-service”, “oauth2”, “commit:3f9a1c2”],

“relevance”: 6.70

}

The Missing Synthesis Layer

Here is where most context graph implementations stop — and where the user experience falls short.

The search returns accurate, well-ranked results. But it returns them as a JSON array of decision traces. The user asked a question in plain English. They received a data structure.

They now have to read three reasoning fields and mentally assemble an answer. That is not a great experience, and it is not necessary. The synthesis step — turning ranked results back into a direct answer — is one additional model call:

Without synthesis

User question

→ Vespa search

→ JSON results

→ User reads 3 records

→ User figures it out

With synthesis

User question

→ Vespa search

→ top-k traces

→ Haiku prompt

→ Direct answer in plain English with source IDs

The synthesis prompt is straightforward:

The user asked: “what is the status of the auth implementation?”

Here are the relevant decisions from the context graph, ranked by relevance and outcome quality:

1. [decision_summary, reasoning, outcome_status, decided_at]

2. [decision_summary, reasoning, outcome_status, decided_at]

3. [decision_summary, reasoning, outcome_status, decided_at]

Answer the user's question directly and concisely.

Cite which traces support your answer.

If outcomes conflict, say so explicitly.

The output the user actually reads:

The auth implementation is complete and working.

The core decision (PROJ-892) consolidated three parallel auth paths into a single OAuth2 flow. This was implemented using pre-emptive token refresh — a deliberate choice made after a failed rollout midway through migration, where old tokens still in rotation caused a 401 spike that required a rollback.

The final implementation has been stable since the fix. Outcome: successful. Not overridden.

Sources: trace-g7h8, trace-d4e5, trace-j1k2

The context graph did the hard work: retrieving the right traces, ranking them correctly, surfacing the outcome history. The synthesis layer is a thin pass over that result that costs one fast model call and converts data into an answer.

The reason synthesis belongs in the architecture is that it closes the loop on the user experience. A system that requires users to read and interpret raw JSON has friction at every query. A system that answers questions in plain English, with source citations, gets used continuously. Continuous use means continuous capture. Continuous capture means the graph keeps growing. The synthesis layer is what makes the whole thing worth doing.

The Accumulation Effect

There is a property of context graphs that is easy to state but takes time to appreciate: the graph at month six is categorically different from the graph at day one — not just larger, but qualitatively more useful in a way that has no equivalent in traditional RAG.

Traditional RAG quality over time

Quality = f(documents in corpus)

Better corpus → better answers

Same corpus → same answers, every time

No feedback loop. Quality is static between updates.

Context Graph quality over time

Quality = f(decisions × outcomes × precedent chains)

Day 1:~0 decisions, ~0 outcomes, ~0 chains

Month 1:~100 decisions, ~0 outcomes, shallow chains

Month 3:~500 decisions, ~80 outcomes, multi-hop chains

Month 6:~2000 decisions, ~400 outcomes, deep chains

Every outcome is a quality signal that reshapes future ranking.

The quality signals only exist through real usage over real time. You cannot manufacture them. A 0.3x penalty on a failed decision only applies if someone tracked the failure as an outcome. A 1.5x boost on a successful decision only applies if someone confirmed the success. You cannot pre-populate these — they accumulate naturally as decisions are made and as humans correct and confirm results.

The precedent chains compound in a specific way. When decision C cites decisions A and B as precedent, and later decision D cites C, decision D implicitly inherits the lessons of A and B — even if the engineer making decision D has never heard of A or B. The knowledge propagates through the citation graph without anyone explicitly transferring it.

Month 1:

no chains yet

Month 3:

A ──preceded_by──▶ B ──preceded_by──▶ C

"we learned from A and B"

Month 6:

A ──preceded_by──▶ B ──preceded_by──▶ C ──preceded_by──▶ D

D inherits lessons of A, B, C without D knowing about them

There is a second accumulation effect that is less obvious: entity density. At month one, entity resolution has seen a limited vocabulary of names from a limited number of source systems. By month six, it has processed thousands of mentions across Jira, PagerDuty, GitHub, Slack, and coding sessions. The entity model has seen every variation of every service name, every team alias, every customer shorthand. Resolution confidence improves simply because the system has seen the patterns before.

This means the graph at month six is not just quantitatively larger. It is qualitatively smarter at connecting the right things, surfacing the right precedent, and suppressing the wrong answers — all without any model retraining, any explicit rule writing, or any human curation. The improvement is a structural property of accumulation.

What This Looks Like End to End

Pull it all together into a single example. A developer joins a team that has been running a context graph for six months.

On their first day, they open a Claude Code session and ask: “We need to add rate limiting to the payment API. What should I know before I start?”

“rate limiting payment API” → POST /search/precedent

↓ Vespa: embed → BM25 + HNSW → first-phase → quality scoring → top 5

Conclusion

Part 1 made the case that AI agents need a different kind of memory than traditional RAG provides — one that remembers decisions, tracks outcomes, and learns from experience.

This post has been about the less glamorous half of that story: how the memory actually gets populated, how the data holds together across source systems, and what happens at each step between a natural language question and a useful answer.

A few things are worth restating plainly.

The capture problem is harder than the query problem. A well-designed context graph query is straightforward to implement. Getting the data into the graph without adding friction for the people who generate it is where most implementations stall. The capture bridge pattern — hooking into existing tools and triggers rather than asking people to write structured records — is the answer to this. If capture is frictionless, the graph fills itself.

Entity resolution is not optional. Every source system names things differently. Without resolution, you have a collection of per-system silos with a shared API. With resolution, you have a unified graph where a question about a service returns everything that ever mentioned it, regardless of how it was spelled. The difference in search quality is enormous.

Quality signals only exist through real usage. The outcome multipliers that push successful decisions up and failed decisions down do not come from anywhere except real humans tracking real results over real time. You cannot bootstrap them. You cannot simulate them. They accumulate. This is why a context graph at month six is not just quantitatively larger than at day one — it is genuinely smarter in a way that requires the passage of time.

The synthesis layer is what makes it usable. Search that returns ranked JSON is a developer tool. Search that returns a plain English answer with citations is a product. The additional Haiku call that converts results into prose is not an implementation detail — it is the difference between a system that gets queried once and abandoned and one that becomes part of how people work.

The deepest property of this architecture is one that has no equivalent in traditional RAG: the system's value compounds with use. Every decision recorded is precedent for the next similar situation. Every outcome tracked is a quality signal that improves future ranking. Every entity resolved is a connection that makes cross-source queries more complete. None of this requires any explicit human curation after the initial setup. It accumulates as a natural consequence of the system being used.

That is the difference between a system that stores information and a system that accumulates institutional wisdom.