Similarity is a guess. Provenance is a proof.
Vector similarity is the default answer for retrieval because it is cheap, easy, and good enough at demo time. It is also the wrong primitive to put underneath an agent that has to defend its answers.
Cosine distance tells you that two pieces of text resemble one another. It does not tell you which one the agent should trust, when each was last verified, or whether the same source was the basis for a different answer yesterday. When the only thing your retrieval can rank by is "looks similar," the model has nothing to lean on when two passages disagree — and disagreement is the normal state of any non-trivial corpus.
Provenance is the structural answer. A provenance graph records where a fact came from, when it was learned, when it was valid, who signed off on it, and what other facts depend on it. Provenance gives the recall layer something to rank by that semantic similarity cannot — verifiability. The agent stops choosing the nearest neighbour and starts choosing the most defensible one.
What the architecture audit found
The 2026-05 architecture audit re-ran every coverage probe against the design and reached a sharper version of the same conclusion. The audit registered 29 unique gaps across 36 experiments — and the cluster of gaps that most directly mattered to recall quality was about provenance, not embeddings.
Two findings stood out:
- Confidence calibration is unmeasured (GAP-28). Without explicit calibration, the similarity score the model sees is a number with no operating curve attached. The recall layer cannot tell a 0.82 cosine match in a clean corpus from a 0.82 match in a noisy one, and the agent inherits that ambiguity.
- Graph + provenance export is missing from the surface (GAP-33). Customers operating in regulated contexts need to be able to walk the provenance graph offline — for audit, for legal hold, for portability. The design now treats the provenance export as a first-class deliverable, not a follow-up.
These are not embedding bugs. They are grounding bugs. The audit's prescription was the same one we had been moving toward: blend three signals at recall time — semantic, structural, provenance — and tune the blend per namespace.
The hybrid recall score
Contexta's recall does not return the nearest neighbour. It returns the best Context Packet — a structured response whose ranking score is the weighted blend of three independent signals:
score(packet) =
w_sem * semantic(packet, query)
+ w_struct * structural(packet, query)
+ w_prov * provenance(packet)
Each weight is tuned per namespace. Three concrete examples:
- A regulated-industry namespace pushes
w_provtoward 0.6. Cosine matches in the right region of vector space are necessary but not sufficient; the recall ranks signed, in-date sources above unsigned ones, and refuses to return anything below the namespace's confidence floor. - A consumer-support namespace leans on
w_sembecause the corpus is broad and the cost of a near-miss is low; provenance still acts as a tiebreaker between equally-similar candidates. - A graph-heavy namespace — think trading or supply-chain — pushes
w_structup so multi-hop reachability matters more than raw embedding closeness. The query "which of our suppliers' suppliers exposed us to that bankruptcy?" is a structural question; semantic similarity cannot answer it.
The blend is the wedge. Pure-vector systems cannot do this because they have nothing else to rank by. Graph-only systems cannot do this because they have no smooth semantic axis. Contexta's substrate carries all three, so the blend is a tunable lever rather than an architectural fight.
What a Packet looks like
A Context Packet is the only contract the LLM consumes. It is the unit that recall returns and that Reflex firings emit. Every Packet carries its own provenance, so the model never has to ask "where did this come from?" — the answer is already in the payload.
import { Contexta } from '@contexta/sdk';
const ctx = new Contexta({ apiKey: process.env.CONTEXTA_API_KEY });
const packet = await ctx.recall({
query: 'Where did Alice work in March 2025?',
userId: 'alice',
asOf: '2025-03-15T00:00:00Z',
minConfidence: 0.85,
});
// packet.citations[i] = {
// source_id, signed_by, valid_at, learned_at,
// confidence, hop_chain, motif_id
// }
Three things are doing work here:
asOfis a bi-temporal lens. The recall returns what was known at that time, not what the system now believes. This is the difference between "what did we know on the day we signed?" and "what do we believe today?"minConfidenceis the calibration knob. Anything below the floor is dropped, not surfaced with a low rank — because a low-rank answer in production is still an answer the agent will quote.citationsare the audit trail. Each citation is a node in the provenance graph; thehop_chainis the traversal that led from the query to the source. A regulator can replay it.
Why this is the moat
Models will keep getting cheaper. Embeddings will keep getting better. Vector databases will keep commoditising. None of those improvements close the grounding gap, because the gap is structural — it is about what the substrate can prove, not how cleverly it can rank.
The hybrid score is the smallest possible primitive that captures the real shape of the problem:
semantic structural provenance
(what (how it (how we
it says) connects) know it)
\ | /
\ | /
+---- Context Packet ----+
|
v
defensible answer
Three signals, one substrate, weighted per namespace, surfaced through a Packet whose citations a regulator can walk. That is what verifiable AI looks like on the inside.
Similarity is a guess about meaning. Provenance is a proof about origin. Production agents need both — but if you have to pick which one to architect for first, pick the one you can defend.