Skip to content
Contexta
Blog

Provenance over similarity

Contexta Team8 min read
  • provenance
  • recall
  • audit

Similarity is a guess. Provenance is a proof.

Vector similarity is the default answer for retrieval because it is cheap, easy, and good enough at demo time. It is also the wrong primitive to put underneath an agent that has to defend its answers.

Cosine distance tells you that two pieces of text resemble one another. It does not tell you which one the agent should trust, when each was last verified, or whether the same source was the basis for a different answer yesterday. When the only thing your retrieval can rank by is "looks similar," the model has nothing to lean on when two passages disagree — and disagreement is the normal state of any non-trivial corpus.

Provenance is the structural answer. A provenance graph records where a fact came from, when it was learned, when it was valid, who signed off on it, and what other facts depend on it. Provenance gives the recall layer something to rank by that semantic similarity cannot — verifiability. The agent stops choosing the nearest neighbour and starts choosing the most defensible one.

What the architecture audit found

The 2026-05 architecture audit re-ran every coverage probe against the design and reached a sharper version of the same conclusion. The audit registered 29 unique gaps across 36 experiments — and the cluster of gaps that most directly mattered to recall quality was about provenance, not embeddings.

Two findings stood out:

  • Confidence calibration is unmeasured (GAP-28). Without explicit calibration, the similarity score the model sees is a number with no operating curve attached. The recall layer cannot tell a 0.82 cosine match in a clean corpus from a 0.82 match in a noisy one, and the agent inherits that ambiguity.
  • Graph + provenance export is missing from the surface (GAP-33). Customers operating in regulated contexts need to be able to walk the provenance graph offline — for audit, for legal hold, for portability. The design now treats the provenance export as a first-class deliverable, not a follow-up.

These are not embedding bugs. They are grounding bugs. The audit's prescription was the same one we had been moving toward: blend three signals at recall time — semantic, structural, provenance — and tune the blend per namespace.

The hybrid recall score

Contexta's recall does not return the nearest neighbour. It returns the best Context Packet — a structured response whose ranking score is the weighted blend of three independent signals:

score(packet) =
  w_sem * semantic(packet, query)
+ w_struct * structural(packet, query)
+ w_prov * provenance(packet)

Each weight is tuned per namespace. Three concrete examples:

  • A regulated-industry namespace pushes w_prov toward 0.6. Cosine matches in the right region of vector space are necessary but not sufficient; the recall ranks signed, in-date sources above unsigned ones, and refuses to return anything below the namespace's confidence floor.
  • A consumer-support namespace leans on w_sem because the corpus is broad and the cost of a near-miss is low; provenance still acts as a tiebreaker between equally-similar candidates.
  • A graph-heavy namespace — think trading or supply-chain — pushes w_struct up so multi-hop reachability matters more than raw embedding closeness. The query "which of our suppliers' suppliers exposed us to that bankruptcy?" is a structural question; semantic similarity cannot answer it.

The blend is the wedge. Pure-vector systems cannot do this because they have nothing else to rank by. Graph-only systems cannot do this because they have no smooth semantic axis. Contexta's substrate carries all three, so the blend is a tunable lever rather than an architectural fight.

What a Packet looks like

A Context Packet is the only contract the LLM consumes. It is the unit that recall returns and that Reflex firings emit. Every Packet carries its own provenance, so the model never has to ask "where did this come from?" — the answer is already in the payload.

import { Contexta } from '@contexta/sdk';

const ctx = new Contexta({ apiKey: process.env.CONTEXTA_API_KEY });

const packet = await ctx.recall({
  query: 'Where did Alice work in March 2025?',
  userId: 'alice',
  asOf: '2025-03-15T00:00:00Z',
  minConfidence: 0.85,
});

// packet.citations[i] = {
//   source_id, signed_by, valid_at, learned_at,
//   confidence, hop_chain, motif_id
// }

Three things are doing work here:

  1. asOf is a bi-temporal lens. The recall returns what was known at that time, not what the system now believes. This is the difference between "what did we know on the day we signed?" and "what do we believe today?"
  2. minConfidence is the calibration knob. Anything below the floor is dropped, not surfaced with a low rank — because a low-rank answer in production is still an answer the agent will quote.
  3. citations are the audit trail. Each citation is a node in the provenance graph; the hop_chain is the traversal that led from the query to the source. A regulator can replay it.

Why this is the moat

Models will keep getting cheaper. Embeddings will keep getting better. Vector databases will keep commoditising. None of those improvements close the grounding gap, because the gap is structural — it is about what the substrate can prove, not how cleverly it can rank.

The hybrid score is the smallest possible primitive that captures the real shape of the problem:

       semantic        structural       provenance
        (what          (how it          (how we
       it says)         connects)        know it)
           \             |              /
            \            |             /
             +---- Context Packet ----+
                        |
                        v
                  defensible answer

Three signals, one substrate, weighted per namespace, surfaced through a Packet whose citations a regulator can walk. That is what verifiable AI looks like on the inside.

Similarity is a guess about meaning. Provenance is a proof about origin. Production agents need both — but if you have to pick which one to architect for first, pick the one you can defend.

About the author

Contexta Team

The Contexta team ships the context harness for production AI agents — persistent memory, declarative Reflexes, and verifiable provenance, all in one substrate.

Keep reading

More field notes from the harness.

7 min read

Why context is the harness

  • strategy
  • architecture
  • context

Agents do not fail at retrieval — they fail at context. The harness is what holds an agent's reasoning together when memory, reactivity, and provenance work as one substrate.

Wire Contexta into your agent.

Spin up a workspace, drop the SDK in, and turn passive memory into reactive context.