Building an enterprise RAG system that survives an audit

Most retrieval-augmented generation systems are built to demo well and fail under model-risk review. The architecture that survives an audit is decided by four choices made before the first prompt: citation lineage, the evaluation harness, access control at retrieval time, and provenance.

Retrieval-augmented generation is the default architecture for enterprise question-answering, and most implementations are built to survive a demo rather than an audit. The demo shows a confident answer drawn from internal documents. The audit asks a different set of questions: which document produced that answer, who was allowed to see it, what the system did when retrieval returned nothing useful, and how you would know if accuracy had degraded since the model was last reviewed. A system that cannot answer those questions is not production-ready in any regulated function, whatever its demo looked like.

This essay sets out the four architectural choices that determine whether a RAG system holds up to model-risk and regulatory review. We have made these choices on deployments inside banks, insurers and pharmaceutical firms, and the pattern is consistent: the decisions that matter are made before the first prompt is written. Retrieval quality, citation lineage, access control and evaluation are not features you add later. They are properties of the architecture, and retrofitting them costs three to five times what designing them in would have.

Citations are a lineage problem, not a UI feature

Most teams treat citations as a presentation layer: the model generates an answer, then a second step finds plausible-looking sources to attach. This is exactly backwards and it will not survive scrutiny, because the citation does not constrain the answer, it decorates it. An auditor who pulls ten answers and traces each back to its cited source will find a material share where the source does not actually support the claim. We have seen that rate run at 15 to 25 percent in systems built this way.

The architecture that survives review pins generation to retrieval. The model is given a fixed set of retrieved chunks, each with a stable identifier, and is instructed to answer only from those chunks and to mark any claim it cannot ground. The cited identifier resolves to the exact chunk, the exact document version, and the exact ingestion timestamp. When the underlying document is updated, the old version is retained and the answer remains traceable to the version that produced it. This is lineage, and it is the single property that distinguishes an auditable system from a plausible one.

The cost of this discipline is that the system says I cannot answer that from the available sources more often. Practitioners new to regulated deployment treat that as a failure. It is the opposite. An honest abstention is auditable and defensible. A confident answer with a decorative citation is a finding waiting to happen.

Access control belongs at retrieval, not at the answer

A common and serious mistake is to apply permissions after generation: retrieve from the full corpus, generate the answer, then check whether the user was entitled to see it. By that point the controlled content has already influenced the output. Even if you suppress the final answer, the model has reasoned over documents the user had no right to access, and a sufficiently probing series of questions can reconstruct their substance. This is a data-leakage path that standard application security review will miss because the leak happens inside the model, not over the wire.

Access control must sit at the retrieval boundary. Every chunk carries the access metadata of its source document, and the retrieval query is filtered by the requesting user's entitlements before any candidate reaches the model. The model only ever sees content the user is cleared for. This requires that your ingestion pipeline preserves document-level permissions as first-class metadata, which in turn requires deciding the entitlement model before you index anything. Teams that index first and bolt on permissions later usually end up re-indexing the entire corpus.

There is a second-order benefit. Once entitlements are enforced at retrieval, the same mechanism gives you per-user, per-document query logs that map cleanly onto existing data-governance reporting. The control you built for security doubles as the evidence trail the audit wants.

The evaluation harness is the deliverable

If you build only one thing properly, build the evaluation harness. Model-risk functions do not approve systems on the basis of a good demonstration. They approve systems that come with a defined test set, a measured baseline, and a documented process for detecting drift. A RAG system without an evaluation harness is, from a governance standpoint, unmeasured and therefore unapprovable.

A workable harness has three layers. First, a curated set of representative questions with known-good answers and known-good sources, built with the business function and refreshed quarterly. Second, automated scoring on each release covering retrieval relevance, answer faithfulness to retrieved content, and citation correctness, with thresholds agreed in advance. Third, a held-out adversarial set: questions the system should refuse, questions designed to surface hallucination, and questions probing the access boundary. We size the initial set at 300 to 500 questions for a single business domain and treat anything below 200 as insufficient for a regulated go-live.

Run the harness on every model change, every prompt change and every significant corpus change, and store the results as part of the change record. When the regulator asks how you know the system still performs as approved, the answer is a versioned report rather than an assurance. This is also how you catch the quiet failure mode of RAG: a model upgrade that improves general benchmarks while degrading faithfulness on your specific corpus.

Design for the day the model is replaced

Underlying models will be deprecated and replaced on the vendor's timetable, not yours. We have seen production models given as little as six months of notice before retirement. An architecture that has hard-coded one model's behaviour into its prompts and thresholds will fail its evaluation harness on the day it is forced to migrate, and the migration will land as an emergency rather than a planned change.

The defensive design keeps the model behind an internal interface, keeps prompts and thresholds in versioned configuration rather than code, and treats the evaluation harness as the gate every candidate model must pass before promotion. When the harness, the lineage and the access boundary are all in place, swapping the model becomes a controlled exercise: run the new candidate against the same test set, compare faithfulness and citation correctness against the approved baseline, document the result, promote or reject. The system that survives an audit is, not coincidentally, the system that survives its own supply chain.

Citations are a lineage problem, not a UI feature

Access control belongs at retrieval, not at the answer

The evaluation harness is the deliverable

Design for the day the model is replaced

Building an enterprise RAG system that survives an audit

Citations are a lineage problem, not a UI feature

Access control belongs at retrieval, not at the answer

The evaluation harness is the deliverable

Design for the day the model is replaced

Like to discuss this with a partner?

Building an enterprise RAG system that survives an audit

Citations are a lineage problem, not a UI feature

Access control belongs at retrieval, not at the answer

The evaluation harness is the deliverable

Design for the day the model is replaced

Like to discuss this with a partner?