The Interpretability Stack: A Practitioner's Toolkit

The polite term for what most enterprise AI teams have, on the interpretability front, is “an aspiration.” The accurate term is “almost nothing.” There is a gap between the published research on interpretability and the working practice inside the firms that have deployed agentic systems at scale, and the gap is not closing as fast as the regulatory deadline is approaching.

This piece is an attempt to describe what an actual interpretability practice consists of in 2026, layer by layer. The intent is operational. A working CTO, a working head of AI risk, or a working auditor should be able to read this list and identify which layers their team has and which they do not. The piece is not a recommendation engine. It is a description of the stack as it exists in the firms that have built one.

I have organised the layers from bottom to top. Each layer answers a different question. Each layer has a different maturity level. I have noted the maturity level honestly.

Layer 0: Versioning and provenance

The question this layer answers. Which version of which component produced this output?

Maturity. Mature. There is no excuse for not having this layer. The tools are conventional software-engineering tools: a model registry, a prompt registry, a tool-schema registry, a retrieval-index registry, and an end-to-end identifier that ties a given output to its specific component versions.

The trap. Most firms have versioning for the model and not for the prompt. This is the layer where the prompt-versioning problem still defeats most teams, because the prompt has historically been treated as configuration rather than as code. The teams that have crossed this gap treat their prompts as versioned artefacts with code-review and change-history. The teams that have not crossed it cannot answer the layer-0 question even in principle.

If this layer is not working, every layer above it is unreliable. There is no point in interpreting the behaviour of a system whose components you cannot pin.

Layer 1: Decision logging

The question this layer answers. What did the system do, when, on whose behalf, with what input, and to what external effect?

Maturity. Mature in principle, irregular in practice. The tooling exists — modern orchestration frameworks have audit-event hooks, modern observability platforms ingest them — and the engineering pattern is well understood. The irregularity is in what teams choose to log. The serious teams log the full decision chain: orchestrator plan, specialist call, tool invocation, external state changes, response composition. The unserious teams log a summary and discover, the first time an auditor asks, that the summary is not faithful.

The trap. Sampling. A team that logs only a sampled fraction of agentic decisions has, in effect, made the auditability of any individual decision a coin flip. Some sampling is reasonable for cost reasons; un-versioned sampling regimes that change over time, with no record of the sampling rule, are not.

Layer 2: Replayability

The question this layer answers. Can the system reproduce the chain of internal calls that produced a past output?

Maturity. Emerging. The capability is technically achievable today and the orchestration frameworks that support it are public. The capability is not yet universal in production, because replay requires more careful state management than most early-generation agentic stacks were built with. The teams that have made the investment can replay; the teams that have not, cannot, and they tend to learn this the first time a regulator or general counsel asks.

The trap. Distinguishing replay from re-prompt. Re-prompt means “let’s give the model the same input again and see what happens.” Replay means “let’s reconstruct the exact context, tool state, and component versions that produced the original output.” The first is cheap and uninformative. The second is the actual audit primitive.

Layer 3: Behavioural monitoring

The question this layer answers. Is the system, in production, doing what we expected it to do, on which distribution of inputs?

Maturity. Mixed. Behavioural monitoring is well developed for classical ML systems — drift detection, performance monitoring against ground-truth labels, distribution monitoring — and substantially less developed for agentic systems, because the unit of behaviour is harder to define. “What is the agent doing in production” is a more open question than “what is the classifier doing in production.”

The teams that have built useful behavioural monitoring for agentic systems have done it by defining a finite set of decision categories, monitoring the distribution of decisions across those categories, and alarming on drift. The teams that have not built it tend to discover their drift through customer complaints.

Layer 4: Component-level interpretability

The question this layer answers. Why did this specific model produce this specific output?

Maturity. Research-grade. This is the layer that most public discourse about “interpretability” is talking about, and it is the layer least likely to be a working part of an enterprise stack. Mechanistic interpretability research is producing important results; those results are not yet operationalisable by a typical working AI team.

What works in practice. Attention-pattern analysis on smaller models, feature attribution on classifier-style components, and — increasingly — sparse-autoencoder-derived feature dashboards on open-weight models that researchers have invested time analysing. The serious teams treat this layer as a research collaboration with the labs publishing the underlying techniques, not as a vendor purchase. The vendors who claim to offer “interpretability” as a product, at this layer, are usually offering a re-packaging of techniques whose limits the underlying researchers are explicit about.

The trap. Treating component-level interpretability as a substitute for orchestration-level auditability. They are different questions. A firm whose component-level interpretability is excellent and whose orchestration-level auditability is non-existent has built half a stack. The half it has built is the more research-glamorous half. The half it has not built is the half that an auditor or regulator will ask for.

Layer 5: Contestation

The question this layer answers. Can a person affected by a decision the system made challenge it, on what timeline, with what evidence, to what redress?

Maturity. Institutional, not technical. This layer is not built by an engineering team. It is built by the operating company’s policy, contractual, and customer-operations functions. The engineering layers below it are necessary but not sufficient. A firm with perfect technical auditability and no contestation process has built a beautifully observable machine that no affected party can challenge.

The trap. Treating contestation as a customer-service function rather than a governance function. A contestation process that runs through the same channels as a billing dispute will produce the same outcomes as a billing dispute, which is not what an AI Act competent authority will be looking for when it asks.

Layer 6: External assurance

The question this layer answers. Can a third party — auditor, regulator, certification body — independently verify what the layers below say is true?

Maturity. Forming. The audit firms have begun to build product offerings here; the standards bodies have begun to develop the technical norms; the regulators have begun to define what they will and will not accept as evidence. None of these processes is finished. The firms that will be ready to satisfy external assurance demands in 2027 are the firms whose layer-0-through-5 stack is in place now, because the audit process can only verify what is being recorded, and the recording cannot be retrofitted.

How to read your own stack against this list

A useful exercise for a working AI team is to walk this list from layer 0 upward and stop at the first layer that does not work. That is the layer the team is publicly claiming and not actually shipping.

A useful exercise for a working buyer is to ask their vendor about each of these layers in turn, and to take a non-answer as an answer. The non-answer is informative.

A useful exercise for a working regulator is to specify which layers a high-risk system must implement, in what form, and to what evidence standard. The current AI Act implementation work is doing exactly this, slowly. The publications that treat the slowness as bureaucratic obstruction are missing the point. The slowness is the standards-development process running in real time.

Note. This piece describes the stack at a moment. The layers above will look different in a year, principally because layer 4 (component interpretability) is the one moving fastest. We will revisit. The stack as a whole, however, is unlikely to change shape. Auditability is built bottom-up. The teams that try to start at layer 4 because it is the layer with the most published research find themselves doing the layer-0 work three quarters later, after losing a procurement gate or two.