Black Box Notes

On opacity, auditability, and the limits of trust in modern AI systems.

Methodology · 02

How we read interpretability claims

Interpretability is the field in which the most precise claims and the most rhetorical claims share the same vocabulary. This page describes how the publication reads interpretability claims when they appear in a vendor's transparency report, a research paper, a regulator's filing, or a marketing deck.

The two failure modes

Interpretability claims fail in two principal ways. The first is overreach: a method that demonstrates a narrow technical property is offered as a general explanation of model behaviour. The second is over-reduction: a generally useful method is dismissed because it does not deliver a property no interpretability method can deliver. The publication's reading practice is designed to resist both.

The two methodological families

We find it useful to keep two families of interpretability work distinct, even though working researchers move between them and the categories are not airtight.

  • Post-hoc explanation. Methods that produce an account of a model's behaviour after the fact, by inspecting the model's outputs against inputs (saliency maps, attention visualisations, feature attributions, contrastive examples, SHAP and its descendants). These methods are useful for surfacing patterns; they are unstable under adversarial inputs; they answer "what is correlated with the model's output" rather than "what caused the model to produce it."
  • Mechanistic interpretation. Methods that attempt to identify the internal computational structures a model uses — circuits, features, sparse autoencoder dictionaries, induction heads, polysemantic neurons. These methods are useful for surfacing causal structure; they are computationally expensive; they generalise poorly across model scales; they tend to surface mechanisms whose interpretability decreases as the mechanism's contribution to performance increases.

Eight questions we ask

Whenever an interpretability claim appears in front of the publication — vendor report, paper, regulator filing, marketing material — we ask:

  1. What method, specifically? "Interpretability" is not a method. The claim either names a method or fails the first question.
  2. At what scale? A method demonstrated on a 124M-parameter model in a research lab is not necessarily a method that applies to a production-scale deployment. Claims that extrapolate scale without justifying the extrapolation are flagged.
  3. On which inputs? Interpretability results are typically reported on a curated input set. A claim that generalises to "production inputs" without characterising the distribution shift between curated and production is flagged.
  4. With what failure rate? Every interpretability method has a regime in which its outputs are unreliable. Claims that do not name the failure regime are flagged.
  5. Reproducible by an independent observer? A claim that rests on proprietary tooling and is not reproducible by an external party is a claim about internal practice rather than a verifiable property of the model.
  6. Demonstrated by intervention, or merely by correlation? A claim becomes substantially more credible when supported by an intervention experiment showing that the identified mechanism, when ablated, removes the behaviour it was claimed to explain. Correlational claims that do not include an intervention experiment remain correlational claims.
  7. What is the explanatory ceiling? Even a perfect interpretability method on a fixed input would not explain why the model's training data produced the mechanism the method identifies. Claims that elide the gap between "the mechanism is identifiable" and "the model's behaviour is explained" are flagged.
  8. Who benefits if this claim is believed? Procurement of an interpretability-favourable narrative is the most common reason for an interpretability claim to appear in a public document. We do not assume bad faith; we do read the document with the question in mind.

Claims that survive scrutiny

Some interpretability claims survive most of the eight questions. The publication's working list:

  • Sparse-autoencoder feature dictionaries with intervention demonstrations. Where a published dictionary identifies features that, when ablated, predictably remove or modify a target behaviour, the claim has substantive support. The work generalises poorly across families of model; the claim should be read as family-specific until demonstrated otherwise.
  • Mechanistic circuit identification with reproducible ablation. Where a research result identifies a circuit (a connected subgraph of model components implementing a behaviour) and demonstrates that ablating the circuit suppresses the behaviour, the claim has substantive support. The reading practice flags any extrapolation from "this circuit implements this behaviour on this distribution" to "this behaviour is implemented by this circuit in production."
  • Calibrated post-hoc attributions with characterised failure modes. A post-hoc attribution method whose failure regime is documented and whose calibration on the deployment distribution is reported can support audit-relevant claims, within the limits the calibration establishes.

Claims that do not survive scrutiny

  • "The model is interpretable because we use Grad-CAM / SHAP / attention visualisation." The claim names a method but does not address questions 2, 3, 4, or 7.
  • "Our model is glass-box." The metaphor is a metaphor. It does not name a method.
  • "Explainable AI." A trade-marketing category, not a methodological claim. The publication's prose treats it as such.
  • "Our model produces a step-by-step rationale." The rationale is itself a model output. Without evidence that the rationale faithfully describes the model's actual computation, the claim is a claim about output formatting.
Common rhetorical move

The publication's reading practice is most alert to a specific rhetorical move: a paper or a vendor produces a credible mechanistic finding on a small model, and the surrounding text generalises the finding to production behaviour at a different scale without justifying the generalisation. The finding stands; the generalisation does not. We mark the boundary in the piece.

How we cite interpretability work

Where a piece names an interpretability result, we cite the published paper and, where possible, the artefact (code repository, model checkpoint, intervention script) that permits replication. Where the work is in pre-print and the publication has not seen an artefact, we cite the pre-print and say so. We do not cite interpretability claims that appear only in vendor marketing material.

What this reading practice does not do

It does not produce a definitive verdict on whether a model is "interpretable." That is the wrong question. Interpretability is a property of the relationship between a method, a claim, an observer, and the kind of decision the observer is trying to make. A model that is interpretable to its engineering team for the purpose of debugging may not be interpretable to a regulator for the purpose of audit. The reading practice attempts to surface the relevant pairings rather than the absolute.

Changelog

  • 2026-05-22. Initial publication.

Continue: How we evaluate transparency reports →