The Ten Hardest Auditability Problems in Agentic AI

A list of unsolved problems is a useful publication exercise. Done badly, it is a wish list. Done well, it forces the writer to be precise about what unsolved means: which problems are technically open, which are institutionally open, and which are open primarily because the people who could close them have not yet decided to.

The ten problems below are the ones the publication considers genuinely hard. They are not the ten problems most often listed in the popular-press write-ups of “AI risk.” Several of those are easier than they look; most of them have been addressed, in principle, by working engineering teams in regulated industries. The problems here are the ones we have read about in audit working papers, regulator correspondence, and the small body of published technical research on agentic-system evaluation, where the published material is honest about what does not yet work.

We have ordered them roughly from technical to institutional. The ordering is loose.

1. Faithful replay of stochastic agentic decisions

Replay — the ability to reconstruct the chain of internal calls that produced a past output — is technically straightforward when the system is deterministic. Most agentic systems are not. The orchestrator’s plan is stochastic; the specialist’s output is stochastic; the retrieval ranks are stochastic; the tool invocations are partially stochastic. A “faithful replay” of a past decision can reproduce the inputs and the component versions, but it cannot reproduce the specific stochastic sample that produced the original output. The audit primitive that the procurement gate requires — can you tell me what actually happened — does not collapse to can you re-run the conditions. The two are close. They are not the same. Progress here would mean either widespread adoption of seed-pinning at every stochastic boundary, or the development of evidence standards that accept distributional rather than point-replay as the audit primitive. Neither is universal.

2. Versioning of “prompt-as-program”

Prompts function as program code in modern agentic systems. They specify behaviour, encode policy, and are routinely modified. The engineering practice of treating them as code — with version control, code review, change history, and immutable references — is not yet universal. The teams that have built it cleared a hard internal political battle; the teams that have not, cannot answer the basic “which prompt produced this output” question. Progress here would mean the engineering convention is no longer optional. It has not yet become that.

3. Silent model substitution behind APIs

A vendor that updates the model behind its API endpoint without notifying the downstream operator has substituted the inference engine under a system the operator believes is stable. The audit log of the operator’s system will report what the operator’s orchestrator thought it called. The actual model that responded may have been different. We have read several audit findings now in which this discrepancy was the central failure mode. The technical fix is straightforward (publish the model version in every response); the institutional fix is harder, because some vendors prefer the operator to remain uncertain. Progress here would mean contract language and procurement standards that make silent substitution untenable. The procurement teams that wrote such language exist. They are still in the minority.

4. Cross-component evaluation in long-horizon agentic tasks

Evaluating a single model on a static benchmark is well understood. Evaluating an agentic system on a long-horizon task — many model calls, tool uses, partial failures, retries — is not. The evaluator has to decide which intermediate steps to score, which to skip, how to handle the agent’s recovery from its own errors, and how to compare runs that diverge after a stochastic branch. The published evaluation literature on this is thin. The UK AI Safety Institute’s published methodology work is one of the more honest treatments. Progress here would mean evaluation suites that treat the multi-step agentic task as the unit of evaluation, in published form, replicable across firms. Several research groups are working on this. None has produced the canonical version.

5. Provenance of retrieved context

Modern agentic systems condition their behaviour on retrieved context — documents pulled from the operator’s data, customer state from the operator’s systems, sometimes external-web content. The audit log needs to record which documents were retrieved, at which versions, and how they conditioned the output. The engineering for this is well understood in principle; the practice is irregular. Retrieval indices are routinely rebuilt without preserving the prior versions. Documents are routinely updated without preserving the version that was retrieved on a given date. An audit nine months later asking “what document conditioned this decision” cannot be satisfied if the document has been silently revised in the meantime. Progress here would mean immutable retrieval-time snapshots, accessible to the audit process. This is achievable. It is not common.

6. Tool-side state changes that the audit cannot capture

An agentic system that invokes external tools changes external state. The audit log can record the invocation. It cannot, in general, record what the external system did with the invocation, what other systems the external system in turn called, or what state was changed elsewhere as a downstream consequence. This is the blast radius problem. It is partially solvable through stronger tool-level logging and downstream-system cooperation, and partially unsolvable for the same reason that any complex distributed system has indirect consequences that are hard to trace. Progress here would mean industry-wide adoption of structured tool-invocation logging standards. The early standards work exists; the adoption is slow.

7. Multi-agent attribution

When several agents contribute to a single decision — the orchestrator decomposes, specialist A proposes, specialist B critiques, the orchestrator composes — attributing the responsibility for an error to a specific agent is non-trivial. The composition can fail because A was wrong, because B’s critique missed it, because the orchestrator weighted the inputs incorrectly, or because of an interaction effect across all three. Audit findings tend to flatten this into “the system failed.” The flattening is not wrong, but it is not informative. Progress here would mean attribution methods, validated on real production traces, that let an auditor reason about contribution rather than only about outcome.

8. Continuous-monitoring methodologies

A pre-deployment audit is a snapshot. A working audit regime is continuous. The methodology for continuous monitoring of agentic systems — what to monitor, at what cadence, with what alerting thresholds, against what definition of “drift” — is not yet a settled discipline. The financial-services audit firms are building product offerings here; the methodology is being written in proprietary form rather than in public. Progress here would mean the publication of validated continuous-monitoring methodologies, either by standards bodies or by audit firms willing to publish their methods rather than only their findings.

9. The compositional opacity of mixed open-and-closed deployments

A deployment that uses a closed-API model for one component and an open-weights model for another inherits the opacity of the closed component while the open component is fully inspectable. The audit conclusion is gated by the least-readable component. This is the weakest-link problem of mixed deployments. It does not have a clean technical fix; it has a procurement-and-contract fix, which the operators we have read about are slowly developing. Progress here would mean standard contractual undertakings that close the gap. Some early contracts exist. The standard does not.

10. Institutional contestation infrastructure

This is the institutional problem we have written about elsewhere on these pages. A technically auditable agentic system whose operator has not built a working contestation procedure — a path for an affected party to challenge a decision, with timelines, evidence standards, and an independent reviewer — is auditable in the procurement sense and unauditable in the consequential sense. The contestation infrastructure is not a technical artefact. It is built by the operator’s policy, contractual, and customer-operations functions, under the operating company’s general counsel. Progress here would mean published contestation procedures with documented track records, evaluated by external observers. We have read a small number of these now. The number is small.

What is not on this list

We have left off several problems that often appear on similar lists. They have been left off because we consider them substantially easier than the list above, or because they are tractable with engineering work that exists today.

Versioning of model weights, in itself, is solved. Logging of orchestrator decisions is, in principle, solved. Decision-provenance metadata is straightforward to record. Pre-deployment evaluation of a single model on a static benchmark is well-developed. The general “the model is a black box” complaint, in its abstract form, is not on this list because we consider it under-specified in the sense we describe in the cornerstone essay elsewhere on this publication.

The ten above are the ones where the published research, the audit-firm methodology, and the regulatory implementation work all converge on “this is where progress is hardest.” A working researcher, a working operator, or a working regulator who has solved one of them has done substantive work. A working publication that pretends they are easy is doing the marketing work of the firms who would prefer the publication keep pretending.

Editorial note. This list will be revised. We will publish the revisions in the same column. We will not quietly amend.