Inside an Agentic Audit: A Hypothetical Walkthrough
A composite scenario, drawn from the patterns of real audit engagements. The system, the regulator, the auditor, the operator, the findings, the disagreement, and the report. Notes on what goes wrong when a real audit meets a stack that was not built to be read.
What follows is a composite scenario. The system, the operator, the regulator, and the audit firm are not real. The interaction patterns are. The scenario has been assembled from the working practice of audit engagements we have read about or had described to us, with names and identifying details removed and the timeline compressed. The publication’s standard non-fabrication rule applies: where a fact in the composite is drawn from real published material we have cited it; the rest is hypothetical and labelled as such.
The point of the walkthrough is not to dramatise an audit. The point is to make visible the operational stress points an agentic-system audit puts on a real operating company. The literature on AI audit tends to describe the activity as if it were a procurement task. It is closer to a forensic engagement performed under time pressure, with the auditor’s findings shaping a regulator’s enforcement posture for years afterward.
The system
A composite mid-sized enterprise — call it the operator — runs an agentic system that handles customer-facing tier-one decisions across a regulated workflow. The decisions are not life-safety decisions but are economically consequential, and the regulator’s posture is that the operator is responsible for the system’s behaviour the way it would be responsible for an internal department’s behaviour.
The system is an orchestrator that decomposes incoming requests into one of about a dozen workflow types, calls a small set of specialist agents on each, retrieves customer state from internal systems, sometimes calls external tools, and emits a customer-facing decision. The orchestrator was built on a public agentic framework. The specialists are LLM calls against a commercial provider. The system has been in production for fourteen months.
The operator’s chief risk officer is the person who commissioned the audit, in response to a regulator’s expectation that an external review will be performed every two years. The audit firm is a mid-sized advisory whose practice has shifted, in the last three years, from financial controls to AI controls. The audit team is four people: a partner, a senior, and two associates.
The kickoff
The audit begins with a scoping meeting. The auditors arrive with a request for the system’s documentation. The operator’s AI risk team arrives with their published model card and a one-page architecture diagram.
The senior auditor — a former systems engineer who has lost patience with model cards — asks for the orchestration-layer audit log. The AI risk team explains that there is an orchestration-layer audit log, that it lives in the operator’s observability platform, and that they will arrange access. The senior asks for the prompt registry. The AI risk team is quieter for a moment; the prompt registry exists in a Notion workspace that has been edited by a number of people over fourteen months, and there is no immutable version history.
This is the first finding. It is unwritten yet. It is already inevitable.
The kickoff scope is set: the auditors will sample fifty production decisions across the period, will request the full decision chain for each, and will independently reconstruct the decision against the operator’s policy. The operator agrees. The auditors leave with credentials and a working folder.
Week two: the replay problem
The first finding crystallises in the second week. The auditors request replay for decision #17 in the sample, an automated denial from nine months ago. The operator’s team can produce the customer-facing decision, the input that triggered it, and the orchestrator’s plan. They cannot produce the prompt that was in effect at the time of the decision — the prompt registry has been edited since — and they cannot produce the retrieved context, because the retrieval index has been re-built twice in the intervening months and the prior versions were not preserved.
The auditors mark this as a major finding. The operator’s AI risk team objects, on grounds that the decision can be reconstructed approximately. The senior auditor’s position is that approximate reconstruction is not the audit primitive, and that the operator is required to demonstrate the actual chain.
The conversation continues for three meetings. The operator’s eventual position is that this is a known limitation of the platform, that it will be remediated in the next product cycle, and that compensating controls were in place during the period in question. The auditors record the position, accept the compensating controls language, and proceed.
The finding is going in the report.
Week three: the policy enforcement problem
The auditors turn to policy enforcement. The operator’s published policy is that the system will not auto-deny in certain customer categories without a human-in-the-loop check. The auditors sample five denials in those categories.
In four of them, the human-in-the-loop check fired correctly. In the fifth, the check was skipped because of a misconfiguration in the orchestration layer’s routing logic, which existed for forty-eight hours in month seven. The misconfiguration was caught and fixed; the affected customer was contacted; the decision was overturned. The operator’s team is open about this. They show the auditors the change-history of the orchestration config, the incident retrospective, the customer communication, and the remediation.
This is one of the better moments in the engagement, paradoxically. The forty-eight-hour incident is the kind of finding an unauditable system would not even surface; an unauditable system would have had a similar incident, would not have known about it, and would not have remediated it. The auditors note the incident, note the remediation, note that the policy enforcement is configuration-driven and therefore depends on the change-management discipline around the configuration, and recommend that the policy enforcement migrate to a more deterministic enforcement point. The operator’s response is that the recommendation is reasonable and will be sequenced into the next two quarters.
This recommendation will land in the regulator’s hands. The regulator will read it. The next audit will check whether it has been implemented.
Week four: the model-substitution surprise
The auditors discover, in the course of pulling sample decisions, that the commercial LLM provider underlying the specialists has silently upgraded the model behind the API endpoint twice during the audit period. The operator’s team knew about one of the upgrades and not the other. The audit-log capture of the model version, where it existed, reflected what the orchestrator believed it was calling, not the model that actually responded. This is a gap.
The discussion that follows is unusually productive. The operator’s AI risk lead has been arguing internally for model-version pinning for over a year and has been told the cost is not warranted. The audit finding is now the budgetary lever to win that argument. The audit firm understands this dynamic and treats it diplomatically. The finding goes in the report with a clear remediation path.
Week six: the regulator-facing summary
The audit report goes through three drafts. The first draft is what the auditors actually saw. The second draft is what the operator’s general counsel will permit to be sent to the regulator. The third draft is the negotiated text.
The negotiation is mostly over framing, not over findings. The findings stand. The framing is over whether the operator’s compensating controls are described as “in place during the period” or “designed during the period and tested afterward.” The first phrasing is defensible. The second is closer to what actually happened in two of the four sub-findings. The negotiated language ends up in between.
The report goes to the regulator. The regulator reads it, sends a follow-up letter requesting evidence of remediation by a specific date, and files the report. The operator’s CRO presents the findings to the board. The board approves the remediation budget. The AI risk team has the budget it was asking for last year.
The audit firm is engaged for the next cycle.
What the walkthrough is meant to show
The composite is not meant to be inspiring or damning. It is meant to be ordinary. The findings in a real agentic-system audit are mostly not catastrophes. They are the slow accumulation of small auditability shortcomings that, separately, looked tolerable to the operator and, together, mean that the system as a whole cannot be independently verified.
There are three implications.
The first is that the audit primitive is reconstruction. The audit is not asking “is the system biased” in any abstract sense; it is asking “can you, the operator, reconstruct what happened on this specific date, with this specific user, under this specific configuration?” The operators who win the audit are the operators who built reconstruction into the stack from the start. The operators who lose the audit are the operators who treated the audit log as observability theatre.
The second is that the cost of remediation, after the audit, is always greater than the cost of building auditability into the platform up front. The operator in the composite will spend the next two quarters back-filling capabilities that the agentic framework’s authors could have built in. The cost is not a discretionary investment; it is a regulatory consequence of having shipped the original system without those capabilities.
The third is that the regulator is reading. The regulator’s letter is the part that most operators underestimate, because they imagine that the audit report is the final artefact. It is not. The audit report is the input. The regulator’s posture toward the operator’s category, for the next several years, is partly downstream of the report. This is what the auditors mean when they say the engagement is consequential.
The hypothetical walkthrough above is, in the publication’s view, the median experience an operator should plan for. The exceptional experience — the audit that goes smoothly because the stack was designed to be audited — is rare enough that it is worth writing about separately, in the firms that have managed it. We will.
Composite disclosure. This walkthrough is a composite scenario, not a description of any specific engagement. No quoted statements are attributed to real auditors, regulators, or operators. The scenario is illustrative of the patterns the publication has observed in published audit reports, regulator correspondence, and the working accounts of practitioners who have spoken with the editorial team on background.