Black Box Notes

On opacity, auditability, and the limits of trust in modern AI systems.

Notes

Open vs. Closed in 2026

The open-versus-closed debate has been treated, for the last several years, as a politics question. By 2026 it is a procurement question. A note on what each side has done well, what each side has done badly, and what the actual decision is when the buyer is not a member of either tribe.

X LinkedIn Mastodon Print

The open-versus-closed debate has been one of the longer-running set pieces in this industry. For most of the 2023–2024 cycle, it was a politics debate — a clash of factional commitments about whether the technology should be controlled, by whom, and on what terms. The factions developed their own publications, their own conferences, and their own moral vocabulary. By the second half of 2024, the more candid participants on both sides had begun to acknowledge that the debate was, in practice, a fight over institutional position rather than a fight over the underlying technology.

By 2026 the political fight is mostly cooled. What remains is the procurement question. Both sides have shipped, both sides have stumbled, and the buyers — the actual deployers of these systems inside companies that are not themselves AI firms — have begun to evaluate the two on operational grounds rather than ideological ones. This piece is an attempt to record the state of the procurement question, with a particular eye to the opacity-and-auditability axis that is this publication’s beat.

What “open” actually means in 2026

The first thing to say is that the word “open” still does too much work. There is now a stable but fragmented vocabulary the serious buyers use. Open weights means the model parameters are available for download, with some licence. Open architecture means the model design is described in enough detail to be reproduced. Open training data means the dataset is, in some form, available. Open evaluation means the model’s evaluation suite is publishable. Open inference means the model can be run on infrastructure the buyer controls. Open licence means the buyer can deploy and modify under permissive terms.

Most “open” models in 2026 are open in some of these senses and not in others. The Llama family, the Mistral family, the Qwen family, DeepSeek, and the open-weight releases from other labs are all open in subsets of these axes. The procurement-relevant distinction is which subset.

The factional debate frequently conflates these. A model can have open weights and a non-commercial licence; that is not the same as an open model from a deployment standpoint. A model can have open architecture and closed weights; that is interesting to researchers and operationally meaningless to a buyer.

Useful procurement conversations specify which axes the buyer needs open. The conversations that go nowhere are the ones that use “open” as a single binary.

The auditability-axis comparison

On opacity and auditability specifically, the open-versus-closed axes intersect in ways the political debate has not always made clear. The interesting question is not “which is more interpretable” — neither family of models is currently interpretable in any deep sense — but “which deployment supports the audit primitives that procurement teams now require.”

The open-weights side has structural advantages on three of those primitives. Versioning is straightforward, because the buyer controls which weights are loaded. Replayability is achievable, because the inference stack runs in the buyer’s own environment and the buyer can pin the runtime. Decision-provenance is supportable, because the buyer’s own observability tooling can be wired into the inference pipeline.

The closed-side equivalents depend on the vendor. A buyer running a closed-source model behind a vendor’s API depends on the vendor for versioning (the model may be updated underneath the buyer’s API endpoint), for replayability (the runtime is the vendor’s), and for decision provenance (the audit trail depends on what the vendor exposes). The buyer can negotiate for these capabilities, and some of the closed vendors now offer them, but the negotiation is the work. The capabilities are not the default.

That is the operational asymmetry. It does not say one family of models is more auditable than the other in some abstract sense. It says the auditability work is structured differently. Open-weights deployments do more engineering up front. Closed-API deployments do more procurement-contract work up front. Both can produce auditable systems. Neither does automatically.

What each side has done well

The open side has done two things well. First, it has driven down the cost of credible inference to the point that buyers in regulated industries can plausibly run their own infrastructure. Even buyers who do not actually want to run their own infrastructure now have a credible alternative position to negotiate from, which has rebalanced the closed-side commercial conversation. Second, the open side has produced a useful research substrate — open-weights models have been the canonical subject of the interpretability research published over the last two years, including the mechanistic-interpretability work whose results inform the rest of this publication’s coverage.

The closed side has done two things well. First, it has shipped the higher capability frontier on most evaluations that buyers actually care about. The gap is narrower than it was. It is still real. Second, the closed labs have produced the more substantive published material on systemic risk evaluation — pre-deployment evaluation, red-teaming, dangerous-capability assessment — which is downstream of their having the most resources and the most exposure to political pressure. Some of this work has been performative; some of it has been substantive. The substantive parts have, on the whole, been the closed labs’ work, not the open labs’.

What each side has done badly

The open side has done two things badly. The release-cycle hygiene has been uneven; some open releases have shipped without the evaluation artefacts a serious buyer needs, on the theory that the community will produce them. The community sometimes does and sometimes does not. The second is that some of the most prominent open-side advocacy has framed openness as morally sufficient, rather than as one input to a more complicated procurement decision. Buyers learn to discount this register.

The closed side has done two things badly. The first is the silent-update problem — models behind APIs being changed without sufficient notice to buyers, which the audit findings we have read suggest is now caught and remediated more frequently than it was, but which remains a structural risk in closed-API deployment. The second is the persistent overstatement of internal interpretability work. Several closed labs have published, in the last two years, interpretability statements whose technical substance does not survive even moderate scrutiny by researchers in the same field. The pattern is recognisable and the damage is reputational.

What the buyer is actually choosing between

For the publication’s reader who is making the procurement decision rather than the political one, the choice is roughly as follows.

A buyer who has the engineering staff to run inference, who can budget for the operational discipline of maintaining their own model and orchestration stack, and who is comfortable taking responsibility for evaluation and red-teaming, can deploy open weights with a credible auditability story. The story is credible because the buyer owns every layer. The cost is the operational layer. The buyer eats it.

A buyer who does not have that staff, who would prefer to outsource the operational layer, and who is willing to do the procurement-contract work to bind a closed vendor to the audit primitives, can deploy a closed model with a credible auditability story. The story is credible to the extent the contract is enforceable and the vendor’s product actually supports the contracted primitives. The cost is the contractual and oversight layer. The buyer eats that.

There is a small set of buyers — increasingly visible — who run mixed deployments. The pattern is to use a closed-API model for the most capable orchestration layer, an open-weights model for the specialist layers that require pinning, and an internal evaluation pipeline that compares the two on the buyer’s actual production workload. This is operationally expensive. It is also, in our reading of audit reports, the deployment pattern that comes closest to satisfying the full set of audit primitives.

What the debate has matured into

The open-versus-closed debate has, in 2026, become less politically heated and more operationally specific. Both sides ship. Both sides have failed. The buyers are buying from both. The factional vocabulary persists at the level of public discourse and has mostly evaporated at the level of procurement decisions.

This publication’s view is that the debate is no longer the most useful frame for thinking about opacity. The more useful frame is the one we have already proposed: technical opacity, institutional opacity, audit primitives, layer by layer. Models can be open or closed; deployments are either auditable or they are not, and the determinant is more often the institutional choice of the deployer than the licensing choice of the model.

Note. The publication’s editorial position on the open-versus-closed debate is no position. We cover deployments. We are interested in what an auditor can read and what a regulator can verify. Both ends of the licensing spectrum can produce stacks that satisfy that test; both ends can produce stacks that fail it. The licensing is not the gate.

Copied