Every large organisation we work with has a graveyard of AI pilots. They are not failures in the way the word usually means. Each one worked. The demo landed, the steering committee nodded, a slide said 'success' and the project quietly never shipped. The most cited figures put the pilot-to-production conversion rate somewhere between one in ten and one in three, depending on whose survey you trust and how generously they define production. The number matters less than the pattern, which is remarkably consistent: the thing that kills a pilot is almost never the model.
We have learned to predict the stall before it happens, because the causes arrive in a fixed order. First nobody owns the result. Then nobody can say whether it is good enough. Then the people whose work it changes were not consulted. Then someone in finance asks what it costs to run at full volume and the answer is uncomfortable. A pilot is engineered to avoid all four questions. Production exists only to answer them. That gap, not model quality, is the subject of this piece.
A pilot optimises for the wrong thing
A pilot is a demonstration that a capability is possible. Production is a commitment that the capability is reliable, owned, and affordable at volume. These are different engineering problems and they reward different behaviour. The pilot rewards a curated dataset, a forgiving demo path, and a sponsor's enthusiasm. None of those survive contact with the long tail of real inputs, the input that nobody anticipated, the edge case that is rare in a sample of two hundred and routine in a population of two million.
The clearest tell is the test set. Ask a pilot team for their evaluation set and you will frequently get a few dozen hand-picked examples that the system handles well. That is not an evaluation, it is a sales aid. A production-bound team has an adversarial set assembled from the cases that broke earlier versions, refreshed as new failures appear, and scored automatically on every change. The presence or absence of that artefact predicts conversion better than any other single signal we track.
Ownership is the first thing to fix
Ask who owns the AI system in production and you want a person, not a committee and not 'the AI team'. The owner is accountable for the output the way a product manager is accountable for a feature: they decide what good looks like, they carry the pager when it degrades, and they have the authority to pull it. Pilots routinely have no such person because the pilot was a project with an end date, not a product with a roadmap. When the project ends, the orphaned system has no one to defend the budget it needs to keep running.
This is an organisational design decision that has to be made before the build, not discovered after it. The owner should sit on the business side, not in central engineering, because the questions that govern a production system are business questions: what error rate is acceptable, what is the cost of a false positive to a customer, when do we override the model. Engineering builds and runs the machine. The business owns what the machine is allowed to decide. Conflate the two and the system either never ships because no one will sign for it, or ships and quietly drifts because no one is watching the thing it actually does.
Evaluation, change management, run-cost
Once ownership exists, the remaining three causes become tractable. Evaluation means an offline scored set plus online monitoring of the metrics the owner cares about, with alerting when they move. A retrieval system needs grounding checks, not just latency dashboards. A classification system needs the confusion matrix broken out by the segments where errors are expensive. Without this you cannot answer 'is it still good', which means you cannot safely change anything, which means the system ossifies and is eventually switched off.
Change management is the cause teams most often skip and most often regret. An AI system that changes how someone does their job will be quietly defeated by the people doing that job if they were not part of designing it. We have watched a perfectly accurate document-triage system fail because the clerks it was meant to help did not trust it and re-checked every output by hand, which was slower than before. The fix was not a better model. It was bringing the clerks in, showing them where it was weak, and letting them set the confidence threshold above which it acted without review.
Run-cost is last because it is the one finance forces you to confront. A pilot's inference bill is a rounding error. The same workload at production volume, with retries, evaluation traffic, and the larger model the accuracy target turned out to require, can be ten to fifty times the pilot estimate. The teams that ship have modelled this from the start: cost per resolved task, not cost per token, measured against the human baseline. If you cannot state that number, the system is not ready for production regardless of how well it demos.
How to beat the stall
The practical move is to refuse to start a pilot until you have named the production owner, defined the evaluation metric they will be held to, identified the people whose work changes, and produced a defensible cost-per-task estimate at full volume. This feels like bureaucracy and it is the opposite. It is the cheapest possible way to discover, on day one and on paper, that an idea will not survive production, before you have spent six months proving it can be demoed.
The good news hidden in the conversion statistic is that the surviving systems are durable. A production AI system with a real owner, a live evaluation harness, an aligned workforce, and a known unit cost does not stall, because every reason it might have stalled was answered before it shipped. The work is front-loaded and unglamorous. It is also the entire difference between an organisation with a portfolio of running systems and one with a graveyard of successful demos.
