Generative AI cost scales with tokens, calls and the human review still wrapped around the output, none of which the pilot stressed. The business case priced a controlled trial and assumed it would extrapolate linearly, which it does not. At full volume the largest model is doing work a smaller one could, retries and oversight pile on, and the cost curve outruns the value curve.
The original case was a one-off spreadsheet that priced the pilot, not production at full volume, and no one rebuilt it when the workload changed. There is usually no FinOps discipline around AI, so spend is unmonitored, models are oversized for the task, and the oversight cost stays invisible. Cost was treated as a launch-day estimate rather than an operating metric.
Margins erode quietly until a finance review forces the question, by which point the spend is embedded in production. The programme's credibility takes the hit even where the underlying value is real, because nobody can defend the unit economics.
Run-cost compounds every month it goes unmanaged, and an oversized model running at full volume can cost several times a right-sized one for identical output. Left unaddressed, a genuinely valuable use case gets cancelled on cost grounds it never needed to fail on.
The unit economics typically live in a single spreadsheet built to win pilot approval, priced for trial volume with the human-oversight cost left out entirely. It is never reconciled against the real production bill, so the gap between assumed and actual cost goes unseen until finance surfaces it. A one-off model cannot manage an operating cost that changes with every token, model choice and review step.
What controls run-cost is an AI FinOps practice: continuous cost measurement against a real unit, model right-sizing so each task runs on the smallest model that meets the bar, and model-agnostic routing that sends work to the cheapest capable model automatically. The governing metric becomes cost per resolved task, including oversight, rather than raw token spend. Cost stops being a launch estimate and becomes an operating discipline with a feedback loop.
Moweb stands up the FinOps practice, right-sizes models against your accuracy bar, and instruments cost per resolved task including the human-oversight cost most teams ignore. The model-agnostic routing layer we build sends each task to the cheapest capable model, and because the architecture is portable, you exploit price moves across providers without re-engineering. We deliver in 8 to 16 weeks on a fixed fee, partner-led, with an audit pack documenting the cost model and controls.
The pilot priced a small, controlled volume and assumed the cost would scale linearly, which generative AI does not. At full volume, token spend, retries and the human oversight wrapped around outputs all multiply, and the largest model is usually doing work a smaller one could handle. The economics did not change; the workload finally revealed what they always were.
The per-token price is, but the bill is not. Most run-cost is driven by choices you control: which model handles each task, how many retries and review steps you allow, and whether you can route to a cheaper capable model. Right-sizing and model-agnostic routing typically cut cost by a large margin without touching output quality.
It is the total cost, including human oversight, to complete one unit of real work the business cares about, such as a resolved ticket or a processed document. Token spend alone hides whether you are actually getting value, because a cheap call that still needs heavy review is not cheap. Cost per resolved task is the only figure finance can weigh against the value delivered.
No, because right-sizing means matching each task to the smallest model that meets your accuracy bar, not lowering the bar. Many tasks are over-served by the largest model and run identically well on a cheaper one. We measure quality against your threshold throughout, so cost comes down while output holds.