Generative AI unit economics: a five-quarter view

The unit cost of inference will fall, then plateau, then rise again as the workloads enterprises actually deploy get harder. We explain why, and what it means for procurement and architecture decisions in 2026 and 2027.

Frontier-model inference cost per million tokens has fallen by roughly an order of magnitude over the eighteen months to early 2026. Boards have noticed. Procurement teams have noticed more.

But the cost path of generative AI in your enterprise is not the same as the cost path on the frontier model price list. We see three regimes, and each demands a different procurement and architecture posture.

Regime one: the demo deflation

From mid-2023 to late-2025, every generation of frontier model brought a meaningful price-per-token drop on the easy workloads - single-turn prompts, simple summarisation, lightweight code suggestion.

Procurement teams that locked in 2024 contracts watched 2025 spot prices fall 60 to 80 percent past them. The lesson learned then is now a baseline rule: do not lock in commitments longer than 12 months on the easy workloads.

Regime two: the workload mix shift

Through 2026, enterprise generative AI deployments are migrating from single-turn prompts to multi-turn agents, deep document understanding and long-horizon reasoning. These workloads cost three to twenty times more per task than the demos that anchored last year's price expectations.

Average enterprise cost-per-task is therefore rising, even as cost-per-token continues to fall. Boards expecting the price-per-token curve to map onto budget impact are being surprised on the wrong side.

Regime three: latency-bound deployments

The most strategically important AI workloads in 2026 are latency-bound: real-time copilots in trading, claims, clinical and customer-service workflows. Latency-bound workloads do not benefit from the price-per-token curve in the same way; speed dictates model choice, and the fastest production models are not the cheapest.

We expect a clear premium to emerge through 2026 and 2027 for sub-150ms inference, paid for by a workforce-intensity dividend. Architecture decisions made today on local model placement, edge inference and caching design will compound over five years.

Regime one: the demo deflation

Regime two: the workload mix shift

Regime three: latency-bound deployments

Generative AI unit economics: a five-quarter view

Regime one: the demo deflation

Regime two: the workload mix shift

Regime three: latency-bound deployments

Your AI bill is climbing faster than the value

Like to discuss this with a partner?

Generative AI unit economics: a five-quarter view

Regime one: the demo deflation

Regime two: the workload mix shift

Regime three: latency-bound deployments

Your AI bill is climbing faster than the value

Like to discuss this with a partner?