Measuring AI ROI: a CFO's framework for production AI

A pilot's headline number and a P&L line are different objects. This is a finance-grade method to baseline, attribute, and net out run-cost so that an AI claim survives an audit committee.

The number a project team reports for an AI initiative and the number a CFO can defend to an audit committee are rarely the same number, and the gap is not dishonesty. It is method. A project team measures the improvement they observed in the conditions they controlled. A CFO has to measure the change in the company's financial position that would not have happened anyway, net of every cost, attributable to this specific intervention, in a way that survives scrutiny from people whose job is to disbelieve it. Those are different disciplines, and most AI ROI claims are built with the first when they need the second.

We have sat on both sides of this. What follows is the method we use to make an AI claim finance-grade: a clean baseline, honest attribution, a fully loaded run-cost, and a translation from operational metric to P&L line. None of it is exotic. It is the same evidentiary standard finance already applies to a capital project. The mistake is exempting AI from that standard because it is new, when novelty is precisely the reason to apply it harder.

Baseline before you build, or you have nothing to compare

The single most common failure is the absent baseline. A team ships an AI tool, the metric looks good, and there is no credible measurement of what the metric was before, under the same definition, in the same population. Without it the improvement is unfalsifiable, and unfalsifiable claims do not clear an audit committee. The baseline has to be captured before the intervention, using the exact metric you will report later, on a population comparable to the one that will use the system.

Baselining also forces a useful argument about what you are actually measuring. 'Handle time fell twenty per cent' is meaningless until you fix the denominator: which tickets, measured how, including or excluding the ones the AI deflected entirely. We insist on writing the metric definition down and having operations and finance both sign it before the build starts. Half the inflated ROI claims we have reviewed dissolve at this step, because the baseline turns out to have been measured differently, or on an easier population, than the post-intervention number it was compared against.

Attribution: what would have happened anyway

The hard question in any ROI calculation is the counterfactual. Revenue rose, but the sales team also changed its territory plan and the market moved. Costs fell, but a reorganisation removed headcount in the same quarter. Attributing the whole delta to the AI system is how project ROI reaches numbers no CFO will sign. The clean answer, where it is affordable, is a holdout: run the AI for one group and not for a comparable other, and measure the difference between them rather than the difference over time. A holdout controls for everything that moved in the wider business, which is exactly what naive before-and-after cannot do.

Where a true holdout is impossible, the honest move is to widen the confidence interval rather than narrow the claim. State the range, name the confounders you could not control, and present the conservative end as the number for planning. A CFO would far rather hear 'between two and four million, we are planning on two' than a single precise figure with no error bars, because the precise figure is the one that gets challenged and collapses. Attribution discipline is not pessimism. It is the difference between a number that survives and one that is quietly written off at the next review.

Run-cost is a recurring liability, not a line item

AI economics differ from traditional software in one structural way: the marginal cost of use is real and non-trivial. Every inference has a cost, and that cost recurs forever at a level proportional to usage. The fully loaded run-cost includes model inference, the retrieval and vector infrastructure, the evaluation and monitoring traffic, the human review that most production systems still need for their tail, and the engineering time to maintain the thing as models and data drift. The token bill is often the smallest of these.

The figure that matters is cost per unit of business value: cost per resolved ticket, per processed claim, per qualified lead, measured against the fully loaded cost of the human baseline doing the same work. Express it that way and the comparison becomes one a CFO already knows how to make. Express it as 'we spent X on AI this quarter' and you have given them a cost with no denominator, which reads as pure expense. We have seen genuinely positive programmes defunded because they were reported as a cost centre rather than as a unit economic that beat the alternative.

From operational metric to P&L line

The last translation is the one most often skipped. A twenty per cent reduction in handle time is an operational fact. It becomes a P&L line only if it converts into something the income statement recognises: fewer contractor hours actually not bought, more volume served by the same team allowing growth without hiring, or revenue retained because response times improved measurably. Time saved that is reabsorbed into the working day with no headcount, contract, or revenue consequence is real for the employee and invisible to the P&L. Saying so out loud is what earns finance's trust for the claims that are genuinely cashable.

This is the discipline in one sentence: an AI ROI claim is finance-grade when it has a pre-registered baseline, a counterfactual you can defend, a fully loaded recurring run-cost, and a named line on the income statement where the benefit lands. Claims with all four are rare and they get funded for years. Claims missing any one of them are the pilots that win applause and lose their budget at the first serious review, and the failure is almost always one of method rather than one of merit.

Baseline before you build, or you have nothing to compare

Attribution: what would have happened anyway

Run-cost is a recurring liability, not a line item

From operational metric to P&L line

Measuring AI ROI: a CFO's framework for production AI

Baseline before you build, or you have nothing to compare

Attribution: what would have happened anyway

Run-cost is a recurring liability, not a line item

From operational metric to P&L line

The AI spend your CFO cannot defend

Like to discuss this with a partner?

Measuring AI ROI: a CFO's framework for production AI

Baseline before you build, or you have nothing to compare

Attribution: what would have happened anyway

Run-cost is a recurring liability, not a line item

From operational metric to P&L line

The AI spend your CFO cannot defend

Like to discuss this with a partner?