Agentic AI, by which we mean systems that plan and execute multi-step tasks with tool access rather than answering a single prompt, is the most over-promised and under-specified category in the enterprise market. The honest position, from deployments across finance, procurement and operations, is that agents work well on a narrow and identifiable class of back-office work and fail on a wider class that vendors nonetheless demonstrate them on. The difference is not model capability and it does not move much when the model improves. It is the shape of the workflow.
The workflows where agents succeed share three properties: they are bounded, reversible and instrumented. Bounded means the space of valid actions is enumerable and the task has a clear definition of done. Reversible means a wrong action can be undone or caught before it has external effect. Instrumented means every step is logged and inspectable. Workflows that are open-ended, irreversible and opaque are where agents fail, and most of the back office is closer to the first description than the marketing suggests, which is precisely why the category is worth taking seriously once the hype is set aside.
Where agents earn their place
Invoice-to-purchase-order matching is a strong case. The task is bounded: match an invoice to a purchase order and a goods-receipt note, flag discrepancies, route exceptions. It is instrumented because every match is logged against documents that already exist. It is reversible because the agent proposes a match for posting rather than moving money. We have seen well-scoped deployments handle 70 to 85 percent of routine matches without human touch while routing the genuine exceptions to a person, and the value is not the headline automation rate but the concentration of human attention on the cases that actually need judgement.
Procurement intake triage is similar: take an unstructured request, classify it, check it against existing contracts and preferred suppliers, assemble the information a buyer needs, and hand over a structured case. The agent does the assembly, which is bounded and reversible, and a human makes the commitment, which is neither. Reconciliation work, where the agent gathers and aligns records from multiple systems and presents the breaks for a person to resolve, fits the same mould. In each case the agent does the legwork and a human owns the decision with external consequence.
The common thread is that the agent operates inside a space of proposals, and the irreversible step, posting, paying, committing, ordering, sits behind a human or a deterministic rule. This is not a limitation to be engineered away as models improve. It is the design that makes the system safe to run, and the firms getting value are the ones that accepted it rather than fighting it.
Where they fail, and why the demo hid it
Agents fail on open-ended workflows with irreversible effects and weak instrumentation. An agent given autonomous authority to negotiate with suppliers, adjust pricing, or initiate payments is operating where the action space is unbounded, the effects are external and immediate, and there is no clean undo. The failure mode is not that the agent is wrong often, it is that when it is wrong the consequence has already happened, and a 3 percent error rate on irreversible high-value actions is not an efficiency, it is a liability the finance function will eventually have to explain.
The reason these failures do not show in the demo is that demos are run on representative happy-path cases, and the cost of agent failure lives almost entirely in the long tail. An agent that handles 95 percent of cases beautifully and takes a damaging, silent, irreversible action on the remaining 5 percent is worse than no automation, because it has removed the human attention that would have caught the bad case while providing no mechanism to catch it itself. Multi-step compounding makes this sharper: a small error early in a chain is reasoned over as fact by later steps, so the agent does not fail loudly, it fails confidently. The longer the chain, the worse the property.
The diagnostic question for any proposed agent deployment is therefore not how often is it right but what happens when it is wrong, and can we tell. If the answer to the second part is not yet or not reliably, the workflow is not ready for an agent regardless of the average-case accuracy.
Designing the human-in-the-loop
Human-in-the-loop is used loosely and most implementations get it wrong by placing the human where they add cost but not safety. A human asked to approve every step becomes a rubber stamp within a week, clicks through without reading, and provides the appearance of oversight with none of the substance. A human asked to review only the cases the system itself flags as uncertain provides real oversight but inherits the system's blind spots, because the cases that need a person most are often the ones the agent was most wrongly confident about.
The design that works places the human at the irreversible boundary and nowhere else. The agent works autonomously through the bounded, reversible, instrumented steps, and a person reviews at the single point where an action becomes external and final, with the full reasoning chain and its sources visible at that point. This concentrates scarce human attention where consequence lives, keeps the reviewer engaged because every review is a real decision rather than a stamp, and gives you the audit trail you will need when something does go wrong. Build the boundary first, decide what the agent may do on each side of it, and the question of where the human goes answers itself.
