Essay · AI Stack

The Meter Is On. Most Teams Are Optimizing the Wrong Thing.

AI was always expensive. The meter just made it measurable. Those are different facts, and teams that confuse them will respond with the wrong instincts.

On June 1, GitHub Copilot, the biggest AI coding tool on the market, switched everyone to usage-based billing. Credits tied to tokens, priced at real API rates, draining against an allowance that used to be a flat ceiling. Within two days the forums were full of screenshots. Bills jumping 10x and 25x. Monthly allowances gone in days. Engineering leaders learning that a multi-hour agent session and a quick chat question never cost the same thing, even when the bill said they did.

← Back to essays

Most of the reaction has been anger at vendors. I get it, but it is wasted energy. Flat-rate AI was a subsidy, and everyone building these platforms knew it. Running these models costs real money on every call. Metered pricing is just the honest version of what was always true.

Here is the reframe I keep offering people: AI was always expensive. The meter just made it measurable. The first fact everyone knew. The second is the one that changes how you should respond, and teams that confuse the two will respond with the wrong instincts.

So how should you actually think about token spend now? Not prompt tricks. Architecture. Here is the mental model I use.

Stop looking at cost per token. Look at cost per finished task.

The first instinct under a new meter is to ration. Ban the expensive models. Cap the context. Tell developers to write shorter prompts. I call this the flinch, and it usually makes things worse, because cost per token is the wrong number.

The right number is what it cost to get a piece of work done and accepted. That means every token across every retry, plus the hour a senior engineer spent fixing the output, plus the bug that slipped through and surfaced in production. Add it all up and the picture often flips. A top-tier model that solves a hard problem in one shot is frequently cheaper than a budget model that takes four tries and still needs cleanup. And a budget model that handles routine extraction perfectly is ten times cheaper than reaching for the strongest model out of habit.

Once you see it this way, the real conclusion lands: token efficiency is not a procurement problem or a prompting problem. It is a workflow design problem, and it belongs to whoever owns your platform.

Four places the money actually goes

The rate card hides where the spend really comes from. Four things to know.

Output is the expensive direction. Output tokens cost several times more than input tokens, and they are the only tokens your system generates rather than supplies. An agent that writes a 3,000-token explanation when a 200-token summary was the job is wasting money on every single call. Yet almost everyone optimizes the input side and ignores this. Set output budgets per task type. Use structured formats wherever code consumes the result. Ask for diffs, not full files. There is a quality bonus hiding here too: a cut-off JSON object fails validation loudly, while a cut-off paragraph ships quietly.

Caching is a trade, not free money. Cache reads are very cheap, but on some providers, Anthropic among them, cache writes cost more than normal input, and a cache hit requires your prompt prefix to be byte-for-byte identical between calls. One timestamp, one request ID, one reordered JSON key before your cache breakpoint, and you pay the write premium on every call while earning zero reads. This is how most caching projects fail. The prefix everyone believed was stable was not, sometimes because the SDK quietly reorders keys. Put stable content first, volatile content last, and watch your read ratio. If reads are not clearly beating writes, your cache is costing you money.

The tail is where budgets die. Averages lie in agentic systems. The scary spend lives in the tail: an agent stuck re-running the same failing test, editing the same file back and forth, retrying a flaky tool call forever. One uncontained loop can burn a developer's monthly allowance in an afternoon. That is exactly the screenshot category that flooded social media in early June. Every agent needs ceilings: max iterations, max tool calls, max spend per task, and a defined behavior when it hits the ceiling. Stop and report, or escalate once. Never retry blindly. Then watch the p99 of cost per task, because that number predicts your surprise bill.

Many expensive calls should never have been AI calls. Validating JSON, formatting code, scanning dependencies, filtering logs. These are solved problems with deterministic tools that cost nothing and never hallucinate. Every one of them routed through a model is pure waste, and less reliable than the parser it replaced. Tools first, model second. It is the cheapest optimization available and the most commonly skipped.

The biggest lever: have the AI write the script, not do the work

This one deserves its own section because it changes the cost curve, not just the cost.

When an agent does a big repetitive task directly, the token bill scales with the size of the task. Five hundred files to migrate means five hundred rounds of model calls, each carrying context, each billed. When the agent instead writes a script and runs it, you pay a fixed token cost to write and test the script, and the execution costs almost nothing. Same outcome, but the bill no longer grows with the work.

The quality argument is even stronger than the cost argument. A script applies the identical change to file 1 and file 500. An agent doing 500 repetitive edits drifts. It gets inconsistent around file 200 and improvises on file 347, and you will not know which files it improvised on. The script you can review once, test on a sample, and re-run when requirements change. The model does the thinking, which is designing the transformation. The script does the repetition.

Two honest caveats. First, "compute is free" is becoming less true. GitHub now bills Actions minutes for its agentic code review on top of token credits, and others will follow. But compute is still orders of magnitude cheaper than frontier-model tokens, so the math holds. Second, the script path has its own failure mode: an agent thrashing to debug a script it cannot get working can burn more tokens than just doing the task. So cap the debug loop and keep a fallback.

The pattern that works best in practice is the hybrid. The script does the mechanical 95 percent and spits out a list of exceptions it could not handle. The model deals only with those. Whenever an agent is about to do something repetitive more than a handful of times, the right question is not "can the model do this." It is "should the model write the thing that does this."

Stop dumping context

Flat-rate pricing trained a bad habit: send everything, because it might be useful. The whole repo. The full policy library. The entire conversation history, to every agent, every time. Under a meter, every speculative token is a charge. And it was never free anyway, because irrelevant context measurably degrades output quality.

The discipline is simple to state: each agent gets what its job needs and nothing else. A reviewer needs the diff, the acceptance criteria, and the quality rules. It does not need the product roadmap.

The same discipline applies across days, and this is where one of my own habits comes in. I do not resume yesterday's session. Resuming is quietly the worst-case billing scenario: the cache expired overnight, so the entire accumulated transcript, every file read, every dead end, gets re-sent and re-billed on your very first message of the morning. You pay for yesterday's whole journey just to ask today's first question.

Instead, I end every working session by having the agent write a handoff file. Decisions made and why, so tomorrow's session does not relitigate them. What is done and verified versus merely assumed. Open questions and next steps. And the section that matters most: discoveries, meaning anything that was expensive to learn. The test flag that has to be set, the API that silently truncates, the schema that differs in staging. Those things cost real tokens to find, and if they are not written down, tomorrow's session pays to find them again. The next morning starts fresh with that one small file instead of a replayed transcript. A handoff is a thousand tokens. Yesterday is fifty thousand. And the fresh session is not just cheaper, it is better, because agents measurably degrade as stale context piles up.

And if you want to know which context actually matters, do not ask the model. Remove a section for a sample of tasks and watch what happens to quality. Sections whose removal changes nothing are dead weight you have been paying to ship on every call.

Savings that ship defects are not savings

Here is what worries me most about the next two quarters. Teams will cut token spend, declare victory, and quietly move the cost somewhere less visible. A prompt cut in half that drops a requirement is not efficiency. It is a deferred expense with interest, paid later in rework and incidents.

So attach a quality gate to every optimization. Same test pass rate, same acceptance coverage, same defect rate, or the change is rejected. In the evaluation frameworks I build, a system that cannot show quality preservation gets capped at a basic maturity rating no matter how clever its caching is. Cost, quality, and speed are one surface. Win the token dashboard while losing the release and you have won nothing.

We have seen this movie

Cloud went through the same arc fifteen years ago. The move from owned capacity to metered consumption, the first shocking bill, the rationing reflex, and then the slow maturing into a real discipline with budgets, attribution, and architecture patterns. We called it FinOps. Token spend is on the same curve, compressed into quarters instead of years.

The destination looks the same too. Spend attributed to teams and task types. Cost per finished task as the headline metric, with the p99 tail right next to it. Prompt templates treated like deployed code, because changing a cached prefix invalidates caches across your whole fleet, and a read-ratio drop that lines up with a release is never a coincidence. Batch queues for anything that does not need an answer in seconds, at roughly half the price.

None of this is exotic. It is the difference between teams that will spend a fraction of today's run rate for the same output, and teams that either flinch into under-using a genuinely transformative capability or sleepwalk into a budget crisis.

The meter is not the story. The meter is the forcing function. AI-assisted engineering just became a measurable system, and measurable systems reward the people who design for the measurement.

Optimize the workflow, not the prompt. Price the whole task, not the call. Let the model write the script instead of doing the repetition. Gate every saving on quality. And put a ceiling on anything that can loop.

The bill was always real. Now it has your name on it.