The Knowledge Hub

Costs

Inference prices have fallen 1,000x in three years — and AI-operated workflow bills are rising anyway. Here is why, and what to do about it.

10 min read · Knowledge Hub module · by Kenny

Last reviewed June 2026

What this edition covers

  • First June 2026 edition — initial publish of the Costs module
  • Verified June 2026 pricing for Claude Haiku 4.5, Sonnet 4.6, and Opus 4.8 across Anthropic, OpenAI, Google, and DeepSeek
  • O(n²) context compounding mechanics documented with worked numeric example
  • TrueFoundry $8,400→$800/month case study included as the primary production reference
  • Singapore PDPA data-residency cost premium documented (~10% on Bedrock Singapore region)

Inference prices have fallen by a factor of roughly 1,000 over three years.1 A model reaching GPT-3 capability cost $60 per million tokens in November 2021. The same capability tier costs $0.06 today. That is not a rounding error. It is a structural change in what AI-operated work costs to run.

And yet teams building AI-operated workflows report that their monthly bills keep climbing. The paradox is real, and it has a name: as prices fall, teams build larger and more complex systems, which consume more tokens.2 Price reductions enable previously unviable use cases rather than simply lowering the cost of existing ones. The cost discipline question is not “how cheap is the API?” It is “how do you architect a system that stays within budget as it grows?”

This module covers the mechanics of AI inference pricing, the specific cost dynamics of multi-step agentic workflows, and the seven operational levers that matter most for a system running alongside a team over months.

How inference is priced

Every LLM API call is billed by the token. One token is approximately four characters or 0.75 words in English. The bill has two line items: input tokens (everything sent to the model — system prompt, conversation history, retrieved documents, tool schemas) and output tokens (everything the model generates in response).

Output tokens are priced at a 3–5x premium over input tokens across all major providers. The reason is computational: input tokens are processed in parallel via the transformer’s attention mechanism, while output tokens are generated one at a time, each requiring a full forward pass through the model. That sequential bottleneck is why output is more expensive — and why output compression is an underused cost lever.

The June 2026 price landscape across the major providers spans more than two orders of magnitude.3 Claude Haiku 4.5 sits at $1.00/$5.00 per million input/output tokens. Claude Sonnet 4.6 at $3.00/$15.00. Claude Opus 4.8 at $5.00/$25.00. On the Google side, Gemini 2.5 Flash at $0.30/$2.50; Gemini 2.5 Flash-Lite at $0.10/$0.40. DeepSeek V4 Flash at $0.14/$0.28 — roughly 21x cheaper on input than Sonnet 4.6. OpenAI GPT-4o-mini at $0.15/$0.60; o3 at $2.00/$8.00.

Within a single provider, the cheapest and most capable model tiers differ by 5–25x. Across providers, the full range exceeds 100x. Not all of that spread is exploitable — the cheapest models cannot do the same work as frontier models. But a significant portion of any agentic workflow is routine enough to run on budget-tier models, and that portion is where the compounding savings live.

The compounding problem: why multi-step workflows cost more than you expect

Single-call pricing is intuitive. Multi-step agentic workflows are not. The critical mechanic: most LLM APIs bill for the entire conversation history on every call, not just the new tokens added to it. This means the cost of each successive step is higher than the last.

For a workflow starting at 500 tokens that adds 300 tokens per step, the context at step 10 is 3,500 tokens. But the total billed tokens across all 10 calls are approximately 20,000 — not 3,500. The cumulative cost grows at O(n²) in the number of steps.4 For a 50-step workflow (not unusual in research or content production agents), the compounding can push total billed tokens to 20–50x a naive per-call estimate.

A three-step Claude code review agent ran at $8,400 per month before optimisation. After routing a single misconfigured step through a semantic cache, the same workflow cost under $800 — a 90% reduction at unchanged capability.4

The cause in that case was a single step that injected a 50,000-token security manual into every pull request review, triggered by automated CI/CD pipelines running at high frequency. That step alone accounted for 92% of total cost. The workflow had no per-step attribution; the overrun was invisible until the monthly bill arrived.

Tool use adds a non-obvious overhead on top of the context compounding. Every call where tools are provided includes the tool schemas regardless of whether any tool is used. For an agent with ten tools, that fixed overhead adds 1,000–2,000 tokens to every request — a cost that accumulates silently across a long workflow.

The seven levers that actually move the number

Seven operational levers, in descending order of impact for most AI-operated workflows:

Prompt caching. Stable context — system prompts, tool schemas, retrieved reference documents — is the single highest-impact target. Anthropic’s cache read price for Claude Sonnet 4.6 is $0.30 per million tokens versus $3.00 for uncached input — a 90% reduction on that portion of the prompt.5 The break-even is one cache read within the five-minute window. For a 50-step workflow where the system prompt and tool schemas together total 3,500 tokens, caching that prefix saves approximately $8.10 per 1,000 calls compared to uncached. As a side effect, prompt caching reduces time-to-first-token by up to 85% for long prompts — the latency benefit compounds with the cost reduction.

Batch API routing. Anthropic’s Batch API delivers a flat 50% discount on all tokens for asynchronous requests completing within 24 hours.3 Batch and caching discounts stack: a batch request with a warm cache hit can cost as little as 5% of a real-time uncached request on input. Research, analysis, content generation, report drafting, competitor monitoring — any step where no one is waiting on a synchronous response belongs in a batch queue. This is not a niche optimisation. It is the correct default for the majority of steps in most AI-operated workflows.

Model-tier routing. Mapping each workflow step to the cheapest model capable of handling it reliably can reduce overall inference cost by 30–70%.6 The three-tier framework: classification, extraction, and formatting steps on Haiku 4.5 or Gemini 2.5 Flash; synthesis, reasoning, and quality judgment on Sonnet 4.6; complex multi-step reasoning and high-stakes validation on Opus 4.8 or o3. The 5–25x price difference between the cheapest and most capable tiers at a single provider means that routing even half of calls to the cheaper tier reduces total cost by 40–60%. Routing rules should be based on observed task failure rates per model tier, not assumptions.

Context pruning at step boundaries. Design each agent step to receive only the context it needs — not the full session history. A content creation step does not need the audit trail of the research step that preceded it. Coordinator-managed context segmentation is the architectural mechanism. This is the structural fix for the O(n²) compounding problem.

Output format constraints. Output tokens are priced at 3–5x input. Structured JSON outputs, explicit field constraints, and system-prompt-level brevity instructions reduce output token counts without reducing information density. Reducing output verbosity by 30% reduces total cost by 15–20% in a typical input-heavy workflow.

Per-step cost attribution. Without granular per-step token logging, cost spikes are invisible until the monthly bill arrives. The TrueFoundry case study — 92% of cost in a single misconfigured step — is not an edge case. It is what happens in any workflow that tracks only aggregate monthly spend. Build per-step attribution in from day one.

Semantic caching for recurring patterns. For workflows that process similar inputs repeatedly — weekly competitor analysis, monthly report generation, recurring classification tasks — semantic caching can avoid redundant API calls entirely. This is more complex to implement than prefix caching but effective for recurring patterns.

Open-weight and budget-tier models: where the trade-offs actually land

DeepSeek V4 Flash at $0.14 per million input tokens is 21x cheaper on input than Claude Sonnet 4.6. Gemini 2.5 Flash-Lite at $0.10 is 30x cheaper. For commodity tasks — classification, formatting, extraction, summarisation of structured documents — these models can perform comparably to frontier models at a fraction of the cost.7

Three trade-offs that the price spread does not resolve:

Quality threshold unpredictability. Budget models fail on reasoning tasks at rates that are not predictable without task-specific testing. The same model may perform differently at different quantisation levels across hosting providers. The price is only the starting point; the total cost of a wrong output — a misclassified customer, an incorrect extraction that propagates through downstream steps — is not in the per-token price.

Singapore PDPA compliance. DeepSeek’s API infrastructure routes through non-Singapore servers. For workflows that process identifiable customer data or confidential business information, this creates a compliance gap. Anthropic inference routed through AWS Bedrock’s Singapore region (ap-southeast-1) and Google Cloud’s Singapore node both support compliant regional routing, with an approximate 10% regional endpoint premium on Bedrock.7 That premium is the cost of compliance, not an inefficiency.

Self-hosting viability threshold. The infrastructure breakeven for self-hosting open-weight models is an enterprise-scale threshold — by one industry estimate, on the order of 8,000 or more conversations per day with H100 GPU capacity held above 50% utilisation.7 (Our open-vs-closed-models analysis treats the same breakeven at the token level and reaches the same conclusion: the crossover sits far above SMB volumes.) Most founder-led teams running AI-operated workflows will not reach that scale in the first year. The cost advantage of self-hosting is real at enterprise scale; it is irrelevant at SMB scale.

The practical pattern: use frontier closed models with prompt caching and Batch API for judgment-heavy steps; use budget-tier managed models for commodity steps; avoid self-hosting until volume clearly justifies the infrastructure overhead.

What this means for a system running over months

An AI-operated engagement — the kind RTSN builds, running alongside a team across months — has a cost profile that differs structurally from a single-use workflow. Several dynamics are specific to multi-month operation.

The context window resets with every API call, but the engagement does not reset. An agent managing content production across a six-month engagement accumulates state — brand decisions, prior content, client preferences, quality gate outcomes — that the next call needs access to. Without structured memory architecture (covered in the AI Memory module), that state either gets injected in full on every call (expensive, and the O(n²) compounding applies) or gets dropped (the agent repeats the same conversations indefinitely).

The Epoch AI analysis documents that median inference prices declined by approximately 50x per year, with post-January 2024 data showing acceleration to a median of 200x per year.1 For a system designed to run for 12 months, the pricing environment at the end of the engagement will be meaningfully cheaper than at the start. Cost projections for multi-month engagements should model price decline into the forecast, not assume static rates.

The illustrative order of magnitude for a well-optimised multi-agent engagement — using prompt caching at a 60% hit rate, Batch API routing for eligible steps, and model-tier routing — puts full-engagement inference costs in the range of tens of dollars at current Claude Sonnet 4.6 pricing. That is not a guarantee; it depends on step-level token discipline and the specific workflow architecture. But it confirms that inference cost, at current prices and with competent optimisation, is not the dominant cost of an AI-operated engagement. The dominant costs are design, judgment, and the human time that the system is intended to free up.

The Knowledge Hub updates monthly — each module re-checked, with what changed summarised at the top. Follow the modules you care about; skip the ones you don’t.

Get monthly Notes from RTSN

Notes + references

  1. a16z, Welcome to LLMflation — LLM inference cost is going down fast (March 2025). a16z.com/llmflation-llm-inference-cost. Source for the 1,000x three-year price decline and the GPT-4-tier 62x figure. The Epoch AI companion analysis documents 50x median annual decline, accelerating to ~200x post-January 2024. epoch.ai/data-insights/llm-inference-price-trends
  2. The “LLM cost paradox” framing appears in multiple industry analyses reviewed. The structural dynamic — falling per-token prices enabling larger systems rather than lower bills — is consistently observed across operational cost guides for AI-operated workflows.
  3. Anthropic, Pricing. Claude API Documentation (verified June 2026). platform.claude.com/docs/en/about-claude/pricing. Source for Claude Haiku 4.5, Sonnet 4.6, and Opus 4.8 prices; Batch API 50% discount; prompt cache read prices. Google and DeepSeek pricing verified from their official API documentation at the same date.
  4. TrueFoundry, Agentic Token Explosion: How to Attribute, Budget, and Control LLM Costs When AI Runs in CI/CD. truefoundry.com/blog/llm-cost-attribution-agentic-cicd. Source for the $8,400/$800/month case study and the 92% single-step cost concentration finding. The O(n²) compounding formula is derived from the cumulative token-billing mechanic documented in this analysis.
  5. ngrok, Prompt caching: 10x cheaper LLM tokens, but how? ngrok.com/blog/prompt-caching. Source for the 90% cache-read discount mechanics, break-even calculation, and 85% latency reduction on long prompts. Confirmed against Anthropic pricing documentation.
  6. Zhao et al., Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey, arXiv:2603.04445; “Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models,” ACL 2025 Findings, arXiv:2505.20921. Source for the 30–70% cost-reduction range from model-tier routing. The range represents the reported spread across industry analyses; the mechanism is well-established.
  7. Introl, Inference Unit Economics: The True Cost Per Million Tokens. introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide. Source for the self-hosting breakeven at ~8,000+ conversations/day and H100 utilisation analysis. DeepSeek pricing and PDPA routing context verified from DeepSeek API documentation and Anthropic/AWS Bedrock regional documentation.

Maintained by

Kenny

Founder, RTSN Studios · Singapore

This module is researched with RTSN’s AI research agents and citation-checked by Kenny before publication.

Cost discipline starts at the architecture level, not the billing dashboard.