Is Your Business Ready for: AI-Powered Search?

Blog

What AI Models Actually Cost in 2026, And How to Stop Wasting Money on the Wrong Ones

Date: 05/03/2026

Stuart Watkins

There’s a conversation about AI model costs that most businesses aren’t having yet, but they will be.

It’s not “should we use AI?” That ship has sailed. It’s “why is our AI bill so high, and what on earth are we getting for it?”

When we built our first team of AI agents at Devstars, we made every mistake going. We defaulted to the most powerful model for everything, ran tasks sequentially when they could run in parallel, and wondered why we were burning through £162 before lunch.

daily ai model costs for claude sonnet, opus, and haiku 4.5 from march 7-31. shows varying costs.

A bit of attention, a bit of model-routing logic, and that bill dropped to around £10 a day. Same output quality. Same capabilities. Just smarter decisions about which brain to use for which job.

Here’s what we learned.

First, Let’s Talk Tokens

Before any pricing makes sense, you need to understand how AI is actually billed.

AI models don’t charge by the word, the query, or the hour. They charge by the token.

A token is roughly 0.75 of a word, or about 4 characters. So “digital marketing” is approximately 3 tokens. A 1,000-word blog post is around 1,300 tokens.

There are two types, and they’re priced very differently:

Input tokens are everything you send to the model. Your question, your instructions, any documents you’re asking it to analyse, and conversation history. This is relatively cheap.

Output tokens are everything the model generates back. The response, the analysis, the content. This is significantly more expensive, typically 3–8x the price of input.

Why the gap? Generating output is computationally heavy. The model is creating something new, token by token, using considerable processing power. Reading your input is comparatively light work.

This matters enormously when you’re designing AI workflows. If you can ask for a shorter, structured answer rather than a flowing essay, you save real money. A model that returns a JSON object rather than paragraphs of explanation costs a fraction of the price for the same underlying analysis.

The Main Models and What They Cost

All prices below are quoted in US dollars per million tokens as of April 2026. Think of a million tokens as roughly 750,000 words, or about 600 average blog posts. We’ve converted to pounds where it’s useful, but the industry prices in dollars, so we’ll stick with that convention for the comparison tables.

A note on pricing changes: AI model costs fall consistently. We verify these figures monthly and add a “last verified” date. If you’re reading this more than a few weeks after publication, check the official pricing pages linked below. The frameworks and strategies in this article hold regardless of the specific numbers.

Claude (Anthropic)

Anthropic’s Claude family is what we use most at Devstars, particularly for our OpenClaw agent platform. The naming is straightforward: Opus is the flagship, Sonnet is the workhorse, Haiku is the sprinter.

ModelInput (per 1M tokens)Output (per 1M tokens)Context WindowBest For
Claude Opus 4.6$5.00$25.001M tokensComplex reasoning, strategy, nuanced analysis
Claude Sonnet 4.6$3.00$15.001M tokensBalanced quality and cost, coding, content
Claude Haiku 4.5$1.00$5.00200K tokensFast classification, routing, repetitive tasks

The big story here is how far prices have fallen. Claude Opus 4 used to cost $15/$75 per million tokens. Opus 4.6 does more, costs 67% less, and now includes a full 1 million token context window at no extra charge. That’s a pretty significant shift.

Both Opus 4.6 and Sonnet 4.6 were released in February 2026 and include the 1M context window at standard pricing, meaning you’re not penalised for sending larger documents. If you’re still running legacy Claude models, migrating is worth doing sooner rather than later.

See Anthropic’s official pricing page for the latest figures.

OpenAI (GPT)

OpenAI’s lineup has expanded considerably. The current flagship is GPT-5.4, though GPT-5.2 remains excellent value for most production work.

ModelInput (per 1M tokens)Output (per 1M tokens)Context WindowBest For
GPT-5.4$2.50$15.00270K tokensProfessional reasoning, complex analysis
GPT-5.2$1.75$14.00200K tokensCoding, agentic tasks, strong all-rounder
GPT-5$1.25$10.00128K tokensGeneral reasoning, content production
GPT-5 Mini$0.25$2.00128K tokensHigh-volume tasks at low cost
GPT-5 Nano$0.05$0.40128K tokensUltra-cheap classification and routing
GPT-4.1$2.00$8.001M tokensLong-context document processing
GPT-4o Mini$0.15$0.60128K tokensBudget workhorse, surprisingly capable

GPT-5 Nano at $0.05 per million input tokens is extraordinary. For routing decisions, simple classification, and format checking, it costs almost nothing and handles the job reliably well. That’s the kind of model you use to decide which expensive model to send the real work to.

GPT-4.1 is worth highlighting separately. It has a 1 million token context window at $2/$8, which makes it competitive with Gemini for long-document processing. If you’re analysing large PDFs, contracts, or codebases, this is a strong option.

See OpenAI’s official pricing page for the latest figures.

Gemini (Google)

Google’s Gemini family offers the cheapest entry point of any major provider, and their context windows are enormous. Gemini 2.0 Flash is being deprecated in June 2026, so we’ve focused on the current and next-generation models.

ModelInput (per 1M tokens)Output (per 1M tokens)Context WindowBest For
Gemini 3.1 Pro$2.00$12.001M tokensFlagship reasoning, complex analysis
Gemini 3 Flash$0.50$3.001M tokensBalanced speed and capability
Gemini 2.5 Pro$1.25$10.001M tokensResearch, deep analysis, coding
Gemini 2.5 Flash$0.30$2.501M tokensFast general-purpose work
Gemini 2.5 Flash-Lite$0.10$0.401M tokensUltra-cheap volume processing

Note: Pro models charge 2x for inputs exceeding 200K tokens. So Gemini 3.1 Pro jumps to $4.00/$18.00 for long-context requests. Flash models maintain flat pricing regardless of context length, which makes them attractive for document-heavy workflows.

Gemini 2.5 Flash-Lite at $0.10/$0.40 per million tokens, with a 1 million token context window, is kind of crazy value. For research-heavy tasks where you need to process large amounts of text, Gemini Flash models are often the right call.

See Google’s official pricing page for the latest figures.

Other Providers Worth Knowing About

The market has expanded well beyond the big three. A few names worth watching:

Mistral offers open-weight models with competitive API pricing. Mistral Large sits around $2/$6 per million tokens, while Mistral Small is roughly $0.10/$0.30. Strong for European businesses with data residency requirements.

DeepSeek from China has made waves with models that benchmark well at rock-bottom prices. DeepSeek V3 runs around $0.27/$1.10 per million tokens. The trade-off is less transparent governance and potential data sovereignty concerns for regulated industries.

xAI’s Grok models are among the cheapest available at $0.20/$0.50 for Grok 4.1, though the ecosystem is less mature.

The thing is, for most businesses, Claude, GPT, and Gemini cover the full range of capability you’ll need. The others are worth knowing about for specific use cases or when you’re optimising costs at serious scale.

The Mental Model: Three Tiers of AI

Stop thinking about specific model names and start thinking in tiers. Model names change every few months. The tier logic holds.

Tier 1, The Strategists (Opus 4.6, GPT-5.4, Gemini 3.1 Pro)
These are your senior partners. Expensive, deliberate, exceptional at nuanced reasoning. Use them sparingly for the decisions that genuinely require deep thinking. Final strategy documents. Complex competitive analysis. High-stakes content where quality is non-negotiable.

Tier 2, The Workhorses (Sonnet 4.6, GPT-5.2, GPT-5, Gemini 2.5 Pro, Gemini 3 Flash)
This is where most of your AI work should live. Balanced capability and cost. Good enough for the vast majority of tasks, excellent for most. First drafts, analysis, client reports, research synthesis, content production at scale.

Tier 3, The Sprinters (Haiku 4.5, GPT-5 Nano, GPT-4o Mini, Gemini 2.5 Flash-Lite)
Fast and cheap. Deceptively capable for the right tasks. Routing decisions, classification, structured data extraction, summarisation, simple Q&A. When you need to process thousands of items, this is your tier.

Matching Work to Models: A Practical Guide

Use Tier 1 for:

  • Strategy documents requiring genuine reasoning and nuance
  • Situations where a hallucination or poor judgement would cause real damage
  • Code architecture decisions (not the actual code, the decisions)
  • Complex financial or legal document analysis
  • Final content review where brand voice and accuracy are critical

Use Tier 2 for:

  • Blog posts, email sequences, landing page copy
  • SEO content and keyword-focused articles
  • Client analysis reports and research synthesis
  • Coding, script writing, and debugging
  • Most agentic tasks where the agent needs to make judgement calls

Use Tier 3 for:

  • Deciding which tier to use for the next task (yes, really — a cheap model routing to an expensive one saves money)
  • Classifying customer queries before they reach a human
  • Extracting structured data from documents
  • Summarising long documents before sending them to a more expensive model
  • Checking outputs for format compliance
  • Any task where you know the output is simple and predictable

How We Cut Our AI Model Costs by 75%

When we built our AI agent team at Devstars, we started naively. Every task went to the best model because it felt safer. Why take risks with a cheaper model?

The problem is that “safer” is relative. Yes, Opus produces slightly better output than Haiku for some tasks. But for a task like “does this piece of content contain a keyword?” or “summarise this in three bullet points,” Haiku does the job just as well at a fraction of the cost.

Here’s what we changed:

We introduced a routing layer. A cheap, fast model reads each incoming task and decides which tier it needs. The routing model itself costs almost nothing. GPT-5 Nano or Haiku 4.5 handles this beautifully. The savings from routing medium tasks away from flagship models are substantial.

We compressed outputs. Instead of asking for a detailed explanation with a flowing narrative, we ask for structured JSON with specific fields. Shorter outputs, same information, lower cost.

We used prompt caching. Anthropic, OpenAI, and Google all offer caching for prompts that repeat across requests. If your system instructions are 2,000 tokens and you send 1,000 requests, that’s 2 million input tokens. With caching, you can reduce that by up to 90%. Anthropic’s prompt caching, for example, drops cache reads to just 10% of the standard input price. Combined with their Batch API (50% off for non-urgent work), you can stack discounts and reduce costs by 75–95% on eligible workloads.

We reserved Tier 1 for Tier 1 work. Sounds obvious. Wasn’t in practice.

The result: costs dropped from around £32 to roughly £8 a day. Same agents, same tasks, same quality output. Just smarter model selection.

The Hidden Cost Levers Most People Miss

Beyond model selection, there are a few optimisation levers that make a real difference at scale:

Batch processing. All three major providers offer batch APIs with 50% discounts for work that doesn’t need real-time responses. If you’re generating reports overnight, processing data in bulk, or running content analysis, batch pricing halves your bill. Anthropic and OpenAI both offer this, and Google’s Gemini has an equivalent.

Context window management. The 1 million token context windows on Claude 4.6 and GPT-4.1 are genuinely useful, but sending everything to a million-token request when 50,000 tokens would do wastes money. With Pro models on Gemini, you’ll pay 2x once you cross 200K input tokens. Be deliberate about what you include.

Output token discipline. This is where the real savings hide. Output tokens cost 3–8x more than input. If you can get the same analysis as a structured JSON response instead of flowing prose, you’ll use fewer output tokens. We’ve seen 40–60% cost reductions on some workflows just by being specific about output format.

The Cost of Not Thinking About This

Here’s the reality check.

A startup running AI agents, processing content, analysing customer queries, and generating reports can easily spend £1,000–£3,000 a month on AI model costs if they’re not paying attention. Most of that spend comes down to two mistakes: using expensive models for low-value tasks, and generating more output tokens than necessary.

A few hours of audit, mapping each workflow to the appropriate model tier, compressing outputs, implementing caching, can cut that by half or more without any reduction in quality. That’s real money, especially when you’re scaling.

What This Means for You Right Now

If you’re just starting with AI tools, don’t default to the most powerful model. Start with a mid-tier model (Sonnet 4.6, GPT-5, Gemini 2.5 Pro), see what quality you’re getting, then decide whether you need to step up or can step down.

If you’re already running AI workflows, audit them. For each task, ask: does this genuinely require Tier 1, or am I paying premium prices out of habit?

If you’re building an AI-powered product or service, build model routing from the start. It’s far easier to do at the design stage than to retrofit later.

And if your AI bill is creeping up and you’re not sure why, get someone to look at it. The savings are usually sitting in plain sight.

A Cross-Provider Comparison at a Glance

For quick reference, here’s how the main tiers compare across providers (April 2026 pricing, per million tokens):

TierClaudeOpenAIGeminiBest Value
FlagshipOpus 4.6: $5/$25GPT-5.4: $2.50/$15Gemini 3.1 Pro: $2/$12Gemini on price. Claude on reasoning depth
WorkhorseSonnet 4.6: $3/$15GPT-5.2: $1.75/$14Gemini 2.5 Pro: $1.25/$10GPT-5.2 or Gemini 2.5 Pro for most tasks
BudgetHaiku 4.5: $1/$5GPT-5 Nano: $0.05/$0.40Flash-Lite: $0.10/$0.40GPT-5 Nano is cheapest. Haiku most capable

The thing is, cheapest isn’t always best value. We’ve tested workflows where Claude Sonnet produced better output on the first pass than a cheaper model that needed two or three iterations. The cost per successful output is what matters, not the cost per token.

A Note on Pricing Trends

AI model costs have been falling consistently and dramatically. Claude Opus 4 cost $15/$75 per million tokens at launch. The current Opus 4.6 is 67% cheaper and significantly more capable. GPT-4 launched at roughly $25/$60 per million tokens. GPT-5.2 now does more for $1.75/$14.

This trend will continue.

The implication: lock in flexible architecture. Don’t hardcode a specific model into your systems. Use an abstraction layer that lets you swap models when better options emerge. Which, in this market, is roughly every three to six months.

Frequently Asked Questions

How much does it cost to run an AI chatbot?
down

It depends entirely on volume and model choice. A customer support chatbot handling 10,000 conversations per month might cost £8–70 depending on whether you use GPT-5 Nano or a Tier 2 model. Start with the cheapest model that produces acceptable quality, then only upgrade where you need to.

Which AI model is cheapest?
down

For raw token price, GPT-5 Nano ($0.05/$0.40 per million tokens) and Gemini 2.5 Flash-Lite ($0.10/$0.40) are the cheapest from major providers. But cheapest per token isn’t the same as cheapest per task. A model that needs three attempts costs more than one that gets it right first time.

How do I reduce AI API costs?
down

Four things make the biggest difference: implement model routing (use cheap models for simple tasks), compress outputs (JSON instead of prose), use prompt caching (up to 90% savings on repeated instructions), and batch non-urgent work (50% discount). We’ve seen combined savings of 75% using all four.

Is Claude more expensive than GPT?
down

At the flagship tier, yes. Claude Opus 4.6 at $5/$25 costs more than GPT-5.4 at $2.50/$15. At mid-tier, they’re comparable, with Sonnet 4.6 at $3/$15 versus GPT-5.2 at $1.75/$14. But Claude tends to produce higher-quality output on complex reasoning and content tasks, which can mean fewer iterations and lower total cost per deliverable.

What is prompt caching and how much does it save?
down

Prompt caching stores your system instructions and repeated context so you don’t pay full price every time. Anthropic charges just 10% of the standard input rate for cached reads. If you’re running agents with consistent system prompts across hundreds of requests, this alone can cut input costs by 90%.



Stuart Watkins is the founder of Devstars and London Web Design Agency, with 25+ years in digital. He runs AI agents in production daily through the OpenClaw platform and has strong opinions about not wasting money on flagship models for tasks a sprinter can handle. If your AI costs are higher than they should be, that’s worth a conversation.

Share this Article share

Fancy a proper chat?

Tell me what you’re trying to fix. Half an hour, no pitch, no slide deck.

If we’re the right fit we’ll talk about what’s next. If we’re not, I’ll point you to someone who is.

Your message has been sent. Thank you.