Cheapest AI API by Use Case: Chatbots, Code Gen, RAG & More
42 models compared across 7 real-world workloads. Stop guessing — find the cheapest AI API for your exact use case with per-request cost breakdowns.
"What's the cheapest AI API?" is the wrong question. The right question is: "What's the cheapest AI API for what I'm building?"
A chatbot that sends short messages and gets short replies has completely different cost drivers than a RAG system that processes 10,000 tokens of context per query. The cheapest model for one workload can be 10× more expensive for another.
This guide breaks down the cheapest AI API for 7 common use cases — with real per-request costs calculated from current pricing data across all 42 models.
💡 Key insight: Output tokens cost 2-6× more than input tokens across every provider. The cheapest API for your use case depends on your input/output ratio — not just the per-token price.
🤖 Chatbots & Conversational AI
🏆 Winner: DeepSeek V4 FlashTypical workload: 500 input tokens (system prompt + history) → 200 output tokens (reply) per turn. 10 turns per conversation.
| Model | Input | Output | Cost/Conversation | Savings vs GPT-5 |
|---|---|---|---|---|
| DeepSeek V4 Flash | $0.14/M | $0.28/M | $0.0003 | ↓ 92% |
| Llama 3.1 8B | $0.10/M | $0.10/M | $0.0002 | ↓ 95% |
| Gemini 2.5 Flash-Lite | $0.10/M | $0.40/M | $0.0003 | ↓ 92% |
| GPT-5 mini | $0.25/M | $2.00/M | $0.0017 | ↓ 55% |
| Claude Haiku 4.5 | $1.00/M | $5.00/M | $0.0055 | — |
| GPT-5 | $1.25/M | $10.00/M | $0.0038 | baseline |
Verdict: For chatbots, DeepSeek V4 Flash delivers solid conversation quality at $0.0003/conversation — 92% cheaper than GPT-5. Llama 3.1 8B is even cheaper but with noticeably lower response quality for complex conversations. If you need GPT-5-level quality, DeepSeek V4 Pro ($0.435/$0.87) is still 85% cheaper.
When to spend more: Customer-facing chatbots handling sensitive topics (healthcare, finance) benefit from Claude Haiku 4.5 or GPT-5 mini's better instruction-following and safety guardrails.
💻 Code Generation & Completion
🏆 Winner: DeepSeek V4 ProTypical workload: 2,000 input tokens (prompt + context) → 4,000 output tokens (generated code). Code generation is output-heavy — output tokens dominate cost.
| Model | Input | Output | Cost/Request | Savings vs Codex |
|---|---|---|---|---|
| DeepSeek V4 Pro | $0.435/M | $0.87/M | $0.0039 | ↓ 93% |
| DeepSeek V4 Flash | $0.14/M | $0.28/M | $0.0013 | ↓ 98% |
| Mistral Large 3 | $0.50/M | $1.50/M | $0.0069 | ↓ 88% |
| Gemini 3 Flash | $0.50/M | $3.00/M | $0.013 | ↓ 78% |
| Claude Sonnet 4.6 | $3.00/M | $15.00/M | $0.066 | — |
| GPT-5.3 Codex | $1.75/M | $14.00/M | $0.0595 | baseline |
Verdict: DeepSeek V4 Pro is the clear winner for code generation — 93% cheaper than GPT-5.3 Codex with comparable code quality for most languages. Its output token pricing ($0.87/M) is absurdly cheap for code workloads where you're generating thousands of tokens per request.
When to spend more: Complex multi-file refactoring or code requiring deep reasoning about large codebases benefits from Claude Sonnet 4.6's superior context handling. For simple completions and boilerplate, DeepSeek V4 Flash at $0.0013/request is unbeatable.
📚 RAG (Retrieval-Augmented Generation)
🏆 Winner: DeepSeek V4 Pro (quality) / Gemini 2.5 Flash-Lite (volume)Typical workload: 10,000 input tokens (retrieved context) → 1,000 output tokens (answer). RAG is input-heavy — large context windows, shorter outputs.
| Model | Input | Output | Cost/Query | Savings vs GPT-5 |
|---|---|---|---|---|
| Gemini 2.5 Flash-Lite | $0.10/M | $0.40/M | $0.0014 | ↓ 95% |
| DeepSeek V4 Flash | $0.14/M | $0.28/M | $0.0017 | ↓ 94% |
| DeepSeek V4 Pro | $0.435/M | $0.87/M | $0.0052 | ↓ 83% |
| Gemini 3 Flash | $0.50/M | $3.00/M | $0.008 | ↓ 74% |
| Claude Haiku 4.5 | $1.00/M | $5.00/M | $0.015 | ↓ 50% |
| GPT-5 | $1.25/M | $10.00/M | $0.0225 | baseline |
Verdict: RAG's input-heavy nature makes cheap input tokens critical. Gemini 2.5 Flash-Lite ($0.10/M input) is 95% cheaper than GPT-5 for RAG queries. If you need higher answer quality, DeepSeek V4 Pro at $0.0052/query is still 83% cheaper than GPT-5 with better reasoning on complex retrieved context.
Pro tip: For high-volume RAG (10K+ queries/day), consider a tiered approach — route simple factual queries to Flash-Lite and complex analytical queries to DeepSeek V4 Pro. This hybrid approach can cut costs by 90%+ while maintaining quality where it matters.
📝 Text Summarization
🏆 Winner: Gemini 2.5 Flash-LiteTypical workload: 5,000 input tokens (document) → 300 output tokens (summary). Summarization is the most input-heavy common workload.
| Model | Input | Output | Cost/Document | Savings vs GPT-5 |
|---|---|---|---|---|
| Gemini 2.5 Flash-Lite | $0.10/M | $0.40/M | $0.0006 | ↓ 91% |
| DeepSeek V4 Flash | $0.14/M | $0.28/M | $0.0008 | ↓ 88% |
| Llama 3.1 8B | $0.10/M | $0.10/M | $0.0005 | ↓ 92% |
| Mistral Small 4 | $0.10/M | $0.30/M | $0.0006 | ↓ 91% |
| Claude Haiku 4.5 | $1.00/M | $5.00/M | $0.0066 | — |
| GPT-5 | $1.25/M | $10.00/M | $0.0093 | baseline |
Verdict: Summarization is input-dominated, making cheap input tokens everything. Gemini 2.5 Flash-Lite at $0.10/M input is 91% cheaper than GPT-5. For summarizing 1,000 documents/day, you're looking at $0.60/day vs $9.25/day — saving $260/month on a single workload.
Quality note: For simple extractive summaries (pull key points), Flash-Lite is excellent. For abstractive summaries requiring deep understanding (rephrase, synthesize, analyze), Claude Haiku 4.5 produces noticeably better results at 10× the cost — still far cheaper than GPT-5.
🔢 Embeddings & Vector Search
🏆 Winner: OpenAI text-embedding-3-smallTypical workload: 500 input tokens per document, no output tokens. Pure embedding generation for vector databases, semantic search, and classification.
| Model | Price | Cost/1M Docs | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02/M tokens | $10 | 1536 dimensions, great quality |
| Cohere embed-english-v3.0 | $0.10/M tokens | $50 | 1024 dimensions, excellent for search |
| Mistral Small 4 (as embedder) | $0.10/M input | $50 | Can do embed + generate in one call |
| Voyage AI voyage-3 | $0.06/M tokens | $30 | 1024 dimensions, strong retrieval |
Verdict: OpenAI's embedding model is the cheapest dedicated option at $0.02/M tokens. For a database of 1M documents (500 tokens each), embedding costs just $10. The real cost of embeddings is usually the vector database hosting, not the embedding API.
Pro tip: If you're already using a chat/completion model for RAG, some providers (Mistral, Cohere) let you use the same model for both embedding and generation — simplifying your stack and potentially reducing API calls.
✍️ Content Generation (Marketing, Copywriting)
🏆 Winner: DeepSeek V4 ProTypical workload: 500 input tokens (brief) → 2,000 output tokens (article/copy). Content generation is output-heavy with high creative requirements.
| Model | Input | Output | Cost/Piece | Quality Rating |
|---|---|---|---|---|
| DeepSeek V4 Pro | $0.435/M | $0.87/M | $0.002 | ⭐⭐⭐⭐ |
| Gemini 3 Flash | $0.50/M | $3.00/M | $0.0063 | ⭐⭐⭐⭐ |
| Claude Sonnet 4.6 | $3.00/M | $15.00/M | $0.0315 | ⭐⭐⭐⭐⭐ |
| GPT-5 | $1.25/M | $10.00/M | $0.0203 | ⭐⭐⭐⭐½ |
| Claude Opus 4.8 | $5.00/M | $25.00/M | $0.051 | ⭐⭐⭐⭐⭐ |
Verdict: DeepSeek V4 Pro at $0.002 per piece of content is 90% cheaper than GPT-5 and produces surprisingly good marketing copy. For brand-sensitive content where tone and voice matter most, Claude Sonnet 4.6 is worth the 15× premium — its writing quality is noticeably more natural and engaging.
Volume math: If you're generating 100 pieces of content/month, DeepSeek V4 Pro costs $0.20. Claude Sonnet 4.6 costs $3.15. GPT-5 costs $2.03. The quality difference between DeepSeek and GPT-5 is much smaller than the price difference.
🔍 Data Extraction & Structured Output
🏆 Winner: Gemini 3 FlashTypical workload: 3,000 input tokens (document) → 500 output tokens (extracted JSON/data). Balanced input/output, but requires reliable structured output formatting.
| Model | Input | Output | Cost/Extraction | JSON Reliability |
|---|---|---|---|---|
| Gemini 3 Flash | $0.50/M | $3.00/M | $0.003 | ⭐⭐⭐⭐⭐ |
| DeepSeek V4 Pro | $0.435/M | $0.87/M | $0.0017 | ⭐⭐⭐⭐ |
| GPT-5 mini | $0.25/M | $2.00/M | $0.0018 | ⭐⭐⭐⭐⭐ |
| Claude Haiku 4.5 | $1.00/M | $5.00/M | $0.0055 | ⭐⭐⭐⭐⭐ |
| GPT-5 | $1.25/M | $10.00/M | $0.0088 | ⭐⭐⭐⭐⭐ |
Verdict: DeepSeek V4 Pro is cheapest per-extraction ($0.0017) but Gemini 3 Flash ($0.003) has better structured output reliability — critical when you're parsing extracted JSON in production. GPT-5 mini ($0.0018) offers excellent JSON reliability at near-DeepSeek prices.
Reliability tip: For data extraction, JSON validity matters more than raw cost. A model that produces invalid JSON 5% of the time costs you in retry logic, error handling, and downstream failures. Pay the small premium for models with proven structured output (Gemini 3 Flash, GPT-5 mini, Claude Haiku 4.5).
The Input/Output Ratio Rule
The biggest mistake developers make when choosing an AI API: comparing models by per-token price without considering their workload's input/output ratio.
Here's why it matters:
- Input-heavy workloads (RAG, summarization, document analysis): Prioritize cheap input tokens. Gemini Flash-Lite ($0.10/M) and DeepSeek V4 Flash ($0.14/M) win.
- Output-heavy workloads (code generation, content creation, chatbot replies): Prioritize cheap output tokens. DeepSeek V4 Pro ($0.87/M output) dominates.
- Balanced workloads (data extraction, classification, simple Q&A): Look at the blended cost. Gemini 3 Flash and Mistral Small 4 are strong here.
🧮 Quick formula: Monthly cost = (monthly input tokens × input price) + (monthly output tokens × output price). A model with $0.10 input / $10.00 output is NOT cheaper than $1.00 input / $1.00 output if your workload is 50/50. Do the math.
Cost Comparison by Monthly Volume
Here's what your monthly bill looks like across 3 common workload profiles, at different volumes:
🤖 Chatbot (500 in / 200 out per conversation, 10 turns)
| Volume | DeepSeek V4 Flash | GPT-5 mini | GPT-5 | Savings |
|---|---|---|---|---|
| 1K conversations/mo | $0.30 | $1.70 | $3.80 | $3.50/mo |
| 10K conversations/mo | $3.00 | $17.00 | $38.00 | $35/mo |
| 100K conversations/mo | $30.00 | $170.00 | $380.00 | $350/mo |
💻 Code Gen (2K in / 4K out per request)
| Volume | DeepSeek V4 Pro | Claude Sonnet 4.6 | GPT-5.3 Codex | Savings |
|---|---|---|---|---|
| 1K requests/mo | $3.90 | $66.00 | $59.50 | $55.60/mo |
| 10K requests/mo | $39.00 | $660.00 | $595.00 | $556/mo |
| 50K requests/mo | $195.00 | $3,300.00 | $2,975.00 | $2,780/mo |
📚 RAG (10K in / 1K out per query)
| Volume | Flash-Lite | DeepSeek V4 Pro | GPT-5 | Savings |
|---|---|---|---|---|
| 1K queries/mo | $1.40 | $5.20 | $22.50 | $21.10/mo |
| 10K queries/mo | $14.00 | $52.00 | $225.00 | $211/mo |
| 100K queries/mo | $140.00 | $520.00 | $2,250.00 | $2,110/mo |
Find the cheapest AI API for your exact use case
Don't guess — calculate. The APIpulse Recommendation Engine analyzes your use case, quality needs, and volume to recommend the top 3 models with projected monthly costs.
Find My Model →Open Cost Calculator Get Pro — $29
The Hidden Costs Nobody Talks About
Per-token pricing is only part of the equation. These hidden costs can dwarf your API bill:
1. Latency vs Throughput Tradeoff
Cheaper models often have higher latency. If your chatbot takes 5 seconds to respond instead of 1 second, you lose users. Factor in the cost of lost conversions when choosing "the cheapest" option.
2. Retry Costs
Models with lower structured output reliability (some open-weight models) require JSON retry logic. A 5% retry rate on 100K requests/month = 5,000 extra API calls. That's a hidden 5% cost increase.
3. Context Window Waste
If you're paying for a 1M context window but only using 10K tokens, you're not wasting money on the unused context — but you are wasting money if a cheaper model with a smaller context window would suffice.
4. Prompt Engineering Overhead
Cheaper models often need more detailed prompts to match quality. If your engineers spend 10 extra hours/month tweaking prompts to save $50 on API costs, you're losing money.
How to Switch Models (Without Breaking Everything)
Found a cheaper model? Here's how to switch safely:
- A/B test first — Route 10% of traffic to the new model, compare quality metrics (user ratings, task completion, error rates).
- Use the same prompt — Most modern models handle similar prompt formats. Test with your existing prompts before rewriting.
- Monitor output distribution — If the new model produces longer/shorter outputs, your downstream systems might break.
- Keep a fallback — Route failed requests to your original model. The cost of a failed request far exceeds the savings from a cheaper model.
- Track per-model costs separately — Use APIpulse's calculator to model costs before switching, then verify against real usage.