← Back to blog

Cheapest RAG Setup in 2026: Full Cost Breakdown

Retrieval-Augmented Generation (RAG) is the most common pattern for building AI applications that know your data. But costs can spiral fast if you pick the wrong models. Here's exactly how to build a production RAG pipeline for under $10/month — with real numbers across every cost component.

The Three Cost Components of RAG

Every RAG pipeline has three cost centers:

  1. Embedding — converting your documents into vectors (one-time per document, plus per query)
  2. Vector storage & search — storing vectors and finding relevant chunks at query time
  3. Generation — sending the retrieved context + query to an LLM for the final answer

Most guides focus only on generation costs. In reality, embedding and vector search often account for 40-60% of your total RAG bill — especially at scale.

Component 1: Embedding Costs

Embedding models convert text into vector representations. Prices are typically per 1M tokens:

Embedding model pricing (per 1M tokens)
Google text-embedding-004$0.10
OpenAI text-embedding-3-small$0.02
OpenAI text-embedding-3-large$0.13
Cohere embed-v4$0.10
Llama 3.1 8B (via Together.ai)$0.10

Best value: OpenAI's text-embedding-3-small at $0.02/1M tokens is the cheapest option with excellent quality. For most RAG use cases, the small model is indistinguishable from the large model.

Real embedding costs

Let's say you have 10,000 documents averaging 2,000 tokens each (20M tokens total):

Embedding 10K documents (one-time)
OpenAI text-embedding-3-small$0.40
Google text-embedding-004$2.00
Cohere embed-v4$2.00

At $0.40 for 10K documents, embedding is practically free. The ongoing cost is embedding each query (~500 tokens), which costs fractions of a cent.

Component 2: Vector Storage & Search

This is where free tiers really shine. You have three options:

Option A: Fully Local (Free)

$0/mo

Use ChromaDB, FAISS, or SQLite-VSS on your own machine. Great for development and small datasets (under 100K documents). No monthly cost, but you manage infrastructure.

  • ChromaDB — easiest setup, good for prototyping
  • FAISS — fastest search, best for large-scale
  • SQLite-VSS — lightweight, embeds in any app

Option B: Free Cloud Tiers

$0/mo

Several vector databases offer generous free tiers:

  • Pinecone — 2GB free (roughly 1M vectors)
  • Weaviate Cloud — 14-day free trial, then free tier
  • Qdrant Cloud — 1GB free
  • MongoDB Atlas — 512MB free (includes vector search)

Option C: Paid Vector DB

$25-70/mo

For production workloads beyond free tier limits:

  • Pinecone Starter — $25/mo for 10GB
  • Weaviate Cloud — $25/mo for 1M vectors
  • Qdrant Cloud — $25/mo for 1M vectors

For the cheapest setup, use ChromaDB locally during development and Pinecone's free tier in production. That keeps vector costs at $0.

Component 3: Generation Costs

This is the ongoing cost that grows with usage. Here's where model selection matters most. Let's compare costs for a typical RAG query: ~2,000 input tokens (retrieved context + query) and ~500 output tokens (answer).

Cost per RAG query (2K input + 500 output tokens)
Gemini 2.0 Flash Lite$0.00030
DeepSeek V4 Flash$0.00042
Gemini 2.0 Flash$0.00045
GPT-4o mini$0.00060
DeepSeek V4 Pro$0.00132
GPT-5 mini$0.00150
Mistral Small 4$0.00060
Claude Haiku 4.5$0.00450
Claude Sonnet 4.6$0.01350
GPT-5$0.00750

The difference is dramatic. Gemini 2.0 Flash Lite costs 45x less per query than Claude Sonnet 4.6. For RAG workloads where you're processing thousands of queries per day, this adds up fast.

Three Budget Tiers

Tier 1: Bootstrap ($0-5/mo)

Under $5/month

Best for: side projects, MVPs, internal tools

  • Embedding: OpenAI text-embedding-3-small ($0.40 for 10K docs)
  • Vector DB: ChromaDB (local, free) or Pinecone free tier
  • Generation: Gemini 2.0 Flash ($0.10/$0.40 per 1M tokens)
  • Query volume: ~1,000 queries/day for $1.35/mo

Total monthly cost at 1K queries/day: ~$1.50

Tier 2: Growth ($5-25/mo)

$5-25/month

Best for: production apps, startups with users

  • Embedding: OpenAI text-embedding-3-small
  • Vector DB: Pinecone free tier or Qdrant free tier
  • Generation: DeepSeek V4 Pro ($0.44/$0.87 per 1M tokens)
  • Query volume: ~5,000 queries/day for $19.80/mo

Total monthly cost at 5K queries/day: ~$20

Tier 3: Scale ($25-100/mo)

$25-100/month

Best for: SaaS products, high-traffic applications

  • Embedding: OpenAI text-embedding-3-small
  • Vector DB: Pinecone Starter ($25/mo for 10GB)
  • Generation: GPT-5 mini ($0.25/$2.00 per 1M tokens) or Claude Haiku 4.5 ($1/$5)
  • Query volume: ~10,000-20,000 queries/day

Total monthly cost at 10K queries/day: ~$45-75

The Complete Cheapest RAG Stack

If your only goal is minimizing cost, here's the absolute cheapest production-ready RAG setup:

Monthly cost at 1,000 queries/day
Embedding (OpenAI small)$0.30/mo
Vector DB (ChromaDB local)$0.00/mo
Generation (Gemini 2.0 Flash)$1.35/mo
Total$1.65/mo

That's $1.65/month for a production RAG pipeline processing 1,000 queries per day. A year ago, the same setup would have cost $15-30/month.

Quality vs. Cost Tradeoffs

The cheapest models aren't always the best for RAG. Here's the quality spectrum:

Recommended: Hybrid routing

Use a cheap model (DeepSeek V4 Flash) for simple queries and route complex queries to a better model (GPT-5 mini or Claude Haiku). This cuts costs by 60-70% while maintaining quality for the queries that matter most.

Optimization Tips

  1. Chunk smartly — Smaller chunks (256-512 tokens) mean less context sent to the LLM, reducing generation costs. Use overlapping chunks to maintain context.
  2. Cache common queries — If 20% of your queries are repetitive, caching eliminates those generation costs entirely.
  3. Use metadata filtering — Filter by date, category, or source before vector search to reduce the number of vectors compared.
  4. Batch embedding — Embed documents in batches of 100+ for better throughput and lower per-token costs.
  5. Set max tokens — Cap generation at the length you actually need. Shorter answers = lower costs.

Calculate your exact RAG costs — Enter your document count, query volume, and token usage to see what you'd pay across every provider.

Calculate Your RAG Costs →

Related Reading

Want to optimize your AI API costs?

APIpulse Pro ($29 one-time) includes saved scenarios, cost report exports, and personalized recommendations that can save you up to 40%.

Get Pro — $29