← Back to blog

LLM API Latency Compared: Speed Benchmarks 2026

Speed matters. A 200ms difference in API response time can mean the difference between a fluid chat experience and a frustrating one. Here's how every major LLM provider compares on real-world latency — and how speed intersects with cost.

Understanding LLM Latency

API latency has three components:

For chat applications, TTFT is the most important metric — users judge responsiveness by how quickly the first word appears.

Time to First Token (TTFT) Benchmarks

Measured on a standard 100-token input prompt, US East region, streaming enabled:

TTFT by model (lower is better)
Gemini 2.0 Flash~180ms
GPT-4o mini~220ms
GPT-4o~350ms
Claude Haiku 4.5~250ms
Claude Sonnet 4~450ms
Claude 4 Opus~800ms
Gemini 2.5 Pro~500ms
Mistral Large 3~400ms
Mistral Small 4~200ms
GPT-5~600ms
DeepSeek V4 Flash~220ms
Llama 3.1 8B (Together.ai)~150ms

Fastest TTFT: Llama 3.1 8B on Together.ai (~150ms) and Gemini 2.0 Flash (~180ms). Budget models consistently win on speed because they have fewer parameters to process.

Output Speed (Tokens per Second)

How fast each model generates tokens after the first one:

Output speed in tok/s (higher is better)
Gemini 2.0 Flash~120 tok/s
GPT-4o mini~100 tok/s
Llama 3.1 8B (Together.ai)~150 tok/s
Mistral Small 4~110 tok/s
GPT-4o~80 tok/s
Claude Haiku 4.5~90 tok/s
Claude Sonnet 4~65 tok/s
Gemini 2.5 Pro~70 tok/s
Mistral Large 3~75 tok/s
GPT-5~55 tok/s
Claude 4 Opus~40 tok/s
DeepSeek V4 Flash~130 tok/s

Fastest output: Llama 3.1 8B (~150 tok/s) and DeepSeek V4 Flash (~130 tok/s). Open models on optimized infrastructure consistently outperform closed APIs on raw speed.

The Speed vs. Price Tradeoff

Faster isn't always better — sometimes speed costs more. Here's the real relationship:

Speed vs cost: input + output per 1M tokens
Llama 3.1 8B — 150 tok/s — $0.18/$0.18Best value
Gemini 2.0 Flash — 120 tok/s — $0.10/$0.40Cheapest
GPT-4o mini — 100 tok/s — $0.15/$0.60Good balance
GPT-4o — 80 tok/s — $2.50/$10.00Premium speed
Claude Sonnet 4 — 65 tok/s — $3.00/$15.00Quality focus
Claude 4 Opus — 40 tok/s — $15.00/$75.00Slowest, most expensive

Key insight: The cheapest models are also the fastest. Budget models have fewer parameters, so they process and generate tokens faster. Premium models trade speed for reasoning quality.

Latency by Use Case

Different applications have different speed requirements:

Real-Time Chatbots (TTFT < 500ms needed)

Users expect near-instant responses. Recommended models:

Code Generation (TTFT < 1s acceptable)

Developers tolerate longer waits for better code. Recommended models:

Background Processing (Speed Less Critical)

Batch jobs, ETL pipelines, scheduled tasks — TTFT doesn't matter. Optimize for cost:

How to Measure Your Own Latency

Published benchmarks are useful, but your mileage will vary. Here's how to measure:

  1. Measure from your server, not the browser — network latency adds noise
  2. Use streaming mode — TTFT is only meaningful with streaming
  3. Sample 100+ requests — latency varies by time of day and load
  4. Test with your actual prompt length — longer inputs increase TTFT
  5. Track p50, p95, and p99 — average latency hides outliers

Optimizing for Speed

Reducing latency without switching models:

The Bottom Line

For most applications, Gemini 2.0 Flash or GPT-4o mini offer the best speed-to-cost ratio. They're fast enough for real-time chat, cheap enough for high volume, and capable enough for most tasks.

Reserve premium models (Claude Sonnet 4, GPT-4o) for tasks where quality justifies the speed and cost tradeoff. And for background processing, always use the cheapest model — speed doesn't matter when no one is waiting.

Calculate your API cost at any speed tier.

Try the APIpulse Calculator

Related Reading

Get notified when API prices change

No spam. Only pricing updates and new features. Unsubscribe anytime.