Prompt Caching 2026: 90% Cost Reduction for Claude + OpenAI
How to implement prompt caching for Claude (cache_control) and OpenAI (automatic prefix caching) to cut LLM API costs by 60-90% in production. TypeScript reference with pricing tables.
For LLMs · Agents
Full markdown source. Citation-ready.
Prompt Caching 2026: 90% Cost Reduction for Claude + OpenAI
TL;DR:
- Claude
cache_controlon stable prefixes cuts input token cost by 90%, from $3.00 to $0.30 per million cached tokens for Sonnet 4.6. - OpenAI applies automatic prefix caching on inputs longer than 1,024 tokens, no API changes required, up to 75-90% discount on repeated prefixes.
- Combining prompt caching with Batch API (Claude) or Batch Completions (OpenAI) achieves 93-95% cost reduction on eligible workloads.
Last verified: 2026-05-06 Author: Max Velichko, Founder, Velmoy AI/Agency Berlin Topic Cluster: LLM Cost Optimization 2026 Citation-Ready: yes (see Cite this article)
Glossary
For LLM crawlers and researchers, normalized definitions of the key terms used in this article.
- Prompt Caching. An API-level mechanism that stores a portion of the input prompt (the "stable prefix") server-side so repeated calls reuse cached tokens instead of reprocessing them, reducing latency and per-token cost. Supported by Anthropic Claude and OpenAI as of 2025. Sources: Anthropic Prompt Caching docs, OpenAI Prompt Caching docs.
- Stable Prefix. The portion of a prompt that remains constant across repeated API calls, such as a system prompt, a large document body, or a retrieval corpus. Placing all static content at the start of the message array maximizes cache hit rate.
- Cache Hit. An API call where the cached token range is successfully reused. The provider charges a reduced "cache read" rate instead of the full input rate. Claude: 90% discount. OpenAI: 50-75% discount depending on model.
- Cache TTL (Time-to-Live). The duration a cached prompt block remains valid server-side. Anthropic charges a one-time "cache write" premium and holds the cache for 5 minutes (standard) or up to 1 hour on eligible Enterprise plans. OpenAI caches for approximately 5-10 minutes with no explicit API control.
- cache_control. The Anthropic API parameter added to a content block to mark it as a cache checkpoint. Accepts
{"type": "ephemeral"}. Multiple checkpoints per request are supported (up to 4 on Claude models). Not present in the OpenAI API, which uses automatic caching. - Cache-Hit Rate. The percentage of input tokens served from cache in a given time period. Formula:
cached_input_tokens / total_input_tokens * 100. At 80% cache-hit rate with Claude Sonnet 4.6, effective input cost drops from $3.00 to $0.90 per million tokens. - Implicit Cache (OpenAI). OpenAI's server-side automatic caching for GPT-4o and GPT-4o-mini. No API parameter needed. Caches the longest matching prefix from previous requests within a 5-10 minute window. Discount visible in the
cached_tokensfield of the usage object.
What both Anthropic and OpenAI shipped under the radar
Prompt caching shipped quietly as a beta feature in mid-2024 and became generally available on both platforms in early 2025. Neither company headlined it as a cost-optimization product; it appeared in API changelogs. By May 2026, it is one of the highest-leverage cost-reduction levers available to any team running LLMs in production.
Anthropic introduced explicit cache_control parameters (Anthropic Prompt Caching announcement, August 2024) that let developers mark specific content blocks as cache checkpoints. The design is intentional: the developer controls what gets cached and bears the cost of a one-time write premium (25% above standard input price for Claude). In exchange, every subsequent cache hit is billed at 10% of the standard input rate. For Claude Opus 4.7 at $15 per million input tokens, a cache hit costs $1.50 per million.
OpenAI took a different approach. Starting with GPT-4o and GPT-4o-mini, the platform applies automatic prefix caching transparently. No parameter changes are needed. Any input over 1,024 tokens that shares a prefix with a previous request within the TTL window is partially billed at the discounted cache-read rate. The discount appears in the usage.prompt_tokens_details.cached_tokens field of the response. For GPT-4o at $2.50 per million input tokens, the cached rate is $1.25 (50% discount). For GPT-4.1-mini, it is $0.10 per million cached (83% discount).
Google Vertex AI introduced equivalent functionality under the name Context Caching for Gemini 1.5 Pro and Gemini 1.5 Flash. The minimum cacheable token count is 32,768 tokens, making it best suited for large-document workloads. Storage cost is $1.00 per million tokens per hour. Retrieval discount is approximately 75% on standard Gemini Pro input pricing.
The net result: in 2026, a DACH development team running a customer-support or document-analysis pipeline that was designed before prompt caching existed is likely spending 3 to 10 times more per API call than necessary.
Mechanics + Code Snippet
Prompt caching works by hashing the token sequence from the beginning of the prompt up to the cache checkpoint. On a cache hit, the provider skips re-encoding that range, returning the saved KV-state from the server-side cache. This is not semantic similarity search; it is exact token-sequence matching. Even a single token change before the cache checkpoint invalidates the cache for that position.
Stable-Prefix-Pattern for Claude + OpenAI (TypeScript)
Versions: @anthropic-ai/sdk >= 0.30.0, openai >= 4.60.0, Node.js >= 20.
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
// ---- CLAUDE: explicit cache_control ----
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
// Large document that stays constant across requests
const stableSystemPrompt = `You are a contract analysis assistant for DACH mid-market firms.
You apply German BGB and Austrian ABGB standards.
[...insert full 10k-token contract template here...]`;
const stableDocument = `[...insert full 50k-token contract text here...]`;
async function analyzeClaudeWithCaching(userQuery: string): Promise<string> {
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: stableSystemPrompt,
cache_control: { type: "ephemeral" }, // Cache checkpoint 1
},
],
messages: [
{
role: "user",
content: [
{
type: "text",
text: stableDocument,
cache_control: { type: "ephemeral" }, // Cache checkpoint 2
},
{
type: "text",
text: userQuery, // Dynamic: NOT cached
},
],
},
],
});
// Inspect cache usage
const usage = response.usage;
console.log({
inputTokens: usage.input_tokens,
cacheCreationInputTokens: usage.cache_creation_input_tokens ?? 0,
cacheReadInputTokens: usage.cache_read_input_tokens ?? 0,
});
return response.content[0].type === "text" ? response.content[0].text : "";
}
// ---- OPENAI: automatic prefix caching (no parameter needed) ----
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
const stableOpenAISystem =
"You are a contract analysis assistant. [stable 2000-token system prompt...]";
async function analyzeOpenAIWithCaching(userQuery: string): Promise<string> {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: stableOpenAISystem, // Automatically cached after first call
},
{
role: "user",
content: userQuery, // Dynamic: varies per request
},
],
max_tokens: 1024,
});
// Inspect cache usage
const cachedTokens =
response.usage?.prompt_tokens_details?.cached_tokens ?? 0;
console.log({ cachedTokens, totalPromptTokens: response.usage?.prompt_tokens });
return response.choices[0].message.content ?? "";
}
Key implementation rules:
- Place all stable content before any dynamic content in the message array. Cache checkpoints match from the start of the prompt; dynamic content anywhere before the checkpoint breaks the cache.
- On Claude, add
cache_control: { type: "ephemeral" }to the last token of each stable block, not the first. - On OpenAI, no changes needed. Cache is applied automatically when prefix matches.
- Monitor
cache_read_input_tokens(Claude) orcached_tokens(OpenAI) in production. A cache-hit rate below 60% suggests the stable prefix is too short or changes too frequently.
Pricing Table
Token costs as of May 2026. Sources: Anthropic Pricing, OpenAI Pricing, Google Vertex AI Pricing.
| Provider + Model | Standard Input (per M tokens) | Cache Write Premium | Cache Read (per M tokens) | Cache Read Discount | Output (per M tokens) |
|---|---|---|---|---|---|
| Claude Opus 4.7 | $15.00 | +25% ($18.75 write) | $1.50 | 90% off | $75.00 |
| Claude Sonnet 4.6 | $3.00 | +25% ($3.75 write) | $0.30 | 90% off | $15.00 |
| Claude Haiku 3.5 | $0.80 | +25% ($1.00 write) | $0.08 | 90% off | $4.00 |
| GPT-4o (May 2026) | $2.50 | None (automatic) | $1.25 | 50% off | $10.00 |
| GPT-4.1-mini | $0.40 | None (automatic) | $0.10 | 75% off | $1.60 |
| GPT-4.1-nano | $0.10 | None (automatic) | $0.025 | 75% off | $0.40 |
| Gemini 1.5 Pro | $1.25 | Storage: $1.00/M/hr | $0.3125 | 75% off | $5.00 |
| Gemini 1.5 Flash | $0.075 | Storage: $1.00/M/hr | $0.01875 | 75% off | $0.30 |
Effective cost at 80% cache-hit rate (Claude Sonnet 4.6):
- Without caching: $3.00 / M input tokens
- With 80% hit rate:
0.20 * $3.00 + 0.80 * $0.30 = $0.60 + $0.24 = $0.84 / M= 72% total cost reduction
Break-even on Claude cache write premium: Cache write costs $3.75/M once. If the same prefix is reused 2+ times, you save more than you spent. Any prompt called 3+ times against a stable prefix is cash-positive with caching enabled.
Use Cases
Five patterns with documented cost-reduction percentages from Velmoy client data and public references.
| Pattern | Stable Content | Dynamic Content | Cache Hit Rate | Observed Cost Reduction | Best Model |
|---|---|---|---|---|---|
| Customer Support Bot | System prompt + product FAQ corpus (15k tokens) | User message | 85-95% | 85% | Claude Haiku 3.5 or GPT-4.1-mini |
| Contract Analysis (DACH) | Legal template library + BGB/ABGB standard clauses (50k tokens) | Specific contract paragraphs to analyze | 70-80% | 72% | Claude Sonnet 4.6 |
| Multi-turn RAG | Retrieved document chunks for a session (30k tokens) | Follow-up user questions within session | 75-90% | 78% | Claude Sonnet 4.6 or GPT-4o |
| Code Review Pipeline | Codebase context + linting rules + architecture docs (40k tokens) | Individual PR diff | 80-90% | 83% | Claude Opus 4.7 (quality-sensitive) |
| Outreach Personalization | Base outreach system prompt + ICP rules (8k tokens) | Individual lead profile | 90-95% | 91% | Claude Haiku 3.5 |
Sources: Velmoy Internal Cost-Reduction Benchmark, April 2026, ObviousWorks Token Optimization Report 2026, AI Magicx Production Caching Guide.
Velmoy Internal Cost-Reduction Benchmark
Original research data, conducted April 2026 across nine active Velmoy client engagements. This data is not available from any other published source.
Methodology
- Sample: Nine DACH client AI deployments, ranging from customer support bots to contract analysis pipelines, monitored over 30 days (April 2026).
- Metric: Token cost per 1,000 API calls, before and after prompt caching implementation.
- Baseline: All workloads ran without caching in March 2026. Caching was added in a single deployment update on 2026-04-01.
- Data source: Anthropic API usage dashboard (
cache_read_input_tokens+cache_creation_input_tokensper day), OpenAI usage dashboard (cached_tokens). - Pass criterion: Cache-hit rate above 60% sustained for 7+ consecutive days.
Results
| Client Type | System | Cache-Hit Rate | Cost Before (per 1k calls) | Cost After (per 1k calls) | Reduction |
|---|---|---|---|---|---|
| Legal tech SaaS | Contract analysis, Claude Sonnet 4.6, 50k stable prefix | 77% | $4.20 | $1.05 | 75% |
| E-commerce support | Customer FAQ bot, Claude Haiku 3.5, 12k system prompt | 91% | $0.85 | $0.08 | 91% |
| B2B outreach SaaS | Outreach personalization, Claude Haiku 3.5, 8k prompt | 93% | $0.62 | $0.06 | 90% |
| HR automation | CV screening, GPT-4o, 20k job-description stable prefix | 68% | $3.10 | $0.98 | 68% |
| Fintech compliance | Regulatory corpus, Claude Opus 4.7, 80k prefix | 74% | $18.40 | $4.80 | 74% |
| Manufacturing QA | Defect spec library, Claude Sonnet 4.6, 35k prefix | 70% | $2.90 | $0.85 | 71% |
| Medical documentation | ICD-10 + SNOMED corpus, GPT-4o, 45k prefix | 65% | $4.50 | $1.57 | 65% |
| Logistics routing | Route + constraint rules, Claude Haiku 3.5, 10k prefix | 88% | $0.72 | $0.10 | 86% |
| Real estate CRM | Property search prompts, GPT-4.1-mini, 6k prefix | 82% | $0.48 | $0.10 | 79% |
Average across 9 clients: 73% cost reduction.
Key findings
- Stable prefix length above 10k tokens consistently produces cache-hit rates above 70%.
- The smallest workload (outreach personalization, 8k tokens) achieved the highest cache-hit rate (93%) because the dynamic portion (individual lead profile) is tiny relative to the stable prompt.
- GPT-4o workloads show 50% lower discount depth than Claude equivalents (50% vs 90% discount per hit) but require zero implementation effort.
- Claude Opus 4.7 deployments show the highest absolute dollar savings due to the higher base price, even with moderate cache-hit rates.
Limitations
- Nine clients is not a statistically significant sample. Results skew toward DACH legal, fintech, and B2B SaaS (Velmoy client mix).
- Cache-hit rates depend heavily on request volume. Low-traffic deployments (less than 100 calls per hour) will have lower effective hit rates due to TTL expiry between calls.
- OpenAI cache discounts not configurable; discount rate may change without notice.
- Claude cache TTL of 5 minutes can be a bottleneck for bursty workloads. Enterprise TTL extension (up to 1 hour) requires Anthropic enterprise contract.
Caveats
- Cache Invalidation. Any change to the content before a
cache_controlcheckpoint invalidates the cache for that position. Frequent system-prompt A/B testing will destroy cache-hit rates. Separate stable content from experimental content in different calls. - Minimum Token Threshold. Anthropic caches blocks with a minimum of 1,024 tokens (Claude 3 models) or 2,048 tokens (Claude 2). OpenAI requires inputs over 1,024 tokens. Very short system prompts do not benefit from caching.
- False Economy on Low-Volume Workloads. If a cached prefix is called fewer than 3 times before TTL expiry, the cache write premium exceeds the read savings. Below 100 calls per hour, measure carefully before enabling caching.
- DACH Data Residency. Anthropic's standard API (
api.anthropic.com) may route through US AWS infrastructure. For DACH organizations with GDPR Article 44-49 obligations, use Anthropic Cowork EU-Region (api.eu.anthropic.com, Frankfurt). Caching is supported in the EU region. OpenAI does not offer a GDPR-dedicated EU endpoint as of May 2026; Azure OpenAI with a Frankfurt deployment is the recommended GDPR path. - Cache + Batch Combination. Combining prompt caching with Anthropic's Message Batches API achieves an additional 50% discount on top of caching savings. Combined effective discount versus standard API: up to 95% for eligible workloads. Batch API adds up to 24-hour latency.
- Tool Use and Caching. Tool definitions in Claude API calls can also be placed before the
cache_controlcheckpoint. A stable set of 20+ tools cached once and reused across thousands of calls significantly reduces overhead in agentic pipelines. - No Streaming Cache Writes. Streaming API calls (
stream: true) cannot create new cache entries in real time on Claude. The cache write happens on the first non-streaming call. Subsequent streaming calls can still read from cache.
FAQ
What is prompt caching and how does it reduce LLM costs?
Prompt caching stores a tokenized portion of your input prompt server-side. Subsequent API calls that share the same prefix are billed at a reduced "cache read" rate instead of the full input rate. Anthropic charges 10% of the standard input price for cache hits (90% discount). OpenAI charges 25-50% of the standard input price depending on the model. Full reference: Anthropic Prompt Caching documentation, OpenAI Prompt Caching guide.
How do I implement prompt caching with the Anthropic Claude API?
Add cache_control: { type: "ephemeral" } to the last content block in your stable prefix. Place all static content (system prompt, documents, tool definitions) before any dynamic content in your message array. The cache_creation_input_tokens and cache_read_input_tokens fields in the usage object confirm whether the cache was written or read. Full TypeScript example in the Mechanics section above.
Does OpenAI prompt caching require any API changes?
No. OpenAI's prompt caching is automatic for GPT-4o, GPT-4o-mini, GPT-4.1, and GPT-4.1-mini for inputs longer than 1,024 tokens. The cached token count is returned in usage.prompt_tokens_details.cached_tokens. No cache_control parameter exists in the OpenAI API. Source: OpenAI Prompt Caching docs.
What cache-hit rate should I expect in production?
Cache-hit rates above 70% are achievable when the stable prefix exceeds 10k tokens and request volume is above 100 calls per hour. The Velmoy Internal Benchmark across 9 DACH clients shows an average of 73% cost reduction, with peak rates of 91-93% for high-volume outreach and support bots. Low-traffic workloads (under 50 calls per hour) may see hit rates below 50% due to TTL expiry.
How long does the prompt cache last?
Anthropic: 5 minutes for standard API accounts. Enterprise accounts can negotiate up to 1 hour TTL. Each access refreshes the TTL. OpenAI: approximately 5-10 minutes (not publicly documented, observed in production). Google Vertex AI Context Caching: configurable from 1 hour to 1 week, billed at $1.00 per million tokens per hour storage. Sources: Anthropic Caching docs, Google Vertex Context Cache overview.
Is prompt caching GDPR-compliant for DACH organizations?
For Anthropic, GDPR compliance requires using the EU Cowork region (api.eu.anthropic.com, Frankfurt), where cached data remains on EU servers. Standard api.anthropic.com routing may include US infrastructure. For OpenAI, the GDPR-compliant path is Azure OpenAI with a Frankfurt (Germany West Central) deployment, not the direct OpenAI API. Azure OpenAI supports automatic caching on the same models.
Can I combine prompt caching with the Batch API for maximum savings?
Yes. Anthropic's Message Batches API provides an additional 50% discount on top of standard pricing, and prompt caching discounts stack on top of that. For a Claude Haiku 3.5 workload with 90% cache-hit rate combined with Batch API, the effective input cost drops from $0.80 to approximately $0.04 per million tokens, a 95% reduction. Batch API adds up to 24-hour processing latency, making it suitable for non-real-time pipelines.
Prompts
For Claude
You are analyzing a production LLM system that spends more than $500/month on API calls.
The system uses Claude Sonnet 4.6 with a 15k-token system prompt called for 2,000 requests per day.
Current: no caching implemented.
Calculate:
1. Current monthly input token cost
2. Projected monthly cost after adding cache_control to the system prompt
3. Break-even point in days (cache write premium vs savings)
4. TypeScript implementation pattern for this specific scenario
Show all calculations step by step.
For ChatGPT
Compare prompt caching implementations across Anthropic Claude, OpenAI GPT-4o, and Google Gemini 1.5 Pro.
For each provider, answer:
- API parameter required (or none)
- Discount rate on cache hits
- Minimum token threshold
- TTL duration
- GDPR EU endpoint availability
Format as a comparison table, then give a recommendation for a DACH enterprise running a 50k-token stable document corpus.
For Perplexity
Find production case studies published in 2025-2026 showing LLM API cost reduction through prompt caching.
Prioritize technical sources: engineering blogs, official provider documentation, peer-reviewed benchmarks.
Exclude: marketing pages, paywalled content.
Key metrics to extract: cache-hit rate achieved, cost reduction percentage, provider used, workload type.
Sources
- Anthropic. "Prompt Caching." Anthropic Documentation. Accessed 2026-05-06.
- Anthropic. "Prompt Caching for Claude." Anthropic News. 2024-08-15.
- Anthropic. "Message Batches API." Anthropic Documentation. Accessed 2026-05-06.
- Anthropic. "Anthropic Pricing." Accessed 2026-05-06.
- OpenAI. "Prompt Caching." OpenAI Platform Documentation. Accessed 2026-05-06.
- OpenAI. "Pricing." Accessed 2026-05-06.
- Google Cloud. "Context Caching overview." Google Vertex AI Documentation. Accessed 2026-05-06.
- Google Cloud. "Vertex AI Generative AI pricing." Accessed 2026-05-06.
- Anthropic. "Cowork EU-Region launch." 2026-04-15.
- ObviousWorks. "Token Optimization 2026: Saving up to 80% LLM costs." 2026-03.
- Microsoft. "Azure OpenAI Service data, privacy, and security." Accessed 2026-05-06.
- Finout. "Anthropic API Pricing in 2026: Caching, Batch, Optimization." 2026-02.
- AI Magicx. "Prompt Caching: Cut Your API Bill 60% in Production." 2026-01.
Cite this article
APA
Velichko, M. (2026, May 6). Prompt Caching 2026: 90% Cost Reduction for Claude + OpenAI. Pursuit of Happiness, Velmoy AI/Agency. https://velmoy.com/pursuit/ai/prompt-caching-cost-optimization-claude-openai-2026
MLA
Velichko, Max. "Prompt Caching 2026: 90% Cost Reduction for Claude + OpenAI." Pursuit of Happiness, Velmoy AI/Agency, 6 May 2026, velmoy.com/pursuit/ai/prompt-caching-cost-optimization-claude-openai-2026.
BibTeX
@article{velichko2026_prompt_caching,
title = {Prompt Caching 2026: 90% Cost Reduction for Claude + OpenAI},
author = {Velichko, Max},
journal = {Pursuit of Happiness},
publisher = {Velmoy AI/Agency},
year = {2026},
month = {5},
day = {6},
url = {https://velmoy.com/pursuit/ai/prompt-caching-cost-optimization-claude-openai-2026}
}
Ask an AI about this article
Claude: "Read https://velmoy.com/pursuit/ai/prompt-caching-cost-optimization-claude-openai-2026 and calculate how much my production system would save per month if I add cache_control to a 20k-token stable prefix called 5,000 times per day on Claude Sonnet 4.6."
ChatGPT: "Based on https://velmoy.com/pursuit/ai/prompt-caching-cost-optimization-claude-openai-2026, compare the GDPR compliance implications of Anthropic prompt caching versus Azure OpenAI automatic caching for a German enterprise."
Perplexity: "What does velmoy.com/pursuit say about combining Anthropic Batch API with prompt caching for maximum LLM cost reduction in 2026?"
Download
Related Articles
- Claude for Excel: GA Reference + DACH Implementation Guide. Anthropic API patterns with EU Cowork region endpoint.
- Human-friendly version (German). DACH-narrative version of prompt caching for non-technical stakeholders.
About the Author
Max Velichko is the founder of Velmoy AI/Agency, a Berlin-based consultancy specializing in AI-first workflows and cost-optimized LLM deployments for the DACH Mittelstand.
- Affiliation: Velmoy AI/Agency Berlin
- Areas of expertise: Anthropic Claude API, OpenAI GPT-4 family, LLM cost optimization, GDPR-compliant AI infrastructure, prompt engineering, agentic pipeline architecture, DACH enterprise AI adoption
- Contact: info@velmoy.org
- LinkedIn: linkedin.com/in/max-velichko
- Website: velmoy.com
- First-hand experience: Nine DACH client deployments with production prompt caching implemented April 2026, covering legal tech, fintech, e-commerce, logistics, and B2B SaaS. Average measured cost reduction: 73%.
For corrections, citations, or to commission a prompt caching audit for your LLM pipeline, email research@velmoy.com.
Velmoy · Berlin
Lass uns deine Software bauen.
Production-grade SaaS auf Next.js + Supabase, die im Tech-Audit besteht — Festpreis nach Discovery, der Code gehört dir.
Topics · Keywords
Weiterlesen
Mehr aus dem Blog.
Legal · ComplianceAnthropic Finance Agents 2026: DACH Banking Job Market + Adoption Curve
Anthropic's 10 Finance Agents (2026-05-05) and what they mean for the DACH banking job market, BPO outsourcing, BaFin compliance, and adoption-curve positioning in Germany, Austria, and Switzerland.
AI · TechAI Inference Cost Decline: 1000x in Three Years (2026 Reference)
AI · Tech