Why Your System Prompt Costs $50K/Month
Your system prompt is 2,000 tokens. Your app handles 10 million requests per month. Every request processes those same 2,000 tokens from scratch.
At GPT-4 Turbo rates ($10/M input tokens): 2,000 × 10M = 20B tokens = $200,000/month on system prompts alone.
And that's just the API cost. The compute cost behind the scenes is even higher.
The Waste
Every request does the same work:
- Tokenize the system prompt
- Run it through all transformer layers
- Build the KV cache for those tokens
- Finally process the unique user query
Steps 1-3 are identical across requests. You're paying to recompute the same thing millions of times.
Prefix Caching
The fix is conceptually simple: compute the KV cache for your system prompt once, reuse it for every request.
# Conceptual implementation
class PrefixCache:
def __init__(self, model, system_prompt):
# Compute KV cache once at startup
self.kv_cache = model.prefill(system_prompt)
def generate(self, user_query):
# Reuse cached KV, only process new tokens
return model.generate(
user_query,
prefix_kv_cache=self.kv_cache
)
Before: Process 2,000 + 100 = 2,100 tokens every request After: Process 100 tokens every request (2,000 cached)
That's a 21x reduction in prefill compute.
Who Offers This
Anthropic: Prompt caching with explicit cache markers
response = client.messages.create(
model="claude-3-sonnet",
system=[{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}],
messages=user_messages
)
OpenAI: Automatic prompt caching for repeated prefixes No explicit API, but identical prefixes are cached for around 1 hour.
Self-hosted (vLLM, TensorRT-LLM): Prefix caching built-in
# vLLM with prefix caching
python -m vllm.entrypoints.openai.api_server \
--enable-prefix-caching
The Math
Let's revisit the numbers with caching:
| Scenario | Tokens Processed | Monthly Cost |
|---|---|---|
| No caching | 21B | $210,000 |
| With caching | 1B (queries only) | $10,000 |
Same functionality. 21x cheaper.
Where It Breaks
Prefix caching requires exact prefix matches. These scenarios break it:
Dynamic system prompts
# Different every request = no caching
system = f"Today is {datetime.now()}. You are..."
Per-user personalization in system prompt
# User name in system prompt = no cache reuse
system = f"You are helping {user.name}. Their preferences are..."
Varying context order
# Documents in different order = different prefix
system = base_prompt + random.shuffle(context_docs)
Best Practices
Static prefix, dynamic suffix
# Good: Static system prompt, dynamic user info in messages
system = "You are a helpful assistant for code review."
messages = [
{"role": "user", "content": f"Review this code by {user.name}: {code}"}
]
Cache-aware prompt design
# Structure prompts with cacheable prefix
CACHEABLE_PREFIX = """
You are an expert financial analyst.
[Long instructions that never change...]
"""
# Dynamic parts come after
full_prompt = CACHEABLE_PREFIX + f"\nAnalyze this report: {report}"
Monitor cache hit rates
Track how often your prefix cache is actually used:
# Log cache statistics
logger.info(f"Cache hit: {response.usage.cache_hit}")
logger.info(f"Cached tokens: {response.usage.cached_tokens}")
If hit rate is low, your "static" prefix isn't static enough.
The Bigger Picture
Prefix caching is one instance of a broader principle: avoid redundant computation.
Other applications:
- Shared context across users: Cache company-wide knowledge base embeddings
- Conversation history: Cache early turns of long conversations
- Tool definitions: Cache function schemas that don't change
Every token you process twice is a token you could have processed once. At scale, the savings are substantial.
That $50K/month system prompt tax? It's optional.