▣ npm · @hapus/mcp-cache · ★9
MCP-Cache: a transparent cache for any MCP server
A transparent proxy that caches oversized MCP tool responses and hands the model query tools — so any MCP server works past the 25K-token wall.
- Period
- 2025 — present
- Role
- Author & maintainer
- Status
- open source
- 25K → unlimited token wall
- −30–50% LLM API cost
- <200ms cached query
- #mcp
- #ai-infrastructure
- #caching
- #typescript
The problem
You ask Claude to analyze a web page. It calls the browser tool, the tool does its job — and then the client throws: MCP tool response (154,162 tokens) exceeds maximum allowed tokens (25,000).
This is not an edge case. Most MCP clients cap tool responses around 25,000 tokens, and modern MCP servers blow past that constantly. A typical SPA’s DOM runs 1.3 MB+ — easily 300,000 tokens. A Playwright screenshot payload: 154,162 tokens. A Figma export: 351,378 tokens. Even a large GitHub PR diff lands at 36,464 tokens, 46% over the limit. Scan the issue trackers of the Playwright, Chrome, and GitHub MCP servers and you find the same failure reported again and again — across the ecosystem, this fails thousands of times a day.
The fixes people reach for are wrong. Patch each server to paginate, and every maintainer re-implements the same logic while the model loses the whole-document view. Tell users to ask narrower questions, and they mostly just give up.
The architecture
I had seen this problem before — just not in AI. When responses get too big between services, API gateways don’t make you rewrite the services. They put a proxy in the middle.
MCP-Cache applies that pattern to MCP. It’s a transparent proxy you wrap around any server:
Claude Desktop (MCP client)
↓
mcp-cache — spawns the real server as a child process,
forwards every JSON-RPC message unchanged,
intercepts oversized tools/call responses
↓
Playwright / Chrome / GitHub / any MCP server
When a response crosses the threshold (default 900 KB, with client-aware token limits — 25K for Claude, 30K for Cursor), MCP-Cache stores the full payload, chunks it, and returns a short summary with a cache ID instead of the raw blob. It also injects six management tools into the wrapped server’s tool list. The one that matters is query_response: text search, JSONPath, or regex over the cached payload, plus chunked reads for sequential access. Entries expire on a TTL (one hour by default), cleanup runs every five minutes, and optional compression keeps the cache small.
The interaction flips from a dead end into a loop. Instead of an error, the model sees “Response too large (1.29 MB, 133 chunks). Saved as resp_abc123 — use query_response to search,” queries for payment, and gets back the three forms it actually needed.
Decisions that mattered
Pure proxy, zero modifications. No server changes, no client changes, no forks. It works with any MCP server in any language over stdio; the entire integration is prefixing the launch command: npx @hapus/mcp-cache npx @playwright/mcp@latest.
Intercept at exactly one point. Initialization, capability negotiation, notifications — everything except tools/call responses passes through untouched. Protocol transparency is what makes “universal” true rather than aspirational; the moment you start rewriting other message types, you inherit every server’s quirks.
Client-aware limits instead of one magic number. Claude caps at 25K tokens, Cursor at 30K, others differ. The proxy adapts to the connected client rather than forcing everyone to the lowest common denominator.
Summary plus queries beats pagination. Pagination forces the model to read everything in order, burning tokens on irrelevant chunks. A summary with a queryable handle lets it narrow iteratively — which is the thing LLMs are actually good at. Three query modes (text, JSONPath, regex) cover unstructured, structured, and pattern cases without inventing a DSL.
Boring storage first. v1 is a file-based cache under ~/.mcp-cache. Redis-backed shared caching and vector-embedding semantic search are on the roadmap, but nobody needs them to stop hitting token errors today.
Numbers
- Token ceiling: 25K → effectively unlimited; 300K+ token responses handled
- Proxy overhead: <10 ms on responses that don’t need caching
- Cached queries: <200 ms
- Disk usage: −70% with compression enabled
- LLM API cost: −30–50% on tool-heavy workflows, because the model queries slices instead of re-ingesting full payloads
- Shipped as
@hapus/mcp-cacheon npm; runs via npx with zero configuration
Lessons
Infrastructure patterns transfer to AI systems almost embarrassingly well. This is a classic API-gateway idea pointed at a 2025 protocol, and it works because the underlying problem — payloads bigger than consumers can swallow — is the same one gateways solved a decade ago.
For ecosystem-wide problems, ship the fix that requires nobody else to change. MCP-Cache gets adopted because the cost of trying it is one command prefix, not a PR against every MCP server in existence.
And give models handles, not firehoses. An LLM with a query tool over cached data beats an LLM force-fed the same data — cheaper, faster, and it stops failing at exactly the moment the task gets interesting.