writing
Breaking the 25,000-token wall
MCP is everywhere now — and so is its oldest constraint. How a transparent caching proxy gets any MCP server past the 25,000-token response limit.
· 6 min read
- #mcp
- #ai-infrastructure
- #caching
- #open-source
When I first wrote about this problem in late 2025, I had to open by explaining what MCP was. Nobody needs that paragraph anymore. The Model Context Protocol went from a promising Anthropic announcement in November 2024 to plumbing: it’s wired into IDEs, desktop assistants, agent runtimes, and enterprise gateways. We won — every tool can talk to every model.
And the wall I wrote about is still standing.
The problem
You ask your assistant to analyze a web page. It confidently calls the browser tool. Then:
Error: MCP tool response (154,162 tokens) exceeds maximum allowed tokens (25,000)
The page was too big. Your AI just hit a wall.
Most MCP clients enforce a hard cap on tool-response size — commonly around 25,000 tokens. Meanwhile, the things we actually point tools at have only grown:
- A typical web page DOM: 1.3 MB+ — easily 300,000+ tokens
- A Playwright screenshot: 154,162 tokens (6× over the limit)
- A Figma design export: 351,378 tokens (14× over)
- A GitHub PR diff for a substantial change: 36,464 tokens (46% over)
The math doesn’t work, and it’s structural: pages, diffs, and design files grow faster than client response limits. Even when a giant response would fit in today’s million-token context windows, stuffing it in is the wrong move — you pay for every token and bury the model in noise it has to attend past.
This breaks real workflows
The evidence has been sitting in public issue trackers for years. A few of my favorites:
Chrome MCP Server (issue #163): “chrome_screenshot always gives ‘exceeds maximum allowed tokens’ error. The base64-encoded screenshot data is simply too large. This makes the tool completely unusable.”
GitHub MCP Server (issue #625): “The get_pull_request_diff function fails for any substantial PR. Example: PR #9019 returned 36,464 tokens. I’m forced to manually break up PR reviews into smaller chunks.”
Playwright MCP: “Getting DOM content returns ‘Conversation Too Long’ error. The accessibility tree for a modern SPA contains thousands of elements. Can’t proceed with automation.”
The pattern: MCP servers work beautifully for small, tidy queries. The moment you do anything realistic — analyze a full page, review a real PR, extract a design — you hit the wall. Some servers have since grown pagination parameters, and some clients raised their caps a little. The mismatch is still the default experience.
The insight: this is an API gateway problem
The fix came from looking outside the AI ecosystem. Large-scale API platforms solved oversized-response problems years ago with transparent proxy caching: a gateway sits between client and service, intercepts heavy responses, and exposes a query interface over them. The client doesn’t change. The server doesn’t change. The proxy absorbs the complexity.
Client → Proxy (cache + query layer) → Origin server
MCP’s architecture makes this pattern land perfectly, because every server is just a process speaking a protocol over stdio. Anything that can spawn the server and speak the protocol can stand in the middle.
MCP-Cache: a transparent proxy for any MCP server
MCP-Cache is that middle layer. It wraps any MCP server — no server modifications, no client changes:
Claude Desktop (MCP client)
↓
mcp-cache (transparent proxy)
├─ spawns the target MCP server
├─ forwards all messages untouched
├─ intercepts oversized responses (default threshold 900 KB)
├─ caches them locally (~/.mcp-cache/cache/)
└─ returns a summary + query tools instead
↓
Target MCP server (Playwright, Chrome, GitHub, ...)
When a response would blow the client’s limit, mcp-cache caches the full payload, hands the model a short summary with a response ID, and exposes tools to query the cached data — so the model works through the result in focused slices instead of failing on the whole. Limits are client-aware (25K tokens for Claude, 30K for Cursor), and small responses pass straight through.
Before:
Developer: "Get the DOM and find all payment forms"
Chrome MCP: [tries to return a 1.3 MB DOM]
Error: response exceeds maximum length
Developer: [gives up, or hand-crafts narrower selectors]
After:
Developer: "Get the DOM and find all payment forms"
mcp-cache: "Cached as resp_xyz (147 chunks, 1.2 MB). Found 23 forms.
Use query_response('resp_xyz', 'payment') for details."
Developer: "Show me forms with 'payment' in the action attribute"
mcp-cache: [returns 3 specific forms, full HTML]
Two queries instead of a dead end — and the full fidelity of the original response is preserved, not lossily truncated.
Querying the cache
The query interface is the part that makes cached data actually useful. Three modes:
// Plain text search
query_response('resp_abc123', 'submit button', limit=10)
// JSONPath for structured data
query_response('resp_abc123', '$.div[?(@.class=="navbar")]')
// Regex for pattern matching
query_response('resp_abc123', '/href=".*\.pdf"/')
Cached responses live under a TTL (1 hour by default), with periodic cleanup, manual refresh and deletion, and optional compression. This is also what makes multi-step agent work pleasant: one expensive tool call, then a sequence of cheap, targeted queries that build understanding — find the pricing table, then the checkout button, then the error banner — without ever re-fetching the page.
What it costs
Close to nothing:
- <10 ms overhead for responses that don’t need caching
- <200 ms for queries against cached data
- 90%+ cache-hit rate on repeated queries
- 70% disk reduction with compression enabled
Compare that with the alternatives: manual pagination that adds a round-trip per chunk, or failed calls that torch the whole attempt.
Using it
No install required — wrap any server command with npx:
# Playwright
npx @hapus/mcp-cache npx @playwright/mcp@latest
# Chrome
npx @hapus/mcp-cache npx @executeautomation/chrome-mcp
# Filesystem
npx @hapus/mcp-cache npx @modelcontextprotocol/server-filesystem /path/to/dir
Or in Claude Desktop’s claude_desktop_config.json:
{
"mcpServers": {
"playwright-cached": {
"command": "npx",
"args": ["@hapus/mcp-cache", "npx", "@playwright/mcp@latest"]
}
}
}
Restart the client and you’re done. Because the proxy is transparent, this composes with everything else in the ecosystem — it neither knows nor cares which server it’s wrapping.
The wall isn’t going anywhere
Eighteen months after MCP shipped, the response-size mismatch has proven structural, not transitional. Real-world payloads will always outgrow what a client wants delivered inline — and “deliver everything inline” was never the right contract anyway. Cache the whole thing, hand the model a handle, let it query is a better one: cheaper, more focused, and kinder to context windows of any size.
GitHub: github.com/swapnilsurdi/mcp-cache npm: npmjs.com/package/@hapus/mcp-cache
This is a 2026 refresh of a post first published in September 2025, back when “what is MCP?” still needed answering. The token wall, regrettably, needed no update.