GPT-4o has a 128K-token context window. GPT-5 has 400K. GPT-4.1 has 1M. Gemini 2.5 Pro has 1M. Llama 4 Scout has 10M. Every model generation ships with a bigger number on the box, and the marketing narrative is clear: more context equals better AI.
This narrative is wrong. Not because bigger windows aren't useful for single-session tasks. They are. But because the industry is selling context window size as the solution to a problem it fundamentally cannot solve: persistence.
The three real problems with AI context are not about window size. They're about reset, degradation, and cost. Bigger windows don't fix any of them.
Problem 1: Context windows reset
A million-token context window resets to zero when you start a new chat.
This is not a bug. It's the architecture. A context window is working memory for a single conversation. It holds everything the model can "see" during that session: your messages, the system prompt, uploaded files, the model's own responses. When the session ends, all of it disappears. The next conversation starts from nothing.
A larger context window lets you have a longer single conversation. You can paste more documents, discuss more topics, and go deeper before hitting the ceiling. That's genuinely useful for tasks that fit within one session: analyzing a long report, refactoring a large codebase, drafting a comprehensive document.
But it does nothing for the scenario that frustrates most people: continuing where you left off. Your Monday conversation and your Tuesday conversation are completely isolated from each other, regardless of whether the window is 8K tokens or 10M tokens. The AI on Tuesday has no knowledge that Monday's conversation ever happened.
Comparing context window sizes to solve the persistence problem is like comparing whiteboard sizes to solve the note-taking problem. It doesn't matter how big the whiteboard is if it gets erased every time you leave the room.
Problem 2: Bigger windows perform worse
This is the part the marketing materials don't mention. As context windows grow, accuracy degrades, even on simple retrieval tasks.
Chroma's "context rot" research, published in July 2025, tested 18 language models across increasing context lengths. The results were consistent across vendors: performance degrades as input grows, regardless of model size or architecture. Claude models decayed the slowest, but all models showed meaningful degradation.
The most striking result was GPT-4.1. On OpenAI's own MRCR benchmark (a multi-round coreference resolution test), GPT-4.1's accuracy dropped from 84% at 8K tokens to 50% at 1M tokens. That's coin-flip accuracy at the model's advertised capacity. The model has a million-token window, but its ability to actually use that window drops by half when you fill it.
This wasn't a surprise to researchers. The "lost in the middle" problem, documented by Stanford and UC Berkeley in 2023, showed that language models recall information best from the beginning and end of their context, with a pronounced accuracy valley in the middle. The recall curve is U-shaped: if a key fact is buried in the middle of a long context, the model is significantly less likely to find it.
Google's own research on Gemini 2.5 Pro illustrates this tension. On the standard needle-in-a-haystack (NIAH) test, Gemini achieves over 99.7% recall at 1M tokens. Impressive. But on the NoLiMa variant, which tests non-lexical matching (the model must understand and paraphrase, not just pattern-match), performance drops significantly. Google's own recommendation? Place key information near the beginning of the context, even with a million-token window.
The concept of a "Maximum Effective Context Window" captures this gap. Some models fail with as few as 100 tokens of distractor context. Most degrade significantly by 1,000 irrelevant tokens. The maximum effective context window, the point at which the model can still reliably use the information, falls short of the advertised maximum by over 99% in many cases.
In other words: a model might accept a million tokens, but it can't reliably process a million tokens. The number on the box is a theoretical maximum, not a practical one.
Problem 3: Bigger windows cost more
Context window utilization has a direct relationship with cost. More tokens in, more money out.
Zep's research on RAG versus long-context approaches quantified this: RAG query cost averages approximately $0.00008, while a long-context query (stuffing the full context into the window) costs approximately $0.10. That's a 1,250x cost difference per query.
For a single user running a handful of queries per day, this difference is noise. But for teams, applications, or anyone running AI workflows at scale, it's the difference between a viable product and an unsustainable one.
The economics get worse as context windows grow. A 10M-token window doesn't just cost more per query. It tempts users and developers into stuffing more context, which compounds both the cost problem and the degradation problem described above. You pay more and get less reliable results.
This is why 60% of production LLM applications use retrieval-augmented generation (RAG) rather than long-context approaches. The industry has already voted with its architecture: for anything beyond a single-session task, loading only the relevant information is cheaper, faster, and more accurate than loading everything.
The real problem is persistence, not capacity
Step back from the technical details and the pattern is clear. People don't want bigger context windows per se. They want their AI to know things across conversations. They want to explain something once and have the AI remember it. They want continuity.
Bigger context windows are the wrong tool for this. They address capacity within a single session while the actual need is persistence across sessions. It's like solving a storage problem by buying a faster processor. The processing is fine. It's the storage that's missing.
ChatGPT's memory feature tried to solve persistence, but it caps out at roughly 1,500 words and has suffered multiple wipes. Claude's compacting reduces earlier conversation content to summaries, preserving some continuity within long sessions but losing detail and nuance. Neither approach provides reliable, lossless, user-controlled persistence.
The solution that actually works is the simplest one: store the knowledge outside the AI entirely.
External knowledge bases: the architecture that works
An external knowledge base inverts the relationship between you and the AI. Instead of pushing context into the AI and hoping it retains enough, you maintain documents that the AI can pull from whenever it needs them.
Your company overview, product roadmap, style guide, meeting notes, project status, and personal preferences all live in documents you control. When the AI needs information, it reads the relevant document. When information changes, you update the document, or have the AI update it for you. The knowledge doesn't disappear between sessions because it was never stored in the session to begin with.
This is not a niche pattern. The Model Context Protocol (MCP), launched by Anthropic in November 2024, provides the standardized connection layer. Within its first year, MCP was adopted by OpenAI, Google, and Microsoft. It joined the Linux Foundation's Agentic AI Foundation in December 2025. The ecosystem has grown to over 16,000 servers with 97 million monthly SDK downloads.
The industry is converging on this architecture because it solves all three problems that bigger context windows cannot.
No reset. Documents persist independently of any conversation. Start a new chat, and the AI can read the same documents it read yesterday, last week, or last month.
No degradation. Instead of stuffing everything into the context window and hoping the model can find what it needs, you retrieve only the relevant documents. The model processes a focused, manageable amount of context. No lost-in-the-middle problem. No accuracy decay from window utilization.
Lower cost. Reading a specific document on demand costs a fraction of loading your entire knowledge base into every conversation. You pay for what you use, not for everything you might need.
What this looks like in practice
Unmarkdown is a markdown document platform with a built-in MCP server. You write documents in markdown, the same format AI tools natively understand. You connect to Claude, and every conversation has access to your full document library.
The MCP integration gives Claude seven tools: create, list, read, update, publish, convert, and check usage. In practice, this means conversations like:
"Read my product roadmap and tell me what's at risk for the Q2 deadline."
The AI reads your actual roadmap, with the real priorities, timelines, and dependencies you've documented. Not a summary. Not a compressed approximation. The full document.
"Update the project status with today's standup notes. The API migration is blocked on the auth team."
The AI opens the document, adds the update, and saves it. Tomorrow's conversation will see the current status.
"Check the style guide and rewrite this paragraph to match our tone."
The AI reads your documented style preferences and applies them. No need to re-explain that you prefer short sentences, avoid jargon, and always use active voice.
The setup takes about two minutes. Claude's integration guide covers the three connection methods: OAuth on claude.ai, config file on Claude Desktop, and CLI command on Claude Code.
The compounding advantage
The people who build this habit now will have a structural advantage that grows over time.
Every document you add makes every future conversation smarter. A product roadmap document means the AI can give roadmap-informed advice in any context, from engineering discussions to stakeholder updates. A meeting notes document means past decisions are never lost, never misremembered, never summarized away. A style guide means consistent output without repeated instructions.
The knowledge accumulates. It doesn't evaporate when you close a tab, get truncated when a conversation runs long, or vanish in a backend wipe. It grows with your work, and every conversation benefits from the full accumulated knowledge.
This is what context engineering actually means: intentionally structuring and maintaining the information your AI needs to be useful. Not fighting the limitations of context windows, but working around them entirely.
Stop waiting for bigger windows
The next model generation will ship with an even bigger context window. The marketing will say it's a breakthrough. It will be useful for certain single-session tasks. And it will still reset to zero when you start a new conversation. It will still degrade in the middle. It will still cost more per query.
The persistence problem isn't a context window problem. It's a knowledge management problem. And the solution is already available: build a knowledge base, connect it to your AI, and stop re-explaining yourself.
You don't need a bigger whiteboard. You need a filing cabinet.
Start with Unmarkdown. Write three documents covering your most frequently re-explained context. Connect to Claude via the integration guide. Ask Claude to read a document and answer a question based on it. You'll feel the difference in the first conversation.
For developers who want to build their own integrations, the full API and MCP documentation provides programmatic access to the same document operations.
