Research

Why Your AI Forgets: Understanding Tokens and Context Windows

April 8, 2026 3 min read

What happens when you paste a 50-page document into ChatGPT and ask it to summarize the last section, but the response seems to miss everything after page 30? You've hit a context limit - and understanding why this happens is one of the most practically useful things an AI tools user can know.

The Unit of AI Memory: Tokens

Every AI model reads text in chunks called tokens - not words, not characters, but something in between. In English, a token is roughly 3/4 of a word. "Chatbot" is one token; "extraordinary" might be two. Punctuation, spaces, and common phrases each consume tokens. A 1,000-word article is roughly 1,300 tokens. A 300-page book runs close to 100,000 tokens.

Every model has a fixed ceiling - a context window - for how many tokens it can process at once. That ceiling covers everything: what you send (your question, pasted documents, the entire conversation history) and what the model sends back. Hit the ceiling and the model either refuses to respond, cuts off your input, or quietly drops the oldest parts of your conversation.

Current context windows vary widely:

GPT-4o: 128,000 tokens (about a 350-page book)
Claude 3.7 Sonnet: 200,000 tokens (about 550 pages)
Gemini 1.5 Pro: up to 1,000,000 tokens
GPT-3.5: 16,000 tokens (about 45 pages)

Bigger Windows Don't Guarantee Better Performance

A million-token window sounds like a solution to everything, but there's a catch. Researchers have identified what they call "lost in the middle" - the tendency for models to pay close attention to text at the very beginning and end of their context window, while losing track of details buried in the middle. Send a model a 500-page document and details on pages 200-300 may get less attention than the opening and closing sections.

For everyday use, this creates predictable failure patterns. When you're running a long research session in ChatGPT or Claude, your early instructions eventually scroll out of context. This is why a chatbot that wrote in your exact brand voice for the first 20 messages might drift into generic language by message 35 - your style guide from the start of the conversation is gone.

Workarounds that help: paste your most important context (style rules, constraints, key facts) at the beginning of every session rather than once and assume it sticks. For long documents, analyze in sections rather than pasting everything at once. For coding work, purpose-built tools like Cursor or Claude Code manage context more intelligently - loading only the files relevant to a task rather than your entire repository.

Tokens also drive cost on paid APIs. Most AI APIs charge per token, for both input and output. Sending the same 50-page document with every API request adds up fast. This is the main reason RAG (retrieval-augmented generation - a technique where the system fetches only the most relevant chunks from a larger knowledge base, rather than sending everything at once) has become standard in production AI applications. Instead of stuffing an entire knowledge base into every prompt, RAG pulls only the three most relevant paragraphs.

The Unit of AI Memory: Tokens

Bigger Windows Don't Guarantee Better Performance

Related Tools

More from today

When AI Code Takes 12 Minutes to Write and 10 Hours to Fix

77% of New Self-Help Books on Amazon Are Likely AI-Written

Anthropic's Mythos AI Found Zero-Days It Wasn't Trained to Find

Cookie Preferences