Context Engineering in 2026: Compaction, Memory & Cost
Every long agent session eventually breaks: the assistant that swore it would never push to main does exactly that forty turns later. The model did not get dumber, its context did. We measured how to fix that on Towards AI’s open-source AI tutor, across big cloud models and cheap local ones.
Try the AI tutor
Open full tutorThe live production tutor, embedded below. Ask it about RAG, agents, embeddings, or anything from the course corpus. It grounds answers in a curated knowledge base and shows its sources.
The problem, by the numbers
Context engineering exists because of these magnitudes: a finite window, a long lesson, retrieval payloads that dwarf the chat, and sessions that grow for dozens of turns.
Context windowstokens
What goes into one turntokens
Conversation length testedturns
Experiments and results
Each study isolates one question and reports it on the same lesson and question set. Open any card for the interactive results.
Gemini 3.5 Flash: the full eval program
On a large model with a big window and prompt caching, which context strategy gives the best answers for the least cost, tokens, and latency?
The naive baseline wins. Up to about 13 turns, keeping the full history is cheapest, fastest, AND has the best memory. Compaction saves tokens but often costs more dollars because it breaks the cache, and it drops the oldest facts. Retrieval payloads (not chat history) dominate the token bill, a stored profile wins personalization, and GraphRAG does not beat classical RAG.
View results Cloud · DeepSeek · F25 + F26DeepSeek V4-Flash: cost and long horizon
Does a cheaper model with stronger caching change the compaction story, even when sessions get long enough that keep-all should finally lose?
No, it sharpens it. DeepSeek's roughly 50x cache discount makes keeping everything the cheapest arm even at 36 turns, undercutting every compaction method (which break the cache). Keep-all also resolves contradictions perfectly where summarization fails. Per turn it runs about 10-15x cheaper than Gemini.
View results Cloud · provider comparisonDeepSeek vs Gemini
Same context strategies, two providers: how much does the model and its caching change the conclusion?
The ranking is the same on both (keep-all wins cost and memory), but the economics differ. DeepSeek's 50x cache discount beats Gemini's roughly 10x, so keep-all's cost lead is larger and holds further out, and per-turn cost is about 10-15x lower. The lesson is provider-independent; it just gets cheaper the stronger the cache.
View results Local · Ollama · F24 + F25Small local models (SLMs)
On a cheap local model with a small window and no caching, what is the best way to survive a long lesson: keep it, compact it, or retrieve it?
On a 32k local model the lesson does not fit, so keep-everything is not even an option (the runtime evicts it). For fetching a document, RAG wins on every model. For compacting a growing chat, no single method wins (it depends on the model), and the model's own capability matters more than the strategy.
View results Local · qwen2.5:32bDoes a bigger local model change it?
If keep-everything breaks on a small local model, does a 4-5x bigger local model fix it?
No. RAG still wins the document task at 100% on the 32B, the win is the right ~3k tokens, not model size. Conversation memory stays pinned in a 27-40% band whatever the method, because the 32k window, not capability, is the binding constraint. A bigger model answers better per turn but does not change the verdict, and it is slower.
View results Cloud + Local · DeepSeek + qwen2.5:32bContext rot: keep-all can lose on quality
Keep-everything fits and is cheap under caching. But does it quietly answer worse when the fact you need is buried in a big context?
Yes, and this is the one axis where keep-everything can lose even when nothing is truncated and nothing is expensive. On a ~1M-token DeepSeek window, a fact buried in the middle of a long context is recovered only ~40% of the time at 200k+, vs ~96% when it leads. On a 32k local model it is a hard cliff: 100% inside the window, 0% the instant it overflows. Retrieval restores it, but the retriever matters.
View results Synthesis · all modelsCross-model: what to actually do
Across a large cloud model, a cheap API model, and tiny local ones, what is the best context strategy for cost, latency, and quality?
There are two regimes. With a big window and caching (Gemini, DeepSeek), keep everything: it is cheapest, fastest, and best on memory, and compaction must justify itself. With a small local window (SLMs), you cannot keep everything (the runtime evicts it), so you must compact or retrieve, RAG wins document tasks, and the model's capability matters more than the method. Cost spans three orders of magnitude.
View resultsAbout this workshop
Context engineering is deciding what the model sees on every single call: instructions, history, retrieved course content, memory, and tool outputs. It is the line between a tutor that holds a coherent session and one that forgets the student halfway through. We show it on Towards AI’s open-source AI tutor for our AI-engineering courses: the compaction toolkit, memory that survives across sessions, and production retrieval.
Everything here is measured, not vibe-checked: tokens, cost, latency, and memory probes across more than a thousand runs. At real volume even Gemini Flash got expensive, so we tested whether a cheaper API model (DeepSeek) and free local models could match the quality for a fraction of the cost. Everything is open source.