Towards AI · Workshop

Context Engineering in 2026: Compaction, Memory & Cost

Every long agent session eventually breaks: the assistant that swore it would never push to main does exactly that forty turns later. The model did not get dumber, its context did. We measured how to fix that on Towards AI’s open-source AI tutor, across big cloud models and cheap local ones.

Try the tutor See experiments Repository

Try the AI tutor

Open full tutor

The live production tutor, embedded below. Ask it about RAG, agents, embeddings, or anything from the course corpus. It grounds answers in a curated knowledge base and shows its sources.

The problem, by the numbers

Context engineering exists because of these magnitudes: a finite window, a long lesson, retrieval payloads that dwarf the chat, and sessions that grow for dozens of turns.

Context windowstokens

Local SLM (8B)32k

the lesson does not fit

Gemini / DeepSeek~1M

the lesson is a rounding error

What goes into one turntokens

One course lesson37.7k

overflows a 32k window

Retrieval payload / turn (Gemini)~200k

where the tokens actually are (F1)

Cached prefix at 36 turns (DeepSeek)1.78M

~97% cache-hit, so it is cheap

Conversation length testedturns

Short / medium sessions13

keep-all still wins here

Contradiction tier22

plant a fact, update it, probe

Long-horizon tier36

keep-all still cheapest (DeepSeek)

Experiments and results

Each study isolates one question and reports it on the same lesson and question set. Open any card for the interactive results.

Cloud · Gemini 3.5 · F1-F23

Gemini 3.5 Flash: the full eval program

On a large model with a big window and prompt caching, which context strategy gives the best answers for the least cost, tokens, and latency?

The naive baseline wins. Up to about 13 turns, keeping the full history is cheapest, fastest, AND has the best memory. Compaction saves tokens but often costs more dollars because it breaks the cache, and it drops the oldest facts. Retrieval payloads (not chat history) dominate the token bill, a stored profile wins personalization, and GraphRAG does not beat classical RAG.

View results

Cloud · DeepSeek · F25 + F26

DeepSeek V4-Flash: cost and long horizon

Does a cheaper model with stronger caching change the compaction story, even when sessions get long enough that keep-all should finally lose?

No, it sharpens it. DeepSeek's roughly 50x cache discount makes keeping everything the cheapest arm even at 36 turns, undercutting every compaction method (which break the cache). Keep-all also resolves contradictions perfectly where summarization fails. Per turn it runs about 10-15x cheaper than Gemini.

View results

Cloud · provider comparison

DeepSeek vs Gemini

Same context strategies, two providers: how much does the model and its caching change the conclusion?

The ranking is the same on both (keep-all wins cost and memory), but the economics differ. DeepSeek's 50x cache discount beats Gemini's roughly 10x, so keep-all's cost lead is larger and holds further out, and per-turn cost is about 10-15x lower. The lesson is provider-independent; it just gets cheaper the stronger the cache.

View results

Local · Ollama · F24 + F25

Small local models (SLMs)

On a cheap local model with a small window and no caching, what is the best way to survive a long lesson: keep it, compact it, or retrieve it?

On a 32k local model the lesson does not fit, so keep-everything is not even an option (the runtime evicts it). For fetching a document, RAG wins on every model. For compacting a growing chat, no single method wins (it depends on the model), and the model's own capability matters more than the strategy.

View results

Local · qwen2.5:32b

Does a bigger local model change it?

If keep-everything breaks on a small local model, does a 4-5x bigger local model fix it?

No. RAG still wins the document task at 100% on the 32B, the win is the right ~3k tokens, not model size. Conversation memory stays pinned in a 27-40% band whatever the method, because the 32k window, not capability, is the binding constraint. A bigger model answers better per turn but does not change the verdict, and it is slower.

View results

Cloud + Local · DeepSeek + qwen2.5:32b

Context rot: keep-all can lose on quality

Keep-everything fits and is cheap under caching. But does it quietly answer worse when the fact you need is buried in a big context?

Yes, and this is the one axis where keep-everything can lose even when nothing is truncated and nothing is expensive. On a ~1M-token DeepSeek window, a fact buried in the middle of a long context is recovered only ~40% of the time at 200k+, vs ~96% when it leads. On a 32k local model it is a hard cliff: 100% inside the window, 0% the instant it overflows. Retrieval restores it, but the retriever matters.

View results

Synthesis · all models

Cross-model: what to actually do

Across a large cloud model, a cheap API model, and tiny local ones, what is the best context strategy for cost, latency, and quality?

There are two regimes. With a big window and caching (Gemini, DeepSeek), keep everything: it is cheapest, fastest, and best on memory, and compaction must justify itself. With a small local window (SLMs), you cannot keep everything (the runtime evicts it), so you must compact or retrieve, RAG wins document tasks, and the model's capability matters more than the method. Cost spans three orders of magnitude.

View results

About this workshop

Context engineering is deciding what the model sees on every single call: instructions, history, retrieved course content, memory, and tool outputs. It is the line between a tutor that holds a coherent session and one that forgets the student halfway through. We show it on Towards AI’s open-source AI tutor for our AI-engineering courses: the compaction toolkit, memory that survives across sessions, and production retrieval.

Everything here is measured, not vibe-checked: tokens, cost, latency, and memory probes across more than a thousand runs. At real volume even Gemini Flash got expensive, so we tested whether a cheaper API model (DeepSeek) and free local models could match the quality for a fraction of the cost. Everything is open source.