Reducing Total Token Consumption of Agentic Coding
TL;DR β Two levers reduce cost: 1. Less turns (parallel tool calls β fewer API round-trips) 2. Less context (methodology + snippets β only relevant info survives)
π Understanding the problem: why tokens grow quadratically
What you see vs. what is sent
π€ User side0 chars
π€ Sent to the API0 chars
The bottom line after 10 exchanges
β
Characters typed by the user
β
Characters sent to the API (cumulative)
β
Multiplication factor
O(nΒ²)
Cumulative cost complexity
Each new message resends the entire history. The total cost grows like the sum 1+2+3+β¦+n = n(n+1)/2.
Interactive simulation
10
2k
1
Classic chatbot vs. agentic coding
π¬ Classic chatbot
User: "Fix this bug"β 200 chars
API response #1system + msg = 4,200 chars
User: "Also add tests"β 150 chars
API response #2system + history = 8,500 chars
User: "Rename the var"β 100 chars
API response #3system + history = 12,800 chars
π€ Agentic coding
User: "Fix this bug"β 200 chars
API call #1system + msg = 4,200 chars
β Read(main.py)tool result: 3,000 chars
API call #2everything + tool = 7,400 chars
β Edit(main.py)tool result: 500 chars
API call #3everything + tools = 8,100 chars
β RunTests()tool result: 2,000 chars
API call #4everything + tools = 10,300 chars
β Result returned to user1 user message β 4 API calls
1 user message = 3-15 API calls
Each tool call (Read, Edit, Bashβ¦) triggers a new API call with the full accumulated context.
Agentic cost: the hidden multiplier
Classic chatbot
n Γ s β O(nΒ²)
10 msgs Γ 2k = ~110k chars total
Agentic (k loops/turn)
n Γ k Γ s β O(nΒ² Γ k)
10 msgs Γ 5 loops Γ 2k = ~550k chars
With agentic coding, the context grows k times faster. A 20-message session with 5 tool loops per turn sends millions of characters to the API.
1. Less Turns = Less Token Consumption
Aggressive parallelization of tool calls reduces API round-trips. I prefer a failed tool call to another full API call.
The 3-turn process
Discover β Read β Act. Group independent calls in the same turn to minimize context resends.
Sequential (8 turns)
Turn 1: Glob("**/*.py")
β wait
Turn 2: Grep("handler")
β wait
Turn 3: GetFolderDescription()
β wait
Turn 4: Read(main.py)
β wait
Turn 5: Read(config.py)
β wait
Turn 6: Edit(main.py)
β wait
Turn 7: Write(test.py)
β wait
Turn 8: RunTests()
8 turns = 8 Γ full context resent
DAG parallel (3 turns)
TURN 1 β discover
Glob("**/*.py")
Grep("handler")
GetFolderDescription()
β need results to know what to read
TURN 2 β read
Read(main.py, symbol="handle")
Read(config.py, symbol="Config")
β need content to write correct code
TURN 3 β act (all at once with depends_on)
WritePlan
Edit(main.py)
Write(test.py)
RunTests()
3 turns = 3 Γ context β less tokens
2. Less Context = Less Tokens
The context is append-only. Condense early or pay forever.
The naive approach: keep everything
Each turn resends the entire conversation history. File reads, bash outputs, reasoning β nothing is thrown away.
Append-only: previous turns stay in place (cached), new content goes at the end (full price). The 400 lines of main.py are sent 3 times β you only needed 1 function.
With Snippets: same work, smaller context
You still pay to Read once. But instead of carrying 400 lines forever, you save a 20-line Snippet β the savings start next turn. Works with any tool result: Read, Grep, Skills, GetFolderDescriptionβ¦
Same structure [S][U][A][...] β just 20L instead of 400L from Call 3 onward
With Methodology: append-only working memory
Each turn, the model appends a Methodology note (goal, plan, discoveries) to the cached prefix. Old tool_results are destroyed β only Methodology + Snippets + the original user message survive.
β API Call 1
S
U
β Response: tool_call(Read) + Methodology#1
β API Call 2 (400L paid once)
S
U
Meth#1
Snippet
Read(main.py) 400L FRESH
β Response: Methodology#2 + Edit(main.py)
β API Call 3 (400L gone! prefix + last result only)
S
U
M#1
Snip
M#2
edit OK
β Response: Methodology#3 + RunTests()
β API Call 4 (prefix grew, fresh = just pytest)
S
U
M#1
Snip
M#2
M#3
pytest 200L
β Response: "all tests pass! Task done."
β cached prefix: S + U + Meth#1..#N + Snippetβ fresh: only last tool_result
Demo: naive vs. optimized
Same task: "fix the bug in main.py". 4 API calls each. Compare what's sent.
An agent's bill is number of API calls Γ (fresh input + output + cache writes). Every mechanism in this section pulls one of four levers: 1 make fewer API calls Β· 2 send less fresh input Β· 3 emit less output Β· 4 pay cache prices instead of fresh prices.
Lever 1 β Make Fewer API Calls
The biggest lever: every avoided call saves its entire input and output. The story is a causal chain β a risk, an expensive cure, then an optimization that makes the cure rarely needed.
1. The risk: LLMs forget instructions in long contexts
If the model skips Methodology, it develops amnesia: it no longer knows what it already tried and gets stuck in loops. If it skips Snippet, it re-reads files it has already seen. Both burn extra calls. Measured spontaneous omission rates:
81.8%
Methodology omission (turn 1) Spontaneous, before any fix
34%
Snippet omission (Opus) Spontaneous, before enforcement
2. The cure: enforcement β detection β recovery side-calls
Because these omissions break the system, harnesses are not optional: every turn is checked, and a detected omission triggers a recovery side-call that forces the missing call. Omission becomes near-zero.
π Enforcement Methodology
check_enforcement() detects missing Methodology call β recover_methodology() fires a side-call with tool_choice locked on Methodology. Triggered only when thinking is non-empty.
π Enforcement Snippet
get_unsnippeted_reads() scans for Read results β₯50 lines without a matching Snippet. Recovery side-call decides snippet/discard per result. Churn discard post-recovery: 45%.
3. The catch: each recovery is one extra API call β so we make firings rare
Enforcement works, but at 34% omission it fires constantly. The cheapest call is the one that never happens: prompt engineering lowers the omission rate itself (placeholders, fresh-token reminders), so the harness almost never has to fire:
π± Seed Placeholder (Turn 1)
When methodology is empty, injects a transient system block with header + syntax + turn rule. Eliminates the 81.8% first-turn omission entirely β the methodology harness no longer fires on turn 1.
β‘ Bigctx-Reminder
~70 fresh tokens injected at the end of the last user message EVERY turn: "This turn MUST end with β₯1 tool_useβ¦ Think β€15 lines then ACT." Wire-only (not persisted).
81.8% β 0%
Methodology omission, turn 1 All Opus sessions before the seed deploy (18/22) vs all sessions after (0/36) β Fisher p < 10β»ΒΉβ°
3.2% β 1.7%
Empty turns in production All Opus sessions over the 3 days before the action-first reminder deploy (61/1,895 turns, 78 sessions) vs the day after (39/2,334 turns, 98 sessions) β same definition both sides
4. Last-resort stoppers: cap what a derailed session can burn
π Loop Detection
MD5 hash of each tool_call batch β cycle detection (size 1β8, repeated β₯3Γ) β a LoopWarning is injected. A session stuck in a cycle stops burning identical calls on the spot.
π§ Paralysis Abort
If turns without progression reach a threshold (default 12), the session is aborted: a frozen agent can never burn more than a bounded number of calls.
The chain in one line: amnesia/re-reads β harnesses are mandatory β each firing costs one call β optimize the prompt so omissions are rare β fewer firings, fewer turns, net savings.
Lever 2 β Send Less Fresh Input Β· a. Compress Tool Results
Fresh input per call is just the last user message + the last batch of tool results (~1β3k tokens on the minimal wire). Tool results are its #1 source β so every result is compressed before it enters the turn.
βοΈ Generic Truncation
Thresholds: max_lines=200, max_chars=8000, head_lines=80. Overflows saved to disk, only first 80 lines returned with a pointer to the full output.
π§ͺ Pytest Compaction
Detects pytest output. On failure: keeps FAILURES/ERRORS/warnings/summary. On all-green with β€20 tests: just lists test names + summary line.
π Edit β Fuzzy Error
When old_string not found: SequenceMatcher sliding window finds the closest match, shows Β±7 lines of context. Saving: the full re-read turn the model would otherwise need to locate the mismatch.
Lever 2 β Send Less Fresh Input Β· b. Never Re-Read a File
The other big source of fresh input is re-reading files already seen. Snippets eliminate it: read once, keep the useful part in the cached methodology note, re-resolve it every turn at cache price. Lifecycle:
1.Wrapping β Results β₯50 lines get markers: ==== A SNIPPETER id: file=β¦ ====. Lines numbered 1..N.
3.Re-resolution β Symbol snippets are re-resolved dynamically every turn via find_symbol(). Immune to line shifts after Edits.
4.Stale marking β After Edit/Write, range-based snippets on the same file get marked ## snippet-stale:. Append-only.
5.Compaction β When the methodology note exceeds 20,000 tokens (β P90 of all notes), structural compaction fires immediately: stale snippets, duplicates, and orphans (file deleted) are purged β no LLM call needed.
50 lines
SNIPPET_MIN_LINES threshold Tool results under 50 lines are not wrapped in snippet markers and impose no save-or-discard obligation on the model.
9.6%
Dead snippets (Opus) Snippets saved but never re-referenced (file path/label never reappears in a later tool_call or Methodology). Measured on 202 Opus sessions. These dead snippets account for β3% of total session cost.
Key principle: Edit/Write results are exempt from snippet wrapping (_SNIPPET_EXEMPT_TOOLS) β they are never stored in the persistent context (methodology note), so there is no snippet to manage. Meanwhile, symbol-based snippet re-resolution exists precisely to avoid re-reading files: the agent can reference a function by name and the system resolves it fresh each turn without an explicit Read call.
Lever 3 β Emit Less Output
Output tokens cost 5Γ input tokens β every line of deliberation is the most expensive line of the call. Two mechanisms keep it short.
β‘ Think less, act more
The bigctx-reminder (Lever 1) ends every turn's input with "Think β€15 lines then ACT" β shorter deliberation is a direct output saving on top of the turn reduction.
π Thinking Overflow
If thinking exceeds 20,000 chars with no tool_call parsed β streaming is cut (~$0.10β0.15 per occurrence: the cut tokens are billed but discarded). The overflow is summarized via claude-sonnet-4-6 into β€5 bullets / max 300 tokens for the next turn.
Lever 4 β Pay Cache Prices, Not Fresh Prices
Whatever still has to travel should be cached, not re-billed. System blocks are sent in a fixed order β [stable_prefix+cc, tool_docs+cc, methodology+cc, seed, delta+cc, volatile] β where +cc marks a cache_control breakpoint (TTL 5 min). Only the tail is fresh.
Snapshot (cached)
Stable prefix of the note already seen by the model. cache_control: ephemeral
β cache hit
Delta (cached)
New content appended this turn. Also gets cache_control.
β cache hit
Fresh tokens
Bigctx-reminder + messages + tool outputs
β fresh
Why doesn't the seed get +cc? The seed placeholder only exists on turn 1, while the methodology note is still empty, and vanishes as soon as the first Methodology call lands. Caching content that is guaranteed to disappear on the next turn would burn one of the 4 available cache breakpoints for zero future hits.
Keep the cached note small and valid
π Compaction Trigger
When the methodology note exceeds 20,000 tokens (β P90 of all notes), structural compaction fires immediately: purge stale snippets, deduplicate, remove orphans. Pure code logic β no LLM call.
π Cache Invalidation
After compaction, _methodology_cache_snapshot = None β the cached prefix is no longer valid. On the next turn the system rebuilds all blocks from scratch (one-time cache_create cost, amortized over subsequent turns).
Cache TTL: 5 minutes by default (ephemeral). Accounts for ~15% of the top-10 session costs as cache_create (1.25Γ one-shot price).