Reducing Total Token Consumption
of Agentic Coding

TL;DR β€” Two levers reduce cost:
1. Less turns (parallel tool calls β†’ fewer API round-trips)
2. Less context (methodology + snippets β†’ only relevant info survives)
πŸ“Š Understanding the problem: why tokens grow quadratically

What you see vs. what is sent

πŸ‘€ User side 0 chars
πŸ€– Sent to the API 0 chars

The bottom line after 10 exchanges

β€”
Characters typed by the user
β€”
Characters sent to the API (cumulative)
β€”
Multiplication factor
O(nΒ²)
Cumulative cost complexity

Each new message resends the entire history. The total cost grows like the sum 1+2+3+…+n = n(n+1)/2.

Interactive simulation

10
2k
1

Classic chatbot vs. agentic coding

πŸ’¬ Classic chatbot

User: "Fix this bug"β†’ 200 chars
API response #1system + msg = 4,200 chars
User: "Also add tests"β†’ 150 chars
API response #2system + history = 8,500 chars
User: "Rename the var"β†’ 100 chars
API response #3system + history = 12,800 chars

πŸ€– Agentic coding

User: "Fix this bug"β†’ 200 chars
API call #1system + msg = 4,200 chars
β†’ Read(main.py)tool result: 3,000 chars
API call #2everything + tool = 7,400 chars
β†’ Edit(main.py)tool result: 500 chars
API call #3everything + tools = 8,100 chars
β†’ RunTests()tool result: 2,000 chars
API call #4everything + tools = 10,300 chars
βœ“ Result returned to user1 user message β†’ 4 API calls
1 user message = 3-15 API calls

Each tool call (Read, Edit, Bash…) triggers a new API call with the full accumulated context.

Agentic cost: the hidden multiplier

Classic chatbot
n Γ— s β†’ O(nΒ²)
10 msgs Γ— 2k = ~110k chars total
Agentic (k loops/turn)
n Γ— k Γ— s β†’ O(nΒ² Γ— k)
10 msgs Γ— 5 loops Γ— 2k = ~550k chars

With agentic coding, the context grows k times faster. A 20-message session with 5 tool loops per turn sends millions of characters to the API.

1. Less Turns = Less Token Consumption

Aggressive parallelization of tool calls reduces API round-trips. I prefer a failed tool call to another full API call.

The 3-turn process

Discover β†’ Read β†’ Act. Group independent calls in the same turn to minimize context resends.

Sequential (8 turns)

Turn 1: Glob("**/*.py")
↓ wait
Turn 2: Grep("handler")
↓ wait
Turn 3: GetFolderDescription()
↓ wait
Turn 4: Read(main.py)
↓ wait
Turn 5: Read(config.py)
↓ wait
Turn 6: Edit(main.py)
↓ wait
Turn 7: Write(test.py)
↓ wait
Turn 8: RunTests()
8 turns = 8 Γ— full context resent

DAG parallel (3 turns)

TURN 1 β€” discover
Glob("**/*.py")
Grep("handler")
GetFolderDescription()
↓ need results to know what to read
TURN 2 β€” read
Read(main.py, symbol="handle")
Read(config.py, symbol="Config")
↓ need content to write correct code
TURN 3 β€” act (all at once with depends_on)
WritePlan
Edit(main.py)
Write(test.py)
RunTests()
3 turns = 3 Γ— context β†’ less tokens

2. Less Context = Less Tokens

The context is append-only. Condense early or pay forever.

The naive approach: keep everything

Each turn resends the entire conversation history. File reads, bash outputs, reasoning β€” nothing is thrown away.

β†’ API Call 1
sys
u
← Response: "I'll read main.py" + tool_call(Read)
β†’ API Call 2
sys
u
asst
Read(main.py) 400L
← Response: "Found bug at L42" + tool_call(Edit)
β†’ API Call 3
sys
u
asst
main.py 400L
asst
edit OK
← Response: "Running tests" + tool_call(RunTests)
β†’ API Call 4
sys
u
asst
main.py 400L
asst
edit
asst
pytest 200L
← Response: "all tests pass! Task done."
β–  cached prefix ($0.30/MTok) β–  fresh content ($3.00/MTok)

Append-only: previous turns stay in place (cached), new content goes at the end (full price). The 400 lines of main.py are sent 3 times β€” you only needed 1 function.

With Snippets: same work, smaller context

You still pay to Read once. But instead of carrying 400 lines forever, you save a 20-line Snippet β€” the savings start next turn. Works with any tool result: Read, Grep, Skills, GetFolderDescription…

β†’ API Call 1 (same as naive)
S
U
← Response: "I'll read main.py" β†’ tool_call(Read)
β†’ API Call 2 (same as naive β€” you pay the 400L once)
S
U
A
Read(main.py) 400L FRESH
← Response: finds bug, emits Snippet(L40-60) + Edit(main.py)
β†’ API Call 3 (HERE the difference β€” 400L β†’ 20L Snippet)
S
U
A
Snippet 20L
A
edit OK
← Response: "done, running tests" β†’ tool_call(RunTests)
β†’ API Call 4 (still small)
S
U
A
20L
A
edit
A
pytest 200L
← Response: "all tests pass! Task done."
Same structure [S][U][A][...] β€” just 20L instead of 400L from Call 3 onward

With Methodology: append-only working memory

Each turn, the model appends a Methodology note (goal, plan, discoveries) to the cached prefix. Old tool_results are destroyed β€” only Methodology + Snippets + the original user message survive.

β†’ API Call 1
S
U
← Response: tool_call(Read) + Methodology#1
β†’ API Call 2 (400L paid once)
S
U
Meth#1
Snippet
Read(main.py) 400L FRESH
← Response: Methodology#2 + Edit(main.py)
β†’ API Call 3 (400L gone! prefix + last result only)
S
U
M#1
Snip
M#2
edit OK
← Response: Methodology#3 + RunTests()
β†’ API Call 4 (prefix grew, fresh = just pytest)
S
U
M#1
Snip
M#2
M#3
pytest 200L
← Response: "all tests pass! Task done."
β–  cached prefix: S + U + Meth#1..#N + Snippet β–  fresh: only last tool_result

Demo: naive vs. optimized

Same task: "fix the bug in main.py". 4 API calls each. Compare what's sent.

❌ Naive (keep everything) 0 chars
β†’ API Call #1 (2,100 ch)
System prompt 2,000ch
User: "fix the bug in main.py" 100ch
← Response:
Assistant: "I'll read main.py" β†’ tool_call(Read)
β†’ API Call #2 (6,300 ch)
System prompt 2,000ch
User: "fix the bug" 100ch
Asst#1: "I'll read main.py" 200ch
tool_result: Read(main.py) β†’ 400 lines 4,000ch FRESH
← Response:
Assistant: "Found bug at L42, editing" β†’ tool_call(Edit)
β†’ API Call #3 (6,700 ch)
System prompt 2,000ch
User 100ch
Asst#1 200ch
Read(main.py) 400 lines 4,000ch STILL HERE
Asst#2 200ch
tool_result: Edit OK 200ch FRESH
← Response:
Assistant: "Fixed! Running tests" β†’ tool_call(RunTests)
β†’ API Call #4 (8,400 ch)
System prompt 2,000ch
User 100ch
Asst#1 200ch
Read(main.py) 400 lines 4,000ch STILL HERE
Asst#2 200ch
Edit OK 200ch STILL HERE
Asst#3 200ch
tool_result: pytest output 1,500ch FRESH
← Response:
Assistant: "All tests pass! Task done."
βœ… Optimized (Methodology + Snippets) 0 chars
β†’ API Call #1 (2,100 ch)
System prompt 2,000ch
User: "fix the bug in main.py" 100ch
← Response:
tool_call(Read) + Methodology#1("Goal: fix bug")
β†’ API Call #2 (6,200 ch)
System prompt 2,000ch
User: "fix the bug" 100ch
Meth#1: "Goal: fix bug. Reading main.py" 100ch (cached)
tool_result: Read(main.py) β†’ 400 lines 4,000ch FRESH
← Response:
Methodology#2 + Snippet(L40-60) + tool_call(Edit)
β†’ API Call #3 (2,950 ch) ← 400L GONE!
System prompt 2,000ch
User 100ch
Meth#1 100ch (cached)
Snippet: main.py L40-60 300ch (cached)
Meth#2: "Found off-by-one at L42" 150ch (cached)
tool_result: Edit OK 200ch FRESH
← Response:
Methodology#3 + tool_call(RunTests)
β†’ API Call #4 (4,400 ch)
System prompt 2,000ch
User 100ch
Meth#1 100ch (cached)
Snippet: main.py L40-60 300ch (cached)
Meth#2 150ch (cached)
Meth#3: "Edit done. Running tests." 150ch (cached)
tool_result: pytest output 1,500ch FRESH
← Response:
"All tests pass! Task done."

Implementation Deep Dive

An agent's bill is number of API calls Γ— (fresh input + output + cache writes). Every mechanism in this section pulls one of four levers: 1 make fewer API calls Β· 2 send less fresh input Β· 3 emit less output Β· 4 pay cache prices instead of fresh prices.

Lever 1 β€” Make Fewer API Calls

The biggest lever: every avoided call saves its entire input and output. The story is a causal chain β€” a risk, an expensive cure, then an optimization that makes the cure rarely needed.

1. The risk: LLMs forget instructions in long contexts

If the model skips Methodology, it develops amnesia: it no longer knows what it already tried and gets stuck in loops. If it skips Snippet, it re-reads files it has already seen. Both burn extra calls. Measured spontaneous omission rates:

81.8%
Methodology omission (turn 1)
Spontaneous, before any fix
34%
Snippet omission (Opus)
Spontaneous, before enforcement

2. The cure: enforcement β€” detection β†’ recovery side-calls

Because these omissions break the system, harnesses are not optional: every turn is checked, and a detected omission triggers a recovery side-call that forces the missing call. Omission becomes near-zero.

πŸ”’ Enforcement Methodology

check_enforcement() detects missing Methodology call β†’ recover_methodology() fires a side-call with tool_choice locked on Methodology. Triggered only when thinking is non-empty.

πŸ“Ž Enforcement Snippet

get_unsnippeted_reads() scans for Read results β‰₯50 lines without a matching Snippet. Recovery side-call decides snippet/discard per result. Churn discard post-recovery: 45%.

3. The catch: each recovery is one extra API call β€” so we make firings rare

Enforcement works, but at 34% omission it fires constantly. The cheapest call is the one that never happens: prompt engineering lowers the omission rate itself (placeholders, fresh-token reminders), so the harness almost never has to fire:

🌱 Seed Placeholder (Turn 1)

When methodology is empty, injects a transient system block with header + syntax + turn rule. Eliminates the 81.8% first-turn omission entirely β€” the methodology harness no longer fires on turn 1.

⚑ Bigctx-Reminder

~70 fresh tokens injected at the end of the last user message EVERY turn: "This turn MUST end with β‰₯1 tool_use… Think ≀15 lines then ACT." Wire-only (not persisted).

81.8% β†’ 0%
Methodology omission, turn 1
All Opus sessions before the seed deploy (18/22) vs all sessions after (0/36) β€” Fisher p < 10⁻¹⁰
3.2% β†’ 1.7%
Empty turns in production
All Opus sessions over the 3 days before the action-first reminder deploy (61/1,895 turns, 78 sessions) vs the day after (39/2,334 turns, 98 sessions) β€” same definition both sides

4. Last-resort stoppers: cap what a derailed session can burn

πŸ” Loop Detection

MD5 hash of each tool_call batch β†’ cycle detection (size 1–8, repeated β‰₯3Γ—) β†’ a LoopWarning is injected. A session stuck in a cycle stops burning identical calls on the spot.

🧊 Paralysis Abort

If turns without progression reach a threshold (default 12), the session is aborted: a frozen agent can never burn more than a bounded number of calls.

The chain in one line: amnesia/re-reads β†’ harnesses are mandatory β†’ each firing costs one call β†’ optimize the prompt so omissions are rare β†’ fewer firings, fewer turns, net savings.

Lever 2 β€” Send Less Fresh Input Β· a. Compress Tool Results

Fresh input per call is just the last user message + the last batch of tool results (~1–3k tokens on the minimal wire). Tool results are its #1 source β€” so every result is compressed before it enters the turn.

βœ‚οΈ Generic Truncation

Thresholds: max_lines=200, max_chars=8000, head_lines=80. Overflows saved to disk, only first 80 lines returned with a pointer to the full output.

πŸ§ͺ Pytest Compaction

Detects pytest output. On failure: keeps FAILURES/ERRORS/warnings/summary. On all-green with ≀20 tests: just lists test names + summary line.

πŸ” Edit β€” Fuzzy Error

When old_string not found: SequenceMatcher sliding window finds the closest match, shows Β±7 lines of context. Saving: the full re-read turn the model would otherwise need to locate the mismatch.

Lever 2 β€” Send Less Fresh Input Β· b. Never Re-Read a File

The other big source of fresh input is re-reading files already seen. Snippets eliminate it: read once, keep the useful part in the cached methodology note, re-resolve it every turn at cache price. Lifecycle:

1. Wrapping β€” Results β‰₯50 lines get markers: ==== A SNIPPETER id: file=… ====. Lines numbered 1..N.
2. Storage β€” Agent calls Snippet(symbol=…) or Snippet(ranges=…). Discard explicit covers unwanted results.
3. Re-resolution β€” Symbol snippets are re-resolved dynamically every turn via find_symbol(). Immune to line shifts after Edits.
4. Stale marking β€” After Edit/Write, range-based snippets on the same file get marked ## snippet-stale:. Append-only.
5. Compaction β€” When the methodology note exceeds 20,000 tokens (β‰ˆ P90 of all notes), structural compaction fires immediately: stale snippets, duplicates, and orphans (file deleted) are purged β€” no LLM call needed.
50 lines
SNIPPET_MIN_LINES threshold
Tool results under 50 lines are not wrapped in snippet markers and impose no save-or-discard obligation on the model.
9.6%
Dead snippets (Opus)
Snippets saved but never re-referenced (file path/label never reappears in a later tool_call or Methodology). Measured on 202 Opus sessions. These dead snippets account for β‰ˆ3% of total session cost.
Key principle: Edit/Write results are exempt from snippet wrapping (_SNIPPET_EXEMPT_TOOLS) β€” they are never stored in the persistent context (methodology note), so there is no snippet to manage. Meanwhile, symbol-based snippet re-resolution exists precisely to avoid re-reading files: the agent can reference a function by name and the system resolves it fresh each turn without an explicit Read call.

Lever 3 β€” Emit Less Output

Output tokens cost 5Γ— input tokens β€” every line of deliberation is the most expensive line of the call. Two mechanisms keep it short.

⚑ Think less, act more

The bigctx-reminder (Lever 1) ends every turn's input with "Think ≀15 lines then ACT" β€” shorter deliberation is a direct output saving on top of the turn reduction.

πŸ’­ Thinking Overflow

If thinking exceeds 20,000 chars with no tool_call parsed β†’ streaming is cut (~$0.10–0.15 per occurrence: the cut tokens are billed but discarded). The overflow is summarized via claude-sonnet-4-6 into ≀5 bullets / max 300 tokens for the next turn.

Lever 4 β€” Pay Cache Prices, Not Fresh Prices

Whatever still has to travel should be cached, not re-billed. System blocks are sent in a fixed order β€” [stable_prefix+cc, tool_docs+cc, methodology+cc, seed, delta+cc, volatile] β€” where +cc marks a cache_control breakpoint (TTL 5 min). Only the tail is fresh.

Snapshot (cached)

Stable prefix of the note already seen by the model. cache_control: ephemeral

βœ“ cache hit

Delta (cached)

New content appended this turn. Also gets cache_control.

βœ“ cache hit

Fresh tokens

Bigctx-reminder + messages + tool outputs

βœ— fresh
Why doesn't the seed get +cc? The seed placeholder only exists on turn 1, while the methodology note is still empty, and vanishes as soon as the first Methodology call lands. Caching content that is guaranteed to disappear on the next turn would burn one of the 4 available cache breakpoints for zero future hits.

Keep the cached note small and valid

πŸ“ Compaction Trigger

When the methodology note exceeds 20,000 tokens (β‰ˆ P90 of all notes), structural compaction fires immediately: purge stale snippets, deduplicate, remove orphans. Pure code logic β€” no LLM call.

πŸ”„ Cache Invalidation

After compaction, _methodology_cache_snapshot = None β†’ the cached prefix is no longer valid. On the next turn the system rebuilds all blocks from scratch (one-time cache_create cost, amortized over subsequent turns).

Cache TTL: 5 minutes by default (ephemeral). Accounts for ~15% of the top-10 session costs as cache_create (1.25Γ— one-shot price).