Reducing Total Token Consumption
of Agentic Coding

TL;DR — Two levers reduce cost:
1. Less turns (parallel tool calls → fewer API round-trips)
2. Less context (methodology + snippets → only relevant info survives)

📊 Understanding the problem: why tokens grow quadratically

What you see vs. what is sent

👤 User side 0 chars

🤖 Sent to the API 0 chars

The bottom line after 10 exchanges

—

Characters typed by the user

—

Characters sent to the API (cumulative)

—

Multiplication factor

O(n²)

Cumulative cost complexity

Each new message resends the entire history. The total cost grows like the sum 1+2+3+…+n = n(n+1)/2.

Interactive simulation

Messages (n): 10

Avg size per message (s): 2k

Tool loops per turn (k): 1

Classic chatbot vs. agentic coding

💬 Classic chatbot

User: "Fix this bug"→ 200 chars

API response #1system + msg = 4,200 chars

User: "Also add tests"→ 150 chars

API response #2system + history = 8,500 chars

User: "Rename the var"→ 100 chars

API response #3system + history = 12,800 chars

🤖 Agentic coding

User: "Fix this bug"→ 200 chars

API call #1system + msg = 4,200 chars

→ Read(main.py)tool result: 3,000 chars

API call #2everything + tool = 7,400 chars

→ Edit(main.py)tool result: 500 chars

API call #3everything + tools = 8,100 chars

→ RunTests()tool result: 2,000 chars

API call #4everything + tools = 10,300 chars

✓ Result returned to user1 user message → 4 API calls

1 user message = 3-15 API calls

Each tool call (Read, Edit, Bash…) triggers a new API call with the full accumulated context.

Agentic cost: the hidden multiplier

Classic chatbot

n × s → O(n²)

10 msgs × 2k = ~110k chars total

Agentic (k loops/turn)

n × k × s → O(n² × k)

10 msgs × 5 loops × 2k = ~550k chars

With agentic coding, the context grows k times faster. A 20-message session with 5 tool loops per turn sends millions of characters to the API.

1. Less Turns = Less Token Consumption

Aggressive parallelization of tool calls reduces API round-trips. I prefer a failed tool call to another full API call.

The 3-turn process

Discover → Read → Act. Group independent calls in the same turn to minimize context resends.

Sequential (8 turns)

Turn 1: Glob("**/*.py")

↓ wait

Turn 2: Grep("handler")

↓ wait

Turn 3: GetFolderDescription()

↓ wait

Turn 4: Read(main.py)

↓ wait

Turn 5: Read(config.py)

↓ wait

Turn 6: Edit(main.py)

↓ wait

Turn 7: Write(test.py)

↓ wait

Turn 8: RunTests()

8 turns = 8 × full context resent

DAG parallel (3 turns)

TURN 1 — discover

Glob("**/*.py")

Grep("handler")

GetFolderDescription()

↓ need results to know what to read

TURN 2 — read

Read(main.py, symbol="handle")

Read(config.py, symbol="Config")

↓ need content to write correct code

TURN 3 — act (all at once with depends_on)

WritePlan

Edit(main.py)

Write(test.py)

RunTests()

3 turns = 3 × context → less tokens

2. Less Context = Less Tokens

The context is append-only. Condense early or pay forever.

The naive approach: keep everything

Each turn resends the entire conversation history. File reads, bash outputs, reasoning — nothing is thrown away.

→ API Call 1

sys

← Response: "I'll read main.py" + tool_call(Read)

→ API Call 2

sys

asst

Read(main.py) 400L

← Response: "Found bug at L42" + tool_call(Edit)

→ API Call 3

sys

asst

main.py 400L

asst

edit OK

← Response: "Running tests" + tool_call(RunTests)

→ API Call 4

sys

asst

main.py 400L

asst

edit

asst

pytest 200L

← Response: "all tests pass! Task done."

■ cached prefix ($0.30/MTok) ■ fresh content ($3.00/MTok)

Append-only: previous turns stay in place (cached), new content goes at the end (full price). The 400 lines of main.py are sent 3 times — you only needed 1 function.

With Snippets: same work, smaller context

You still pay to Read once. But instead of carrying 400 lines forever, you save a 20-line Snippet — the savings start next turn. Works with any tool result: Read, Grep, Skills, GetFolderDescription…

→ API Call 1 (same as naive)

← Response: "I'll read main.py" → tool_call(Read)

→ API Call 2 (same as naive — you pay the 400L once)

Read(main.py) 400L FRESH

← Response: finds bug, emits Snippet(L40-60) + Edit(main.py)

→ API Call 3 (HERE the difference — 400L → 20L Snippet)

Snippet 20L

edit OK

← Response: "done, running tests" → tool_call(RunTests)

→ API Call 4 (still small)

20L

edit

pytest 200L

← Response: "all tests pass! Task done."

Same structure [S][U][A][...] — just 20L instead of 400L from Call 3 onward

With Methodology: append-only working memory

Each turn, the model appends a Methodology note (goal, plan, discoveries) to the cached prefix. Old tool_results are destroyed — only Methodology + Snippets + the original user message survive.

→ API Call 1

← Response: tool_call(Read) + Methodology#1

→ API Call 2 (400L paid once)

Meth#1

Snippet

Read(main.py) 400L FRESH

← Response: Methodology#2 + Edit(main.py)

→ API Call 3 (400L gone! prefix + last result only)

M#1

Snip

M#2

edit OK

← Response: Methodology#3 + RunTests()

→ API Call 4 (prefix grew, fresh = just pytest)

M#1

Snip

M#2

M#3

pytest 200L

← Response: "all tests pass! Task done."

■ cached prefix: S + U + Meth#1..#N + Snippet ■ fresh: only last tool_result

Demo: naive vs. optimized

Same task: "fix the bug in main.py". 4 API calls each. Compare what's sent.

❌ Naive (keep everything) 0 chars

→ API Call #1 (2,100 ch)

System prompt 2,000ch

User: "fix the bug in main.py" 100ch

← Response:

Assistant: "I'll read main.py" → tool_call(Read)

→ API Call #2 (6,300 ch)

System prompt 2,000ch

User: "fix the bug" 100ch

Asst#1: "I'll read main.py" 200ch

tool_result: Read(main.py) → 400 lines 4,000ch FRESH

← Response:

Assistant: "Found bug at L42, editing" → tool_call(Edit)

→ API Call #3 (6,700 ch)

System prompt 2,000ch

User 100ch

Asst#1 200ch

Read(main.py) 400 lines 4,000ch STILL HERE

Asst#2 200ch

tool_result: Edit OK 200ch FRESH

← Response:

Assistant: "Fixed! Running tests" → tool_call(RunTests)

→ API Call #4 (8,400 ch)

System prompt 2,000ch

User 100ch

Asst#1 200ch

Read(main.py) 400 lines 4,000ch STILL HERE

Asst#2 200ch

Edit OK 200ch STILL HERE

Asst#3 200ch

tool_result: pytest output 1,500ch FRESH

← Response:

Assistant: "All tests pass! Task done."

✅ Optimized (Methodology + Snippets) 0 chars

→ API Call #1 (2,100 ch)

System prompt 2,000ch

User: "fix the bug in main.py" 100ch

← Response:

tool_call(Read) + Methodology#1("Goal: fix bug")

→ API Call #2 (6,200 ch)

System prompt 2,000ch

User: "fix the bug" 100ch

Meth#1: "Goal: fix bug. Reading main.py" 100ch (cached)

tool_result: Read(main.py) → 400 lines 4,000ch FRESH

← Response:

Methodology#2 + Snippet(L40-60) + tool_call(Edit)

→ API Call #3 (2,950 ch) ← 400L GONE!

System prompt 2,000ch

User 100ch

Meth#1 100ch (cached)

Snippet: main.py L40-60 300ch (cached)

Meth#2: "Found off-by-one at L42" 150ch (cached)

tool_result: Edit OK 200ch FRESH

← Response:

Methodology#3 + tool_call(RunTests)

→ API Call #4 (4,400 ch)

System prompt 2,000ch

User 100ch

Meth#1 100ch (cached)

Snippet: main.py L40-60 300ch (cached)

Meth#2 150ch (cached)

Meth#3: "Edit done. Running tests." 150ch (cached)

tool_result: pytest output 1,500ch FRESH

← Response:

"All tests pass! Task done."

Implementation Deep Dive

An agent's bill is number of API calls × (fresh input + output + cache writes). Every mechanism in this section pulls one of four levers: 1 make fewer API calls · 2 send less fresh input · 3 emit less output · 4 pay cache prices instead of fresh prices.

Lever 1 — Make Fewer API Calls

The biggest lever: every avoided call saves its entire input and output. The story is a causal chain — a risk, an expensive cure, then an optimization that makes the cure rarely needed.

1. The risk: LLMs forget instructions in long contexts

If the model skips Methodology, it develops amnesia: it no longer knows what it already tried and gets stuck in loops. If it skips Snippet, it re-reads files it has already seen. Both burn extra calls. Measured spontaneous omission rates:

81.8%

Methodology omission (turn 1)
Spontaneous, before any fix

34%

Snippet omission (Opus)
Spontaneous, before enforcement

2. The cure: enforcement — detection → recovery side-calls

Because these omissions break the system, harnesses are not optional: every turn is checked, and a detected omission triggers a recovery side-call that forces the missing call. Omission becomes near-zero.

🔒 Enforcement Methodology

check_enforcement() detects missing Methodology call → recover_methodology() fires a side-call with tool_choice locked on Methodology. Triggered only when thinking is non-empty.

📎 Enforcement Snippet

get_unsnippeted_reads() scans for Read results ≥50 lines without a matching Snippet. Recovery side-call decides snippet/discard per result. Churn discard post-recovery: 45%.

3. The catch: each recovery is one extra API call — so we make firings rare

Enforcement works, but at 34% omission it fires constantly. The cheapest call is the one that never happens: prompt engineering lowers the omission rate itself (placeholders, fresh-token reminders), so the harness almost never has to fire:

🌱 Seed Placeholder (Turn 1)

When methodology is empty, injects a transient system block with header + syntax + turn rule. Eliminates the 81.8% first-turn omission entirely — the methodology harness no longer fires on turn 1.

⚡ Bigctx-Reminder

~70 fresh tokens injected at the end of the last user message EVERY turn: "This turn MUST end with ≥1 tool_use… Think ≤15 lines then ACT." Wire-only (not persisted).

81.8% → 0%

Methodology omission, turn 1
All Opus sessions before the seed deploy (18/22) vs all sessions after (0/36) — Fisher p < 10⁻¹⁰

3.2% → 1.7%

Empty turns in production
All Opus sessions over the 3 days before the action-first reminder deploy (61/1,895 turns, 78 sessions) vs the day after (39/2,334 turns, 98 sessions) — same definition both sides

4. Last-resort stoppers: cap what a derailed session can burn

🔁 Loop Detection

MD5 hash of each tool_call batch → cycle detection (size 1–8, repeated ≥3×) → a LoopWarning is injected. A session stuck in a cycle stops burning identical calls on the spot.

🧊 Paralysis Abort

If turns without progression reach a threshold (default 12), the session is aborted: a frozen agent can never burn more than a bounded number of calls.

The chain in one line: amnesia/re-reads → harnesses are mandatory → each firing costs one call → optimize the prompt so omissions are rare → fewer firings, fewer turns, net savings.

Lever 2 — Send Less Fresh Input · a. Compress Tool Results

Fresh input per call is just the last user message + the last batch of tool results (~1–3k tokens on the minimal wire). Tool results are its #1 source — so every result is compressed before it enters the turn.

✂️ Generic Truncation

Thresholds: max_lines=200, max_chars=8000, head_lines=80. Overflows saved to disk, only first 80 lines returned with a pointer to the full output.

🧪 Pytest Compaction

Detects pytest output. On failure: keeps FAILURES/ERRORS/warnings/summary. On all-green with ≤20 tests: just lists test names + summary line.

🔍 Edit — Fuzzy Error

When old_string not found: SequenceMatcher sliding window finds the closest match, shows ±7 lines of context. Saving: the full re-read turn the model would otherwise need to locate the mismatch.

Lever 2 — Send Less Fresh Input · b. Never Re-Read a File

The other big source of fresh input is re-reading files already seen. Snippets eliminate it: read once, keep the useful part in the cached methodology note, re-resolve it every turn at cache price. Lifecycle:

1. Wrapping — Results ≥50 lines get markers: ==== A SNIPPETER id: file=… ====. Lines numbered 1..N.

2. Storage — Agent calls Snippet(symbol=…) or Snippet(ranges=…). Discard explicit covers unwanted results.

3. Re-resolution — Symbol snippets are re-resolved dynamically every turn via find_symbol(). Immune to line shifts after Edits.

4. Stale marking — After Edit/Write, range-based snippets on the same file get marked ## snippet-stale:. Append-only.

5. Compaction — When the methodology note exceeds 20,000 tokens (≈ P90 of all notes), structural compaction fires immediately: stale snippets, duplicates, and orphans (file deleted) are purged — no LLM call needed.

50 lines

SNIPPET_MIN_LINES threshold
Tool results under 50 lines are not wrapped in snippet markers and impose no save-or-discard obligation on the model.

9.6%

Dead snippets (Opus)
Snippets saved but never re-referenced (file path/label never reappears in a later tool_call or Methodology). Measured on 202 Opus sessions. These dead snippets account for ≈3% of total session cost.

Key principle: Edit/Write results are exempt from snippet wrapping (_SNIPPET_EXEMPT_TOOLS) — they are never stored in the persistent context (methodology note), so there is no snippet to manage. Meanwhile, symbol-based snippet re-resolution exists precisely to avoid re-reading files: the agent can reference a function by name and the system resolves it fresh each turn without an explicit Read call.

Lever 3 — Emit Less Output

Output tokens cost 5× input tokens — every line of deliberation is the most expensive line of the call. Two mechanisms keep it short.

⚡ Think less, act more

The bigctx-reminder (Lever 1) ends every turn's input with "Think ≤15 lines then ACT" — shorter deliberation is a direct output saving on top of the turn reduction.

💭 Thinking Overflow

If thinking exceeds 20,000 chars with no tool_call parsed → streaming is cut (~$0.10–0.15 per occurrence: the cut tokens are billed but discarded). The overflow is summarized via claude-sonnet-4-6 into ≤5 bullets / max 300 tokens for the next turn.

Lever 4 — Pay Cache Prices, Not Fresh Prices

Whatever still has to travel should be cached, not re-billed. System blocks are sent in a fixed order — [stable_prefix+cc, tool_docs+cc, methodology+cc, seed, delta+cc, volatile] — where +cc marks a cache_control breakpoint (TTL 5 min). Only the tail is fresh.

Snapshot (cached)

Stable prefix of the note already seen by the model. cache_control: ephemeral

✓ cache hit

Delta (cached)

New content appended this turn. Also gets cache_control.

✓ cache hit

Fresh tokens

Bigctx-reminder + messages + tool outputs

✗ fresh

Why doesn't the seed get +cc? The seed placeholder only exists on turn 1, while the methodology note is still empty, and vanishes as soon as the first Methodology call lands. Caching content that is guaranteed to disappear on the next turn would burn one of the 4 available cache breakpoints for zero future hits.

Keep the cached note small and valid

📏 Compaction Trigger

When the methodology note exceeds 20,000 tokens (≈ P90 of all notes), structural compaction fires immediately: purge stale snippets, deduplicate, remove orphans. Pure code logic — no LLM call.

🔄 Cache Invalidation

After compaction, _methodology_cache_snapshot = None → the cached prefix is no longer valid. On the next turn the system rebuilds all blocks from scratch (one-time cache_create cost, amortized over subsequent turns).

Cache TTL: 5 minutes by default (ephemeral). Accounts for ~15% of the top-10 session costs as cache_create (1.25× one-shot price).

Reducing Total Token Consumptionof Agentic Coding

What you see vs. what is sent

The bottom line after 10 exchanges

Interactive simulation

Classic chatbot vs. agentic coding

💬 Classic chatbot

🤖 Agentic coding

Agentic cost: the hidden multiplier

1. Less Turns = Less Token Consumption

The 3-turn process

Sequential (8 turns)

DAG parallel (3 turns)

2. Less Context = Less Tokens

The naive approach: keep everything

With Snippets: same work, smaller context

With Methodology: append-only working memory

Demo: naive vs. optimized

Implementation Deep Dive

Lever 1 — Make Fewer API Calls

1. The risk: LLMs forget instructions in long contexts

2. The cure: enforcement — detection → recovery side-calls

🔒 Enforcement Methodology

📎 Enforcement Snippet

3. The catch: each recovery is one extra API call — so we make firings rare

🌱 Seed Placeholder (Turn 1)

⚡ Bigctx-Reminder

4. Last-resort stoppers: cap what a derailed session can burn

🔁 Loop Detection

🧊 Paralysis Abort

Lever 2 — Send Less Fresh Input · a. Compress Tool Results

✂️ Generic Truncation

🧪 Pytest Compaction

🔍 Edit — Fuzzy Error

Lever 2 — Send Less Fresh Input · b. Never Re-Read a File

Lever 3 — Emit Less Output

⚡ Think less, act more

💭 Thinking Overflow

Lever 4 — Pay Cache Prices, Not Fresh Prices

Snapshot (cached)

Delta (cached)

Fresh tokens

Keep the cached note small and valid

📏 Compaction Trigger

🔄 Cache Invalidation

Reducing Total Token Consumption
of Agentic Coding