Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Launchpad
Be early to the next big token project
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
OpenClaw burns 21.5 million tokens in a day? Three optimization strategies to drastically reduce costs
Why My OpenClaw Sessions Burned 21.5M Tokens in a Day (And What Actually Fixed It)
By MOSHIII
Compiled by Peggy, BlockBeats
Editor’s note: As the Agent app rapidly becomes popular, many teams notice a seemingly counterintuitive phenomenon: the system runs smoothly, but token costs keep rising unnoticed. This article dissects a real OpenClaw workload and finds that the explosion in costs is often not caused by user input or model output, but by overlooked cached prefix replay. The model repeatedly reads large amounts of historical context in each call, leading to massive token consumption.
Using specific session data, the article shows how tool outputs, browser snapshots, JSON logs, and other large intermediate artifacts are continuously written into the history context and repeatedly read during agent loops.
Through this case, the author proposes a clear optimization approach: from context structure design and tool output management to compaction mechanism configuration. For developers building Agent systems, this is not only a technical troubleshooting record but also a practical money-saving guide.
Below is the original text:
I analyzed a real OpenClaw workload and found a pattern I believe many Agent users will recognize:
Token usage appears very “active”
Responses seem normal
But token consumption suddenly skyrockets
Here is the breakdown of the analysis, root causes, and practical fixes.
TL;DR
The main cost driver is not overly long user messages. Instead, it’s the massive cached prefix being repeatedly replayed.
From session data:
Total tokens: 21,543,714
cacheRead: 17,105,970 (79.40%)
input: 4,345,264 (20.17%)
output: 92,480 (0.43%)
In other words: most of the call costs are not from processing new user intents but from repeatedly reading huge amounts of historical context.
“Wait, how can this be?” moment
I initially thought high token usage came from very long user prompts,大量 output generations, or expensive tool calls.
But the dominant pattern is:
input: hundreds to thousands of tokens
cacheRead: 170,000 to 180,000 tokens per call
That is, the model repeatedly reads the same large, stable prefix in each round.
Data scope
I analyzed data from two levels:
Runtime logs
Session transcripts
Note that:
Runtime logs mainly observe behavioral signals (like restarts, errors, configuration issues)
Precise token counts come from the usage field in session JSONL files
Scripts used:
scripts/session_token_breakdown.py
scripts/session_duplicate_waste_analysis.py
Generated analysis files:
tmp/session_token_stats_v2.txt
tmp/session_token_stats_v2.json
tmp/session_duplicate_waste.txt
tmp/session_duplicate_waste.json
tmp/session_duplicate_waste.png
Where is the actual token consumption?
1) Session concentration
One session’s consumption far exceeds others:
570587c3-dc42-47e4-9dd4-985c2a50af86: 19,204,645 tokens
Followed by a sharp drop:
ef42abbb-d8a1-48d8-9924-2f869dea6d4a: 1,505,038
ea880b13-f97f-4d45-ba8c-a236cf6f2bb5: 649,584
2) Behavior concentration
Tokens mainly come from:
toolUse: 16,372,294
stop: 5,171,420
This indicates the problem mainly lies in tool call chains looping, not regular chat.
3) Time concentration
Token peaks are not random but clustered in specific hours:
2026-03-08 16:00: 4,105,105
2026-03-08 09:00: 4,036,070
2026-03-08 07:00: 2,793,648
What’s inside the huge cached prefix?
It’s not dialogue content but mainly large intermediate artifacts:
Large toolResult data blocks
Long reasoning / thinking traces
Large JSON snapshots
File lists
Browser-captured data
Sub-agent conversation records
In the largest session, character count is approximately:
toolResult: 366,469 characters
assistant:thinking: 331,494 characters
assistant:toolCall: 53,039 characters
Once these are retained in the history context, subsequent calls may re-read them via cache prefixes.
Specific example (from session files)
Repeatedly appear large context blocks at:
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:70
Large gateway JSON log (~37,000 characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:134
Browser snapshot + secure encapsulation (~29,000 characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:219
Huge file list output (~41,000 characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:311
session/status snapshot + large prompt structure (~30,000 characters)
“Duplicate content waste” vs “Cache replay burden”
I also measured the proportion of duplicate content within a single call:
Duplicate ratio: approximately 1.72%
It exists, but is not the main issue.
The real problem is: the absolute size of cached prefixes is too large.
The structure is: huge historical context, re-read each round, with only a small amount of new input added on top.
Therefore, the optimization focus is not deduplication but context structure design.
Why is the Agent loop particularly prone to this problem?
Three mechanisms stack:
1.大量工具输出被写入历史上下文
2.工具循环会产生大量短间隔调用
3.前缀变化很小 → cache 每次都会重新读取
If context compaction isn’t reliably triggered, the problem quickly escalates.
Most effective fixes (by impact order)
P0—Don’t put huge tool outputs into long-term context
For oversized tool outputs:
Keep summaries + reference paths / IDs
Write original payloads into artifact files
Don’t keep full original texts in chat history
Prioritize limiting these categories:
Large JSON
Long directory lists
Browser full snapshots
Sub-agent full transcripts
P1—Ensure compaction mechanism actually works
In this data, configuration issues repeatedly appear: invalid compaction keys
This silently disables optimization.
Correct approach: use only version-compatible configurations
Then verify:
openclaw doctor --fix
And check startup logs to confirm compaction is accepted.
P1—Reduce persistence of reasoning text
Avoid replaying long reasoning texts repeatedly
In production: save brief summaries instead of full reasoning
P3—Improve prompt caching design
Goal is not to maximize cacheRead. The goal is to use cache on stable, concise, high-value prefixes.
Suggestions:
Put stable rules into system prompt
Don’t put unstable data into stable prefix
Avoid injecting大量 debug data each round
Practical stop-loss plan (if I handle it tomorrow)
Identify sessions with highest cacheRead ratio
Run /compact on runaway sessions
Truncate and artifact-ify tool outputs
After each change, rerun token statistics
Track four KPIs:
cacheRead / totalTokens
toolUse avgTotal / call
Number of calls exceeding 100k tokens
Largest session’s proportion
Signs of success
If the optimization works, you should see:
A significant reduction in 100k+ token calls
Decreased cacheRead proportion
Lower weight of toolUse calls
Reduced dominance of individual sessions
If these metrics don’t improve, your context strategy is still too loose.
Reproduction commands
python3 scripts/session_token_breakdown.py ‘sessions’
–include-deleted
–top 20
–outlier-threshold 120000
–json-out tmp/session_token_stats_v2.json \
python3 scripts/session_duplicate_waste_analysis.py ‘sessions’
–include-deleted
–top 20
–png-out tmp/session_duplicate_waste.png
–json-out tmp/session_duplicate_waste.json \
Conclusion
If your Agent system seems fine but costs keep rising, check one thing first: Are you paying for new inference or for large-scale replay of old context?
In my case, most costs actually come from context replay.
Once you realize this, the solution becomes clear: strictly control what data enters long-term context.