OpenClaw burns 21.5 million tokens in a day? Three optimization strategies to drastically reduce costs

Why My OpenClaw Sessions Burned 21.5M Tokens in a Day (And What Actually Fixed It)

By MOSHIII

Compiled by Peggy, BlockBeats

Editor’s note: As the Agent app rapidly becomes popular, many teams notice a seemingly counterintuitive phenomenon: the system runs smoothly, but token costs keep rising unnoticed. This article dissects a real OpenClaw workload and finds that the explosion in costs is often not caused by user input or model output, but by overlooked cached prefix replay. The model repeatedly reads large amounts of historical context in each call, leading to massive token consumption.

Using specific session data, the article shows how tool outputs, browser snapshots, JSON logs, and other large intermediate artifacts are continuously written into the history context and repeatedly read during agent loops.

Through this case, the author proposes a clear optimization approach: from context structure design and tool output management to compaction mechanism configuration. For developers building Agent systems, this is not only a technical troubleshooting record but also a practical money-saving guide.

Below is the original text:

I analyzed a real OpenClaw workload and found a pattern I believe many Agent users will recognize:

Token usage appears very “active”

Responses seem normal

But token consumption suddenly skyrockets

Here is the breakdown of the analysis, root causes, and practical fixes.

TL;DR

The main cost driver is not overly long user messages. Instead, it’s the massive cached prefix being repeatedly replayed.

From session data:

Total tokens: 21,543,714

cacheRead: 17,105,970 (79.40%)

input: 4,345,264 (20.17%)

output: 92,480 (0.43%)

In other words: most of the call costs are not from processing new user intents but from repeatedly reading huge amounts of historical context.

“Wait, how can this be?” moment

I initially thought high token usage came from very long user prompts,大量 output generations, or expensive tool calls.

But the dominant pattern is:

input: hundreds to thousands of tokens

cacheRead: 170,000 to 180,000 tokens per call

That is, the model repeatedly reads the same large, stable prefix in each round.

Data scope

I analyzed data from two levels:

  1. Runtime logs

  2. Session transcripts

Note that:

Runtime logs mainly observe behavioral signals (like restarts, errors, configuration issues)

Precise token counts come from the usage field in session JSONL files

Scripts used:

scripts/session_token_breakdown.py

scripts/session_duplicate_waste_analysis.py

Generated analysis files:

tmp/session_token_stats_v2.txt

tmp/session_token_stats_v2.json

tmp/session_duplicate_waste.txt

tmp/session_duplicate_waste.json

tmp/session_duplicate_waste.png

Where is the actual token consumption?

1) Session concentration

One session’s consumption far exceeds others:

570587c3-dc42-47e4-9dd4-985c2a50af86: 19,204,645 tokens

Followed by a sharp drop:

ef42abbb-d8a1-48d8-9924-2f869dea6d4a: 1,505,038

ea880b13-f97f-4d45-ba8c-a236cf6f2bb5: 649,584

2) Behavior concentration

Tokens mainly come from:

toolUse: 16,372,294

stop: 5,171,420

This indicates the problem mainly lies in tool call chains looping, not regular chat.

3) Time concentration

Token peaks are not random but clustered in specific hours:

2026-03-08 16:00: 4,105,105

2026-03-08 09:00: 4,036,070

2026-03-08 07:00: 2,793,648

What’s inside the huge cached prefix?

It’s not dialogue content but mainly large intermediate artifacts:

Large toolResult data blocks

Long reasoning / thinking traces

Large JSON snapshots

File lists

Browser-captured data

Sub-agent conversation records

In the largest session, character count is approximately:

toolResult: 366,469 characters

assistant:thinking: 331,494 characters

assistant:toolCall: 53,039 characters

Once these are retained in the history context, subsequent calls may re-read them via cache prefixes.

Specific example (from session files)

Repeatedly appear large context blocks at:

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:70

Large gateway JSON log (~37,000 characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:134

Browser snapshot + secure encapsulation (~29,000 characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:219

Huge file list output (~41,000 characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:311

session/status snapshot + large prompt structure (~30,000 characters)

“Duplicate content waste” vs “Cache replay burden”

I also measured the proportion of duplicate content within a single call:

Duplicate ratio: approximately 1.72%

It exists, but is not the main issue.

The real problem is: the absolute size of cached prefixes is too large.

The structure is: huge historical context, re-read each round, with only a small amount of new input added on top.

Therefore, the optimization focus is not deduplication but context structure design.

Why is the Agent loop particularly prone to this problem?

Three mechanisms stack:

1.大量工具输出被写入历史上下文

2.工具循环会产生大量短间隔调用

3.前缀变化很小 → cache 每次都会重新读取

If context compaction isn’t reliably triggered, the problem quickly escalates.

Most effective fixes (by impact order)

P0—Don’t put huge tool outputs into long-term context

For oversized tool outputs:

  • Keep summaries + reference paths / IDs

  • Write original payloads into artifact files

  • Don’t keep full original texts in chat history

Prioritize limiting these categories:

  • Large JSON

  • Long directory lists

  • Browser full snapshots

  • Sub-agent full transcripts

P1—Ensure compaction mechanism actually works

In this data, configuration issues repeatedly appear: invalid compaction keys

This silently disables optimization.

Correct approach: use only version-compatible configurations

Then verify:

openclaw doctor --fix

And check startup logs to confirm compaction is accepted.

P1—Reduce persistence of reasoning text

Avoid replaying long reasoning texts repeatedly

In production: save brief summaries instead of full reasoning

P3—Improve prompt caching design

Goal is not to maximize cacheRead. The goal is to use cache on stable, concise, high-value prefixes.

Suggestions:

  • Put stable rules into system prompt

  • Don’t put unstable data into stable prefix

  • Avoid injecting大量 debug data each round

Practical stop-loss plan (if I handle it tomorrow)

  1. Identify sessions with highest cacheRead ratio

  2. Run /compact on runaway sessions

  3. Truncate and artifact-ify tool outputs

  4. After each change, rerun token statistics

Track four KPIs:

cacheRead / totalTokens

toolUse avgTotal / call

Number of calls exceeding 100k tokens

Largest session’s proportion

Signs of success

If the optimization works, you should see:

A significant reduction in 100k+ token calls

Decreased cacheRead proportion

Lower weight of toolUse calls

Reduced dominance of individual sessions

If these metrics don’t improve, your context strategy is still too loose.

Reproduction commands

python3 scripts/session_token_breakdown.py ‘sessions’
–include-deleted
–top 20
–outlier-threshold 120000
–json-out tmp/session_token_stats_v2.json \

tmp/session_token_stats_v2.txt

python3 scripts/session_duplicate_waste_analysis.py ‘sessions’
–include-deleted
–top 20
–png-out tmp/session_duplicate_waste.png
–json-out tmp/session_duplicate_waste.json \

tmp/session_duplicate_waste.txt

Conclusion

If your Agent system seems fine but costs keep rising, check one thing first: Are you paying for new inference or for large-scale replay of old context?

In my case, most costs actually come from context replay.

Once you realize this, the solution becomes clear: strictly control what data enters long-term context.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin