- Add intelligent-router.sh hook for automatic agent routing - Add AUTO-TRIGGER-SUMMARY.md documentation - Add FINAL-INTEGRATION-SUMMARY.md documentation - Complete Prometheus integration (6 commands + 4 tools) - Complete Dexto integration (12 commands + 5 tools) - Enhanced Ralph with access to all agents - Fix /clawd command (removed disable-model-invocation) - Update hooks.json to v5 with intelligent routing - 291 total skills now available - All 21 commands with automatic routing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
40 KiB
Context Window Calculation Analysis
Problem Statement
Our /context overlay shows inconsistent numbers:
- Total shown: 122.4k tokens (from API's actual count)
- Breakdown sum: ~73k tokens (our length/4 estimates)
- Free space: Calculated from breakdown, not actual total
This leads to confusing UX where numbers don't add up.
Additionally, our compaction decision uses a different calculation than /context, leading to inconsistency.
Critical Finding #1: Reasoning Tokens Not Sent Back to LLM
Current State (Dexto)
We have the type but DON'T actually store reasoning:
// AssistantMessage in context/types.ts
interface AssistantMessage {
reasoning?: string; // Field EXISTS but is never populated!
tokenUsage?: TokenUsage;
// ...
}
Two separate bugs:
-
stream-processor.tsnever persists reasoning text:// Line 24: Reasoning IS accumulated during streaming private reasoningText: string = ''; // Lines 97-108: Accumulated from reasoning-delta events case 'reasoning-delta': this.reasoningText += event.text; // ✓ Collected // BUT lines 314-320: Only tokenUsage is persisted! await this.contextManager.updateAssistantMessage( this.assistantMessageId, { tokenUsage: usage } // ✗ No reasoning field! ); -
formatAssistantMessage()invercel.tsignoresmsg.reasoning:- Only extracts
msg.content(text parts) andmsg.toolCalls - Even if reasoning WAS stored, it wouldn't be sent back
- Only extracts
Result: Reasoning is collected → emitted to events → but never persisted or round-tripped.
How OpenCode Handles It (Correctly)
// In toModelMessage() - opencode/src/session/message-v2.ts
if (part.type === "reasoning") {
assistantMessage.parts.push({
type: "reasoning",
text: part.text,
providerMetadata: part.metadata, // Critical for round-tripping!
})
}
OpenCode:
- Stores reasoning as
ReasoningPartin message parts - Includes
providerMetadata(contains thought signatures for Gemini, etc.) - Sends reasoning back in
toModelMessage()conversion - Tracks
reasoningtokens separately in token usage
How Gemini-CLI Handles It (Different Approach)
// Uses thought: true flag on parts from model
{ text: 'Hmm', thought: true }
// BUT they explicitly FILTER OUT thoughts before storing in history!
// geminiChat.ts line 815:
modelResponseParts.push(
...content.parts.filter((part) => !part.thought), // Filter OUT thoughts
);
// Token tracking still captures thoughtsTokenCount from API response
// chatRecordingService.ts line 278:
tokens.thoughts = respUsageMetadata.thoughtsTokenCount ?? 0;
Key difference: Gemini-CLI tracks thought tokens for display/cost but does NOT round-trip them. This works because Google's API doesn't require thought history for context continuity.
Why We Follow OpenCode's Approach
- We use Vercel AI SDK like OpenCode, not Google's native SDK
- Provider-agnostic: OpenCode's approach works across all providers
- No provider-specific logic: We shouldn't special-case Google's behavior
- Context continuity: Some providers (especially via AI SDK) may need reasoning for proper state
Impact of Current Bugs
- Context continuity broken: Reasoning traces lost between turns
- Token counting incorrect: Reasoning tokens used but not tracked in context
- Provider metadata lost: Cannot round-trip provider-specific metadata (e.g., OpenAI item IDs)
Critical Finding #2: Token Usage Storage
What We Track
Session Level (session-manager.ts):
sessionData.tokenUsage = {
inputTokens: 0,
outputTokens: 0,
reasoningTokens: 0,
cacheReadTokens: 0,
cacheWriteTokens: 0,
totalTokens: 0,
};
Message Level (AssistantMessage):
interface AssistantMessage {
tokenUsage?: TokenUsage; // Available but...
}
Current Flow
-
stream-processor.tscreates assistant message with empty metadata:await this.contextManager.addAssistantMessage('', [], {}); -
After streaming completes, we DO update with token usage:
await this.contextManager.updateAssistantMessage( this.assistantMessageId, { tokenUsage: usage } );
So we HAVE the data on each message, we just don't use it for context calculation!
Critical Finding #3: Estimate vs Actual Mismatch
The Problem
API actual inputTokens: 122.4k
Our length/4 estimate: 73.0k
Difference: 49.4k (67% underestimate!)
Why So Different?
-
Tokenizers don't split evenly by characters
- Code tokenizes differently than prose
- JSON schemas are verbose when tokenized
- Special characters, whitespace handling varies
-
We're comparing different things
actualTokens= from last LLM call (includes everything sent)breakdown estimate= calculated now on current history
-
Context has grown since last call
- Last call's
inputTokensdoesn't include the response that followed - New user messages added since
- Last call's
How Other Tools Handle This
Claude Code (Anthropic)
Uses /v1/messages/count_tokens API for exact counts!
// From cli.js (minified)
countTokens(A,Q) {
return this._client.post("/v1/messages/count_tokens", { body: A, ...Q })
}
Categories tracked:
- System prompt
- System tools
- Memory files
- Skills
- MCP tools (with deferred loading)
- Agents
- Messages (with sub-breakdown)
- Free space
- Autocompact buffer
Free space calculation:
// YA = sum of all category tokens (excluding deferred)
let YA = k.reduce((CA, _A) => CA + (_A.isDeferred ? 0 : _A.tokens), 0)
// WA = buffer (autocompact or compact)
let WA = autocompactEnabled ? (maxTokens - contextUsed) : 500;
// Free space
let wA = Math.max(0, maxTokens - YA - WA)
gemini-cli
Hybrid approach:
// Sync estimation (fast)
estimateTokenCountSync(parts): number {
// ASCII: ~4 chars per token (0.25 tokens/char)
// Non-ASCII/CJK: ~1-2 chars per token (1.3 tokens/char)
}
// API counting (when needed)
if (hasMedia) {
use Gemini countTokens API
} else {
use sync estimation
}
Token tracking from API response:
{
input: promptTokenCount,
output: candidatesTokenCount,
cached: cachedContentTokenCount,
thoughts: thoughtsTokenCount, // Reasoning!
tool: toolUsePromptTokenCount,
total: totalTokenCount
}
opencode
Simple estimation + detailed tracking:
Token.estimate(input: string): number {
return Math.round(input.length / 4)
}
// But tracks actuals per message:
StepFinishPart {
tokens: {
input: number,
output: number,
reasoning: number,
cache: { read: number, write: number }
}
}
Current Architecture Issues
1. Reasoning Pipeline (BROKEN - Two Bugs)
Current (broken):
LLM Response → reasoning-delta events received
↓
stream-processor.ts → accumulates reasoningText ✓
↓
updateAssistantMessage() → ONLY saves tokenUsage, NOT reasoning ✗
↓
AssistantMessage.reasoning = undefined (never set!)
↓
formatAssistantMessage() → has nothing to format anyway
↓
Reasoning NOT sent back to LLM ❌
Should be (following OpenCode):
LLM Response → reasoning-delta events received (with providerMetadata)
↓
stream-processor.ts → accumulates reasoningText AND reasoningMetadata
↓
updateAssistantMessage() → saves reasoning + reasoningMetadata + tokenUsage
↓
AssistantMessage.reasoning = "thinking..." ✓
AssistantMessage.reasoningMetadata = { openai: { itemId: "..." } } ✓
↓
formatAssistantMessage() → includes reasoning part with providerMetadata
↓
Reasoning sent back to LLM ✓
2. Token Calculation (/context)
Current:
// Uses length/4 estimate for everything
systemPromptTokens = estimateStringTokens(systemPrompt); // length/4
messagesTokens = estimateMessagesTokens(preparedHistory); // length/4
toolsTokens = estimateToolTokens(tools); // length/4
total = systemPromptTokens + messagesTokens + toolsTokens;
freeSpace = maxTokens - total - outputBuffer;
Problem: Total doesn't match API's actual count.
3. Compaction Decision
Current (turn-executor.ts):
const estimatedTokens = estimateMessagesTokens(prepared.preparedHistory);
if (estimatedTokens > compactionThreshold) {
// Compact!
}
Problem: Uses different calculation than /context, and both are wrong!
Proposed Solution
Principle: Single Source of Truth
- Use actual token counts from API as ground truth
- Track tokens per message for accurate history calculation
- Estimate only what we cannot measure
- Same formula for
/contextAND compaction decisions
THE FORMULA (Precise Specification)
Core Formula
estimatedNextInput = lastInputTokens + lastOutputTokens + newMessagesEstimate
Variable Definitions
| Variable | Definition | Source | When Updated |
|---|---|---|---|
lastInputTokens |
Tokens we SENT in the most recent LLM call | tokenUsage.inputTokens from API response |
After EVERY LLM call |
lastOutputTokens |
Tokens the LLM RETURNED in its response | tokenUsage.outputTokens from API response |
After EVERY LLM call |
newMessagesEstimate |
Estimate for messages added AFTER the last LLM call | length/4 heuristic |
Calculated on demand |
What Counts as "New Messages"?
Messages added to history AFTER lastInputTokens was recorded:
- Tool results (role='tool') from the last assistant's tool calls
- New user messages typed since last LLM call
- Any injected system messages added between calls
Example Flow
┌─────────────────────────────────────────────────────────────────┐
│ Turn 1: User asks "What's the weather in NYC?" │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call: │
│ inputTokens = 5000 (system + tools + user message) │
│ outputTokens = 100 (assistant: "I'll check" + tool_call) │
│ │
│ After call: UPDATE lastInputTokens=5000, lastOutputTokens=100 │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Tool executes, result added to history │
│ Tool result: "NYC: 72°F, sunny" (role='tool') │
│ │
│ This is a NEW MESSAGE (added after lastInputTokens recorded) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Before Turn 2: Calculate estimated context │
├─────────────────────────────────────────────────────────────────┤
│ lastInputTokens = 5000 (from Turn 1) │
│ lastOutputTokens = 100 (from Turn 1) │
│ newMessagesEstimate = estimate(tool_result) ≈ 20 │
│ │
│ estimatedNextInput = 5000 + 100 + 20 = 5120 │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Turn 2: LLM processes tool result │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call: │
│ inputTokens = 5115 (ACTUAL - this is our ground truth!) │
│ outputTokens = 50 (assistant: "The weather is 72°F...") │
│ │
│ VERIFICATION: estimated=5120, actual=5115, error=+5 (+0.1%) │
│ │
│ After call: UPDATE lastInputTokens=5115, lastOutputTokens=50 │
└─────────────────────────────────────────────────────────────────┘
Verification Metrics
On EVERY LLM call, log the accuracy of our previous estimate:
// Before LLM call
const estimated = lastInputTokens + lastOutputTokens + newMessagesEstimate;
// After LLM call, compare to actual
const actual = response.tokenUsage.inputTokens;
const error = estimated - actual;
const errorPercent = (error / actual) * 100;
logger.info(`Context estimate: estimated=${estimated}, actual=${actual}, error=${error > 0 ? '+' : ''}${error} (${errorPercent.toFixed(1)}%)`);
Breakdown for Display (Back-Calculation)
For /context overlay, we show a breakdown. Since we only know the TOTAL accurately, we back-calculate messages:
const total = lastInputTokens + lastOutputTokens + newMessagesEstimate;
// These are estimates (we can't measure them directly)
const systemPromptEstimate = estimateTokens(systemPrompt); // length/4
const toolsEstimate = estimateToolsTokens(tools); // length/4
// Back-calculate messages so the math adds up
const messagesDisplay = total - systemPromptEstimate - toolsEstimate;
// If negative, our estimates are too high - cap at 0 and log warning
if (messagesDisplay < 0) {
logger.warn(`Back-calculated messages negative (${messagesDisplay}), estimates may be too high`);
messagesDisplay = 0;
}
Edge Cases
| Scenario | Behavior |
|---|---|
| No LLM call yet | lastInputTokens=null, fall back to pure estimation, show "(estimated)" label |
| After compaction | History changed significantly, set lastInputTokens=null, fall back to estimation until next call |
| messagesDisplay negative | Cap at 0, log warning - indicates system/tools estimates too high |
| System prompt changed | Next estimate may be off, but next actual will correct it |
| Tools changed (MCP) | Same as above - self-correcting after next call |
What /context Should Display
Context Usage: 52,100 / 200,000 tokens (26%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Breakdown:
System prompt: 4,000 tokens (estimated)
Tools: 8,000 tokens (estimated)
Messages: 40,100 tokens (back-calculated)
─────────────────────────────
Total: 52,100 tokens
Calculation basis:
Last actual input: 50,000 tokens
Last output: 2,000 tokens
New since then: 100 tokens (estimated)
Last estimate accuracy: +0.6% error
Free space: 131,900 tokens (after 16,000 output buffer)
Implementation Checklist
- Store
lastInputTokensandlastOutputTokensafter each LLM call - Track which messages are "new" since last LLM call (need message timestamp or index tracking)
- Calculate
newMessagesEstimateonly for messages added after last call - Log verification metrics on every LLM call
- Update
/contextoverlay to show this breakdown - Handle edge cases (no call yet, after compaction)
- Use SAME formula for compaction decisions
Legacy Edge Cases (keeping for reference)
-
No LLM call yet (new session)
- Fall back to pure estimation
- All numbers are estimates with "(estimated)" label
-
messagesDisplay comes out negative
- Our estimates for system/tools are too high
- Cap at 0, log warning
- Indicates estimation needs calibration
-
After compaction
- Token counts reset with new session
compactionCounttracks how many times compacted
-
Reasoning tokens
- Must be sent back to LLM (fix formatter) ✅ DONE
- Include in context calculation
- Track separately for display
Verification: Why lastOutputTokens Is Safe to Use Directly
Verified on 2025-01-20 by analyzing AI SDK source code and our codebase
Question: Does outputTokens include content that might be pruned before the next LLM call?
Answer: No. outputTokens is safe to use directly because:
Part 1: What does outputTokens include? (AI SDK Verification)
Anthropic - verified via ai/packages/anthropic/src/__fixtures__/anthropic-json-tool.1.chunks.txt:
{"type":"message_delta","delta":{"stop_reason":"tool_use"},"usage":{"output_tokens":47}}
Tool call response reports output_tokens: 47 - includes tool calls ✅
OpenAI - verified via ai/packages/openai/src/responses/__fixtures__/openai-shell-tool.1.chunks.txt:
{"output":[{"type":"shell_call","action":{"commands":["ls -a ~/Desktop"]}}],"usage":{"output_tokens":41}}
Shell tool call reports output_tokens: 41 - includes tool calls ✅
Google - verified via ai/packages/google/src/google-generative-ai-language-model.test.ts lines 2274-2302:
content: { parts: [{ functionCall: { name: 'test-tool', args: { value: 'test' } } }] },
usageMetadata: { promptTokenCount: 10, candidatesTokenCount: 20, totalTokenCount: 30 }
Function call response reports candidatesTokenCount: 20 - includes tool calls ✅
Part 2: What gets pruned in our system?
From manager.ts prepareHistory():
- Only tool result messages (role='tool') can be pruned
- They're marked with
compactedAttimestamp - Replaced with placeholder:
[Old tool result content cleared]
What is NEVER pruned:
- Assistant messages (text content)
- Assistant's tool calls
- User messages
Verification Table
| Message Type | Pruned? | Part of outputTokens? |
|---|---|---|
| Assistant text | ❌ Never | ✅ Yes |
| Assistant tool calls | ❌ Never | ✅ Yes (verified across all providers) |
| Tool results (role='tool') | ✅ Can be pruned | ❌ No (separate messages) |
Code Evidence
stream-processor.ts: Tool calls stored viaaddToolCall()with full argumentsmanager.tsline 279: Onlymsg.role === 'tool' && msg.compactedAtgets placeholder- No code path exists to prune assistant messages
Conclusion: The formula lastInputTokens + lastOutputTokens + newMessagesEstimate is correct because:
lastInputTokensreflects pruned history (API tells us exactly what was sent)lastOutputTokensis the assistant's response (text + tool calls) which is stored and sent back as-is- All major providers (Anthropic, OpenAI, Google) include tool calls in their output token counts
- Only tool results (separate messages) can be pruned, and those are in
inputTokens
Implementation Plan
Phase 1: Fix Reasoning Storage (HIGH PRIORITY - Bug #1) ✅ COMPLETED
The root cause: stream-processor.ts collects reasoning but never persists it.
Files to modify:
packages/core/src/llm/executor/stream-processor.tspackages/core/src/context/types.ts
Changes:
-
Add
reasoningMetadatafield toAssistantMessagetype:// In context/types.ts interface AssistantMessage { reasoning?: string; reasoningMetadata?: Record<string, unknown>; // NEW - for provider round-tripping // ... } -
Capture
providerMetadatafrom reasoning-delta events:// In stream-processor.ts, add field: private reasoningMetadata: Record<string, unknown> | undefined; // In reasoning-delta case: case 'reasoning-delta': this.reasoningText += event.text; // Capture provider metadata for round-tripping (OpenAI itemId, etc.) if (event.providerMetadata) { this.reasoningMetadata = event.providerMetadata; } // ... emit events -
Fix the bug - persist reasoning in
updateAssistantMessage():// In stream-processor.ts, 'finish' case (around line 315): if (this.assistantMessageId) { await this.contextManager.updateAssistantMessage( this.assistantMessageId, { tokenUsage: usage, reasoning: this.reasoningText || undefined, // ADD THIS reasoningMetadata: this.reasoningMetadata, // ADD THIS } ); }
Phase 2: Fix Reasoning Round-Trip (Bug #2) ✅ COMPLETED
Files to modify:
packages/core/src/llm/formatters/vercel.ts
Changes:
- Update
formatAssistantMessage()to include reasoning:// In formatAssistantMessage(), before returning: if (msg.reasoning) { contentParts.push({ type: 'reasoning', text: msg.reasoning, providerMetadata: msg.reasoningMetadata, }); }
Verified: Vercel AI SDK's AssistantContent type supports ReasoningPart:
// packages/provider-utils/src/types/assistant-model-message.ts
export type AssistantContent = string | Array<TextPart | FilePart | ReasoningPart | ...>;
// packages/provider-utils/src/types/content-part.ts
export interface ReasoningPart {
type: 'reasoning';
text: string;
providerOptions?: ProviderOptions; // For round-tripping provider metadata
}
Phase 3: Unified Context Calculation ✅ COMPLETED
Files to modify:
packages/core/src/context/manager.ts-getContextTokenEstimate()packages/core/src/llm/executor/turn-executor.ts- compaction checkpackages/cli/src/cli/ink-cli/components/overlays/ContextStatsOverlay.tsx
Changes:
-
Create shared
calculateContextUsage()function:// New file: packages/core/src/context/context-calculator.ts export async function calculateContextUsage( contextManager: ContextManager, tools: ToolDefinitions, maxContextTokens: number, outputBuffer: number ): Promise<ContextUsage> { // Implement the formula above } -
Use in
/context:// In DextoAgent.getContextStats() const usage = await calculateContextUsage(...); return usage; -
Use in compaction decision:
// In turn-executor.ts const usage = await calculateContextUsage(...); if (usage.total > compactionThreshold) { // Compact! }
Phase 4: Message-Level Token Tracking
Already implemented! We just need to use it:
// In calculateContextUsage(), sum from messages:
const history = await contextManager.getHistory();
let totalInputFromMessages = 0;
let totalOutputFromMessages = 0;
let totalReasoningFromMessages = 0;
for (const msg of history) {
if (msg.role === 'assistant' && msg.tokenUsage) {
totalOutputFromMessages += msg.tokenUsage.outputTokens ?? 0;
totalReasoningFromMessages += msg.tokenUsage.reasoningTokens ?? 0;
}
}
Phase 5: Calibration & Logging
- Log estimate vs actual on every LLM call (already done, level=info)
- Track calibration ratio over time
- Consider adaptive estimation based on observed ratios
Phase 6: Future - API Token Counting
For Anthropic:
// New method in Anthropic service
async countTokens(messages: Message[], tools: Tool[]): Promise<{
input_tokens: number;
}>
For other providers:
- tiktoken for OpenAI
- Gemini countTokens API
- Fallback to estimation
Data Flow Diagram
Current State (BROKEN)
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Response Stream │
├─────────────────────────────────────────────────────────────────────┤
│ reasoning-delta events → reasoningText accumulated ✓ │
│ text-delta events → content accumulated ✓ │
│ finish event → usage: { inputTokens, outputTokens, ... } │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ stream-processor.ts updateAssistantMessage() │
├─────────────────────────────────────────────────────────────────────┤
│ await this.contextManager.updateAssistantMessage( │
│ this.assistantMessageId, │
│ { tokenUsage: usage } ← ONLY tokenUsage saved! │
│ ); ← reasoning NOT included! ✗ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ AssistantMessage Stored │
├─────────────────────────────────────────────────────────────────────┤
│ { │
│ role: 'assistant', │
│ content: [...], ← ✓ Stored │
│ reasoning: undefined, ← ✗ NEVER SET! │
│ tokenUsage: {...} ← ✓ Stored │
│ } │
└─────────────────────────────────────────────────────────────────────┘
### Target State (FIXED)
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Response Stream │
├─────────────────────────────────────────────────────────────────────┤
│ reasoning-delta events → reasoningText + providerMetadata ✓ │
│ text-delta events → content accumulated ✓ │
│ finish event → usage: { inputTokens, outputTokens, ... } │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ stream-processor.ts updateAssistantMessage() │
├─────────────────────────────────────────────────────────────────────┤
│ await this.contextManager.updateAssistantMessage( │
│ this.assistantMessageId, │
│ { │
│ tokenUsage: usage, │
│ reasoning: this.reasoningText, ← NEW │
│ reasoningMetadata: this.reasoningMetadata ← NEW │
│ } │
│ ); │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ AssistantMessage Stored │
├─────────────────────────────────────────────────────────────────────┤
│ { │
│ role: 'assistant', │
│ content: [...], │
│ reasoning: 'Let me think...', ← ✓ Now stored │
│ reasoningMetadata: { openai: { itemId: '...' } }, ← ✓ For round-trip
│ tokenUsage: { inputTokens, outputTokens, reasoningTokens } │
│ } │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Next LLM Call (Formatter) │
├─────────────────────────────────────────────────────────────────────┤
│ formatAssistantMessage() includes: │
│ - content (text parts) ✓ Already done │
│ - toolCalls ✓ Already done │
│ - reasoning + providerMetadata ✓ NEW - enables round-trip │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ /context Calculation │
├─────────────────────────────────────────────────────────────────────┤
│ currentTotal = lastInput + lastOutput + newMessagesEstimate │
│ │
│ Breakdown: │
│ systemPrompt = estimate (length/4) │
│ tools = estimate (length/4) │
│ messages = currentTotal - systemPrompt - tools (back-calc) │
│ reasoning = sum(msg.tokenUsage.reasoningTokens) (for display) │
│ │
│ freeSpace = maxTokens - currentTotal - outputBuffer │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Compaction Decision │
├─────────────────────────────────────────────────────────────────────┤
│ SAME FORMULA as /context! │
│ │
│ if (currentTotal > compactionThreshold) { │
│ triggerCompaction(); │
│ } │
└─────────────────────────────────────────────────────────────────────┘
Testing Strategy
Unit Tests
-
Reasoning storage test (Phase 1)
- Mock LLM stream with reasoning-delta events
- Verify
stream-processor.tscallsupdateAssistantMessage()with reasoning - Verify
reasoningMetadatais captured fromproviderMetadata
-
Reasoning round-trip test (Phase 2)
- Create
AssistantMessagewithreasoningandreasoningMetadata - Call
formatAssistantMessage() - Verify output contains reasoning part with
providerMetadata
- Create
-
Token calculation test (Phase 3)
- Mock message with known tokenUsage
- Verify calculation matches expected
-
Edge case tests
- New session (no actuals) - falls back to estimation
- Negative messagesDisplay (capped at 0)
- Post-compaction state
- Empty reasoning (should not create empty reasoning part)
Integration Tests
-
Full reasoning flow test
- Enable extended thinking on Claude
- Send message that triggers reasoning
- Verify reasoning persisted to message
- Send follow-up message
- Verify reasoning sent back to LLM (check formatted messages)
-
Token tracking test
- Send message
- Verify tokenUsage stored on message
- Open /context
- Verify numbers use actual from last call
-
Compaction alignment test
- Fill context near threshold
- Verify /context and compaction trigger at same point
Success Criteria
- Numbers add up: Total = SystemPrompt + Tools + Messages
- Consistency: /context and compaction use same calculation
- Reasoning works: Traces sent back to LLM correctly
- Calibration visible: Logs show estimate vs actual ratio
- Provider compatibility: Works with Anthropic, OpenAI, Google, etc.
Appendix: Verification Against Other Implementations
This plan was verified against actual implementations on 2025-01-20.
OpenCode Verification (~/Projects/external/opencode)
| Claim | Verified | Evidence |
|---|---|---|
Stores reasoning as ReasoningPart |
✅ | message-v2.ts lines 78-89 |
Includes providerMetadata for round-tripping |
✅ | message-v2.ts lines 554-560 |
toModelMessage() sends reasoning back |
✅ | message-v2.ts lines 435-569 |
| Tracks reasoning tokens separately | ✅ | session/index.ts line 432, schemas throughout |
| Handles provider-specific metadata | ✅ | openai-responses-language-model.ts lines 520-538 |
OpenCode approach: Full round-trip of reasoning with provider metadata. This is our reference implementation.
Gemini-CLI Verification (~/Projects/external/gemini-cli)
| Claim in Original Plan | Actual Behavior | Status |
|---|---|---|
| "Parts with thought: true included when sending history back" | WRONG - They filter OUT thoughts at line 815 | ❌ Corrected |
Uses thought: true flag |
✅ Correct | ✅ |
Tracks thoughtsTokenCount |
✅ Correct - chatRecordingService.ts line 278 |
✅ |
Gemini-CLI approach: Track thought tokens for cost/display but do NOT round-trip them. This is a simpler approach but requires Google-specific handling.
Why We Follow OpenCode
- Same SDK: Both use Vercel AI SDK
- Provider-agnostic: Works across all providers without special-casing
- Future-proof: Preserves metadata for providers that need it
- Simpler code: No provider-specific filtering logic
Dexto Implementation Verification
| Component | Current State | Bug |
|---|---|---|
stream-processor.ts |
Accumulates reasoningText but doesn't persist |
Bug #1 |
vercel.ts formatter |
Ignores msg.reasoning |
Bug #2 (blocked by #1) |
AssistantMessage type |
Has reasoning?: string field |
✅ Ready |
Per-message tokenUsage |
Stored via updateAssistantMessage() |
✅ Working |
lastActualInputTokens |
Set after each LLM call | ✅ Working |
| Compaction calculation | Uses estimateMessagesTokens() only |
Different from /context |
/context calculation |
Uses full estimation (system + tools + messages) | Different from compaction |