Files

admin b52318eeae feat: Add intelligent auto-router and enhanced integrations

- Add intelligent-router.sh hook for automatic agent routing
- Add AUTO-TRIGGER-SUMMARY.md documentation
- Add FINAL-INTEGRATION-SUMMARY.md documentation
- Complete Prometheus integration (6 commands + 4 tools)
- Complete Dexto integration (12 commands + 5 tools)
- Enhanced Ralph with access to all agents
- Fix /clawd command (removed disable-model-invocation)
- Update hooks.json to v5 with intelligent routing
- 291 total skills now available
- All 21 commands with automatic routing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2026-01-28 00:27:56 +04:00

40 KiB

Raw Blame History

Context Window Calculation Analysis

Problem Statement

Our /context overlay shows inconsistent numbers:

Total shown: 122.4k tokens (from API's actual count)
Breakdown sum: ~73k tokens (our length/4 estimates)
Free space: Calculated from breakdown, not actual total

This leads to confusing UX where numbers don't add up.

Additionally, our compaction decision uses a different calculation than /context, leading to inconsistency.

Critical Finding #1: Reasoning Tokens Not Sent Back to LLM

Current State (Dexto)

We have the type but DON'T actually store reasoning:

// AssistantMessage in context/types.ts
interface AssistantMessage {
    reasoning?: string;  // Field EXISTS but is never populated!
    tokenUsage?: TokenUsage;
    // ...
}

Two separate bugs:

stream-processor.ts never persists reasoning text:

// Line 24: Reasoning IS accumulated during streaming
private reasoningText: string = '';

// Lines 97-108: Accumulated from reasoning-delta events
case 'reasoning-delta':
    this.reasoningText += event.text;  // ✓ Collected

// BUT lines 314-320: Only tokenUsage is persisted!
await this.contextManager.updateAssistantMessage(
    this.assistantMessageId,
    { tokenUsage: usage }  // ✗ No reasoning field!
);

formatAssistantMessage() in vercel.ts ignores msg.reasoning:
- Only extracts msg.content (text parts) and msg.toolCalls
- Even if reasoning WAS stored, it wouldn't be sent back

Result: Reasoning is collected → emitted to events → but never persisted or round-tripped.

How OpenCode Handles It (Correctly)

// In toModelMessage() - opencode/src/session/message-v2.ts
if (part.type === "reasoning") {
    assistantMessage.parts.push({
        type: "reasoning",
        text: part.text,
        providerMetadata: part.metadata,  // Critical for round-tripping!
    })
}

OpenCode:

Stores reasoning as ReasoningPart in message parts
Includes providerMetadata (contains thought signatures for Gemini, etc.)
Sends reasoning back in toModelMessage() conversion
Tracks reasoning tokens separately in token usage

How Gemini-CLI Handles It (Different Approach)

// Uses thought: true flag on parts from model
{ text: 'Hmm', thought: true }

// BUT they explicitly FILTER OUT thoughts before storing in history!
// geminiChat.ts line 815:
modelResponseParts.push(
  ...content.parts.filter((part) => !part.thought),  // Filter OUT thoughts
);

// Token tracking still captures thoughtsTokenCount from API response
// chatRecordingService.ts line 278:
tokens.thoughts = respUsageMetadata.thoughtsTokenCount ?? 0;

Key difference: Gemini-CLI tracks thought tokens for display/cost but does NOT round-trip them. This works because Google's API doesn't require thought history for context continuity.

Why We Follow OpenCode's Approach

We use Vercel AI SDK like OpenCode, not Google's native SDK
Provider-agnostic: OpenCode's approach works across all providers
No provider-specific logic: We shouldn't special-case Google's behavior
Context continuity: Some providers (especially via AI SDK) may need reasoning for proper state

Impact of Current Bugs

Context continuity broken: Reasoning traces lost between turns
Token counting incorrect: Reasoning tokens used but not tracked in context
Provider metadata lost: Cannot round-trip provider-specific metadata (e.g., OpenAI item IDs)

Critical Finding #2: Token Usage Storage

What We Track

Session Level (session-manager.ts):

sessionData.tokenUsage = {
    inputTokens: 0,
    outputTokens: 0,
    reasoningTokens: 0,
    cacheReadTokens: 0,
    cacheWriteTokens: 0,
    totalTokens: 0,
};

Message Level (AssistantMessage):

interface AssistantMessage {
    tokenUsage?: TokenUsage;  // Available but...
}

Current Flow

stream-processor.ts creates assistant message with empty metadata:
```
await this.contextManager.addAssistantMessage('', [], {});
```

After streaming completes, we DO update with token usage:

await this.contextManager.updateAssistantMessage(
    this.assistantMessageId,
    { tokenUsage: usage }
);

So we HAVE the data on each message, we just don't use it for context calculation!

Critical Finding #3: Estimate vs Actual Mismatch

The Problem

API actual inputTokens: 122.4k
Our length/4 estimate:   73.0k
Difference:              49.4k (67% underestimate!)

Why So Different?

Tokenizers don't split evenly by characters
- Code tokenizes differently than prose
- JSON schemas are verbose when tokenized
- Special characters, whitespace handling varies
We're comparing different things
- actualTokens = from last LLM call (includes everything sent)
- breakdown estimate = calculated now on current history
Context has grown since last call
- Last call's inputTokens doesn't include the response that followed
- New user messages added since

How Other Tools Handle This

Claude Code (Anthropic)

Uses /v1/messages/count_tokens API for exact counts!

// From cli.js (minified)
countTokens(A,Q) {
  return this._client.post("/v1/messages/count_tokens", { body: A, ...Q })
}

Categories tracked:

System prompt
System tools
Memory files
Skills
MCP tools (with deferred loading)
Agents
Messages (with sub-breakdown)
Free space
Autocompact buffer

Free space calculation:

// YA = sum of all category tokens (excluding deferred)
let YA = k.reduce((CA, _A) => CA + (_A.isDeferred ? 0 : _A.tokens), 0)

// WA = buffer (autocompact or compact)
let WA = autocompactEnabled ? (maxTokens - contextUsed) : 500;

// Free space
let wA = Math.max(0, maxTokens - YA - WA)

gemini-cli

Hybrid approach:

// Sync estimation (fast)
estimateTokenCountSync(parts): number {
  // ASCII: ~4 chars per token (0.25 tokens/char)
  // Non-ASCII/CJK: ~1-2 chars per token (1.3 tokens/char)
}

// API counting (when needed)
if (hasMedia) {
  use Gemini countTokens API
} else {
  use sync estimation
}

Token tracking from API response:

{
  input: promptTokenCount,
  output: candidatesTokenCount,
  cached: cachedContentTokenCount,
  thoughts: thoughtsTokenCount,      // Reasoning!
  tool: toolUsePromptTokenCount,
  total: totalTokenCount
}

opencode

Simple estimation + detailed tracking:

Token.estimate(input: string): number {
  return Math.round(input.length / 4)
}

// But tracks actuals per message:
StepFinishPart {
  tokens: {
    input: number,
    output: number,
    reasoning: number,
    cache: { read: number, write: number }
  }
}

Current Architecture Issues

1. Reasoning Pipeline (BROKEN - Two Bugs)

Current (broken):

LLM Response → reasoning-delta events received
                          ↓
stream-processor.ts → accumulates reasoningText ✓
                          ↓
updateAssistantMessage() → ONLY saves tokenUsage, NOT reasoning ✗
                          ↓
AssistantMessage.reasoning = undefined (never set!)
                          ↓
formatAssistantMessage() → has nothing to format anyway
                          ↓
Reasoning NOT sent back to LLM ❌

Should be (following OpenCode):

LLM Response → reasoning-delta events received (with providerMetadata)
                          ↓
stream-processor.ts → accumulates reasoningText AND reasoningMetadata
                          ↓
updateAssistantMessage() → saves reasoning + reasoningMetadata + tokenUsage
                          ↓
AssistantMessage.reasoning = "thinking..." ✓
AssistantMessage.reasoningMetadata = { openai: { itemId: "..." } } ✓
                          ↓
formatAssistantMessage() → includes reasoning part with providerMetadata
                          ↓
Reasoning sent back to LLM ✓

2. Token Calculation (/context)

Current:

// Uses length/4 estimate for everything
systemPromptTokens = estimateStringTokens(systemPrompt);  // length/4
messagesTokens = estimateMessagesTokens(preparedHistory); // length/4
toolsTokens = estimateToolTokens(tools);                  // length/4

total = systemPromptTokens + messagesTokens + toolsTokens;
freeSpace = maxTokens - total - outputBuffer;

Problem: Total doesn't match API's actual count.

3. Compaction Decision

Current (turn-executor.ts):

const estimatedTokens = estimateMessagesTokens(prepared.preparedHistory);
if (estimatedTokens > compactionThreshold) {
  // Compact!
}

Problem: Uses different calculation than /context, and both are wrong!

Proposed Solution

Principle: Single Source of Truth

Use actual token counts from API as ground truth
Track tokens per message for accurate history calculation
Estimate only what we cannot measure
Same formula for /context AND compaction decisions

THE FORMULA (Precise Specification)

Core Formula

estimatedNextInput = lastInputTokens + lastOutputTokens + newMessagesEstimate

Variable Definitions

Variable	Definition	Source	When Updated
`lastInputTokens`	Tokens we SENT in the most recent LLM call	`tokenUsage.inputTokens` from API response	After EVERY LLM call
`lastOutputTokens`	Tokens the LLM RETURNED in its response	`tokenUsage.outputTokens` from API response	After EVERY LLM call
`newMessagesEstimate`	Estimate for messages added AFTER the last LLM call	`length/4` heuristic	Calculated on demand

What Counts as "New Messages"?

Messages added to history AFTER lastInputTokens was recorded:

Tool results (role='tool') from the last assistant's tool calls
New user messages typed since last LLM call
Any injected system messages added between calls

Example Flow

┌─────────────────────────────────────────────────────────────────┐
│ Turn 1: User asks "What's the weather in NYC?"                  │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call:                                                       │
│   inputTokens = 5000 (system + tools + user message)            │
│   outputTokens = 100 (assistant: "I'll check" + tool_call)      │
│                                                                 │
│ After call: UPDATE lastInputTokens=5000, lastOutputTokens=100   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Tool executes, result added to history                          │
│ Tool result: "NYC: 72°F, sunny" (role='tool')                   │
│                                                                 │
│ This is a NEW MESSAGE (added after lastInputTokens recorded)    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Before Turn 2: Calculate estimated context                      │
├─────────────────────────────────────────────────────────────────┤
│ lastInputTokens = 5000 (from Turn 1)                            │
│ lastOutputTokens = 100 (from Turn 1)                            │
│ newMessagesEstimate = estimate(tool_result) ≈ 20                │
│                                                                 │
│ estimatedNextInput = 5000 + 100 + 20 = 5120                     │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Turn 2: LLM processes tool result                               │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call:                                                       │
│   inputTokens = 5115 (ACTUAL - this is our ground truth!)       │
│   outputTokens = 50 (assistant: "The weather is 72°F...")       │
│                                                                 │
│ VERIFICATION: estimated=5120, actual=5115, error=+5 (+0.1%)     │
│                                                                 │
│ After call: UPDATE lastInputTokens=5115, lastOutputTokens=50    │
└─────────────────────────────────────────────────────────────────┘

Verification Metrics

On EVERY LLM call, log the accuracy of our previous estimate:

// Before LLM call
const estimated = lastInputTokens + lastOutputTokens + newMessagesEstimate;

// After LLM call, compare to actual
const actual = response.tokenUsage.inputTokens;
const error = estimated - actual;
const errorPercent = (error / actual) * 100;

logger.info(`Context estimate: estimated=${estimated}, actual=${actual}, error=${error > 0 ? '+' : ''}${error} (${errorPercent.toFixed(1)}%)`);

Breakdown for Display (Back-Calculation)

For /context overlay, we show a breakdown. Since we only know the TOTAL accurately, we back-calculate messages:

const total = lastInputTokens + lastOutputTokens + newMessagesEstimate;

// These are estimates (we can't measure them directly)
const systemPromptEstimate = estimateTokens(systemPrompt);  // length/4
const toolsEstimate = estimateToolsTokens(tools);           // length/4

// Back-calculate messages so the math adds up
const messagesDisplay = total - systemPromptEstimate - toolsEstimate;

// If negative, our estimates are too high - cap at 0 and log warning
if (messagesDisplay < 0) {
    logger.warn(`Back-calculated messages negative (${messagesDisplay}), estimates may be too high`);
    messagesDisplay = 0;
}

Edge Cases

Scenario	Behavior
No LLM call yet	`lastInputTokens=null`, fall back to pure estimation, show "(estimated)" label
After compaction	History changed significantly, set `lastInputTokens=null`, fall back to estimation until next call
messagesDisplay negative	Cap at 0, log warning - indicates system/tools estimates too high
System prompt changed	Next estimate may be off, but next actual will correct it
Tools changed (MCP)	Same as above - self-correcting after next call

What /context Should Display

Context Usage: 52,100 / 200,000 tokens (26%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Breakdown:
  System prompt:  4,000 tokens (estimated)
  Tools:          8,000 tokens (estimated)
  Messages:      40,100 tokens (back-calculated)
  ─────────────────────────────
  Total:         52,100 tokens

Calculation basis:
  Last actual input:  50,000 tokens
  Last output:         2,000 tokens
  New since then:        100 tokens (estimated)

Last estimate accuracy: +0.6% error

Free space: 131,900 tokens (after 16,000 output buffer)

Implementation Checklist

Store lastInputTokens and lastOutputTokens after each LLM call
Track which messages are "new" since last LLM call (need message timestamp or index tracking)
Calculate newMessagesEstimate only for messages added after last call
Log verification metrics on every LLM call
Update /context overlay to show this breakdown
Handle edge cases (no call yet, after compaction)
Use SAME formula for compaction decisions

Legacy Edge Cases (keeping for reference)

No LLM call yet (new session)
- Fall back to pure estimation
- All numbers are estimates with "(estimated)" label
messagesDisplay comes out negative
- Our estimates for system/tools are too high
- Cap at 0, log warning
- Indicates estimation needs calibration
After compaction
- Token counts reset with new session
- compactionCount tracks how many times compacted
Reasoning tokens
- Must be sent back to LLM (fix formatter) ✅ DONE
- Include in context calculation
- Track separately for display

Verification: Why `lastOutputTokens` Is Safe to Use Directly

Verified on 2025-01-20 by analyzing AI SDK source code and our codebase

Question: Does outputTokens include content that might be pruned before the next LLM call?

Answer: No. outputTokens is safe to use directly because:

Part 1: What does `outputTokens` include? (AI SDK Verification)

Anthropic - verified via ai/packages/anthropic/src/__fixtures__/anthropic-json-tool.1.chunks.txt:

{"type":"message_delta","delta":{"stop_reason":"tool_use"},"usage":{"output_tokens":47}}

Tool call response reports output_tokens: 47 - includes tool calls ✅

OpenAI - verified via ai/packages/openai/src/responses/__fixtures__/openai-shell-tool.1.chunks.txt:

{"output":[{"type":"shell_call","action":{"commands":["ls -a ~/Desktop"]}}],"usage":{"output_tokens":41}}

Shell tool call reports output_tokens: 41 - includes tool calls ✅

Google - verified via ai/packages/google/src/google-generative-ai-language-model.test.ts lines 2274-2302:

content: { parts: [{ functionCall: { name: 'test-tool', args: { value: 'test' } } }] },
usageMetadata: { promptTokenCount: 10, candidatesTokenCount: 20, totalTokenCount: 30 }

Function call response reports candidatesTokenCount: 20 - includes tool calls ✅

Part 2: What gets pruned in our system?

From manager.ts prepareHistory():

Only tool result messages (role='tool') can be pruned
They're marked with compactedAt timestamp
Replaced with placeholder: [Old tool result content cleared]

What is NEVER pruned:

Assistant messages (text content)
Assistant's tool calls
User messages

Verification Table

Message Type	Pruned?	Part of outputTokens?
Assistant text	❌ Never	✅ Yes
Assistant tool calls	❌ Never	✅ Yes (verified across all providers)
Tool results (role='tool')	✅ Can be pruned	❌ No (separate messages)

Code Evidence

stream-processor.ts: Tool calls stored via addToolCall() with full arguments
manager.ts line 279: Only msg.role === 'tool' && msg.compactedAt gets placeholder
No code path exists to prune assistant messages

Conclusion: The formula lastInputTokens + lastOutputTokens + newMessagesEstimate is correct because:

lastInputTokens reflects pruned history (API tells us exactly what was sent)
lastOutputTokens is the assistant's response (text + tool calls) which is stored and sent back as-is
All major providers (Anthropic, OpenAI, Google) include tool calls in their output token counts
Only tool results (separate messages) can be pruned, and those are in inputTokens

Implementation Plan

Phase 1: Fix Reasoning Storage (HIGH PRIORITY - Bug #1) ✅ COMPLETED

The root cause: stream-processor.ts collects reasoning but never persists it.

Files to modify:

packages/core/src/llm/executor/stream-processor.ts
packages/core/src/context/types.ts

Changes:

Add reasoningMetadata field to AssistantMessage type:

// In context/types.ts
interface AssistantMessage {
  reasoning?: string;
  reasoningMetadata?: Record<string, unknown>;  // NEW - for provider round-tripping
  // ...
}

Capture providerMetadata from reasoning-delta events:

// In stream-processor.ts, add field:
private reasoningMetadata: Record<string, unknown> | undefined;

// In reasoning-delta case:
case 'reasoning-delta':
    this.reasoningText += event.text;
    // Capture provider metadata for round-tripping (OpenAI itemId, etc.)
    if (event.providerMetadata) {
        this.reasoningMetadata = event.providerMetadata;
    }
    // ... emit events

Fix the bug - persist reasoning in updateAssistantMessage():

// In stream-processor.ts, 'finish' case (around line 315):
if (this.assistantMessageId) {
    await this.contextManager.updateAssistantMessage(
        this.assistantMessageId,
        {
            tokenUsage: usage,
            reasoning: this.reasoningText || undefined,           // ADD THIS
            reasoningMetadata: this.reasoningMetadata,            // ADD THIS
        }
    );
}

Phase 2: Fix Reasoning Round-Trip (Bug #2) ✅ COMPLETED

Files to modify:

packages/core/src/llm/formatters/vercel.ts

Changes:

Update formatAssistantMessage() to include reasoning:

// In formatAssistantMessage(), before returning:
if (msg.reasoning) {
    contentParts.push({
        type: 'reasoning',
        text: msg.reasoning,
        providerMetadata: msg.reasoningMetadata,
    });
}

Verified: Vercel AI SDK's AssistantContent type supports ReasoningPart:

// packages/provider-utils/src/types/assistant-model-message.ts
export type AssistantContent = string | Array<TextPart | FilePart | ReasoningPart | ...>;

// packages/provider-utils/src/types/content-part.ts
export interface ReasoningPart {
  type: 'reasoning';
  text: string;
  providerOptions?: ProviderOptions;  // For round-tripping provider metadata
}

Phase 3: Unified Context Calculation ✅ COMPLETED

Files to modify:

packages/core/src/context/manager.ts - getContextTokenEstimate()
packages/core/src/llm/executor/turn-executor.ts - compaction check
packages/cli/src/cli/ink-cli/components/overlays/ContextStatsOverlay.tsx

Changes:

Create shared calculateContextUsage() function:

// New file: packages/core/src/context/context-calculator.ts
export async function calculateContextUsage(
  contextManager: ContextManager,
  tools: ToolDefinitions,
  maxContextTokens: number,
  outputBuffer: number
): Promise<ContextUsage> {
  // Implement the formula above
}

Use in /context:

// In DextoAgent.getContextStats()
const usage = await calculateContextUsage(...);
return usage;

Use in compaction decision:

// In turn-executor.ts
const usage = await calculateContextUsage(...);
if (usage.total > compactionThreshold) {
  // Compact!
}

Phase 4: Message-Level Token Tracking

Already implemented! We just need to use it:

// In calculateContextUsage(), sum from messages:
const history = await contextManager.getHistory();
let totalInputFromMessages = 0;
let totalOutputFromMessages = 0;
let totalReasoningFromMessages = 0;

for (const msg of history) {
  if (msg.role === 'assistant' && msg.tokenUsage) {
    totalOutputFromMessages += msg.tokenUsage.outputTokens ?? 0;
    totalReasoningFromMessages += msg.tokenUsage.reasoningTokens ?? 0;
  }
}

Phase 5: Calibration & Logging

Log estimate vs actual on every LLM call (already done, level=info)
Track calibration ratio over time
Consider adaptive estimation based on observed ratios

Phase 6: Future - API Token Counting

For Anthropic:

// New method in Anthropic service
async countTokens(messages: Message[], tools: Tool[]): Promise<{
  input_tokens: number;
}>

For other providers:

tiktoken for OpenAI
Gemini countTokens API
Fallback to estimation

Data Flow Diagram

Current State (BROKEN)

┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Response Stream                          │
├─────────────────────────────────────────────────────────────────────┤
│  reasoning-delta events → reasoningText accumulated ✓               │
│  text-delta events → content accumulated ✓                          │
│  finish event → usage: { inputTokens, outputTokens, ... }           │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│              stream-processor.ts updateAssistantMessage()           │
├─────────────────────────────────────────────────────────────────────┤
│  await this.contextManager.updateAssistantMessage(                  │
│      this.assistantMessageId,                                       │
│      { tokenUsage: usage }     ← ONLY tokenUsage saved!             │
│  );                            ← reasoning NOT included! ✗          │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    AssistantMessage Stored                          │
├─────────────────────────────────────────────────────────────────────┤
│  {                                                                  │
│    role: 'assistant',                                               │
│    content: [...],             ← ✓ Stored                           │
│    reasoning: undefined,       ← ✗ NEVER SET!                       │
│    tokenUsage: {...}           ← ✓ Stored                           │
│  }                                                                  │
└─────────────────────────────────────────────────────────────────────┘

### Target State (FIXED)

┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Response Stream                          │
├─────────────────────────────────────────────────────────────────────┤
│  reasoning-delta events → reasoningText + providerMetadata ✓        │
│  text-delta events → content accumulated ✓                          │
│  finish event → usage: { inputTokens, outputTokens, ... }           │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│              stream-processor.ts updateAssistantMessage()           │
├─────────────────────────────────────────────────────────────────────┤
│  await this.contextManager.updateAssistantMessage(                  │
│      this.assistantMessageId,                                       │
│      {                                                              │
│          tokenUsage: usage,                                         │
│          reasoning: this.reasoningText,           ← NEW             │
│          reasoningMetadata: this.reasoningMetadata ← NEW            │
│      }                                                              │
│  );                                                                 │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    AssistantMessage Stored                          │
├─────────────────────────────────────────────────────────────────────┤
│  {                                                                  │
│    role: 'assistant',                                               │
│    content: [...],                                                  │
│    reasoning: 'Let me think...',    ← ✓ Now stored                  │
│    reasoningMetadata: { openai: { itemId: '...' } }, ← ✓ For round-trip
│    tokenUsage: { inputTokens, outputTokens, reasoningTokens }       │
│  }                                                                  │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Next LLM Call (Formatter)                        │
├─────────────────────────────────────────────────────────────────────┤
│  formatAssistantMessage() includes:                                 │
│    - content (text parts)              ✓ Already done               │
│    - toolCalls                         ✓ Already done               │
│    - reasoning + providerMetadata      ✓ NEW - enables round-trip   │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    /context Calculation                             │
├─────────────────────────────────────────────────────────────────────┤
│  currentTotal = lastInput + lastOutput + newMessagesEstimate        │
│                                                                     │
│  Breakdown:                                                         │
│    systemPrompt = estimate (length/4)                               │
│    tools = estimate (length/4)                                      │
│    messages = currentTotal - systemPrompt - tools (back-calc)       │
│    reasoning = sum(msg.tokenUsage.reasoningTokens) (for display)    │
│                                                                     │
│  freeSpace = maxTokens - currentTotal - outputBuffer                │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Compaction Decision                              │
├─────────────────────────────────────────────────────────────────────┤
│  SAME FORMULA as /context!                                          │
│                                                                     │
│  if (currentTotal > compactionThreshold) {                          │
│    triggerCompaction();                                             │
│  }                                                                  │
└─────────────────────────────────────────────────────────────────────┘

Testing Strategy

Unit Tests

Reasoning storage test (Phase 1)
- Mock LLM stream with reasoning-delta events
- Verify stream-processor.ts calls updateAssistantMessage() with reasoning
- Verify reasoningMetadata is captured from providerMetadata
Reasoning round-trip test (Phase 2)
- Create AssistantMessage with reasoning and reasoningMetadata
- Call formatAssistantMessage()
- Verify output contains reasoning part with providerMetadata
Token calculation test (Phase 3)
- Mock message with known tokenUsage
- Verify calculation matches expected
Edge case tests
- New session (no actuals) - falls back to estimation
- Negative messagesDisplay (capped at 0)
- Post-compaction state
- Empty reasoning (should not create empty reasoning part)

Integration Tests

Full reasoning flow test
- Enable extended thinking on Claude
- Send message that triggers reasoning
- Verify reasoning persisted to message
- Send follow-up message
- Verify reasoning sent back to LLM (check formatted messages)
Token tracking test
- Send message
- Verify tokenUsage stored on message
- Open /context
- Verify numbers use actual from last call
Compaction alignment test
- Fill context near threshold
- Verify /context and compaction trigger at same point

Success Criteria

Numbers add up: Total = SystemPrompt + Tools + Messages
Consistency: /context and compaction use same calculation
Reasoning works: Traces sent back to LLM correctly
Calibration visible: Logs show estimate vs actual ratio
Provider compatibility: Works with Anthropic, OpenAI, Google, etc.

Appendix: Verification Against Other Implementations

This plan was verified against actual implementations on 2025-01-20.

OpenCode Verification (~/Projects/external/opencode)

Claim	Verified	Evidence
Stores reasoning as `ReasoningPart`	✅	`message-v2.ts` lines 78-89
Includes `providerMetadata` for round-tripping	✅	`message-v2.ts` lines 554-560
`toModelMessage()` sends reasoning back	✅	`message-v2.ts` lines 435-569
Tracks reasoning tokens separately	✅	`session/index.ts` line 432, schemas throughout
Handles provider-specific metadata	✅	`openai-responses-language-model.ts` lines 520-538

OpenCode approach: Full round-trip of reasoning with provider metadata. This is our reference implementation.

Gemini-CLI Verification (~/Projects/external/gemini-cli)

Claim in Original Plan	Actual Behavior	Status
"Parts with thought: true included when sending history back"	WRONG - They filter OUT thoughts at line 815	❌ Corrected
Uses `thought: true` flag	✅ Correct	✅
Tracks `thoughtsTokenCount`	✅ Correct - `chatRecordingService.ts` line 278	✅

Gemini-CLI approach: Track thought tokens for cost/display but do NOT round-trip them. This is a simpler approach but requires Google-specific handling.

Why We Follow OpenCode

Same SDK: Both use Vercel AI SDK
Provider-agnostic: Works across all providers without special-casing
Future-proof: Preserves metadata for providers that need it
Simpler code: No provider-specific filtering logic

Dexto Implementation Verification

Component	Current State	Bug
`stream-processor.ts`	Accumulates `reasoningText` but doesn't persist	Bug #1
`vercel.ts` formatter	Ignores `msg.reasoning`	Bug #2 (blocked by #1)
`AssistantMessage` type	Has `reasoning?: string` field	✅ Ready
Per-message `tokenUsage`	Stored via `updateAssistantMessage()`	✅ Working
`lastActualInputTokens`	Set after each LLM call	✅ Working
Compaction calculation	Uses `estimateMessagesTokens()` only	Different from /context
`/context` calculation	Uses full estimation (system + tools + messages)	Different from compaction

40 KiB Raw Blame History

Context Window Calculation Analysis

Problem Statement

Critical Finding #1: Reasoning Tokens Not Sent Back to LLM

Current State (Dexto)

How OpenCode Handles It (Correctly)

How Gemini-CLI Handles It (Different Approach)

Why We Follow OpenCode's Approach

Impact of Current Bugs

Critical Finding #2: Token Usage Storage

What We Track

Current Flow

Critical Finding #3: Estimate vs Actual Mismatch

The Problem

Why So Different?

How Other Tools Handle This

Claude Code (Anthropic)

gemini-cli

opencode

Current Architecture Issues

1. Reasoning Pipeline (BROKEN - Two Bugs)

2. Token Calculation (/context)

3. Compaction Decision

Proposed Solution

Principle: Single Source of Truth

THE FORMULA (Precise Specification)

Core Formula

Variable Definitions

What Counts as "New Messages"?

Example Flow

Verification Metrics

Breakdown for Display (Back-Calculation)

Edge Cases

What /context Should Display

Implementation Checklist

Legacy Edge Cases (keeping for reference)

Verification: Why lastOutputTokens Is Safe to Use Directly

Part 1: What does outputTokens include? (AI SDK Verification)

Part 2: What gets pruned in our system?

Verification Table

Code Evidence

Implementation Plan

Phase 1: Fix Reasoning Storage (HIGH PRIORITY - Bug #1) ✅ COMPLETED

Phase 2: Fix Reasoning Round-Trip (Bug #2) ✅ COMPLETED

Phase 3: Unified Context Calculation ✅ COMPLETED

Phase 4: Message-Level Token Tracking

Phase 5: Calibration & Logging

Phase 6: Future - API Token Counting

Data Flow Diagram

Current State (BROKEN)

Testing Strategy

Unit Tests

Integration Tests

Success Criteria

Appendix: Verification Against Other Implementations

OpenCode Verification (~/Projects/external/opencode)

Gemini-CLI Verification (~/Projects/external/gemini-cli)

Why We Follow OpenCode

Dexto Implementation Verification

40 KiB

Raw Blame History

Verification: Why `lastOutputTokens` Is Safe to Use Directly

Part 1: What does `outputTokens` include? (AI SDK Verification)