Files
SuperCharged-Claude-Code-Up…/dexto/feature-plans/context-calculation.md
admin b52318eeae feat: Add intelligent auto-router and enhanced integrations
- Add intelligent-router.sh hook for automatic agent routing
- Add AUTO-TRIGGER-SUMMARY.md documentation
- Add FINAL-INTEGRATION-SUMMARY.md documentation
- Complete Prometheus integration (6 commands + 4 tools)
- Complete Dexto integration (12 commands + 5 tools)
- Enhanced Ralph with access to all agents
- Fix /clawd command (removed disable-model-invocation)
- Update hooks.json to v5 with intelligent routing
- 291 total skills now available
- All 21 commands with automatic routing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-01-28 00:27:56 +04:00

40 KiB

Context Window Calculation Analysis

Problem Statement

Our /context overlay shows inconsistent numbers:

  • Total shown: 122.4k tokens (from API's actual count)
  • Breakdown sum: ~73k tokens (our length/4 estimates)
  • Free space: Calculated from breakdown, not actual total

This leads to confusing UX where numbers don't add up.

Additionally, our compaction decision uses a different calculation than /context, leading to inconsistency.


Critical Finding #1: Reasoning Tokens Not Sent Back to LLM

Current State (Dexto)

We have the type but DON'T actually store reasoning:

// AssistantMessage in context/types.ts
interface AssistantMessage {
    reasoning?: string;  // Field EXISTS but is never populated!
    tokenUsage?: TokenUsage;
    // ...
}

Two separate bugs:

  1. stream-processor.ts never persists reasoning text:

    // Line 24: Reasoning IS accumulated during streaming
    private reasoningText: string = '';
    
    // Lines 97-108: Accumulated from reasoning-delta events
    case 'reasoning-delta':
        this.reasoningText += event.text;  // ✓ Collected
    
    // BUT lines 314-320: Only tokenUsage is persisted!
    await this.contextManager.updateAssistantMessage(
        this.assistantMessageId,
        { tokenUsage: usage }  // ✗ No reasoning field!
    );
    
  2. formatAssistantMessage() in vercel.ts ignores msg.reasoning:

    • Only extracts msg.content (text parts) and msg.toolCalls
    • Even if reasoning WAS stored, it wouldn't be sent back

Result: Reasoning is collected → emitted to events → but never persisted or round-tripped.

How OpenCode Handles It (Correctly)

// In toModelMessage() - opencode/src/session/message-v2.ts
if (part.type === "reasoning") {
    assistantMessage.parts.push({
        type: "reasoning",
        text: part.text,
        providerMetadata: part.metadata,  // Critical for round-tripping!
    })
}

OpenCode:

  1. Stores reasoning as ReasoningPart in message parts
  2. Includes providerMetadata (contains thought signatures for Gemini, etc.)
  3. Sends reasoning back in toModelMessage() conversion
  4. Tracks reasoning tokens separately in token usage

How Gemini-CLI Handles It (Different Approach)

// Uses thought: true flag on parts from model
{ text: 'Hmm', thought: true }

// BUT they explicitly FILTER OUT thoughts before storing in history!
// geminiChat.ts line 815:
modelResponseParts.push(
  ...content.parts.filter((part) => !part.thought),  // Filter OUT thoughts
);

// Token tracking still captures thoughtsTokenCount from API response
// chatRecordingService.ts line 278:
tokens.thoughts = respUsageMetadata.thoughtsTokenCount ?? 0;

Key difference: Gemini-CLI tracks thought tokens for display/cost but does NOT round-trip them. This works because Google's API doesn't require thought history for context continuity.

Why We Follow OpenCode's Approach

  1. We use Vercel AI SDK like OpenCode, not Google's native SDK
  2. Provider-agnostic: OpenCode's approach works across all providers
  3. No provider-specific logic: We shouldn't special-case Google's behavior
  4. Context continuity: Some providers (especially via AI SDK) may need reasoning for proper state

Impact of Current Bugs

  1. Context continuity broken: Reasoning traces lost between turns
  2. Token counting incorrect: Reasoning tokens used but not tracked in context
  3. Provider metadata lost: Cannot round-trip provider-specific metadata (e.g., OpenAI item IDs)

Critical Finding #2: Token Usage Storage

What We Track

Session Level (session-manager.ts):

sessionData.tokenUsage = {
    inputTokens: 0,
    outputTokens: 0,
    reasoningTokens: 0,
    cacheReadTokens: 0,
    cacheWriteTokens: 0,
    totalTokens: 0,
};

Message Level (AssistantMessage):

interface AssistantMessage {
    tokenUsage?: TokenUsage;  // Available but...
}

Current Flow

  1. stream-processor.ts creates assistant message with empty metadata:

    await this.contextManager.addAssistantMessage('', [], {});
    
  2. After streaming completes, we DO update with token usage:

    await this.contextManager.updateAssistantMessage(
        this.assistantMessageId,
        { tokenUsage: usage }
    );
    

So we HAVE the data on each message, we just don't use it for context calculation!


Critical Finding #3: Estimate vs Actual Mismatch

The Problem

API actual inputTokens: 122.4k
Our length/4 estimate:   73.0k
Difference:              49.4k (67% underestimate!)

Why So Different?

  1. Tokenizers don't split evenly by characters

    • Code tokenizes differently than prose
    • JSON schemas are verbose when tokenized
    • Special characters, whitespace handling varies
  2. We're comparing different things

    • actualTokens = from last LLM call (includes everything sent)
    • breakdown estimate = calculated now on current history
  3. Context has grown since last call

    • Last call's inputTokens doesn't include the response that followed
    • New user messages added since

How Other Tools Handle This

Claude Code (Anthropic)

Uses /v1/messages/count_tokens API for exact counts!

// From cli.js (minified)
countTokens(A,Q) {
  return this._client.post("/v1/messages/count_tokens", { body: A, ...Q })
}

Categories tracked:

  • System prompt
  • System tools
  • Memory files
  • Skills
  • MCP tools (with deferred loading)
  • Agents
  • Messages (with sub-breakdown)
  • Free space
  • Autocompact buffer

Free space calculation:

// YA = sum of all category tokens (excluding deferred)
let YA = k.reduce((CA, _A) => CA + (_A.isDeferred ? 0 : _A.tokens), 0)

// WA = buffer (autocompact or compact)
let WA = autocompactEnabled ? (maxTokens - contextUsed) : 500;

// Free space
let wA = Math.max(0, maxTokens - YA - WA)

gemini-cli

Hybrid approach:

// Sync estimation (fast)
estimateTokenCountSync(parts): number {
  // ASCII: ~4 chars per token (0.25 tokens/char)
  // Non-ASCII/CJK: ~1-2 chars per token (1.3 tokens/char)
}

// API counting (when needed)
if (hasMedia) {
  use Gemini countTokens API
} else {
  use sync estimation
}

Token tracking from API response:

{
  input: promptTokenCount,
  output: candidatesTokenCount,
  cached: cachedContentTokenCount,
  thoughts: thoughtsTokenCount,      // Reasoning!
  tool: toolUsePromptTokenCount,
  total: totalTokenCount
}

opencode

Simple estimation + detailed tracking:

Token.estimate(input: string): number {
  return Math.round(input.length / 4)
}

// But tracks actuals per message:
StepFinishPart {
  tokens: {
    input: number,
    output: number,
    reasoning: number,
    cache: { read: number, write: number }
  }
}

Current Architecture Issues

1. Reasoning Pipeline (BROKEN - Two Bugs)

Current (broken):

LLM Response → reasoning-delta events received
                          ↓
stream-processor.ts → accumulates reasoningText ✓
                          ↓
updateAssistantMessage() → ONLY saves tokenUsage, NOT reasoning ✗
                          ↓
AssistantMessage.reasoning = undefined (never set!)
                          ↓
formatAssistantMessage() → has nothing to format anyway
                          ↓
Reasoning NOT sent back to LLM ❌

Should be (following OpenCode):

LLM Response → reasoning-delta events received (with providerMetadata)
                          ↓
stream-processor.ts → accumulates reasoningText AND reasoningMetadata
                          ↓
updateAssistantMessage() → saves reasoning + reasoningMetadata + tokenUsage
                          ↓
AssistantMessage.reasoning = "thinking..." ✓
AssistantMessage.reasoningMetadata = { openai: { itemId: "..." } } ✓
                          ↓
formatAssistantMessage() → includes reasoning part with providerMetadata
                          ↓
Reasoning sent back to LLM ✓

2. Token Calculation (/context)

Current:

// Uses length/4 estimate for everything
systemPromptTokens = estimateStringTokens(systemPrompt);  // length/4
messagesTokens = estimateMessagesTokens(preparedHistory); // length/4
toolsTokens = estimateToolTokens(tools);                  // length/4

total = systemPromptTokens + messagesTokens + toolsTokens;
freeSpace = maxTokens - total - outputBuffer;

Problem: Total doesn't match API's actual count.

3. Compaction Decision

Current (turn-executor.ts):

const estimatedTokens = estimateMessagesTokens(prepared.preparedHistory);
if (estimatedTokens > compactionThreshold) {
  // Compact!
}

Problem: Uses different calculation than /context, and both are wrong!


Proposed Solution

Principle: Single Source of Truth

  1. Use actual token counts from API as ground truth
  2. Track tokens per message for accurate history calculation
  3. Estimate only what we cannot measure
  4. Same formula for /context AND compaction decisions

THE FORMULA (Precise Specification)

Core Formula

estimatedNextInput = lastInputTokens + lastOutputTokens + newMessagesEstimate

Variable Definitions

Variable Definition Source When Updated
lastInputTokens Tokens we SENT in the most recent LLM call tokenUsage.inputTokens from API response After EVERY LLM call
lastOutputTokens Tokens the LLM RETURNED in its response tokenUsage.outputTokens from API response After EVERY LLM call
newMessagesEstimate Estimate for messages added AFTER the last LLM call length/4 heuristic Calculated on demand

What Counts as "New Messages"?

Messages added to history AFTER lastInputTokens was recorded:

  • Tool results (role='tool') from the last assistant's tool calls
  • New user messages typed since last LLM call
  • Any injected system messages added between calls

Example Flow

┌─────────────────────────────────────────────────────────────────┐
│ Turn 1: User asks "What's the weather in NYC?"                  │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call:                                                       │
│   inputTokens = 5000 (system + tools + user message)            │
│   outputTokens = 100 (assistant: "I'll check" + tool_call)      │
│                                                                 │
│ After call: UPDATE lastInputTokens=5000, lastOutputTokens=100   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Tool executes, result added to history                          │
│ Tool result: "NYC: 72°F, sunny" (role='tool')                   │
│                                                                 │
│ This is a NEW MESSAGE (added after lastInputTokens recorded)    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Before Turn 2: Calculate estimated context                      │
├─────────────────────────────────────────────────────────────────┤
│ lastInputTokens = 5000 (from Turn 1)                            │
│ lastOutputTokens = 100 (from Turn 1)                            │
│ newMessagesEstimate = estimate(tool_result) ≈ 20                │
│                                                                 │
│ estimatedNextInput = 5000 + 100 + 20 = 5120                     │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Turn 2: LLM processes tool result                               │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call:                                                       │
│   inputTokens = 5115 (ACTUAL - this is our ground truth!)       │
│   outputTokens = 50 (assistant: "The weather is 72°F...")       │
│                                                                 │
│ VERIFICATION: estimated=5120, actual=5115, error=+5 (+0.1%)     │
│                                                                 │
│ After call: UPDATE lastInputTokens=5115, lastOutputTokens=50    │
└─────────────────────────────────────────────────────────────────┘

Verification Metrics

On EVERY LLM call, log the accuracy of our previous estimate:

// Before LLM call
const estimated = lastInputTokens + lastOutputTokens + newMessagesEstimate;

// After LLM call, compare to actual
const actual = response.tokenUsage.inputTokens;
const error = estimated - actual;
const errorPercent = (error / actual) * 100;

logger.info(`Context estimate: estimated=${estimated}, actual=${actual}, error=${error > 0 ? '+' : ''}${error} (${errorPercent.toFixed(1)}%)`);

Breakdown for Display (Back-Calculation)

For /context overlay, we show a breakdown. Since we only know the TOTAL accurately, we back-calculate messages:

const total = lastInputTokens + lastOutputTokens + newMessagesEstimate;

// These are estimates (we can't measure them directly)
const systemPromptEstimate = estimateTokens(systemPrompt);  // length/4
const toolsEstimate = estimateToolsTokens(tools);           // length/4

// Back-calculate messages so the math adds up
const messagesDisplay = total - systemPromptEstimate - toolsEstimate;

// If negative, our estimates are too high - cap at 0 and log warning
if (messagesDisplay < 0) {
    logger.warn(`Back-calculated messages negative (${messagesDisplay}), estimates may be too high`);
    messagesDisplay = 0;
}

Edge Cases

Scenario Behavior
No LLM call yet lastInputTokens=null, fall back to pure estimation, show "(estimated)" label
After compaction History changed significantly, set lastInputTokens=null, fall back to estimation until next call
messagesDisplay negative Cap at 0, log warning - indicates system/tools estimates too high
System prompt changed Next estimate may be off, but next actual will correct it
Tools changed (MCP) Same as above - self-correcting after next call

What /context Should Display

Context Usage: 52,100 / 200,000 tokens (26%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Breakdown:
  System prompt:  4,000 tokens (estimated)
  Tools:          8,000 tokens (estimated)
  Messages:      40,100 tokens (back-calculated)
  ─────────────────────────────
  Total:         52,100 tokens

Calculation basis:
  Last actual input:  50,000 tokens
  Last output:         2,000 tokens
  New since then:        100 tokens (estimated)

Last estimate accuracy: +0.6% error

Free space: 131,900 tokens (after 16,000 output buffer)

Implementation Checklist

  • Store lastInputTokens and lastOutputTokens after each LLM call
  • Track which messages are "new" since last LLM call (need message timestamp or index tracking)
  • Calculate newMessagesEstimate only for messages added after last call
  • Log verification metrics on every LLM call
  • Update /context overlay to show this breakdown
  • Handle edge cases (no call yet, after compaction)
  • Use SAME formula for compaction decisions

Legacy Edge Cases (keeping for reference)

  1. No LLM call yet (new session)

    • Fall back to pure estimation
    • All numbers are estimates with "(estimated)" label
  2. messagesDisplay comes out negative

    • Our estimates for system/tools are too high
    • Cap at 0, log warning
    • Indicates estimation needs calibration
  3. After compaction

    • Token counts reset with new session
    • compactionCount tracks how many times compacted
  4. Reasoning tokens

    • Must be sent back to LLM (fix formatter) DONE
    • Include in context calculation
    • Track separately for display

Verification: Why lastOutputTokens Is Safe to Use Directly

Verified on 2025-01-20 by analyzing AI SDK source code and our codebase

Question: Does outputTokens include content that might be pruned before the next LLM call?

Answer: No. outputTokens is safe to use directly because:

Part 1: What does outputTokens include? (AI SDK Verification)

Anthropic - verified via ai/packages/anthropic/src/__fixtures__/anthropic-json-tool.1.chunks.txt:

{"type":"message_delta","delta":{"stop_reason":"tool_use"},"usage":{"output_tokens":47}}

Tool call response reports output_tokens: 47 - includes tool calls

OpenAI - verified via ai/packages/openai/src/responses/__fixtures__/openai-shell-tool.1.chunks.txt:

{"output":[{"type":"shell_call","action":{"commands":["ls -a ~/Desktop"]}}],"usage":{"output_tokens":41}}

Shell tool call reports output_tokens: 41 - includes tool calls

Google - verified via ai/packages/google/src/google-generative-ai-language-model.test.ts lines 2274-2302:

content: { parts: [{ functionCall: { name: 'test-tool', args: { value: 'test' } } }] },
usageMetadata: { promptTokenCount: 10, candidatesTokenCount: 20, totalTokenCount: 30 }

Function call response reports candidatesTokenCount: 20 - includes tool calls

Part 2: What gets pruned in our system?

From manager.ts prepareHistory():

  • Only tool result messages (role='tool') can be pruned
  • They're marked with compactedAt timestamp
  • Replaced with placeholder: [Old tool result content cleared]

What is NEVER pruned:

  • Assistant messages (text content)
  • Assistant's tool calls
  • User messages

Verification Table

Message Type Pruned? Part of outputTokens?
Assistant text Never Yes
Assistant tool calls Never Yes (verified across all providers)
Tool results (role='tool') Can be pruned No (separate messages)

Code Evidence

  • stream-processor.ts: Tool calls stored via addToolCall() with full arguments
  • manager.ts line 279: Only msg.role === 'tool' && msg.compactedAt gets placeholder
  • No code path exists to prune assistant messages

Conclusion: The formula lastInputTokens + lastOutputTokens + newMessagesEstimate is correct because:

  • lastInputTokens reflects pruned history (API tells us exactly what was sent)
  • lastOutputTokens is the assistant's response (text + tool calls) which is stored and sent back as-is
  • All major providers (Anthropic, OpenAI, Google) include tool calls in their output token counts
  • Only tool results (separate messages) can be pruned, and those are in inputTokens

Implementation Plan

Phase 1: Fix Reasoning Storage (HIGH PRIORITY - Bug #1) COMPLETED

The root cause: stream-processor.ts collects reasoning but never persists it.

Files to modify:

  • packages/core/src/llm/executor/stream-processor.ts
  • packages/core/src/context/types.ts

Changes:

  1. Add reasoningMetadata field to AssistantMessage type:

    // In context/types.ts
    interface AssistantMessage {
      reasoning?: string;
      reasoningMetadata?: Record<string, unknown>;  // NEW - for provider round-tripping
      // ...
    }
    
  2. Capture providerMetadata from reasoning-delta events:

    // In stream-processor.ts, add field:
    private reasoningMetadata: Record<string, unknown> | undefined;
    
    // In reasoning-delta case:
    case 'reasoning-delta':
        this.reasoningText += event.text;
        // Capture provider metadata for round-tripping (OpenAI itemId, etc.)
        if (event.providerMetadata) {
            this.reasoningMetadata = event.providerMetadata;
        }
        // ... emit events
    
  3. Fix the bug - persist reasoning in updateAssistantMessage():

    // In stream-processor.ts, 'finish' case (around line 315):
    if (this.assistantMessageId) {
        await this.contextManager.updateAssistantMessage(
            this.assistantMessageId,
            {
                tokenUsage: usage,
                reasoning: this.reasoningText || undefined,           // ADD THIS
                reasoningMetadata: this.reasoningMetadata,            // ADD THIS
            }
        );
    }
    

Phase 2: Fix Reasoning Round-Trip (Bug #2) COMPLETED

Files to modify:

  • packages/core/src/llm/formatters/vercel.ts

Changes:

  1. Update formatAssistantMessage() to include reasoning:
    // In formatAssistantMessage(), before returning:
    if (msg.reasoning) {
        contentParts.push({
            type: 'reasoning',
            text: msg.reasoning,
            providerMetadata: msg.reasoningMetadata,
        });
    }
    

Verified: Vercel AI SDK's AssistantContent type supports ReasoningPart:

// packages/provider-utils/src/types/assistant-model-message.ts
export type AssistantContent = string | Array<TextPart | FilePart | ReasoningPart | ...>;

// packages/provider-utils/src/types/content-part.ts
export interface ReasoningPart {
  type: 'reasoning';
  text: string;
  providerOptions?: ProviderOptions;  // For round-tripping provider metadata
}

Phase 3: Unified Context Calculation COMPLETED

Files to modify:

  • packages/core/src/context/manager.ts - getContextTokenEstimate()
  • packages/core/src/llm/executor/turn-executor.ts - compaction check
  • packages/cli/src/cli/ink-cli/components/overlays/ContextStatsOverlay.tsx

Changes:

  1. Create shared calculateContextUsage() function:

    // New file: packages/core/src/context/context-calculator.ts
    export async function calculateContextUsage(
      contextManager: ContextManager,
      tools: ToolDefinitions,
      maxContextTokens: number,
      outputBuffer: number
    ): Promise<ContextUsage> {
      // Implement the formula above
    }
    
  2. Use in /context:

    // In DextoAgent.getContextStats()
    const usage = await calculateContextUsage(...);
    return usage;
    
  3. Use in compaction decision:

    // In turn-executor.ts
    const usage = await calculateContextUsage(...);
    if (usage.total > compactionThreshold) {
      // Compact!
    }
    

Phase 4: Message-Level Token Tracking

Already implemented! We just need to use it:

// In calculateContextUsage(), sum from messages:
const history = await contextManager.getHistory();
let totalInputFromMessages = 0;
let totalOutputFromMessages = 0;
let totalReasoningFromMessages = 0;

for (const msg of history) {
  if (msg.role === 'assistant' && msg.tokenUsage) {
    totalOutputFromMessages += msg.tokenUsage.outputTokens ?? 0;
    totalReasoningFromMessages += msg.tokenUsage.reasoningTokens ?? 0;
  }
}

Phase 5: Calibration & Logging

  1. Log estimate vs actual on every LLM call (already done, level=info)
  2. Track calibration ratio over time
  3. Consider adaptive estimation based on observed ratios

Phase 6: Future - API Token Counting

For Anthropic:

// New method in Anthropic service
async countTokens(messages: Message[], tools: Tool[]): Promise<{
  input_tokens: number;
}>

For other providers:

  • tiktoken for OpenAI
  • Gemini countTokens API
  • Fallback to estimation

Data Flow Diagram

Current State (BROKEN)

┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Response Stream                          │
├─────────────────────────────────────────────────────────────────────┤
│  reasoning-delta events → reasoningText accumulated ✓               │
│  text-delta events → content accumulated ✓                          │
│  finish event → usage: { inputTokens, outputTokens, ... }           │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│              stream-processor.ts updateAssistantMessage()           │
├─────────────────────────────────────────────────────────────────────┤
│  await this.contextManager.updateAssistantMessage(                  │
│      this.assistantMessageId,                                       │
│      { tokenUsage: usage }     ← ONLY tokenUsage saved!             │
│  );                            ← reasoning NOT included! ✗          │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    AssistantMessage Stored                          │
├─────────────────────────────────────────────────────────────────────┤
│  {                                                                  │
│    role: 'assistant',                                               │
│    content: [...],             ← ✓ Stored                           │
│    reasoning: undefined,       ← ✗ NEVER SET!                       │
│    tokenUsage: {...}           ← ✓ Stored                           │
│  }                                                                  │
└─────────────────────────────────────────────────────────────────────┘

### Target State (FIXED)

┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Response Stream                          │
├─────────────────────────────────────────────────────────────────────┤
│  reasoning-delta events → reasoningText + providerMetadata ✓        │
│  text-delta events → content accumulated ✓                          │
│  finish event → usage: { inputTokens, outputTokens, ... }           │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│              stream-processor.ts updateAssistantMessage()           │
├─────────────────────────────────────────────────────────────────────┤
│  await this.contextManager.updateAssistantMessage(                  │
│      this.assistantMessageId,                                       │
│      {                                                              │
│          tokenUsage: usage,                                         │
│          reasoning: this.reasoningText,           ← NEW             │
│          reasoningMetadata: this.reasoningMetadata ← NEW            │
│      }                                                              │
│  );                                                                 │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    AssistantMessage Stored                          │
├─────────────────────────────────────────────────────────────────────┤
│  {                                                                  │
│    role: 'assistant',                                               │
│    content: [...],                                                  │
│    reasoning: 'Let me think...',    ← ✓ Now stored                  │
│    reasoningMetadata: { openai: { itemId: '...' } }, ← ✓ For round-trip
│    tokenUsage: { inputTokens, outputTokens, reasoningTokens }       │
│  }                                                                  │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Next LLM Call (Formatter)                        │
├─────────────────────────────────────────────────────────────────────┤
│  formatAssistantMessage() includes:                                 │
│    - content (text parts)              ✓ Already done               │
│    - toolCalls                         ✓ Already done               │
│    - reasoning + providerMetadata      ✓ NEW - enables round-trip   │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    /context Calculation                             │
├─────────────────────────────────────────────────────────────────────┤
│  currentTotal = lastInput + lastOutput + newMessagesEstimate        │
│                                                                     │
│  Breakdown:                                                         │
│    systemPrompt = estimate (length/4)                               │
│    tools = estimate (length/4)                                      │
│    messages = currentTotal - systemPrompt - tools (back-calc)       │
│    reasoning = sum(msg.tokenUsage.reasoningTokens) (for display)    │
│                                                                     │
│  freeSpace = maxTokens - currentTotal - outputBuffer                │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Compaction Decision                              │
├─────────────────────────────────────────────────────────────────────┤
│  SAME FORMULA as /context!                                          │
│                                                                     │
│  if (currentTotal > compactionThreshold) {                          │
│    triggerCompaction();                                             │
│  }                                                                  │
└─────────────────────────────────────────────────────────────────────┘

Testing Strategy

Unit Tests

  1. Reasoning storage test (Phase 1)

    • Mock LLM stream with reasoning-delta events
    • Verify stream-processor.ts calls updateAssistantMessage() with reasoning
    • Verify reasoningMetadata is captured from providerMetadata
  2. Reasoning round-trip test (Phase 2)

    • Create AssistantMessage with reasoning and reasoningMetadata
    • Call formatAssistantMessage()
    • Verify output contains reasoning part with providerMetadata
  3. Token calculation test (Phase 3)

    • Mock message with known tokenUsage
    • Verify calculation matches expected
  4. Edge case tests

    • New session (no actuals) - falls back to estimation
    • Negative messagesDisplay (capped at 0)
    • Post-compaction state
    • Empty reasoning (should not create empty reasoning part)

Integration Tests

  1. Full reasoning flow test

    • Enable extended thinking on Claude
    • Send message that triggers reasoning
    • Verify reasoning persisted to message
    • Send follow-up message
    • Verify reasoning sent back to LLM (check formatted messages)
  2. Token tracking test

    • Send message
    • Verify tokenUsage stored on message
    • Open /context
    • Verify numbers use actual from last call
  3. Compaction alignment test

    • Fill context near threshold
    • Verify /context and compaction trigger at same point

Success Criteria

  1. Numbers add up: Total = SystemPrompt + Tools + Messages
  2. Consistency: /context and compaction use same calculation
  3. Reasoning works: Traces sent back to LLM correctly
  4. Calibration visible: Logs show estimate vs actual ratio
  5. Provider compatibility: Works with Anthropic, OpenAI, Google, etc.

Appendix: Verification Against Other Implementations

This plan was verified against actual implementations on 2025-01-20.

OpenCode Verification (~/Projects/external/opencode)

Claim Verified Evidence
Stores reasoning as ReasoningPart message-v2.ts lines 78-89
Includes providerMetadata for round-tripping message-v2.ts lines 554-560
toModelMessage() sends reasoning back message-v2.ts lines 435-569
Tracks reasoning tokens separately session/index.ts line 432, schemas throughout
Handles provider-specific metadata openai-responses-language-model.ts lines 520-538

OpenCode approach: Full round-trip of reasoning with provider metadata. This is our reference implementation.

Gemini-CLI Verification (~/Projects/external/gemini-cli)

Claim in Original Plan Actual Behavior Status
"Parts with thought: true included when sending history back" WRONG - They filter OUT thoughts at line 815 Corrected
Uses thought: true flag Correct
Tracks thoughtsTokenCount Correct - chatRecordingService.ts line 278

Gemini-CLI approach: Track thought tokens for cost/display but do NOT round-trip them. This is a simpler approach but requires Google-specific handling.

Why We Follow OpenCode

  1. Same SDK: Both use Vercel AI SDK
  2. Provider-agnostic: Works across all providers without special-casing
  3. Future-proof: Preserves metadata for providers that need it
  4. Simpler code: No provider-specific filtering logic

Dexto Implementation Verification

Component Current State Bug
stream-processor.ts Accumulates reasoningText but doesn't persist Bug #1
vercel.ts formatter Ignores msg.reasoning Bug #2 (blocked by #1)
AssistantMessage type Has reasoning?: string field Ready
Per-message tokenUsage Stored via updateAssistantMessage() Working
lastActualInputTokens Set after each LLM call Working
Compaction calculation Uses estimateMessagesTokens() only Different from /context
/context calculation Uses full estimation (system + tools + messages) Different from compaction