# Context Window Calculation Analysis ## Problem Statement Our `/context` overlay shows inconsistent numbers: - **Total shown**: 122.4k tokens (from API's actual count) - **Breakdown sum**: ~73k tokens (our length/4 estimates) - **Free space**: Calculated from breakdown, not actual total This leads to confusing UX where numbers don't add up. Additionally, our compaction decision uses a different calculation than `/context`, leading to inconsistency. --- ## Critical Finding #1: Reasoning Tokens Not Sent Back to LLM ### Current State (Dexto) **We have the type but DON'T actually store reasoning:** ```typescript // AssistantMessage in context/types.ts interface AssistantMessage { reasoning?: string; // Field EXISTS but is never populated! tokenUsage?: TokenUsage; // ... } ``` **Two separate bugs:** 1. **`stream-processor.ts` never persists reasoning text:** ```typescript // Line 24: Reasoning IS accumulated during streaming private reasoningText: string = ''; // Lines 97-108: Accumulated from reasoning-delta events case 'reasoning-delta': this.reasoningText += event.text; // ✓ Collected // BUT lines 314-320: Only tokenUsage is persisted! await this.contextManager.updateAssistantMessage( this.assistantMessageId, { tokenUsage: usage } // ✗ No reasoning field! ); ``` 2. **`formatAssistantMessage()` in `vercel.ts` ignores `msg.reasoning`:** - Only extracts `msg.content` (text parts) and `msg.toolCalls` - Even if reasoning WAS stored, it wouldn't be sent back **Result:** Reasoning is collected → emitted to events → but never persisted or round-tripped. ### How OpenCode Handles It (Correctly) ```typescript // In toModelMessage() - opencode/src/session/message-v2.ts if (part.type === "reasoning") { assistantMessage.parts.push({ type: "reasoning", text: part.text, providerMetadata: part.metadata, // Critical for round-tripping! }) } ``` OpenCode: 1. Stores reasoning as `ReasoningPart` in message parts 2. Includes `providerMetadata` (contains thought signatures for Gemini, etc.) 3. Sends reasoning back in `toModelMessage()` conversion 4. Tracks `reasoning` tokens separately in token usage ### How Gemini-CLI Handles It (Different Approach) ```typescript // Uses thought: true flag on parts from model { text: 'Hmm', thought: true } // BUT they explicitly FILTER OUT thoughts before storing in history! // geminiChat.ts line 815: modelResponseParts.push( ...content.parts.filter((part) => !part.thought), // Filter OUT thoughts ); // Token tracking still captures thoughtsTokenCount from API response // chatRecordingService.ts line 278: tokens.thoughts = respUsageMetadata.thoughtsTokenCount ?? 0; ``` **Key difference:** Gemini-CLI tracks thought tokens for display/cost but does NOT round-trip them. This works because Google's API doesn't require thought history for context continuity. ### Why We Follow OpenCode's Approach 1. **We use Vercel AI SDK** like OpenCode, not Google's native SDK 2. **Provider-agnostic**: OpenCode's approach works across all providers 3. **No provider-specific logic**: We shouldn't special-case Google's behavior 4. **Context continuity**: Some providers (especially via AI SDK) may need reasoning for proper state ### Impact of Current Bugs 1. **Context continuity broken**: Reasoning traces lost between turns 2. **Token counting incorrect**: Reasoning tokens used but not tracked in context 3. **Provider metadata lost**: Cannot round-trip provider-specific metadata (e.g., OpenAI item IDs) --- ## Critical Finding #2: Token Usage Storage ### What We Track **Session Level** (`session-manager.ts`): ```typescript sessionData.tokenUsage = { inputTokens: 0, outputTokens: 0, reasoningTokens: 0, cacheReadTokens: 0, cacheWriteTokens: 0, totalTokens: 0, }; ``` **Message Level** (`AssistantMessage`): ```typescript interface AssistantMessage { tokenUsage?: TokenUsage; // Available but... } ``` ### Current Flow 1. `stream-processor.ts` creates assistant message with empty metadata: ```typescript await this.contextManager.addAssistantMessage('', [], {}); ``` 2. After streaming completes, we DO update with token usage: ```typescript await this.contextManager.updateAssistantMessage( this.assistantMessageId, { tokenUsage: usage } ); ``` **So we HAVE the data on each message**, we just don't use it for context calculation! --- ## Critical Finding #3: Estimate vs Actual Mismatch ### The Problem ``` API actual inputTokens: 122.4k Our length/4 estimate: 73.0k Difference: 49.4k (67% underestimate!) ``` ### Why So Different? 1. **Tokenizers don't split evenly by characters** - Code tokenizes differently than prose - JSON schemas are verbose when tokenized - Special characters, whitespace handling varies 2. **We're comparing different things** - `actualTokens` = from last LLM call (includes everything sent) - `breakdown estimate` = calculated now on current history 3. **Context has grown since last call** - Last call's `inputTokens` doesn't include the response that followed - New user messages added since --- ## How Other Tools Handle This ### Claude Code (Anthropic) **Uses `/v1/messages/count_tokens` API for exact counts!** ```javascript // From cli.js (minified) countTokens(A,Q) { return this._client.post("/v1/messages/count_tokens", { body: A, ...Q }) } ``` **Categories tracked:** - System prompt - System tools - Memory files - Skills - MCP tools (with deferred loading) - Agents - Messages (with sub-breakdown) - Free space - Autocompact buffer **Free space calculation:** ```javascript // YA = sum of all category tokens (excluding deferred) let YA = k.reduce((CA, _A) => CA + (_A.isDeferred ? 0 : _A.tokens), 0) // WA = buffer (autocompact or compact) let WA = autocompactEnabled ? (maxTokens - contextUsed) : 500; // Free space let wA = Math.max(0, maxTokens - YA - WA) ``` ### gemini-cli **Hybrid approach:** ```typescript // Sync estimation (fast) estimateTokenCountSync(parts): number { // ASCII: ~4 chars per token (0.25 tokens/char) // Non-ASCII/CJK: ~1-2 chars per token (1.3 tokens/char) } // API counting (when needed) if (hasMedia) { use Gemini countTokens API } else { use sync estimation } ``` **Token tracking from API response:** ```typescript { input: promptTokenCount, output: candidatesTokenCount, cached: cachedContentTokenCount, thoughts: thoughtsTokenCount, // Reasoning! tool: toolUsePromptTokenCount, total: totalTokenCount } ``` ### opencode **Simple estimation + detailed tracking:** ```typescript Token.estimate(input: string): number { return Math.round(input.length / 4) } // But tracks actuals per message: StepFinishPart { tokens: { input: number, output: number, reasoning: number, cache: { read: number, write: number } } } ``` --- ## Current Architecture Issues ### 1. Reasoning Pipeline (BROKEN - Two Bugs) **Current (broken):** ``` LLM Response → reasoning-delta events received ↓ stream-processor.ts → accumulates reasoningText ✓ ↓ updateAssistantMessage() → ONLY saves tokenUsage, NOT reasoning ✗ ↓ AssistantMessage.reasoning = undefined (never set!) ↓ formatAssistantMessage() → has nothing to format anyway ↓ Reasoning NOT sent back to LLM ❌ ``` **Should be (following OpenCode):** ``` LLM Response → reasoning-delta events received (with providerMetadata) ↓ stream-processor.ts → accumulates reasoningText AND reasoningMetadata ↓ updateAssistantMessage() → saves reasoning + reasoningMetadata + tokenUsage ↓ AssistantMessage.reasoning = "thinking..." ✓ AssistantMessage.reasoningMetadata = { openai: { itemId: "..." } } ✓ ↓ formatAssistantMessage() → includes reasoning part with providerMetadata ↓ Reasoning sent back to LLM ✓ ``` ### 2. Token Calculation (/context) **Current:** ```typescript // Uses length/4 estimate for everything systemPromptTokens = estimateStringTokens(systemPrompt); // length/4 messagesTokens = estimateMessagesTokens(preparedHistory); // length/4 toolsTokens = estimateToolTokens(tools); // length/4 total = systemPromptTokens + messagesTokens + toolsTokens; freeSpace = maxTokens - total - outputBuffer; ``` **Problem:** Total doesn't match API's actual count. ### 3. Compaction Decision **Current (`turn-executor.ts`):** ```typescript const estimatedTokens = estimateMessagesTokens(prepared.preparedHistory); if (estimatedTokens > compactionThreshold) { // Compact! } ``` **Problem:** Uses different calculation than `/context`, and both are wrong! --- ## Proposed Solution ### Principle: Single Source of Truth 1. **Use actual token counts from API as ground truth** 2. **Track tokens per message for accurate history calculation** 3. **Estimate only what we cannot measure** 4. **Same formula for `/context` AND compaction decisions** --- ## THE FORMULA (Precise Specification) ### Core Formula ``` estimatedNextInput = lastInputTokens + lastOutputTokens + newMessagesEstimate ``` ### Variable Definitions | Variable | Definition | Source | When Updated | |----------|------------|--------|--------------| | `lastInputTokens` | Tokens we SENT in the most recent LLM call | `tokenUsage.inputTokens` from API response | After EVERY LLM call | | `lastOutputTokens` | Tokens the LLM RETURNED in its response | `tokenUsage.outputTokens` from API response | After EVERY LLM call | | `newMessagesEstimate` | Estimate for messages added AFTER the last LLM call | `length/4` heuristic | Calculated on demand | ### What Counts as "New Messages"? Messages added to history AFTER `lastInputTokens` was recorded: - **Tool results** (role='tool') from the last assistant's tool calls - **New user messages** typed since last LLM call - **Any injected system messages** added between calls ### Example Flow ``` ┌─────────────────────────────────────────────────────────────────┐ │ Turn 1: User asks "What's the weather in NYC?" │ ├─────────────────────────────────────────────────────────────────┤ │ LLM Call: │ │ inputTokens = 5000 (system + tools + user message) │ │ outputTokens = 100 (assistant: "I'll check" + tool_call) │ │ │ │ After call: UPDATE lastInputTokens=5000, lastOutputTokens=100 │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Tool executes, result added to history │ │ Tool result: "NYC: 72°F, sunny" (role='tool') │ │ │ │ This is a NEW MESSAGE (added after lastInputTokens recorded) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Before Turn 2: Calculate estimated context │ ├─────────────────────────────────────────────────────────────────┤ │ lastInputTokens = 5000 (from Turn 1) │ │ lastOutputTokens = 100 (from Turn 1) │ │ newMessagesEstimate = estimate(tool_result) ≈ 20 │ │ │ │ estimatedNextInput = 5000 + 100 + 20 = 5120 │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Turn 2: LLM processes tool result │ ├─────────────────────────────────────────────────────────────────┤ │ LLM Call: │ │ inputTokens = 5115 (ACTUAL - this is our ground truth!) │ │ outputTokens = 50 (assistant: "The weather is 72°F...") │ │ │ │ VERIFICATION: estimated=5120, actual=5115, error=+5 (+0.1%) │ │ │ │ After call: UPDATE lastInputTokens=5115, lastOutputTokens=50 │ └─────────────────────────────────────────────────────────────────┘ ``` ### Verification Metrics On EVERY LLM call, log the accuracy of our previous estimate: ```typescript // Before LLM call const estimated = lastInputTokens + lastOutputTokens + newMessagesEstimate; // After LLM call, compare to actual const actual = response.tokenUsage.inputTokens; const error = estimated - actual; const errorPercent = (error / actual) * 100; logger.info(`Context estimate: estimated=${estimated}, actual=${actual}, error=${error > 0 ? '+' : ''}${error} (${errorPercent.toFixed(1)}%)`); ``` ### Breakdown for Display (Back-Calculation) For `/context` overlay, we show a breakdown. Since we only know the TOTAL accurately, we back-calculate messages: ```typescript const total = lastInputTokens + lastOutputTokens + newMessagesEstimate; // These are estimates (we can't measure them directly) const systemPromptEstimate = estimateTokens(systemPrompt); // length/4 const toolsEstimate = estimateToolsTokens(tools); // length/4 // Back-calculate messages so the math adds up const messagesDisplay = total - systemPromptEstimate - toolsEstimate; // If negative, our estimates are too high - cap at 0 and log warning if (messagesDisplay < 0) { logger.warn(`Back-calculated messages negative (${messagesDisplay}), estimates may be too high`); messagesDisplay = 0; } ``` ### Edge Cases | Scenario | Behavior | |----------|----------| | **No LLM call yet** | `lastInputTokens=null`, fall back to pure estimation, show "(estimated)" label | | **After compaction** | History changed significantly, set `lastInputTokens=null`, fall back to estimation until next call | | **messagesDisplay negative** | Cap at 0, log warning - indicates system/tools estimates too high | | **System prompt changed** | Next estimate may be off, but next actual will correct it | | **Tools changed (MCP)** | Same as above - self-correcting after next call | ### What /context Should Display ``` Context Usage: 52,100 / 200,000 tokens (26%) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Breakdown: System prompt: 4,000 tokens (estimated) Tools: 8,000 tokens (estimated) Messages: 40,100 tokens (back-calculated) ───────────────────────────── Total: 52,100 tokens Calculation basis: Last actual input: 50,000 tokens Last output: 2,000 tokens New since then: 100 tokens (estimated) Last estimate accuracy: +0.6% error Free space: 131,900 tokens (after 16,000 output buffer) ``` ### Implementation Checklist - [ ] Store `lastInputTokens` and `lastOutputTokens` after each LLM call - [ ] Track which messages are "new" since last LLM call (need message timestamp or index tracking) - [ ] Calculate `newMessagesEstimate` only for messages added after last call - [ ] Log verification metrics on every LLM call - [ ] Update `/context` overlay to show this breakdown - [ ] Handle edge cases (no call yet, after compaction) - [ ] Use SAME formula for compaction decisions --- ### Legacy Edge Cases (keeping for reference) 1. **No LLM call yet (new session)** - Fall back to pure estimation - All numbers are estimates with "(estimated)" label 2. **messagesDisplay comes out negative** - Our estimates for system/tools are too high - Cap at 0, log warning - Indicates estimation needs calibration 3. **After compaction** - Token counts reset with new session - `compactionCount` tracks how many times compacted 4. **Reasoning tokens** - Must be sent back to LLM (fix formatter) ✅ DONE - Include in context calculation - Track separately for display ### Verification: Why `lastOutputTokens` Is Safe to Use Directly *Verified on 2025-01-20 by analyzing AI SDK source code and our codebase* **Question:** Does `outputTokens` include content that might be pruned before the next LLM call? **Answer:** No. `outputTokens` is safe to use directly because: #### Part 1: What does `outputTokens` include? (AI SDK Verification) **Anthropic** - verified via `ai/packages/anthropic/src/__fixtures__/anthropic-json-tool.1.chunks.txt`: ```json {"type":"message_delta","delta":{"stop_reason":"tool_use"},"usage":{"output_tokens":47}} ``` Tool call response reports `output_tokens: 47` - **includes tool calls** ✅ **OpenAI** - verified via `ai/packages/openai/src/responses/__fixtures__/openai-shell-tool.1.chunks.txt`: ```json {"output":[{"type":"shell_call","action":{"commands":["ls -a ~/Desktop"]}}],"usage":{"output_tokens":41}} ``` Shell tool call reports `output_tokens: 41` - **includes tool calls** ✅ **Google** - verified via `ai/packages/google/src/google-generative-ai-language-model.test.ts` lines 2274-2302: ```typescript content: { parts: [{ functionCall: { name: 'test-tool', args: { value: 'test' } } }] }, usageMetadata: { promptTokenCount: 10, candidatesTokenCount: 20, totalTokenCount: 30 } ``` Function call response reports `candidatesTokenCount: 20` - **includes tool calls** ✅ #### Part 2: What gets pruned in our system? From `manager.ts` `prepareHistory()`: - Only **tool result messages** (role='tool') can be pruned - They're marked with `compactedAt` timestamp - Replaced with placeholder: `[Old tool result content cleared]` **What is NEVER pruned:** - Assistant messages (text content) - Assistant's tool calls - User messages #### Verification Table | Message Type | Pruned? | Part of outputTokens? | |-------------|---------|----------------------| | Assistant text | ❌ Never | ✅ Yes | | Assistant tool calls | ❌ Never | ✅ Yes (verified across all providers) | | Tool results (role='tool') | ✅ Can be pruned | ❌ No (separate messages) | #### Code Evidence - `stream-processor.ts`: Tool calls stored via `addToolCall()` with full arguments - `manager.ts` line 279: Only `msg.role === 'tool' && msg.compactedAt` gets placeholder - No code path exists to prune assistant messages **Conclusion:** The formula `lastInputTokens + lastOutputTokens + newMessagesEstimate` is correct because: - `lastInputTokens` reflects pruned history (API tells us exactly what was sent) - `lastOutputTokens` is the assistant's response (text + tool calls) which is stored and sent back as-is - All major providers (Anthropic, OpenAI, Google) include tool calls in their output token counts - Only tool results (separate messages) can be pruned, and those are in `inputTokens` --- ## Implementation Plan ### Phase 1: Fix Reasoning Storage (HIGH PRIORITY - Bug #1) ✅ COMPLETED **The root cause:** `stream-processor.ts` collects reasoning but never persists it. **Files to modify:** - `packages/core/src/llm/executor/stream-processor.ts` - `packages/core/src/context/types.ts` **Changes:** 1. Add `reasoningMetadata` field to `AssistantMessage` type: ```typescript // In context/types.ts interface AssistantMessage { reasoning?: string; reasoningMetadata?: Record; // NEW - for provider round-tripping // ... } ``` 2. Capture `providerMetadata` from reasoning-delta events: ```typescript // In stream-processor.ts, add field: private reasoningMetadata: Record | undefined; // In reasoning-delta case: case 'reasoning-delta': this.reasoningText += event.text; // Capture provider metadata for round-tripping (OpenAI itemId, etc.) if (event.providerMetadata) { this.reasoningMetadata = event.providerMetadata; } // ... emit events ``` 3. **Fix the bug** - persist reasoning in `updateAssistantMessage()`: ```typescript // In stream-processor.ts, 'finish' case (around line 315): if (this.assistantMessageId) { await this.contextManager.updateAssistantMessage( this.assistantMessageId, { tokenUsage: usage, reasoning: this.reasoningText || undefined, // ADD THIS reasoningMetadata: this.reasoningMetadata, // ADD THIS } ); } ``` ### Phase 2: Fix Reasoning Round-Trip (Bug #2) ✅ COMPLETED **Files to modify:** - `packages/core/src/llm/formatters/vercel.ts` **Changes:** 1. Update `formatAssistantMessage()` to include reasoning: ```typescript // In formatAssistantMessage(), before returning: if (msg.reasoning) { contentParts.push({ type: 'reasoning', text: msg.reasoning, providerMetadata: msg.reasoningMetadata, }); } ``` **Verified:** Vercel AI SDK's `AssistantContent` type supports `ReasoningPart`: ```typescript // packages/provider-utils/src/types/assistant-model-message.ts export type AssistantContent = string | Array; // packages/provider-utils/src/types/content-part.ts export interface ReasoningPart { type: 'reasoning'; text: string; providerOptions?: ProviderOptions; // For round-tripping provider metadata } ``` ### Phase 3: Unified Context Calculation ✅ COMPLETED **Files to modify:** - `packages/core/src/context/manager.ts` - `getContextTokenEstimate()` - `packages/core/src/llm/executor/turn-executor.ts` - compaction check - `packages/cli/src/cli/ink-cli/components/overlays/ContextStatsOverlay.tsx` **Changes:** 1. Create shared `calculateContextUsage()` function: ```typescript // New file: packages/core/src/context/context-calculator.ts export async function calculateContextUsage( contextManager: ContextManager, tools: ToolDefinitions, maxContextTokens: number, outputBuffer: number ): Promise { // Implement the formula above } ``` 2. Use in `/context`: ```typescript // In DextoAgent.getContextStats() const usage = await calculateContextUsage(...); return usage; ``` 3. Use in compaction decision: ```typescript // In turn-executor.ts const usage = await calculateContextUsage(...); if (usage.total > compactionThreshold) { // Compact! } ``` ### Phase 4: Message-Level Token Tracking **Already implemented!** We just need to use it: ```typescript // In calculateContextUsage(), sum from messages: const history = await contextManager.getHistory(); let totalInputFromMessages = 0; let totalOutputFromMessages = 0; let totalReasoningFromMessages = 0; for (const msg of history) { if (msg.role === 'assistant' && msg.tokenUsage) { totalOutputFromMessages += msg.tokenUsage.outputTokens ?? 0; totalReasoningFromMessages += msg.tokenUsage.reasoningTokens ?? 0; } } ``` ### Phase 5: Calibration & Logging 1. Log estimate vs actual on every LLM call (already done, level=info) 2. Track calibration ratio over time 3. Consider adaptive estimation based on observed ratios ### Phase 6: Future - API Token Counting **For Anthropic:** ```typescript // New method in Anthropic service async countTokens(messages: Message[], tools: Tool[]): Promise<{ input_tokens: number; }> ``` **For other providers:** - tiktoken for OpenAI - Gemini countTokens API - Fallback to estimation --- ## Data Flow Diagram ### Current State (BROKEN) ``` ┌─────────────────────────────────────────────────────────────────────┐ │ LLM Response Stream │ ├─────────────────────────────────────────────────────────────────────┤ │ reasoning-delta events → reasoningText accumulated ✓ │ │ text-delta events → content accumulated ✓ │ │ finish event → usage: { inputTokens, outputTokens, ... } │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ stream-processor.ts updateAssistantMessage() │ ├─────────────────────────────────────────────────────────────────────┤ │ await this.contextManager.updateAssistantMessage( │ │ this.assistantMessageId, │ │ { tokenUsage: usage } ← ONLY tokenUsage saved! │ │ ); ← reasoning NOT included! ✗ │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ AssistantMessage Stored │ ├─────────────────────────────────────────────────────────────────────┤ │ { │ │ role: 'assistant', │ │ content: [...], ← ✓ Stored │ │ reasoning: undefined, ← ✗ NEVER SET! │ │ tokenUsage: {...} ← ✓ Stored │ │ } │ └─────────────────────────────────────────────────────────────────────┘ ### Target State (FIXED) ┌─────────────────────────────────────────────────────────────────────┐ │ LLM Response Stream │ ├─────────────────────────────────────────────────────────────────────┤ │ reasoning-delta events → reasoningText + providerMetadata ✓ │ │ text-delta events → content accumulated ✓ │ │ finish event → usage: { inputTokens, outputTokens, ... } │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ stream-processor.ts updateAssistantMessage() │ ├─────────────────────────────────────────────────────────────────────┤ │ await this.contextManager.updateAssistantMessage( │ │ this.assistantMessageId, │ │ { │ │ tokenUsage: usage, │ │ reasoning: this.reasoningText, ← NEW │ │ reasoningMetadata: this.reasoningMetadata ← NEW │ │ } │ │ ); │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ AssistantMessage Stored │ ├─────────────────────────────────────────────────────────────────────┤ │ { │ │ role: 'assistant', │ │ content: [...], │ │ reasoning: 'Let me think...', ← ✓ Now stored │ │ reasoningMetadata: { openai: { itemId: '...' } }, ← ✓ For round-trip │ tokenUsage: { inputTokens, outputTokens, reasoningTokens } │ │ } │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Next LLM Call (Formatter) │ ├─────────────────────────────────────────────────────────────────────┤ │ formatAssistantMessage() includes: │ │ - content (text parts) ✓ Already done │ │ - toolCalls ✓ Already done │ │ - reasoning + providerMetadata ✓ NEW - enables round-trip │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ /context Calculation │ ├─────────────────────────────────────────────────────────────────────┤ │ currentTotal = lastInput + lastOutput + newMessagesEstimate │ │ │ │ Breakdown: │ │ systemPrompt = estimate (length/4) │ │ tools = estimate (length/4) │ │ messages = currentTotal - systemPrompt - tools (back-calc) │ │ reasoning = sum(msg.tokenUsage.reasoningTokens) (for display) │ │ │ │ freeSpace = maxTokens - currentTotal - outputBuffer │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Compaction Decision │ ├─────────────────────────────────────────────────────────────────────┤ │ SAME FORMULA as /context! │ │ │ │ if (currentTotal > compactionThreshold) { │ │ triggerCompaction(); │ │ } │ └─────────────────────────────────────────────────────────────────────┘ ``` --- ## Testing Strategy ### Unit Tests 1. **Reasoning storage test (Phase 1)** - Mock LLM stream with reasoning-delta events - Verify `stream-processor.ts` calls `updateAssistantMessage()` with reasoning - Verify `reasoningMetadata` is captured from `providerMetadata` 2. **Reasoning round-trip test (Phase 2)** - Create `AssistantMessage` with `reasoning` and `reasoningMetadata` - Call `formatAssistantMessage()` - Verify output contains reasoning part with `providerMetadata` 3. **Token calculation test (Phase 3)** - Mock message with known tokenUsage - Verify calculation matches expected 4. **Edge case tests** - New session (no actuals) - falls back to estimation - Negative messagesDisplay (capped at 0) - Post-compaction state - Empty reasoning (should not create empty reasoning part) ### Integration Tests 1. **Full reasoning flow test** - Enable extended thinking on Claude - Send message that triggers reasoning - Verify reasoning persisted to message - Send follow-up message - Verify reasoning sent back to LLM (check formatted messages) 2. **Token tracking test** - Send message - Verify tokenUsage stored on message - Open /context - Verify numbers use actual from last call 3. **Compaction alignment test** - Fill context near threshold - Verify /context and compaction trigger at same point --- ## Success Criteria 1. **Numbers add up**: Total = SystemPrompt + Tools + Messages 2. **Consistency**: /context and compaction use same calculation 3. **Reasoning works**: Traces sent back to LLM correctly 4. **Calibration visible**: Logs show estimate vs actual ratio 5. **Provider compatibility**: Works with Anthropic, OpenAI, Google, etc. --- ## Appendix: Verification Against Other Implementations *This plan was verified against actual implementations on 2025-01-20.* ### OpenCode Verification (~/Projects/external/opencode) | Claim | Verified | Evidence | |-------|----------|----------| | Stores reasoning as `ReasoningPart` | ✅ | `message-v2.ts` lines 78-89 | | Includes `providerMetadata` for round-tripping | ✅ | `message-v2.ts` lines 554-560 | | `toModelMessage()` sends reasoning back | ✅ | `message-v2.ts` lines 435-569 | | Tracks reasoning tokens separately | ✅ | `session/index.ts` line 432, schemas throughout | | Handles provider-specific metadata | ✅ | `openai-responses-language-model.ts` lines 520-538 | **OpenCode approach:** Full round-trip of reasoning with provider metadata. This is our reference implementation. ### Gemini-CLI Verification (~/Projects/external/gemini-cli) | Claim in Original Plan | Actual Behavior | Status | |------------------------|-----------------|--------| | "Parts with thought: true included when sending history back" | **WRONG** - They filter OUT thoughts at line 815 | ❌ Corrected | | Uses `thought: true` flag | ✅ Correct | ✅ | | Tracks `thoughtsTokenCount` | ✅ Correct - `chatRecordingService.ts` line 278 | ✅ | **Gemini-CLI approach:** Track thought tokens for cost/display but do NOT round-trip them. This is a simpler approach but requires Google-specific handling. ### Why We Follow OpenCode 1. **Same SDK**: Both use Vercel AI SDK 2. **Provider-agnostic**: Works across all providers without special-casing 3. **Future-proof**: Preserves metadata for providers that need it 4. **Simpler code**: No provider-specific filtering logic ### Dexto Implementation Verification | Component | Current State | Bug | |-----------|---------------|-----| | `stream-processor.ts` | Accumulates `reasoningText` but doesn't persist | **Bug #1** | | `vercel.ts` formatter | Ignores `msg.reasoning` | **Bug #2** (blocked by #1) | | `AssistantMessage` type | Has `reasoning?: string` field | ✅ Ready | | Per-message `tokenUsage` | Stored via `updateAssistantMessage()` | ✅ Working | | `lastActualInputTokens` | Set after each LLM call | ✅ Working | | Compaction calculation | Uses `estimateMessagesTokens()` only | Different from /context | | `/context` calculation | Uses full estimation (system + tools + messages) | Different from compaction |