SuperCharged-Claude-Code-Up…/dexto/feature-plans/context-calculation.md

# Context Window Calculation Analysis

## Problem Statement

Our `/context` overlay shows inconsistent numbers:
- **Total shown**: 122.4k tokens (from API's actual count)
- **Breakdown sum**: ~73k tokens (our length/4 estimates)
- **Free space**: Calculated from breakdown, not actual total

This leads to confusing UX where numbers don't add up.

Additionally, our compaction decision uses a different calculation than `/context`, leading to inconsistency.

---

## Critical Finding #1: Reasoning Tokens Not Sent Back to LLM

### Current State (Dexto)

**We have the type but DON'T actually store reasoning:**
```typescript
// AssistantMessage in context/types.ts
interface AssistantMessage {
    reasoning?: string;  // Field EXISTS but is never populated!
    tokenUsage?: TokenUsage;
    // ...
}
```

**Two separate bugs:**

1. **`stream-processor.ts` never persists reasoning text:**
   ```typescript
   // Line 24: Reasoning IS accumulated during streaming
   private reasoningText: string = '';

   // Lines 97-108: Accumulated from reasoning-delta events
   case 'reasoning-delta':
       this.reasoningText += event.text;  // ✓ Collected

   // BUT lines 314-320: Only tokenUsage is persisted!
   await this.contextManager.updateAssistantMessage(
       this.assistantMessageId,
       { tokenUsage: usage }  // ✗ No reasoning field!
   );
   ```

2. **`formatAssistantMessage()` in `vercel.ts` ignores `msg.reasoning`:**
   - Only extracts `msg.content` (text parts) and `msg.toolCalls`
   - Even if reasoning WAS stored, it wouldn't be sent back

**Result:** Reasoning is collected → emitted to events → but never persisted or round-tripped.

### How OpenCode Handles It (Correctly)

```typescript
// In toModelMessage() - opencode/src/session/message-v2.ts
if (part.type === "reasoning") {
    assistantMessage.parts.push({
        type: "reasoning",
        text: part.text,
        providerMetadata: part.metadata,  // Critical for round-tripping!
    })
}
```

OpenCode:
1. Stores reasoning as `ReasoningPart` in message parts
2. Includes `providerMetadata` (contains thought signatures for Gemini, etc.)
3. Sends reasoning back in `toModelMessage()` conversion
4. Tracks `reasoning` tokens separately in token usage

### How Gemini-CLI Handles It (Different Approach)

```typescript
// Uses thought: true flag on parts from model
{ text: 'Hmm', thought: true }

// BUT they explicitly FILTER OUT thoughts before storing in history!
// geminiChat.ts line 815:
modelResponseParts.push(
  ...content.parts.filter((part) => !part.thought),  // Filter OUT thoughts
);

// Token tracking still captures thoughtsTokenCount from API response
// chatRecordingService.ts line 278:
tokens.thoughts = respUsageMetadata.thoughtsTokenCount ?? 0;
```

**Key difference:** Gemini-CLI tracks thought tokens for display/cost but does NOT round-trip them.
This works because Google's API doesn't require thought history for context continuity.

### Why We Follow OpenCode's Approach

1. **We use Vercel AI SDK** like OpenCode, not Google's native SDK
2. **Provider-agnostic**: OpenCode's approach works across all providers
3. **No provider-specific logic**: We shouldn't special-case Google's behavior
4. **Context continuity**: Some providers (especially via AI SDK) may need reasoning for proper state

### Impact of Current Bugs

1. **Context continuity broken**: Reasoning traces lost between turns
2. **Token counting incorrect**: Reasoning tokens used but not tracked in context
3. **Provider metadata lost**: Cannot round-trip provider-specific metadata (e.g., OpenAI item IDs)

---

## Critical Finding #2: Token Usage Storage

### What We Track

**Session Level** (`session-manager.ts`):
```typescript
sessionData.tokenUsage = {
    inputTokens: 0,
    outputTokens: 0,
    reasoningTokens: 0,
    cacheReadTokens: 0,
    cacheWriteTokens: 0,
    totalTokens: 0,
};
```

**Message Level** (`AssistantMessage`):
```typescript
interface AssistantMessage {
    tokenUsage?: TokenUsage;  // Available but...
}
```

### Current Flow

1. `stream-processor.ts` creates assistant message with empty metadata:
   ```typescript
   await this.contextManager.addAssistantMessage('', [], {});
   ```

2. After streaming completes, we DO update with token usage:
   ```typescript
   await this.contextManager.updateAssistantMessage(
       this.assistantMessageId,
       { tokenUsage: usage }
   );
   ```

**So we HAVE the data on each message**, we just don't use it for context calculation!

---

## Critical Finding #3: Estimate vs Actual Mismatch

### The Problem

```
API actual inputTokens: 122.4k
Our length/4 estimate:   73.0k
Difference:              49.4k (67% underestimate!)
```

### Why So Different?

1. **Tokenizers don't split evenly by characters**
   - Code tokenizes differently than prose
   - JSON schemas are verbose when tokenized
   - Special characters, whitespace handling varies

2. **We're comparing different things**
   - `actualTokens` = from last LLM call (includes everything sent)
   - `breakdown estimate` = calculated now on current history

3. **Context has grown since last call**
   - Last call's `inputTokens` doesn't include the response that followed
   - New user messages added since

---

## How Other Tools Handle This

### Claude Code (Anthropic)

**Uses `/v1/messages/count_tokens` API for exact counts!**

```javascript
// From cli.js (minified)
countTokens(A,Q) {
  return this._client.post("/v1/messages/count_tokens", { body: A, ...Q })
}
```

**Categories tracked:**
- System prompt
- System tools
- Memory files
- Skills
- MCP tools (with deferred loading)
- Agents
- Messages (with sub-breakdown)
- Free space
- Autocompact buffer

**Free space calculation:**
```javascript
// YA = sum of all category tokens (excluding deferred)
let YA = k.reduce((CA, _A) => CA + (_A.isDeferred ? 0 : _A.tokens), 0)

// WA = buffer (autocompact or compact)
let WA = autocompactEnabled ? (maxTokens - contextUsed) : 500;

// Free space
let wA = Math.max(0, maxTokens - YA - WA)
```

### gemini-cli

**Hybrid approach:**

```typescript
// Sync estimation (fast)
estimateTokenCountSync(parts): number {
  // ASCII: ~4 chars per token (0.25 tokens/char)
  // Non-ASCII/CJK: ~1-2 chars per token (1.3 tokens/char)
}

// API counting (when needed)
if (hasMedia) {
  use Gemini countTokens API
} else {
  use sync estimation
}
```

**Token tracking from API response:**
```typescript
{
  input: promptTokenCount,
  output: candidatesTokenCount,
  cached: cachedContentTokenCount,
  thoughts: thoughtsTokenCount,      // Reasoning!
  tool: toolUsePromptTokenCount,
  total: totalTokenCount
}
```

### opencode

**Simple estimation + detailed tracking:**

```typescript
Token.estimate(input: string): number {
  return Math.round(input.length / 4)
}

// But tracks actuals per message:
StepFinishPart {
  tokens: {
    input: number,
    output: number,
    reasoning: number,
    cache: { read: number, write: number }
  }
}
```

---

## Current Architecture Issues

### 1. Reasoning Pipeline (BROKEN - Two Bugs)

**Current (broken):**
```
LLM Response → reasoning-delta events received
                          ↓
stream-processor.ts → accumulates reasoningText ✓
                          ↓
updateAssistantMessage() → ONLY saves tokenUsage, NOT reasoning ✗
                          ↓
AssistantMessage.reasoning = undefined (never set!)
                          ↓
formatAssistantMessage() → has nothing to format anyway
                          ↓
Reasoning NOT sent back to LLM ❌
```

**Should be (following OpenCode):**
```
LLM Response → reasoning-delta events received (with providerMetadata)
                          ↓
stream-processor.ts → accumulates reasoningText AND reasoningMetadata
                          ↓
updateAssistantMessage() → saves reasoning + reasoningMetadata + tokenUsage
                          ↓
AssistantMessage.reasoning = "thinking..." ✓
AssistantMessage.reasoningMetadata = { openai: { itemId: "..." } } ✓
                          ↓
formatAssistantMessage() → includes reasoning part with providerMetadata
                          ↓
Reasoning sent back to LLM ✓
```

### 2. Token Calculation (/context)

**Current:**
```typescript
// Uses length/4 estimate for everything
systemPromptTokens = estimateStringTokens(systemPrompt);  // length/4
messagesTokens = estimateMessagesTokens(preparedHistory); // length/4
toolsTokens = estimateToolTokens(tools);                  // length/4

total = systemPromptTokens + messagesTokens + toolsTokens;
freeSpace = maxTokens - total - outputBuffer;
```

**Problem:** Total doesn't match API's actual count.

### 3. Compaction Decision

**Current (`turn-executor.ts`):**
```typescript
const estimatedTokens = estimateMessagesTokens(prepared.preparedHistory);
if (estimatedTokens > compactionThreshold) {
  // Compact!
}
```

**Problem:** Uses different calculation than `/context`, and both are wrong!

---

## Proposed Solution

### Principle: Single Source of Truth

1. **Use actual token counts from API as ground truth**
2. **Track tokens per message for accurate history calculation**
3. **Estimate only what we cannot measure**
4. **Same formula for `/context` AND compaction decisions**

---

## THE FORMULA (Precise Specification)

### Core Formula

```
estimatedNextInput = lastInputTokens + lastOutputTokens + newMessagesEstimate
```

### Variable Definitions

| Variable | Definition | Source | When Updated |
|----------|------------|--------|--------------|
| `lastInputTokens` | Tokens we SENT in the most recent LLM call | `tokenUsage.inputTokens` from API response | After EVERY LLM call |
| `lastOutputTokens` | Tokens the LLM RETURNED in its response | `tokenUsage.outputTokens` from API response | After EVERY LLM call |
| `newMessagesEstimate` | Estimate for messages added AFTER the last LLM call | `length/4` heuristic | Calculated on demand |

### What Counts as "New Messages"?

Messages added to history AFTER `lastInputTokens` was recorded:
- **Tool results** (role='tool') from the last assistant's tool calls
- **New user messages** typed since last LLM call
- **Any injected system messages** added between calls

### Example Flow

```
┌─────────────────────────────────────────────────────────────────┐
│ Turn 1: User asks "What's the weather in NYC?"                  │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call:                                                       │
│   inputTokens = 5000 (system + tools + user message)            │
│   outputTokens = 100 (assistant: "I'll check" + tool_call)      │
│                                                                 │
│ After call: UPDATE lastInputTokens=5000, lastOutputTokens=100   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Tool executes, result added to history                          │
│ Tool result: "NYC: 72°F, sunny" (role='tool')                   │
│                                                                 │
│ This is a NEW MESSAGE (added after lastInputTokens recorded)    │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Before Turn 2: Calculate estimated context                      │
├─────────────────────────────────────────────────────────────────┤
│ lastInputTokens = 5000 (from Turn 1)                            │
│ lastOutputTokens = 100 (from Turn 1)                            │
│ newMessagesEstimate = estimate(tool_result) ≈ 20                │
│                                                                 │
│ estimatedNextInput = 5000 + 100 + 20 = 5120                     │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│ Turn 2: LLM processes tool result                               │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call:                                                       │
│   inputTokens = 5115 (ACTUAL - this is our ground truth!)       │
│   outputTokens = 50 (assistant: "The weather is 72°F...")       │
│                                                                 │
│ VERIFICATION: estimated=5120, actual=5115, error=+5 (+0.1%)     │
│                                                                 │
│ After call: UPDATE lastInputTokens=5115, lastOutputTokens=50    │
└─────────────────────────────────────────────────────────────────┘
```

### Verification Metrics

On EVERY LLM call, log the accuracy of our previous estimate:

```typescript
// Before LLM call
const estimated = lastInputTokens + lastOutputTokens + newMessagesEstimate;

// After LLM call, compare to actual
const actual = response.tokenUsage.inputTokens;
const error = estimated - actual;
const errorPercent = (error / actual) * 100;

logger.info(`Context estimate: estimated=${estimated}, actual=${actual}, error=${error > 0 ? '+' : ''}${error} (${errorPercent.toFixed(1)}%)`);
```

### Breakdown for Display (Back-Calculation)

For `/context` overlay, we show a breakdown. Since we only know the TOTAL accurately, we back-calculate messages:

```typescript
const total = lastInputTokens + lastOutputTokens + newMessagesEstimate;

// These are estimates (we can't measure them directly)
const systemPromptEstimate = estimateTokens(systemPrompt);  // length/4
const toolsEstimate = estimateToolsTokens(tools);           // length/4

// Back-calculate messages so the math adds up
const messagesDisplay = total - systemPromptEstimate - toolsEstimate;

// If negative, our estimates are too high - cap at 0 and log warning
if (messagesDisplay < 0) {
    logger.warn(`Back-calculated messages negative (${messagesDisplay}), estimates may be too high`);
    messagesDisplay = 0;
}
```

### Edge Cases

| Scenario | Behavior |
|----------|----------|
| **No LLM call yet** | `lastInputTokens=null`, fall back to pure estimation, show "(estimated)" label |
| **After compaction** | History changed significantly, set `lastInputTokens=null`, fall back to estimation until next call |
| **messagesDisplay negative** | Cap at 0, log warning - indicates system/tools estimates too high |
| **System prompt changed** | Next estimate may be off, but next actual will correct it |
| **Tools changed (MCP)** | Same as above - self-correcting after next call |

### What /context Should Display

```
Context Usage: 52,100 / 200,000 tokens (26%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Breakdown:
  System prompt:  4,000 tokens (estimated)
  Tools:          8,000 tokens (estimated)
  Messages:      40,100 tokens (back-calculated)
  ─────────────────────────────
  Total:         52,100 tokens

Calculation basis:
  Last actual input:  50,000 tokens
  Last output:         2,000 tokens
  New since then:        100 tokens (estimated)

Last estimate accuracy: +0.6% error

Free space: 131,900 tokens (after 16,000 output buffer)
```

### Implementation Checklist

- [ ] Store `lastInputTokens` and `lastOutputTokens` after each LLM call
- [ ] Track which messages are "new" since last LLM call (need message timestamp or index tracking)
- [ ] Calculate `newMessagesEstimate` only for messages added after last call
- [ ] Log verification metrics on every LLM call
- [ ] Update `/context` overlay to show this breakdown
- [ ] Handle edge cases (no call yet, after compaction)
- [ ] Use SAME formula for compaction decisions

---

### Legacy Edge Cases (keeping for reference)

1. **No LLM call yet (new session)**
   - Fall back to pure estimation
   - All numbers are estimates with "(estimated)" label

2. **messagesDisplay comes out negative**
   - Our estimates for system/tools are too high
   - Cap at 0, log warning
   - Indicates estimation needs calibration

3. **After compaction**
   - Token counts reset with new session
   - `compactionCount` tracks how many times compacted

4. **Reasoning tokens**
   - Must be sent back to LLM (fix formatter) ✅ DONE
   - Include in context calculation
   - Track separately for display

### Verification: Why `lastOutputTokens` Is Safe to Use Directly

*Verified on 2025-01-20 by analyzing AI SDK source code and our codebase*

**Question:** Does `outputTokens` include content that might be pruned before the next LLM call?

**Answer:** No. `outputTokens` is safe to use directly because:

#### Part 1: What does `outputTokens` include? (AI SDK Verification)

**Anthropic** - verified via `ai/packages/anthropic/src/__fixtures__/anthropic-json-tool.1.chunks.txt`:
```json
{"type":"message_delta","delta":{"stop_reason":"tool_use"},"usage":{"output_tokens":47}}
```
Tool call response reports `output_tokens: 47` - **includes tool calls** ✅

**OpenAI** - verified via `ai/packages/openai/src/responses/__fixtures__/openai-shell-tool.1.chunks.txt`:
```json
{"output":[{"type":"shell_call","action":{"commands":["ls -a ~/Desktop"]}}],"usage":{"output_tokens":41}}
```
Shell tool call reports `output_tokens: 41` - **includes tool calls** ✅

**Google** - verified via `ai/packages/google/src/google-generative-ai-language-model.test.ts` lines 2274-2302:
```typescript
content: { parts: [{ functionCall: { name: 'test-tool', args: { value: 'test' } } }] },
usageMetadata: { promptTokenCount: 10, candidatesTokenCount: 20, totalTokenCount: 30 }
```
Function call response reports `candidatesTokenCount: 20` - **includes tool calls** ✅

#### Part 2: What gets pruned in our system?

From `manager.ts` `prepareHistory()`:
- Only **tool result messages** (role='tool') can be pruned
- They're marked with `compactedAt` timestamp
- Replaced with placeholder: `[Old tool result content cleared]`

**What is NEVER pruned:**
- Assistant messages (text content)
- Assistant's tool calls
- User messages

#### Verification Table

| Message Type | Pruned? | Part of outputTokens? |
|-------------|---------|----------------------|
| Assistant text | ❌ Never | ✅ Yes |
| Assistant tool calls | ❌ Never | ✅ Yes (verified across all providers) |
| Tool results (role='tool') | ✅ Can be pruned | ❌ No (separate messages) |

#### Code Evidence

- `stream-processor.ts`: Tool calls stored via `addToolCall()` with full arguments
- `manager.ts` line 279: Only `msg.role === 'tool' && msg.compactedAt` gets placeholder
- No code path exists to prune assistant messages

**Conclusion:** The formula `lastInputTokens + lastOutputTokens + newMessagesEstimate` is correct because:
- `lastInputTokens` reflects pruned history (API tells us exactly what was sent)
- `lastOutputTokens` is the assistant's response (text + tool calls) which is stored and sent back as-is
- All major providers (Anthropic, OpenAI, Google) include tool calls in their output token counts
- Only tool results (separate messages) can be pruned, and those are in `inputTokens`

---

## Implementation Plan

### Phase 1: Fix Reasoning Storage (HIGH PRIORITY - Bug #1) ✅ COMPLETED

**The root cause:** `stream-processor.ts` collects reasoning but never persists it.

**Files to modify:**
- `packages/core/src/llm/executor/stream-processor.ts`
- `packages/core/src/context/types.ts`

**Changes:**

1. Add `reasoningMetadata` field to `AssistantMessage` type:
   ```typescript
   // In context/types.ts
   interface AssistantMessage {
     reasoning?: string;
     reasoningMetadata?: Record<string, unknown>;  // NEW - for provider round-tripping
     // ...
   }
   ```

2. Capture `providerMetadata` from reasoning-delta events:
   ```typescript
   // In stream-processor.ts, add field:
   private reasoningMetadata: Record<string, unknown> | undefined;

   // In reasoning-delta case:
   case 'reasoning-delta':
       this.reasoningText += event.text;
       // Capture provider metadata for round-tripping (OpenAI itemId, etc.)
       if (event.providerMetadata) {
           this.reasoningMetadata = event.providerMetadata;
       }
       // ... emit events
   ```

3. **Fix the bug** - persist reasoning in `updateAssistantMessage()`:
   ```typescript
   // In stream-processor.ts, 'finish' case (around line 315):
   if (this.assistantMessageId) {
       await this.contextManager.updateAssistantMessage(
           this.assistantMessageId,
           {
               tokenUsage: usage,
               reasoning: this.reasoningText || undefined,           // ADD THIS
               reasoningMetadata: this.reasoningMetadata,            // ADD THIS
           }
       );
   }
   ```

### Phase 2: Fix Reasoning Round-Trip (Bug #2) ✅ COMPLETED

**Files to modify:**
- `packages/core/src/llm/formatters/vercel.ts`

**Changes:**

1. Update `formatAssistantMessage()` to include reasoning:
   ```typescript
   // In formatAssistantMessage(), before returning:
   if (msg.reasoning) {
       contentParts.push({
           type: 'reasoning',
           text: msg.reasoning,
           providerMetadata: msg.reasoningMetadata,
       });
   }
   ```

**Verified:** Vercel AI SDK's `AssistantContent` type supports `ReasoningPart`:
```typescript
// packages/provider-utils/src/types/assistant-model-message.ts
export type AssistantContent = string | Array<TextPart | FilePart | ReasoningPart | ...>;

// packages/provider-utils/src/types/content-part.ts
export interface ReasoningPart {
  type: 'reasoning';
  text: string;
  providerOptions?: ProviderOptions;  // For round-tripping provider metadata
}
```

### Phase 3: Unified Context Calculation ✅ COMPLETED

**Files to modify:**
- `packages/core/src/context/manager.ts` - `getContextTokenEstimate()`
- `packages/core/src/llm/executor/turn-executor.ts` - compaction check
- `packages/cli/src/cli/ink-cli/components/overlays/ContextStatsOverlay.tsx`

**Changes:**

1. Create shared `calculateContextUsage()` function:
   ```typescript
   // New file: packages/core/src/context/context-calculator.ts
   export async function calculateContextUsage(
     contextManager: ContextManager,
     tools: ToolDefinitions,
     maxContextTokens: number,
     outputBuffer: number
   ): Promise<ContextUsage> {
     // Implement the formula above
   }
   ```

2. Use in `/context`:
   ```typescript
   // In DextoAgent.getContextStats()
   const usage = await calculateContextUsage(...);
   return usage;
   ```

3. Use in compaction decision:
   ```typescript
   // In turn-executor.ts
   const usage = await calculateContextUsage(...);
   if (usage.total > compactionThreshold) {
     // Compact!
   }
   ```

### Phase 4: Message-Level Token Tracking

**Already implemented!** We just need to use it:

```typescript
// In calculateContextUsage(), sum from messages:
const history = await contextManager.getHistory();
let totalInputFromMessages = 0;
let totalOutputFromMessages = 0;
let totalReasoningFromMessages = 0;

for (const msg of history) {
  if (msg.role === 'assistant' && msg.tokenUsage) {
    totalOutputFromMessages += msg.tokenUsage.outputTokens ?? 0;
    totalReasoningFromMessages += msg.tokenUsage.reasoningTokens ?? 0;
  }
}
```

### Phase 5: Calibration & Logging

1. Log estimate vs actual on every LLM call (already done, level=info)
2. Track calibration ratio over time
3. Consider adaptive estimation based on observed ratios

### Phase 6: Future - API Token Counting

**For Anthropic:**
```typescript
// New method in Anthropic service
async countTokens(messages: Message[], tools: Tool[]): Promise<{
  input_tokens: number;
}>
```

**For other providers:**
- tiktoken for OpenAI
- Gemini countTokens API
- Fallback to estimation

---

## Data Flow Diagram

### Current State (BROKEN)

```
┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Response Stream                          │
├─────────────────────────────────────────────────────────────────────┤
│  reasoning-delta events → reasoningText accumulated ✓               │
│  text-delta events → content accumulated ✓                          │
│  finish event → usage: { inputTokens, outputTokens, ... }           │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│              stream-processor.ts updateAssistantMessage()           │
├─────────────────────────────────────────────────────────────────────┤
│  await this.contextManager.updateAssistantMessage(                  │
│      this.assistantMessageId,                                       │
│      { tokenUsage: usage }     ← ONLY tokenUsage saved!             │
│  );                            ← reasoning NOT included! ✗          │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    AssistantMessage Stored                          │
├─────────────────────────────────────────────────────────────────────┤
│  {                                                                  │
│    role: 'assistant',                                               │
│    content: [...],             ← ✓ Stored                           │
│    reasoning: undefined,       ← ✗ NEVER SET!                       │
│    tokenUsage: {...}           ← ✓ Stored                           │
│  }                                                                  │
└─────────────────────────────────────────────────────────────────────┘

### Target State (FIXED)

┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Response Stream                          │
├─────────────────────────────────────────────────────────────────────┤
│  reasoning-delta events → reasoningText + providerMetadata ✓        │
│  text-delta events → content accumulated ✓                          │
│  finish event → usage: { inputTokens, outputTokens, ... }           │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│              stream-processor.ts updateAssistantMessage()           │
├─────────────────────────────────────────────────────────────────────┤
│  await this.contextManager.updateAssistantMessage(                  │
│      this.assistantMessageId,                                       │
│      {                                                              │
│          tokenUsage: usage,                                         │
│          reasoning: this.reasoningText,           ← NEW             │
│          reasoningMetadata: this.reasoningMetadata ← NEW            │
│      }                                                              │
│  );                                                                 │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    AssistantMessage Stored                          │
├─────────────────────────────────────────────────────────────────────┤
│  {                                                                  │
│    role: 'assistant',                                               │
│    content: [...],                                                  │
│    reasoning: 'Let me think...',    ← ✓ Now stored                  │
│    reasoningMetadata: { openai: { itemId: '...' } }, ← ✓ For round-trip
│    tokenUsage: { inputTokens, outputTokens, reasoningTokens }       │
│  }                                                                  │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Next LLM Call (Formatter)                        │
├─────────────────────────────────────────────────────────────────────┤
│  formatAssistantMessage() includes:                                 │
│    - content (text parts)              ✓ Already done               │
│    - toolCalls                         ✓ Already done               │
│    - reasoning + providerMetadata      ✓ NEW - enables round-trip   │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    /context Calculation                             │
├─────────────────────────────────────────────────────────────────────┤
│  currentTotal = lastInput + lastOutput + newMessagesEstimate        │
│                                                                     │
│  Breakdown:                                                         │
│    systemPrompt = estimate (length/4)                               │
│    tools = estimate (length/4)                                      │
│    messages = currentTotal - systemPrompt - tools (back-calc)       │
│    reasoning = sum(msg.tokenUsage.reasoningTokens) (for display)    │
│                                                                     │
│  freeSpace = maxTokens - currentTotal - outputBuffer                │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Compaction Decision                              │
├─────────────────────────────────────────────────────────────────────┤
│  SAME FORMULA as /context!                                          │
│                                                                     │
│  if (currentTotal > compactionThreshold) {                          │
│    triggerCompaction();                                             │
│  }                                                                  │
└─────────────────────────────────────────────────────────────────────┘
```

---

## Testing Strategy

### Unit Tests

1. **Reasoning storage test (Phase 1)**
   - Mock LLM stream with reasoning-delta events
   - Verify `stream-processor.ts` calls `updateAssistantMessage()` with reasoning
   - Verify `reasoningMetadata` is captured from `providerMetadata`

2. **Reasoning round-trip test (Phase 2)**
   - Create `AssistantMessage` with `reasoning` and `reasoningMetadata`
   - Call `formatAssistantMessage()`
   - Verify output contains reasoning part with `providerMetadata`

3. **Token calculation test (Phase 3)**
   - Mock message with known tokenUsage
   - Verify calculation matches expected

4. **Edge case tests**
   - New session (no actuals) - falls back to estimation
   - Negative messagesDisplay (capped at 0)
   - Post-compaction state
   - Empty reasoning (should not create empty reasoning part)

### Integration Tests

1. **Full reasoning flow test**
   - Enable extended thinking on Claude
   - Send message that triggers reasoning
   - Verify reasoning persisted to message
   - Send follow-up message
   - Verify reasoning sent back to LLM (check formatted messages)

2. **Token tracking test**
   - Send message
   - Verify tokenUsage stored on message
   - Open /context
   - Verify numbers use actual from last call

3. **Compaction alignment test**
   - Fill context near threshold
   - Verify /context and compaction trigger at same point

---

## Success Criteria

1. **Numbers add up**: Total = SystemPrompt + Tools + Messages
2. **Consistency**: /context and compaction use same calculation
3. **Reasoning works**: Traces sent back to LLM correctly
4. **Calibration visible**: Logs show estimate vs actual ratio
5. **Provider compatibility**: Works with Anthropic, OpenAI, Google, etc.

---

## Appendix: Verification Against Other Implementations

*This plan was verified against actual implementations on 2025-01-20.*

### OpenCode Verification (~/Projects/external/opencode)

| Claim | Verified | Evidence |
|-------|----------|----------|
| Stores reasoning as `ReasoningPart` | ✅ | `message-v2.ts` lines 78-89 |
| Includes `providerMetadata` for round-tripping | ✅ | `message-v2.ts` lines 554-560 |
| `toModelMessage()` sends reasoning back | ✅ | `message-v2.ts` lines 435-569 |
| Tracks reasoning tokens separately | ✅ | `session/index.ts` line 432, schemas throughout |
| Handles provider-specific metadata | ✅ | `openai-responses-language-model.ts` lines 520-538 |

**OpenCode approach:** Full round-trip of reasoning with provider metadata. This is our reference implementation.

### Gemini-CLI Verification (~/Projects/external/gemini-cli)

| Claim in Original Plan | Actual Behavior | Status |
|------------------------|-----------------|--------|
| "Parts with thought: true included when sending history back" | **WRONG** - They filter OUT thoughts at line 815 | ❌ Corrected |
| Uses `thought: true` flag | ✅ Correct | ✅ |
| Tracks `thoughtsTokenCount` | ✅ Correct - `chatRecordingService.ts` line 278 | ✅ |

**Gemini-CLI approach:** Track thought tokens for cost/display but do NOT round-trip them.
This is a simpler approach but requires Google-specific handling.

### Why We Follow OpenCode

1. **Same SDK**: Both use Vercel AI SDK
2. **Provider-agnostic**: Works across all providers without special-casing
3. **Future-proof**: Preserves metadata for providers that need it
4. **Simpler code**: No provider-specific filtering logic

### Dexto Implementation Verification

| Component | Current State | Bug |
|-----------|---------------|-----|
| `stream-processor.ts` | Accumulates `reasoningText` but doesn't persist | **Bug #1** |
| `vercel.ts` formatter | Ignores `msg.reasoning` | **Bug #2** (blocked by #1) |
| `AssistantMessage` type | Has `reasoning?: string` field | ✅ Ready |
| Per-message `tokenUsage` | Stored via `updateAssistantMessage()` | ✅ Working |
| `lastActualInputTokens` | Set after each LLM call | ✅ Working |
| Compaction calculation | Uses `estimateMessagesTokens()` only | Different from /context |
| `/context` calculation | Uses full estimation (system + tools + messages) | Different from compaction |