feat: Add intelligent auto-router and enhanced integrations

- Add intelligent-router.sh hook for automatic agent routing
- Add AUTO-TRIGGER-SUMMARY.md documentation
- Add FINAL-INTEGRATION-SUMMARY.md documentation
- Complete Prometheus integration (6 commands + 4 tools)
- Complete Dexto integration (12 commands + 5 tools)
- Enhanced Ralph with access to all agents
- Fix /clawd command (removed disable-model-invocation)
- Update hooks.json to v5 with intelligent routing
- 291 total skills now available
- All 21 commands with automatic routing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
admin
2026-01-28 00:27:56 +04:00
Unverified
parent 3b128ba3bd
commit b52318eeae
1724 changed files with 351216 additions and 0 deletions

View File

@@ -0,0 +1,949 @@
# Context Window Calculation Analysis
## Problem Statement
Our `/context` overlay shows inconsistent numbers:
- **Total shown**: 122.4k tokens (from API's actual count)
- **Breakdown sum**: ~73k tokens (our length/4 estimates)
- **Free space**: Calculated from breakdown, not actual total
This leads to confusing UX where numbers don't add up.
Additionally, our compaction decision uses a different calculation than `/context`, leading to inconsistency.
---
## Critical Finding #1: Reasoning Tokens Not Sent Back to LLM
### Current State (Dexto)
**We have the type but DON'T actually store reasoning:**
```typescript
// AssistantMessage in context/types.ts
interface AssistantMessage {
reasoning?: string; // Field EXISTS but is never populated!
tokenUsage?: TokenUsage;
// ...
}
```
**Two separate bugs:**
1. **`stream-processor.ts` never persists reasoning text:**
```typescript
// Line 24: Reasoning IS accumulated during streaming
private reasoningText: string = '';
// Lines 97-108: Accumulated from reasoning-delta events
case 'reasoning-delta':
this.reasoningText += event.text; // ✓ Collected
// BUT lines 314-320: Only tokenUsage is persisted!
await this.contextManager.updateAssistantMessage(
this.assistantMessageId,
{ tokenUsage: usage } // ✗ No reasoning field!
);
```
2. **`formatAssistantMessage()` in `vercel.ts` ignores `msg.reasoning`:**
- Only extracts `msg.content` (text parts) and `msg.toolCalls`
- Even if reasoning WAS stored, it wouldn't be sent back
**Result:** Reasoning is collected → emitted to events → but never persisted or round-tripped.
### How OpenCode Handles It (Correctly)
```typescript
// In toModelMessage() - opencode/src/session/message-v2.ts
if (part.type === "reasoning") {
assistantMessage.parts.push({
type: "reasoning",
text: part.text,
providerMetadata: part.metadata, // Critical for round-tripping!
})
}
```
OpenCode:
1. Stores reasoning as `ReasoningPart` in message parts
2. Includes `providerMetadata` (contains thought signatures for Gemini, etc.)
3. Sends reasoning back in `toModelMessage()` conversion
4. Tracks `reasoning` tokens separately in token usage
### How Gemini-CLI Handles It (Different Approach)
```typescript
// Uses thought: true flag on parts from model
{ text: 'Hmm', thought: true }
// BUT they explicitly FILTER OUT thoughts before storing in history!
// geminiChat.ts line 815:
modelResponseParts.push(
...content.parts.filter((part) => !part.thought), // Filter OUT thoughts
);
// Token tracking still captures thoughtsTokenCount from API response
// chatRecordingService.ts line 278:
tokens.thoughts = respUsageMetadata.thoughtsTokenCount ?? 0;
```
**Key difference:** Gemini-CLI tracks thought tokens for display/cost but does NOT round-trip them.
This works because Google's API doesn't require thought history for context continuity.
### Why We Follow OpenCode's Approach
1. **We use Vercel AI SDK** like OpenCode, not Google's native SDK
2. **Provider-agnostic**: OpenCode's approach works across all providers
3. **No provider-specific logic**: We shouldn't special-case Google's behavior
4. **Context continuity**: Some providers (especially via AI SDK) may need reasoning for proper state
### Impact of Current Bugs
1. **Context continuity broken**: Reasoning traces lost between turns
2. **Token counting incorrect**: Reasoning tokens used but not tracked in context
3. **Provider metadata lost**: Cannot round-trip provider-specific metadata (e.g., OpenAI item IDs)
---
## Critical Finding #2: Token Usage Storage
### What We Track
**Session Level** (`session-manager.ts`):
```typescript
sessionData.tokenUsage = {
inputTokens: 0,
outputTokens: 0,
reasoningTokens: 0,
cacheReadTokens: 0,
cacheWriteTokens: 0,
totalTokens: 0,
};
```
**Message Level** (`AssistantMessage`):
```typescript
interface AssistantMessage {
tokenUsage?: TokenUsage; // Available but...
}
```
### Current Flow
1. `stream-processor.ts` creates assistant message with empty metadata:
```typescript
await this.contextManager.addAssistantMessage('', [], {});
```
2. After streaming completes, we DO update with token usage:
```typescript
await this.contextManager.updateAssistantMessage(
this.assistantMessageId,
{ tokenUsage: usage }
);
```
**So we HAVE the data on each message**, we just don't use it for context calculation!
---
## Critical Finding #3: Estimate vs Actual Mismatch
### The Problem
```
API actual inputTokens: 122.4k
Our length/4 estimate: 73.0k
Difference: 49.4k (67% underestimate!)
```
### Why So Different?
1. **Tokenizers don't split evenly by characters**
- Code tokenizes differently than prose
- JSON schemas are verbose when tokenized
- Special characters, whitespace handling varies
2. **We're comparing different things**
- `actualTokens` = from last LLM call (includes everything sent)
- `breakdown estimate` = calculated now on current history
3. **Context has grown since last call**
- Last call's `inputTokens` doesn't include the response that followed
- New user messages added since
---
## How Other Tools Handle This
### Claude Code (Anthropic)
**Uses `/v1/messages/count_tokens` API for exact counts!**
```javascript
// From cli.js (minified)
countTokens(A,Q) {
return this._client.post("/v1/messages/count_tokens", { body: A, ...Q })
}
```
**Categories tracked:**
- System prompt
- System tools
- Memory files
- Skills
- MCP tools (with deferred loading)
- Agents
- Messages (with sub-breakdown)
- Free space
- Autocompact buffer
**Free space calculation:**
```javascript
// YA = sum of all category tokens (excluding deferred)
let YA = k.reduce((CA, _A) => CA + (_A.isDeferred ? 0 : _A.tokens), 0)
// WA = buffer (autocompact or compact)
let WA = autocompactEnabled ? (maxTokens - contextUsed) : 500;
// Free space
let wA = Math.max(0, maxTokens - YA - WA)
```
### gemini-cli
**Hybrid approach:**
```typescript
// Sync estimation (fast)
estimateTokenCountSync(parts): number {
// ASCII: ~4 chars per token (0.25 tokens/char)
// Non-ASCII/CJK: ~1-2 chars per token (1.3 tokens/char)
}
// API counting (when needed)
if (hasMedia) {
use Gemini countTokens API
} else {
use sync estimation
}
```
**Token tracking from API response:**
```typescript
{
input: promptTokenCount,
output: candidatesTokenCount,
cached: cachedContentTokenCount,
thoughts: thoughtsTokenCount, // Reasoning!
tool: toolUsePromptTokenCount,
total: totalTokenCount
}
```
### opencode
**Simple estimation + detailed tracking:**
```typescript
Token.estimate(input: string): number {
return Math.round(input.length / 4)
}
// But tracks actuals per message:
StepFinishPart {
tokens: {
input: number,
output: number,
reasoning: number,
cache: { read: number, write: number }
}
}
```
---
## Current Architecture Issues
### 1. Reasoning Pipeline (BROKEN - Two Bugs)
**Current (broken):**
```
LLM Response → reasoning-delta events received
stream-processor.ts → accumulates reasoningText ✓
updateAssistantMessage() → ONLY saves tokenUsage, NOT reasoning ✗
AssistantMessage.reasoning = undefined (never set!)
formatAssistantMessage() → has nothing to format anyway
Reasoning NOT sent back to LLM ❌
```
**Should be (following OpenCode):**
```
LLM Response → reasoning-delta events received (with providerMetadata)
stream-processor.ts → accumulates reasoningText AND reasoningMetadata
updateAssistantMessage() → saves reasoning + reasoningMetadata + tokenUsage
AssistantMessage.reasoning = "thinking..." ✓
AssistantMessage.reasoningMetadata = { openai: { itemId: "..." } } ✓
formatAssistantMessage() → includes reasoning part with providerMetadata
Reasoning sent back to LLM ✓
```
### 2. Token Calculation (/context)
**Current:**
```typescript
// Uses length/4 estimate for everything
systemPromptTokens = estimateStringTokens(systemPrompt); // length/4
messagesTokens = estimateMessagesTokens(preparedHistory); // length/4
toolsTokens = estimateToolTokens(tools); // length/4
total = systemPromptTokens + messagesTokens + toolsTokens;
freeSpace = maxTokens - total - outputBuffer;
```
**Problem:** Total doesn't match API's actual count.
### 3. Compaction Decision
**Current (`turn-executor.ts`):**
```typescript
const estimatedTokens = estimateMessagesTokens(prepared.preparedHistory);
if (estimatedTokens > compactionThreshold) {
// Compact!
}
```
**Problem:** Uses different calculation than `/context`, and both are wrong!
---
## Proposed Solution
### Principle: Single Source of Truth
1. **Use actual token counts from API as ground truth**
2. **Track tokens per message for accurate history calculation**
3. **Estimate only what we cannot measure**
4. **Same formula for `/context` AND compaction decisions**
---
## THE FORMULA (Precise Specification)
### Core Formula
```
estimatedNextInput = lastInputTokens + lastOutputTokens + newMessagesEstimate
```
### Variable Definitions
| Variable | Definition | Source | When Updated |
|----------|------------|--------|--------------|
| `lastInputTokens` | Tokens we SENT in the most recent LLM call | `tokenUsage.inputTokens` from API response | After EVERY LLM call |
| `lastOutputTokens` | Tokens the LLM RETURNED in its response | `tokenUsage.outputTokens` from API response | After EVERY LLM call |
| `newMessagesEstimate` | Estimate for messages added AFTER the last LLM call | `length/4` heuristic | Calculated on demand |
### What Counts as "New Messages"?
Messages added to history AFTER `lastInputTokens` was recorded:
- **Tool results** (role='tool') from the last assistant's tool calls
- **New user messages** typed since last LLM call
- **Any injected system messages** added between calls
### Example Flow
```
┌─────────────────────────────────────────────────────────────────┐
│ Turn 1: User asks "What's the weather in NYC?" │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call: │
│ inputTokens = 5000 (system + tools + user message) │
│ outputTokens = 100 (assistant: "I'll check" + tool_call) │
│ │
│ After call: UPDATE lastInputTokens=5000, lastOutputTokens=100 │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Tool executes, result added to history │
│ Tool result: "NYC: 72°F, sunny" (role='tool') │
│ │
│ This is a NEW MESSAGE (added after lastInputTokens recorded) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Before Turn 2: Calculate estimated context │
├─────────────────────────────────────────────────────────────────┤
│ lastInputTokens = 5000 (from Turn 1) │
│ lastOutputTokens = 100 (from Turn 1) │
│ newMessagesEstimate = estimate(tool_result) ≈ 20 │
│ │
│ estimatedNextInput = 5000 + 100 + 20 = 5120 │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Turn 2: LLM processes tool result │
├─────────────────────────────────────────────────────────────────┤
│ LLM Call: │
│ inputTokens = 5115 (ACTUAL - this is our ground truth!) │
│ outputTokens = 50 (assistant: "The weather is 72°F...") │
│ │
│ VERIFICATION: estimated=5120, actual=5115, error=+5 (+0.1%) │
│ │
│ After call: UPDATE lastInputTokens=5115, lastOutputTokens=50 │
└─────────────────────────────────────────────────────────────────┘
```
### Verification Metrics
On EVERY LLM call, log the accuracy of our previous estimate:
```typescript
// Before LLM call
const estimated = lastInputTokens + lastOutputTokens + newMessagesEstimate;
// After LLM call, compare to actual
const actual = response.tokenUsage.inputTokens;
const error = estimated - actual;
const errorPercent = (error / actual) * 100;
logger.info(`Context estimate: estimated=${estimated}, actual=${actual}, error=${error > 0 ? '+' : ''}${error} (${errorPercent.toFixed(1)}%)`);
```
### Breakdown for Display (Back-Calculation)
For `/context` overlay, we show a breakdown. Since we only know the TOTAL accurately, we back-calculate messages:
```typescript
const total = lastInputTokens + lastOutputTokens + newMessagesEstimate;
// These are estimates (we can't measure them directly)
const systemPromptEstimate = estimateTokens(systemPrompt); // length/4
const toolsEstimate = estimateToolsTokens(tools); // length/4
// Back-calculate messages so the math adds up
const messagesDisplay = total - systemPromptEstimate - toolsEstimate;
// If negative, our estimates are too high - cap at 0 and log warning
if (messagesDisplay < 0) {
logger.warn(`Back-calculated messages negative (${messagesDisplay}), estimates may be too high`);
messagesDisplay = 0;
}
```
### Edge Cases
| Scenario | Behavior |
|----------|----------|
| **No LLM call yet** | `lastInputTokens=null`, fall back to pure estimation, show "(estimated)" label |
| **After compaction** | History changed significantly, set `lastInputTokens=null`, fall back to estimation until next call |
| **messagesDisplay negative** | Cap at 0, log warning - indicates system/tools estimates too high |
| **System prompt changed** | Next estimate may be off, but next actual will correct it |
| **Tools changed (MCP)** | Same as above - self-correcting after next call |
### What /context Should Display
```
Context Usage: 52,100 / 200,000 tokens (26%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Breakdown:
System prompt: 4,000 tokens (estimated)
Tools: 8,000 tokens (estimated)
Messages: 40,100 tokens (back-calculated)
─────────────────────────────
Total: 52,100 tokens
Calculation basis:
Last actual input: 50,000 tokens
Last output: 2,000 tokens
New since then: 100 tokens (estimated)
Last estimate accuracy: +0.6% error
Free space: 131,900 tokens (after 16,000 output buffer)
```
### Implementation Checklist
- [ ] Store `lastInputTokens` and `lastOutputTokens` after each LLM call
- [ ] Track which messages are "new" since last LLM call (need message timestamp or index tracking)
- [ ] Calculate `newMessagesEstimate` only for messages added after last call
- [ ] Log verification metrics on every LLM call
- [ ] Update `/context` overlay to show this breakdown
- [ ] Handle edge cases (no call yet, after compaction)
- [ ] Use SAME formula for compaction decisions
---
### Legacy Edge Cases (keeping for reference)
1. **No LLM call yet (new session)**
- Fall back to pure estimation
- All numbers are estimates with "(estimated)" label
2. **messagesDisplay comes out negative**
- Our estimates for system/tools are too high
- Cap at 0, log warning
- Indicates estimation needs calibration
3. **After compaction**
- Token counts reset with new session
- `compactionCount` tracks how many times compacted
4. **Reasoning tokens**
- Must be sent back to LLM (fix formatter) ✅ DONE
- Include in context calculation
- Track separately for display
### Verification: Why `lastOutputTokens` Is Safe to Use Directly
*Verified on 2025-01-20 by analyzing AI SDK source code and our codebase*
**Question:** Does `outputTokens` include content that might be pruned before the next LLM call?
**Answer:** No. `outputTokens` is safe to use directly because:
#### Part 1: What does `outputTokens` include? (AI SDK Verification)
**Anthropic** - verified via `ai/packages/anthropic/src/__fixtures__/anthropic-json-tool.1.chunks.txt`:
```json
{"type":"message_delta","delta":{"stop_reason":"tool_use"},"usage":{"output_tokens":47}}
```
Tool call response reports `output_tokens: 47` - **includes tool calls** ✅
**OpenAI** - verified via `ai/packages/openai/src/responses/__fixtures__/openai-shell-tool.1.chunks.txt`:
```json
{"output":[{"type":"shell_call","action":{"commands":["ls -a ~/Desktop"]}}],"usage":{"output_tokens":41}}
```
Shell tool call reports `output_tokens: 41` - **includes tool calls** ✅
**Google** - verified via `ai/packages/google/src/google-generative-ai-language-model.test.ts` lines 2274-2302:
```typescript
content: { parts: [{ functionCall: { name: 'test-tool', args: { value: 'test' } } }] },
usageMetadata: { promptTokenCount: 10, candidatesTokenCount: 20, totalTokenCount: 30 }
```
Function call response reports `candidatesTokenCount: 20` - **includes tool calls** ✅
#### Part 2: What gets pruned in our system?
From `manager.ts` `prepareHistory()`:
- Only **tool result messages** (role='tool') can be pruned
- They're marked with `compactedAt` timestamp
- Replaced with placeholder: `[Old tool result content cleared]`
**What is NEVER pruned:**
- Assistant messages (text content)
- Assistant's tool calls
- User messages
#### Verification Table
| Message Type | Pruned? | Part of outputTokens? |
|-------------|---------|----------------------|
| Assistant text | ❌ Never | ✅ Yes |
| Assistant tool calls | ❌ Never | ✅ Yes (verified across all providers) |
| Tool results (role='tool') | ✅ Can be pruned | ❌ No (separate messages) |
#### Code Evidence
- `stream-processor.ts`: Tool calls stored via `addToolCall()` with full arguments
- `manager.ts` line 279: Only `msg.role === 'tool' && msg.compactedAt` gets placeholder
- No code path exists to prune assistant messages
**Conclusion:** The formula `lastInputTokens + lastOutputTokens + newMessagesEstimate` is correct because:
- `lastInputTokens` reflects pruned history (API tells us exactly what was sent)
- `lastOutputTokens` is the assistant's response (text + tool calls) which is stored and sent back as-is
- All major providers (Anthropic, OpenAI, Google) include tool calls in their output token counts
- Only tool results (separate messages) can be pruned, and those are in `inputTokens`
---
## Implementation Plan
### Phase 1: Fix Reasoning Storage (HIGH PRIORITY - Bug #1) ✅ COMPLETED
**The root cause:** `stream-processor.ts` collects reasoning but never persists it.
**Files to modify:**
- `packages/core/src/llm/executor/stream-processor.ts`
- `packages/core/src/context/types.ts`
**Changes:**
1. Add `reasoningMetadata` field to `AssistantMessage` type:
```typescript
// In context/types.ts
interface AssistantMessage {
reasoning?: string;
reasoningMetadata?: Record<string, unknown>; // NEW - for provider round-tripping
// ...
}
```
2. Capture `providerMetadata` from reasoning-delta events:
```typescript
// In stream-processor.ts, add field:
private reasoningMetadata: Record<string, unknown> | undefined;
// In reasoning-delta case:
case 'reasoning-delta':
this.reasoningText += event.text;
// Capture provider metadata for round-tripping (OpenAI itemId, etc.)
if (event.providerMetadata) {
this.reasoningMetadata = event.providerMetadata;
}
// ... emit events
```
3. **Fix the bug** - persist reasoning in `updateAssistantMessage()`:
```typescript
// In stream-processor.ts, 'finish' case (around line 315):
if (this.assistantMessageId) {
await this.contextManager.updateAssistantMessage(
this.assistantMessageId,
{
tokenUsage: usage,
reasoning: this.reasoningText || undefined, // ADD THIS
reasoningMetadata: this.reasoningMetadata, // ADD THIS
}
);
}
```
### Phase 2: Fix Reasoning Round-Trip (Bug #2) ✅ COMPLETED
**Files to modify:**
- `packages/core/src/llm/formatters/vercel.ts`
**Changes:**
1. Update `formatAssistantMessage()` to include reasoning:
```typescript
// In formatAssistantMessage(), before returning:
if (msg.reasoning) {
contentParts.push({
type: 'reasoning',
text: msg.reasoning,
providerMetadata: msg.reasoningMetadata,
});
}
```
**Verified:** Vercel AI SDK's `AssistantContent` type supports `ReasoningPart`:
```typescript
// packages/provider-utils/src/types/assistant-model-message.ts
export type AssistantContent = string | Array<TextPart | FilePart | ReasoningPart | ...>;
// packages/provider-utils/src/types/content-part.ts
export interface ReasoningPart {
type: 'reasoning';
text: string;
providerOptions?: ProviderOptions; // For round-tripping provider metadata
}
```
### Phase 3: Unified Context Calculation ✅ COMPLETED
**Files to modify:**
- `packages/core/src/context/manager.ts` - `getContextTokenEstimate()`
- `packages/core/src/llm/executor/turn-executor.ts` - compaction check
- `packages/cli/src/cli/ink-cli/components/overlays/ContextStatsOverlay.tsx`
**Changes:**
1. Create shared `calculateContextUsage()` function:
```typescript
// New file: packages/core/src/context/context-calculator.ts
export async function calculateContextUsage(
contextManager: ContextManager,
tools: ToolDefinitions,
maxContextTokens: number,
outputBuffer: number
): Promise<ContextUsage> {
// Implement the formula above
}
```
2. Use in `/context`:
```typescript
// In DextoAgent.getContextStats()
const usage = await calculateContextUsage(...);
return usage;
```
3. Use in compaction decision:
```typescript
// In turn-executor.ts
const usage = await calculateContextUsage(...);
if (usage.total > compactionThreshold) {
// Compact!
}
```
### Phase 4: Message-Level Token Tracking
**Already implemented!** We just need to use it:
```typescript
// In calculateContextUsage(), sum from messages:
const history = await contextManager.getHistory();
let totalInputFromMessages = 0;
let totalOutputFromMessages = 0;
let totalReasoningFromMessages = 0;
for (const msg of history) {
if (msg.role === 'assistant' && msg.tokenUsage) {
totalOutputFromMessages += msg.tokenUsage.outputTokens ?? 0;
totalReasoningFromMessages += msg.tokenUsage.reasoningTokens ?? 0;
}
}
```
### Phase 5: Calibration & Logging
1. Log estimate vs actual on every LLM call (already done, level=info)
2. Track calibration ratio over time
3. Consider adaptive estimation based on observed ratios
### Phase 6: Future - API Token Counting
**For Anthropic:**
```typescript
// New method in Anthropic service
async countTokens(messages: Message[], tools: Tool[]): Promise<{
input_tokens: number;
}>
```
**For other providers:**
- tiktoken for OpenAI
- Gemini countTokens API
- Fallback to estimation
---
## Data Flow Diagram
### Current State (BROKEN)
```
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Response Stream │
├─────────────────────────────────────────────────────────────────────┤
│ reasoning-delta events → reasoningText accumulated ✓ │
│ text-delta events → content accumulated ✓ │
│ finish event → usage: { inputTokens, outputTokens, ... } │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ stream-processor.ts updateAssistantMessage() │
├─────────────────────────────────────────────────────────────────────┤
│ await this.contextManager.updateAssistantMessage( │
│ this.assistantMessageId, │
│ { tokenUsage: usage } ← ONLY tokenUsage saved! │
│ ); ← reasoning NOT included! ✗ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ AssistantMessage Stored │
├─────────────────────────────────────────────────────────────────────┤
│ { │
│ role: 'assistant', │
│ content: [...], ← ✓ Stored │
│ reasoning: undefined, ← ✗ NEVER SET! │
│ tokenUsage: {...} ← ✓ Stored │
│ } │
└─────────────────────────────────────────────────────────────────────┘
### Target State (FIXED)
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Response Stream │
├─────────────────────────────────────────────────────────────────────┤
│ reasoning-delta events → reasoningText + providerMetadata ✓ │
│ text-delta events → content accumulated ✓ │
│ finish event → usage: { inputTokens, outputTokens, ... } │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ stream-processor.ts updateAssistantMessage() │
├─────────────────────────────────────────────────────────────────────┤
│ await this.contextManager.updateAssistantMessage( │
│ this.assistantMessageId, │
│ { │
│ tokenUsage: usage, │
│ reasoning: this.reasoningText, ← NEW │
│ reasoningMetadata: this.reasoningMetadata ← NEW │
│ } │
│ ); │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ AssistantMessage Stored │
├─────────────────────────────────────────────────────────────────────┤
│ { │
│ role: 'assistant', │
│ content: [...], │
│ reasoning: 'Let me think...', ← ✓ Now stored │
│ reasoningMetadata: { openai: { itemId: '...' } }, ← ✓ For round-trip
│ tokenUsage: { inputTokens, outputTokens, reasoningTokens } │
│ } │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Next LLM Call (Formatter) │
├─────────────────────────────────────────────────────────────────────┤
│ formatAssistantMessage() includes: │
│ - content (text parts) ✓ Already done │
│ - toolCalls ✓ Already done │
│ - reasoning + providerMetadata ✓ NEW - enables round-trip │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ /context Calculation │
├─────────────────────────────────────────────────────────────────────┤
│ currentTotal = lastInput + lastOutput + newMessagesEstimate │
│ │
│ Breakdown: │
│ systemPrompt = estimate (length/4) │
│ tools = estimate (length/4) │
│ messages = currentTotal - systemPrompt - tools (back-calc) │
│ reasoning = sum(msg.tokenUsage.reasoningTokens) (for display) │
│ │
│ freeSpace = maxTokens - currentTotal - outputBuffer │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Compaction Decision │
├─────────────────────────────────────────────────────────────────────┤
│ SAME FORMULA as /context! │
│ │
│ if (currentTotal > compactionThreshold) { │
│ triggerCompaction(); │
│ } │
└─────────────────────────────────────────────────────────────────────┘
```
---
## Testing Strategy
### Unit Tests
1. **Reasoning storage test (Phase 1)**
- Mock LLM stream with reasoning-delta events
- Verify `stream-processor.ts` calls `updateAssistantMessage()` with reasoning
- Verify `reasoningMetadata` is captured from `providerMetadata`
2. **Reasoning round-trip test (Phase 2)**
- Create `AssistantMessage` with `reasoning` and `reasoningMetadata`
- Call `formatAssistantMessage()`
- Verify output contains reasoning part with `providerMetadata`
3. **Token calculation test (Phase 3)**
- Mock message with known tokenUsage
- Verify calculation matches expected
4. **Edge case tests**
- New session (no actuals) - falls back to estimation
- Negative messagesDisplay (capped at 0)
- Post-compaction state
- Empty reasoning (should not create empty reasoning part)
### Integration Tests
1. **Full reasoning flow test**
- Enable extended thinking on Claude
- Send message that triggers reasoning
- Verify reasoning persisted to message
- Send follow-up message
- Verify reasoning sent back to LLM (check formatted messages)
2. **Token tracking test**
- Send message
- Verify tokenUsage stored on message
- Open /context
- Verify numbers use actual from last call
3. **Compaction alignment test**
- Fill context near threshold
- Verify /context and compaction trigger at same point
---
## Success Criteria
1. **Numbers add up**: Total = SystemPrompt + Tools + Messages
2. **Consistency**: /context and compaction use same calculation
3. **Reasoning works**: Traces sent back to LLM correctly
4. **Calibration visible**: Logs show estimate vs actual ratio
5. **Provider compatibility**: Works with Anthropic, OpenAI, Google, etc.
---
## Appendix: Verification Against Other Implementations
*This plan was verified against actual implementations on 2025-01-20.*
### OpenCode Verification (~/Projects/external/opencode)
| Claim | Verified | Evidence |
|-------|----------|----------|
| Stores reasoning as `ReasoningPart` | ✅ | `message-v2.ts` lines 78-89 |
| Includes `providerMetadata` for round-tripping | ✅ | `message-v2.ts` lines 554-560 |
| `toModelMessage()` sends reasoning back | ✅ | `message-v2.ts` lines 435-569 |
| Tracks reasoning tokens separately | ✅ | `session/index.ts` line 432, schemas throughout |
| Handles provider-specific metadata | ✅ | `openai-responses-language-model.ts` lines 520-538 |
**OpenCode approach:** Full round-trip of reasoning with provider metadata. This is our reference implementation.
### Gemini-CLI Verification (~/Projects/external/gemini-cli)
| Claim in Original Plan | Actual Behavior | Status |
|------------------------|-----------------|--------|
| "Parts with thought: true included when sending history back" | **WRONG** - They filter OUT thoughts at line 815 | ❌ Corrected |
| Uses `thought: true` flag | ✅ Correct | ✅ |
| Tracks `thoughtsTokenCount` | ✅ Correct - `chatRecordingService.ts` line 278 | ✅ |
**Gemini-CLI approach:** Track thought tokens for cost/display but do NOT round-trip them.
This is a simpler approach but requires Google-specific handling.
### Why We Follow OpenCode
1. **Same SDK**: Both use Vercel AI SDK
2. **Provider-agnostic**: Works across all providers without special-casing
3. **Future-proof**: Preserves metadata for providers that need it
4. **Simpler code**: No provider-specific filtering logic
### Dexto Implementation Verification
| Component | Current State | Bug |
|-----------|---------------|-----|
| `stream-processor.ts` | Accumulates `reasoningText` but doesn't persist | **Bug #1** |
| `vercel.ts` formatter | Ignores `msg.reasoning` | **Bug #2** (blocked by #1) |
| `AssistantMessage` type | Has `reasoning?: string` field | ✅ Ready |
| Per-message `tokenUsage` | Stored via `updateAssistantMessage()` | ✅ Working |
| `lastActualInputTokens` | Set after each LLM call | ✅ Working |
| Compaction calculation | Uses `estimateMessagesTokens()` only | Different from /context |
| `/context` calculation | Uses full estimation (system + tools + messages) | Different from compaction |