- Add intelligent-router.sh hook for automatic agent routing - Add AUTO-TRIGGER-SUMMARY.md documentation - Add FINAL-INTEGRATION-SUMMARY.md documentation - Complete Prometheus integration (6 commands + 4 tools) - Complete Dexto integration (12 commands + 5 tools) - Enhanced Ralph with access to all agents - Fix /clawd command (removed disable-model-invocation) - Update hooks.json to v5 with intelligent routing - 291 total skills now available - All 21 commands with automatic routing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
950 lines
40 KiB
Markdown
950 lines
40 KiB
Markdown
# Context Window Calculation Analysis
|
|
|
|
## Problem Statement
|
|
|
|
Our `/context` overlay shows inconsistent numbers:
|
|
- **Total shown**: 122.4k tokens (from API's actual count)
|
|
- **Breakdown sum**: ~73k tokens (our length/4 estimates)
|
|
- **Free space**: Calculated from breakdown, not actual total
|
|
|
|
This leads to confusing UX where numbers don't add up.
|
|
|
|
Additionally, our compaction decision uses a different calculation than `/context`, leading to inconsistency.
|
|
|
|
---
|
|
|
|
## Critical Finding #1: Reasoning Tokens Not Sent Back to LLM
|
|
|
|
### Current State (Dexto)
|
|
|
|
**We have the type but DON'T actually store reasoning:**
|
|
```typescript
|
|
// AssistantMessage in context/types.ts
|
|
interface AssistantMessage {
|
|
reasoning?: string; // Field EXISTS but is never populated!
|
|
tokenUsage?: TokenUsage;
|
|
// ...
|
|
}
|
|
```
|
|
|
|
**Two separate bugs:**
|
|
|
|
1. **`stream-processor.ts` never persists reasoning text:**
|
|
```typescript
|
|
// Line 24: Reasoning IS accumulated during streaming
|
|
private reasoningText: string = '';
|
|
|
|
// Lines 97-108: Accumulated from reasoning-delta events
|
|
case 'reasoning-delta':
|
|
this.reasoningText += event.text; // ✓ Collected
|
|
|
|
// BUT lines 314-320: Only tokenUsage is persisted!
|
|
await this.contextManager.updateAssistantMessage(
|
|
this.assistantMessageId,
|
|
{ tokenUsage: usage } // ✗ No reasoning field!
|
|
);
|
|
```
|
|
|
|
2. **`formatAssistantMessage()` in `vercel.ts` ignores `msg.reasoning`:**
|
|
- Only extracts `msg.content` (text parts) and `msg.toolCalls`
|
|
- Even if reasoning WAS stored, it wouldn't be sent back
|
|
|
|
**Result:** Reasoning is collected → emitted to events → but never persisted or round-tripped.
|
|
|
|
### How OpenCode Handles It (Correctly)
|
|
|
|
```typescript
|
|
// In toModelMessage() - opencode/src/session/message-v2.ts
|
|
if (part.type === "reasoning") {
|
|
assistantMessage.parts.push({
|
|
type: "reasoning",
|
|
text: part.text,
|
|
providerMetadata: part.metadata, // Critical for round-tripping!
|
|
})
|
|
}
|
|
```
|
|
|
|
OpenCode:
|
|
1. Stores reasoning as `ReasoningPart` in message parts
|
|
2. Includes `providerMetadata` (contains thought signatures for Gemini, etc.)
|
|
3. Sends reasoning back in `toModelMessage()` conversion
|
|
4. Tracks `reasoning` tokens separately in token usage
|
|
|
|
### How Gemini-CLI Handles It (Different Approach)
|
|
|
|
```typescript
|
|
// Uses thought: true flag on parts from model
|
|
{ text: 'Hmm', thought: true }
|
|
|
|
// BUT they explicitly FILTER OUT thoughts before storing in history!
|
|
// geminiChat.ts line 815:
|
|
modelResponseParts.push(
|
|
...content.parts.filter((part) => !part.thought), // Filter OUT thoughts
|
|
);
|
|
|
|
// Token tracking still captures thoughtsTokenCount from API response
|
|
// chatRecordingService.ts line 278:
|
|
tokens.thoughts = respUsageMetadata.thoughtsTokenCount ?? 0;
|
|
```
|
|
|
|
**Key difference:** Gemini-CLI tracks thought tokens for display/cost but does NOT round-trip them.
|
|
This works because Google's API doesn't require thought history for context continuity.
|
|
|
|
### Why We Follow OpenCode's Approach
|
|
|
|
1. **We use Vercel AI SDK** like OpenCode, not Google's native SDK
|
|
2. **Provider-agnostic**: OpenCode's approach works across all providers
|
|
3. **No provider-specific logic**: We shouldn't special-case Google's behavior
|
|
4. **Context continuity**: Some providers (especially via AI SDK) may need reasoning for proper state
|
|
|
|
### Impact of Current Bugs
|
|
|
|
1. **Context continuity broken**: Reasoning traces lost between turns
|
|
2. **Token counting incorrect**: Reasoning tokens used but not tracked in context
|
|
3. **Provider metadata lost**: Cannot round-trip provider-specific metadata (e.g., OpenAI item IDs)
|
|
|
|
---
|
|
|
|
## Critical Finding #2: Token Usage Storage
|
|
|
|
### What We Track
|
|
|
|
**Session Level** (`session-manager.ts`):
|
|
```typescript
|
|
sessionData.tokenUsage = {
|
|
inputTokens: 0,
|
|
outputTokens: 0,
|
|
reasoningTokens: 0,
|
|
cacheReadTokens: 0,
|
|
cacheWriteTokens: 0,
|
|
totalTokens: 0,
|
|
};
|
|
```
|
|
|
|
**Message Level** (`AssistantMessage`):
|
|
```typescript
|
|
interface AssistantMessage {
|
|
tokenUsage?: TokenUsage; // Available but...
|
|
}
|
|
```
|
|
|
|
### Current Flow
|
|
|
|
1. `stream-processor.ts` creates assistant message with empty metadata:
|
|
```typescript
|
|
await this.contextManager.addAssistantMessage('', [], {});
|
|
```
|
|
|
|
2. After streaming completes, we DO update with token usage:
|
|
```typescript
|
|
await this.contextManager.updateAssistantMessage(
|
|
this.assistantMessageId,
|
|
{ tokenUsage: usage }
|
|
);
|
|
```
|
|
|
|
**So we HAVE the data on each message**, we just don't use it for context calculation!
|
|
|
|
---
|
|
|
|
## Critical Finding #3: Estimate vs Actual Mismatch
|
|
|
|
### The Problem
|
|
|
|
```
|
|
API actual inputTokens: 122.4k
|
|
Our length/4 estimate: 73.0k
|
|
Difference: 49.4k (67% underestimate!)
|
|
```
|
|
|
|
### Why So Different?
|
|
|
|
1. **Tokenizers don't split evenly by characters**
|
|
- Code tokenizes differently than prose
|
|
- JSON schemas are verbose when tokenized
|
|
- Special characters, whitespace handling varies
|
|
|
|
2. **We're comparing different things**
|
|
- `actualTokens` = from last LLM call (includes everything sent)
|
|
- `breakdown estimate` = calculated now on current history
|
|
|
|
3. **Context has grown since last call**
|
|
- Last call's `inputTokens` doesn't include the response that followed
|
|
- New user messages added since
|
|
|
|
---
|
|
|
|
## How Other Tools Handle This
|
|
|
|
### Claude Code (Anthropic)
|
|
|
|
**Uses `/v1/messages/count_tokens` API for exact counts!**
|
|
|
|
```javascript
|
|
// From cli.js (minified)
|
|
countTokens(A,Q) {
|
|
return this._client.post("/v1/messages/count_tokens", { body: A, ...Q })
|
|
}
|
|
```
|
|
|
|
**Categories tracked:**
|
|
- System prompt
|
|
- System tools
|
|
- Memory files
|
|
- Skills
|
|
- MCP tools (with deferred loading)
|
|
- Agents
|
|
- Messages (with sub-breakdown)
|
|
- Free space
|
|
- Autocompact buffer
|
|
|
|
**Free space calculation:**
|
|
```javascript
|
|
// YA = sum of all category tokens (excluding deferred)
|
|
let YA = k.reduce((CA, _A) => CA + (_A.isDeferred ? 0 : _A.tokens), 0)
|
|
|
|
// WA = buffer (autocompact or compact)
|
|
let WA = autocompactEnabled ? (maxTokens - contextUsed) : 500;
|
|
|
|
// Free space
|
|
let wA = Math.max(0, maxTokens - YA - WA)
|
|
```
|
|
|
|
### gemini-cli
|
|
|
|
**Hybrid approach:**
|
|
|
|
```typescript
|
|
// Sync estimation (fast)
|
|
estimateTokenCountSync(parts): number {
|
|
// ASCII: ~4 chars per token (0.25 tokens/char)
|
|
// Non-ASCII/CJK: ~1-2 chars per token (1.3 tokens/char)
|
|
}
|
|
|
|
// API counting (when needed)
|
|
if (hasMedia) {
|
|
use Gemini countTokens API
|
|
} else {
|
|
use sync estimation
|
|
}
|
|
```
|
|
|
|
**Token tracking from API response:**
|
|
```typescript
|
|
{
|
|
input: promptTokenCount,
|
|
output: candidatesTokenCount,
|
|
cached: cachedContentTokenCount,
|
|
thoughts: thoughtsTokenCount, // Reasoning!
|
|
tool: toolUsePromptTokenCount,
|
|
total: totalTokenCount
|
|
}
|
|
```
|
|
|
|
### opencode
|
|
|
|
**Simple estimation + detailed tracking:**
|
|
|
|
```typescript
|
|
Token.estimate(input: string): number {
|
|
return Math.round(input.length / 4)
|
|
}
|
|
|
|
// But tracks actuals per message:
|
|
StepFinishPart {
|
|
tokens: {
|
|
input: number,
|
|
output: number,
|
|
reasoning: number,
|
|
cache: { read: number, write: number }
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Current Architecture Issues
|
|
|
|
### 1. Reasoning Pipeline (BROKEN - Two Bugs)
|
|
|
|
**Current (broken):**
|
|
```
|
|
LLM Response → reasoning-delta events received
|
|
↓
|
|
stream-processor.ts → accumulates reasoningText ✓
|
|
↓
|
|
updateAssistantMessage() → ONLY saves tokenUsage, NOT reasoning ✗
|
|
↓
|
|
AssistantMessage.reasoning = undefined (never set!)
|
|
↓
|
|
formatAssistantMessage() → has nothing to format anyway
|
|
↓
|
|
Reasoning NOT sent back to LLM ❌
|
|
```
|
|
|
|
**Should be (following OpenCode):**
|
|
```
|
|
LLM Response → reasoning-delta events received (with providerMetadata)
|
|
↓
|
|
stream-processor.ts → accumulates reasoningText AND reasoningMetadata
|
|
↓
|
|
updateAssistantMessage() → saves reasoning + reasoningMetadata + tokenUsage
|
|
↓
|
|
AssistantMessage.reasoning = "thinking..." ✓
|
|
AssistantMessage.reasoningMetadata = { openai: { itemId: "..." } } ✓
|
|
↓
|
|
formatAssistantMessage() → includes reasoning part with providerMetadata
|
|
↓
|
|
Reasoning sent back to LLM ✓
|
|
```
|
|
|
|
### 2. Token Calculation (/context)
|
|
|
|
**Current:**
|
|
```typescript
|
|
// Uses length/4 estimate for everything
|
|
systemPromptTokens = estimateStringTokens(systemPrompt); // length/4
|
|
messagesTokens = estimateMessagesTokens(preparedHistory); // length/4
|
|
toolsTokens = estimateToolTokens(tools); // length/4
|
|
|
|
total = systemPromptTokens + messagesTokens + toolsTokens;
|
|
freeSpace = maxTokens - total - outputBuffer;
|
|
```
|
|
|
|
**Problem:** Total doesn't match API's actual count.
|
|
|
|
### 3. Compaction Decision
|
|
|
|
**Current (`turn-executor.ts`):**
|
|
```typescript
|
|
const estimatedTokens = estimateMessagesTokens(prepared.preparedHistory);
|
|
if (estimatedTokens > compactionThreshold) {
|
|
// Compact!
|
|
}
|
|
```
|
|
|
|
**Problem:** Uses different calculation than `/context`, and both are wrong!
|
|
|
|
---
|
|
|
|
## Proposed Solution
|
|
|
|
### Principle: Single Source of Truth
|
|
|
|
1. **Use actual token counts from API as ground truth**
|
|
2. **Track tokens per message for accurate history calculation**
|
|
3. **Estimate only what we cannot measure**
|
|
4. **Same formula for `/context` AND compaction decisions**
|
|
|
|
---
|
|
|
|
## THE FORMULA (Precise Specification)
|
|
|
|
### Core Formula
|
|
|
|
```
|
|
estimatedNextInput = lastInputTokens + lastOutputTokens + newMessagesEstimate
|
|
```
|
|
|
|
### Variable Definitions
|
|
|
|
| Variable | Definition | Source | When Updated |
|
|
|----------|------------|--------|--------------|
|
|
| `lastInputTokens` | Tokens we SENT in the most recent LLM call | `tokenUsage.inputTokens` from API response | After EVERY LLM call |
|
|
| `lastOutputTokens` | Tokens the LLM RETURNED in its response | `tokenUsage.outputTokens` from API response | After EVERY LLM call |
|
|
| `newMessagesEstimate` | Estimate for messages added AFTER the last LLM call | `length/4` heuristic | Calculated on demand |
|
|
|
|
### What Counts as "New Messages"?
|
|
|
|
Messages added to history AFTER `lastInputTokens` was recorded:
|
|
- **Tool results** (role='tool') from the last assistant's tool calls
|
|
- **New user messages** typed since last LLM call
|
|
- **Any injected system messages** added between calls
|
|
|
|
### Example Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Turn 1: User asks "What's the weather in NYC?" │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ LLM Call: │
|
|
│ inputTokens = 5000 (system + tools + user message) │
|
|
│ outputTokens = 100 (assistant: "I'll check" + tool_call) │
|
|
│ │
|
|
│ After call: UPDATE lastInputTokens=5000, lastOutputTokens=100 │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Tool executes, result added to history │
|
|
│ Tool result: "NYC: 72°F, sunny" (role='tool') │
|
|
│ │
|
|
│ This is a NEW MESSAGE (added after lastInputTokens recorded) │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Before Turn 2: Calculate estimated context │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ lastInputTokens = 5000 (from Turn 1) │
|
|
│ lastOutputTokens = 100 (from Turn 1) │
|
|
│ newMessagesEstimate = estimate(tool_result) ≈ 20 │
|
|
│ │
|
|
│ estimatedNextInput = 5000 + 100 + 20 = 5120 │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Turn 2: LLM processes tool result │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ LLM Call: │
|
|
│ inputTokens = 5115 (ACTUAL - this is our ground truth!) │
|
|
│ outputTokens = 50 (assistant: "The weather is 72°F...") │
|
|
│ │
|
|
│ VERIFICATION: estimated=5120, actual=5115, error=+5 (+0.1%) │
|
|
│ │
|
|
│ After call: UPDATE lastInputTokens=5115, lastOutputTokens=50 │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Verification Metrics
|
|
|
|
On EVERY LLM call, log the accuracy of our previous estimate:
|
|
|
|
```typescript
|
|
// Before LLM call
|
|
const estimated = lastInputTokens + lastOutputTokens + newMessagesEstimate;
|
|
|
|
// After LLM call, compare to actual
|
|
const actual = response.tokenUsage.inputTokens;
|
|
const error = estimated - actual;
|
|
const errorPercent = (error / actual) * 100;
|
|
|
|
logger.info(`Context estimate: estimated=${estimated}, actual=${actual}, error=${error > 0 ? '+' : ''}${error} (${errorPercent.toFixed(1)}%)`);
|
|
```
|
|
|
|
### Breakdown for Display (Back-Calculation)
|
|
|
|
For `/context` overlay, we show a breakdown. Since we only know the TOTAL accurately, we back-calculate messages:
|
|
|
|
```typescript
|
|
const total = lastInputTokens + lastOutputTokens + newMessagesEstimate;
|
|
|
|
// These are estimates (we can't measure them directly)
|
|
const systemPromptEstimate = estimateTokens(systemPrompt); // length/4
|
|
const toolsEstimate = estimateToolsTokens(tools); // length/4
|
|
|
|
// Back-calculate messages so the math adds up
|
|
const messagesDisplay = total - systemPromptEstimate - toolsEstimate;
|
|
|
|
// If negative, our estimates are too high - cap at 0 and log warning
|
|
if (messagesDisplay < 0) {
|
|
logger.warn(`Back-calculated messages negative (${messagesDisplay}), estimates may be too high`);
|
|
messagesDisplay = 0;
|
|
}
|
|
```
|
|
|
|
### Edge Cases
|
|
|
|
| Scenario | Behavior |
|
|
|----------|----------|
|
|
| **No LLM call yet** | `lastInputTokens=null`, fall back to pure estimation, show "(estimated)" label |
|
|
| **After compaction** | History changed significantly, set `lastInputTokens=null`, fall back to estimation until next call |
|
|
| **messagesDisplay negative** | Cap at 0, log warning - indicates system/tools estimates too high |
|
|
| **System prompt changed** | Next estimate may be off, but next actual will correct it |
|
|
| **Tools changed (MCP)** | Same as above - self-correcting after next call |
|
|
|
|
### What /context Should Display
|
|
|
|
```
|
|
Context Usage: 52,100 / 200,000 tokens (26%)
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|
|
|
Breakdown:
|
|
System prompt: 4,000 tokens (estimated)
|
|
Tools: 8,000 tokens (estimated)
|
|
Messages: 40,100 tokens (back-calculated)
|
|
─────────────────────────────
|
|
Total: 52,100 tokens
|
|
|
|
Calculation basis:
|
|
Last actual input: 50,000 tokens
|
|
Last output: 2,000 tokens
|
|
New since then: 100 tokens (estimated)
|
|
|
|
Last estimate accuracy: +0.6% error
|
|
|
|
Free space: 131,900 tokens (after 16,000 output buffer)
|
|
```
|
|
|
|
### Implementation Checklist
|
|
|
|
- [ ] Store `lastInputTokens` and `lastOutputTokens` after each LLM call
|
|
- [ ] Track which messages are "new" since last LLM call (need message timestamp or index tracking)
|
|
- [ ] Calculate `newMessagesEstimate` only for messages added after last call
|
|
- [ ] Log verification metrics on every LLM call
|
|
- [ ] Update `/context` overlay to show this breakdown
|
|
- [ ] Handle edge cases (no call yet, after compaction)
|
|
- [ ] Use SAME formula for compaction decisions
|
|
|
|
---
|
|
|
|
### Legacy Edge Cases (keeping for reference)
|
|
|
|
1. **No LLM call yet (new session)**
|
|
- Fall back to pure estimation
|
|
- All numbers are estimates with "(estimated)" label
|
|
|
|
2. **messagesDisplay comes out negative**
|
|
- Our estimates for system/tools are too high
|
|
- Cap at 0, log warning
|
|
- Indicates estimation needs calibration
|
|
|
|
3. **After compaction**
|
|
- Token counts reset with new session
|
|
- `compactionCount` tracks how many times compacted
|
|
|
|
4. **Reasoning tokens**
|
|
- Must be sent back to LLM (fix formatter) ✅ DONE
|
|
- Include in context calculation
|
|
- Track separately for display
|
|
|
|
### Verification: Why `lastOutputTokens` Is Safe to Use Directly
|
|
|
|
*Verified on 2025-01-20 by analyzing AI SDK source code and our codebase*
|
|
|
|
**Question:** Does `outputTokens` include content that might be pruned before the next LLM call?
|
|
|
|
**Answer:** No. `outputTokens` is safe to use directly because:
|
|
|
|
#### Part 1: What does `outputTokens` include? (AI SDK Verification)
|
|
|
|
**Anthropic** - verified via `ai/packages/anthropic/src/__fixtures__/anthropic-json-tool.1.chunks.txt`:
|
|
```json
|
|
{"type":"message_delta","delta":{"stop_reason":"tool_use"},"usage":{"output_tokens":47}}
|
|
```
|
|
Tool call response reports `output_tokens: 47` - **includes tool calls** ✅
|
|
|
|
**OpenAI** - verified via `ai/packages/openai/src/responses/__fixtures__/openai-shell-tool.1.chunks.txt`:
|
|
```json
|
|
{"output":[{"type":"shell_call","action":{"commands":["ls -a ~/Desktop"]}}],"usage":{"output_tokens":41}}
|
|
```
|
|
Shell tool call reports `output_tokens: 41` - **includes tool calls** ✅
|
|
|
|
**Google** - verified via `ai/packages/google/src/google-generative-ai-language-model.test.ts` lines 2274-2302:
|
|
```typescript
|
|
content: { parts: [{ functionCall: { name: 'test-tool', args: { value: 'test' } } }] },
|
|
usageMetadata: { promptTokenCount: 10, candidatesTokenCount: 20, totalTokenCount: 30 }
|
|
```
|
|
Function call response reports `candidatesTokenCount: 20` - **includes tool calls** ✅
|
|
|
|
#### Part 2: What gets pruned in our system?
|
|
|
|
From `manager.ts` `prepareHistory()`:
|
|
- Only **tool result messages** (role='tool') can be pruned
|
|
- They're marked with `compactedAt` timestamp
|
|
- Replaced with placeholder: `[Old tool result content cleared]`
|
|
|
|
**What is NEVER pruned:**
|
|
- Assistant messages (text content)
|
|
- Assistant's tool calls
|
|
- User messages
|
|
|
|
#### Verification Table
|
|
|
|
| Message Type | Pruned? | Part of outputTokens? |
|
|
|-------------|---------|----------------------|
|
|
| Assistant text | ❌ Never | ✅ Yes |
|
|
| Assistant tool calls | ❌ Never | ✅ Yes (verified across all providers) |
|
|
| Tool results (role='tool') | ✅ Can be pruned | ❌ No (separate messages) |
|
|
|
|
#### Code Evidence
|
|
|
|
- `stream-processor.ts`: Tool calls stored via `addToolCall()` with full arguments
|
|
- `manager.ts` line 279: Only `msg.role === 'tool' && msg.compactedAt` gets placeholder
|
|
- No code path exists to prune assistant messages
|
|
|
|
**Conclusion:** The formula `lastInputTokens + lastOutputTokens + newMessagesEstimate` is correct because:
|
|
- `lastInputTokens` reflects pruned history (API tells us exactly what was sent)
|
|
- `lastOutputTokens` is the assistant's response (text + tool calls) which is stored and sent back as-is
|
|
- All major providers (Anthropic, OpenAI, Google) include tool calls in their output token counts
|
|
- Only tool results (separate messages) can be pruned, and those are in `inputTokens`
|
|
|
|
---
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Fix Reasoning Storage (HIGH PRIORITY - Bug #1) ✅ COMPLETED
|
|
|
|
**The root cause:** `stream-processor.ts` collects reasoning but never persists it.
|
|
|
|
**Files to modify:**
|
|
- `packages/core/src/llm/executor/stream-processor.ts`
|
|
- `packages/core/src/context/types.ts`
|
|
|
|
**Changes:**
|
|
|
|
1. Add `reasoningMetadata` field to `AssistantMessage` type:
|
|
```typescript
|
|
// In context/types.ts
|
|
interface AssistantMessage {
|
|
reasoning?: string;
|
|
reasoningMetadata?: Record<string, unknown>; // NEW - for provider round-tripping
|
|
// ...
|
|
}
|
|
```
|
|
|
|
2. Capture `providerMetadata` from reasoning-delta events:
|
|
```typescript
|
|
// In stream-processor.ts, add field:
|
|
private reasoningMetadata: Record<string, unknown> | undefined;
|
|
|
|
// In reasoning-delta case:
|
|
case 'reasoning-delta':
|
|
this.reasoningText += event.text;
|
|
// Capture provider metadata for round-tripping (OpenAI itemId, etc.)
|
|
if (event.providerMetadata) {
|
|
this.reasoningMetadata = event.providerMetadata;
|
|
}
|
|
// ... emit events
|
|
```
|
|
|
|
3. **Fix the bug** - persist reasoning in `updateAssistantMessage()`:
|
|
```typescript
|
|
// In stream-processor.ts, 'finish' case (around line 315):
|
|
if (this.assistantMessageId) {
|
|
await this.contextManager.updateAssistantMessage(
|
|
this.assistantMessageId,
|
|
{
|
|
tokenUsage: usage,
|
|
reasoning: this.reasoningText || undefined, // ADD THIS
|
|
reasoningMetadata: this.reasoningMetadata, // ADD THIS
|
|
}
|
|
);
|
|
}
|
|
```
|
|
|
|
### Phase 2: Fix Reasoning Round-Trip (Bug #2) ✅ COMPLETED
|
|
|
|
**Files to modify:**
|
|
- `packages/core/src/llm/formatters/vercel.ts`
|
|
|
|
**Changes:**
|
|
|
|
1. Update `formatAssistantMessage()` to include reasoning:
|
|
```typescript
|
|
// In formatAssistantMessage(), before returning:
|
|
if (msg.reasoning) {
|
|
contentParts.push({
|
|
type: 'reasoning',
|
|
text: msg.reasoning,
|
|
providerMetadata: msg.reasoningMetadata,
|
|
});
|
|
}
|
|
```
|
|
|
|
**Verified:** Vercel AI SDK's `AssistantContent` type supports `ReasoningPart`:
|
|
```typescript
|
|
// packages/provider-utils/src/types/assistant-model-message.ts
|
|
export type AssistantContent = string | Array<TextPart | FilePart | ReasoningPart | ...>;
|
|
|
|
// packages/provider-utils/src/types/content-part.ts
|
|
export interface ReasoningPart {
|
|
type: 'reasoning';
|
|
text: string;
|
|
providerOptions?: ProviderOptions; // For round-tripping provider metadata
|
|
}
|
|
```
|
|
|
|
### Phase 3: Unified Context Calculation ✅ COMPLETED
|
|
|
|
**Files to modify:**
|
|
- `packages/core/src/context/manager.ts` - `getContextTokenEstimate()`
|
|
- `packages/core/src/llm/executor/turn-executor.ts` - compaction check
|
|
- `packages/cli/src/cli/ink-cli/components/overlays/ContextStatsOverlay.tsx`
|
|
|
|
**Changes:**
|
|
|
|
1. Create shared `calculateContextUsage()` function:
|
|
```typescript
|
|
// New file: packages/core/src/context/context-calculator.ts
|
|
export async function calculateContextUsage(
|
|
contextManager: ContextManager,
|
|
tools: ToolDefinitions,
|
|
maxContextTokens: number,
|
|
outputBuffer: number
|
|
): Promise<ContextUsage> {
|
|
// Implement the formula above
|
|
}
|
|
```
|
|
|
|
2. Use in `/context`:
|
|
```typescript
|
|
// In DextoAgent.getContextStats()
|
|
const usage = await calculateContextUsage(...);
|
|
return usage;
|
|
```
|
|
|
|
3. Use in compaction decision:
|
|
```typescript
|
|
// In turn-executor.ts
|
|
const usage = await calculateContextUsage(...);
|
|
if (usage.total > compactionThreshold) {
|
|
// Compact!
|
|
}
|
|
```
|
|
|
|
### Phase 4: Message-Level Token Tracking
|
|
|
|
**Already implemented!** We just need to use it:
|
|
|
|
```typescript
|
|
// In calculateContextUsage(), sum from messages:
|
|
const history = await contextManager.getHistory();
|
|
let totalInputFromMessages = 0;
|
|
let totalOutputFromMessages = 0;
|
|
let totalReasoningFromMessages = 0;
|
|
|
|
for (const msg of history) {
|
|
if (msg.role === 'assistant' && msg.tokenUsage) {
|
|
totalOutputFromMessages += msg.tokenUsage.outputTokens ?? 0;
|
|
totalReasoningFromMessages += msg.tokenUsage.reasoningTokens ?? 0;
|
|
}
|
|
}
|
|
```
|
|
|
|
### Phase 5: Calibration & Logging
|
|
|
|
1. Log estimate vs actual on every LLM call (already done, level=info)
|
|
2. Track calibration ratio over time
|
|
3. Consider adaptive estimation based on observed ratios
|
|
|
|
### Phase 6: Future - API Token Counting
|
|
|
|
**For Anthropic:**
|
|
```typescript
|
|
// New method in Anthropic service
|
|
async countTokens(messages: Message[], tools: Tool[]): Promise<{
|
|
input_tokens: number;
|
|
}>
|
|
```
|
|
|
|
**For other providers:**
|
|
- tiktoken for OpenAI
|
|
- Gemini countTokens API
|
|
- Fallback to estimation
|
|
|
|
---
|
|
|
|
## Data Flow Diagram
|
|
|
|
### Current State (BROKEN)
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ LLM Response Stream │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ reasoning-delta events → reasoningText accumulated ✓ │
|
|
│ text-delta events → content accumulated ✓ │
|
|
│ finish event → usage: { inputTokens, outputTokens, ... } │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ stream-processor.ts updateAssistantMessage() │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ await this.contextManager.updateAssistantMessage( │
|
|
│ this.assistantMessageId, │
|
|
│ { tokenUsage: usage } ← ONLY tokenUsage saved! │
|
|
│ ); ← reasoning NOT included! ✗ │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ AssistantMessage Stored │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ { │
|
|
│ role: 'assistant', │
|
|
│ content: [...], ← ✓ Stored │
|
|
│ reasoning: undefined, ← ✗ NEVER SET! │
|
|
│ tokenUsage: {...} ← ✓ Stored │
|
|
│ } │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
|
|
### Target State (FIXED)
|
|
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ LLM Response Stream │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ reasoning-delta events → reasoningText + providerMetadata ✓ │
|
|
│ text-delta events → content accumulated ✓ │
|
|
│ finish event → usage: { inputTokens, outputTokens, ... } │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ stream-processor.ts updateAssistantMessage() │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ await this.contextManager.updateAssistantMessage( │
|
|
│ this.assistantMessageId, │
|
|
│ { │
|
|
│ tokenUsage: usage, │
|
|
│ reasoning: this.reasoningText, ← NEW │
|
|
│ reasoningMetadata: this.reasoningMetadata ← NEW │
|
|
│ } │
|
|
│ ); │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ AssistantMessage Stored │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ { │
|
|
│ role: 'assistant', │
|
|
│ content: [...], │
|
|
│ reasoning: 'Let me think...', ← ✓ Now stored │
|
|
│ reasoningMetadata: { openai: { itemId: '...' } }, ← ✓ For round-trip
|
|
│ tokenUsage: { inputTokens, outputTokens, reasoningTokens } │
|
|
│ } │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Next LLM Call (Formatter) │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ formatAssistantMessage() includes: │
|
|
│ - content (text parts) ✓ Already done │
|
|
│ - toolCalls ✓ Already done │
|
|
│ - reasoning + providerMetadata ✓ NEW - enables round-trip │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ /context Calculation │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ currentTotal = lastInput + lastOutput + newMessagesEstimate │
|
|
│ │
|
|
│ Breakdown: │
|
|
│ systemPrompt = estimate (length/4) │
|
|
│ tools = estimate (length/4) │
|
|
│ messages = currentTotal - systemPrompt - tools (back-calc) │
|
|
│ reasoning = sum(msg.tokenUsage.reasoningTokens) (for display) │
|
|
│ │
|
|
│ freeSpace = maxTokens - currentTotal - outputBuffer │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Compaction Decision │
|
|
├─────────────────────────────────────────────────────────────────────┤
|
|
│ SAME FORMULA as /context! │
|
|
│ │
|
|
│ if (currentTotal > compactionThreshold) { │
|
|
│ triggerCompaction(); │
|
|
│ } │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
|
|
1. **Reasoning storage test (Phase 1)**
|
|
- Mock LLM stream with reasoning-delta events
|
|
- Verify `stream-processor.ts` calls `updateAssistantMessage()` with reasoning
|
|
- Verify `reasoningMetadata` is captured from `providerMetadata`
|
|
|
|
2. **Reasoning round-trip test (Phase 2)**
|
|
- Create `AssistantMessage` with `reasoning` and `reasoningMetadata`
|
|
- Call `formatAssistantMessage()`
|
|
- Verify output contains reasoning part with `providerMetadata`
|
|
|
|
3. **Token calculation test (Phase 3)**
|
|
- Mock message with known tokenUsage
|
|
- Verify calculation matches expected
|
|
|
|
4. **Edge case tests**
|
|
- New session (no actuals) - falls back to estimation
|
|
- Negative messagesDisplay (capped at 0)
|
|
- Post-compaction state
|
|
- Empty reasoning (should not create empty reasoning part)
|
|
|
|
### Integration Tests
|
|
|
|
1. **Full reasoning flow test**
|
|
- Enable extended thinking on Claude
|
|
- Send message that triggers reasoning
|
|
- Verify reasoning persisted to message
|
|
- Send follow-up message
|
|
- Verify reasoning sent back to LLM (check formatted messages)
|
|
|
|
2. **Token tracking test**
|
|
- Send message
|
|
- Verify tokenUsage stored on message
|
|
- Open /context
|
|
- Verify numbers use actual from last call
|
|
|
|
3. **Compaction alignment test**
|
|
- Fill context near threshold
|
|
- Verify /context and compaction trigger at same point
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
1. **Numbers add up**: Total = SystemPrompt + Tools + Messages
|
|
2. **Consistency**: /context and compaction use same calculation
|
|
3. **Reasoning works**: Traces sent back to LLM correctly
|
|
4. **Calibration visible**: Logs show estimate vs actual ratio
|
|
5. **Provider compatibility**: Works with Anthropic, OpenAI, Google, etc.
|
|
|
|
---
|
|
|
|
## Appendix: Verification Against Other Implementations
|
|
|
|
*This plan was verified against actual implementations on 2025-01-20.*
|
|
|
|
### OpenCode Verification (~/Projects/external/opencode)
|
|
|
|
| Claim | Verified | Evidence |
|
|
|-------|----------|----------|
|
|
| Stores reasoning as `ReasoningPart` | ✅ | `message-v2.ts` lines 78-89 |
|
|
| Includes `providerMetadata` for round-tripping | ✅ | `message-v2.ts` lines 554-560 |
|
|
| `toModelMessage()` sends reasoning back | ✅ | `message-v2.ts` lines 435-569 |
|
|
| Tracks reasoning tokens separately | ✅ | `session/index.ts` line 432, schemas throughout |
|
|
| Handles provider-specific metadata | ✅ | `openai-responses-language-model.ts` lines 520-538 |
|
|
|
|
**OpenCode approach:** Full round-trip of reasoning with provider metadata. This is our reference implementation.
|
|
|
|
### Gemini-CLI Verification (~/Projects/external/gemini-cli)
|
|
|
|
| Claim in Original Plan | Actual Behavior | Status |
|
|
|------------------------|-----------------|--------|
|
|
| "Parts with thought: true included when sending history back" | **WRONG** - They filter OUT thoughts at line 815 | ❌ Corrected |
|
|
| Uses `thought: true` flag | ✅ Correct | ✅ |
|
|
| Tracks `thoughtsTokenCount` | ✅ Correct - `chatRecordingService.ts` line 278 | ✅ |
|
|
|
|
**Gemini-CLI approach:** Track thought tokens for cost/display but do NOT round-trip them.
|
|
This is a simpler approach but requires Google-specific handling.
|
|
|
|
### Why We Follow OpenCode
|
|
|
|
1. **Same SDK**: Both use Vercel AI SDK
|
|
2. **Provider-agnostic**: Works across all providers without special-casing
|
|
3. **Future-proof**: Preserves metadata for providers that need it
|
|
4. **Simpler code**: No provider-specific filtering logic
|
|
|
|
### Dexto Implementation Verification
|
|
|
|
| Component | Current State | Bug |
|
|
|-----------|---------------|-----|
|
|
| `stream-processor.ts` | Accumulates `reasoningText` but doesn't persist | **Bug #1** |
|
|
| `vercel.ts` formatter | Ignores `msg.reasoning` | **Bug #2** (blocked by #1) |
|
|
| `AssistantMessage` type | Has `reasoning?: string` field | ✅ Ready |
|
|
| Per-message `tokenUsage` | Stored via `updateAssistantMessage()` | ✅ Working |
|
|
| `lastActualInputTokens` | Set after each LLM call | ✅ Working |
|
|
| Compaction calculation | Uses `estimateMessagesTokens()` only | Different from /context |
|
|
| `/context` calculation | Uses full estimation (system + tools + messages) | Different from compaction |
|