feat: Add intelligent auto-router and enhanced integrations
- Add intelligent-router.sh hook for automatic agent routing - Add AUTO-TRIGGER-SUMMARY.md documentation - Add FINAL-INTEGRATION-SUMMARY.md documentation - Complete Prometheus integration (6 commands + 4 tools) - Complete Dexto integration (12 commands + 5 tools) - Enhanced Ralph with access to all agents - Fix /clawd command (removed disable-model-invocation) - Update hooks.json to v5 with intelligent routing - 291 total skills now available - All 21 commands with automatic routing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
949
dexto/feature-plans/context-calculation.md
Normal file
949
dexto/feature-plans/context-calculation.md
Normal file
@@ -0,0 +1,949 @@
|
||||
# Context Window Calculation Analysis
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Our `/context` overlay shows inconsistent numbers:
|
||||
- **Total shown**: 122.4k tokens (from API's actual count)
|
||||
- **Breakdown sum**: ~73k tokens (our length/4 estimates)
|
||||
- **Free space**: Calculated from breakdown, not actual total
|
||||
|
||||
This leads to confusing UX where numbers don't add up.
|
||||
|
||||
Additionally, our compaction decision uses a different calculation than `/context`, leading to inconsistency.
|
||||
|
||||
---
|
||||
|
||||
## Critical Finding #1: Reasoning Tokens Not Sent Back to LLM
|
||||
|
||||
### Current State (Dexto)
|
||||
|
||||
**We have the type but DON'T actually store reasoning:**
|
||||
```typescript
|
||||
// AssistantMessage in context/types.ts
|
||||
interface AssistantMessage {
|
||||
reasoning?: string; // Field EXISTS but is never populated!
|
||||
tokenUsage?: TokenUsage;
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
**Two separate bugs:**
|
||||
|
||||
1. **`stream-processor.ts` never persists reasoning text:**
|
||||
```typescript
|
||||
// Line 24: Reasoning IS accumulated during streaming
|
||||
private reasoningText: string = '';
|
||||
|
||||
// Lines 97-108: Accumulated from reasoning-delta events
|
||||
case 'reasoning-delta':
|
||||
this.reasoningText += event.text; // ✓ Collected
|
||||
|
||||
// BUT lines 314-320: Only tokenUsage is persisted!
|
||||
await this.contextManager.updateAssistantMessage(
|
||||
this.assistantMessageId,
|
||||
{ tokenUsage: usage } // ✗ No reasoning field!
|
||||
);
|
||||
```
|
||||
|
||||
2. **`formatAssistantMessage()` in `vercel.ts` ignores `msg.reasoning`:**
|
||||
- Only extracts `msg.content` (text parts) and `msg.toolCalls`
|
||||
- Even if reasoning WAS stored, it wouldn't be sent back
|
||||
|
||||
**Result:** Reasoning is collected → emitted to events → but never persisted or round-tripped.
|
||||
|
||||
### How OpenCode Handles It (Correctly)
|
||||
|
||||
```typescript
|
||||
// In toModelMessage() - opencode/src/session/message-v2.ts
|
||||
if (part.type === "reasoning") {
|
||||
assistantMessage.parts.push({
|
||||
type: "reasoning",
|
||||
text: part.text,
|
||||
providerMetadata: part.metadata, // Critical for round-tripping!
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
OpenCode:
|
||||
1. Stores reasoning as `ReasoningPart` in message parts
|
||||
2. Includes `providerMetadata` (contains thought signatures for Gemini, etc.)
|
||||
3. Sends reasoning back in `toModelMessage()` conversion
|
||||
4. Tracks `reasoning` tokens separately in token usage
|
||||
|
||||
### How Gemini-CLI Handles It (Different Approach)
|
||||
|
||||
```typescript
|
||||
// Uses thought: true flag on parts from model
|
||||
{ text: 'Hmm', thought: true }
|
||||
|
||||
// BUT they explicitly FILTER OUT thoughts before storing in history!
|
||||
// geminiChat.ts line 815:
|
||||
modelResponseParts.push(
|
||||
...content.parts.filter((part) => !part.thought), // Filter OUT thoughts
|
||||
);
|
||||
|
||||
// Token tracking still captures thoughtsTokenCount from API response
|
||||
// chatRecordingService.ts line 278:
|
||||
tokens.thoughts = respUsageMetadata.thoughtsTokenCount ?? 0;
|
||||
```
|
||||
|
||||
**Key difference:** Gemini-CLI tracks thought tokens for display/cost but does NOT round-trip them.
|
||||
This works because Google's API doesn't require thought history for context continuity.
|
||||
|
||||
### Why We Follow OpenCode's Approach
|
||||
|
||||
1. **We use Vercel AI SDK** like OpenCode, not Google's native SDK
|
||||
2. **Provider-agnostic**: OpenCode's approach works across all providers
|
||||
3. **No provider-specific logic**: We shouldn't special-case Google's behavior
|
||||
4. **Context continuity**: Some providers (especially via AI SDK) may need reasoning for proper state
|
||||
|
||||
### Impact of Current Bugs
|
||||
|
||||
1. **Context continuity broken**: Reasoning traces lost between turns
|
||||
2. **Token counting incorrect**: Reasoning tokens used but not tracked in context
|
||||
3. **Provider metadata lost**: Cannot round-trip provider-specific metadata (e.g., OpenAI item IDs)
|
||||
|
||||
---
|
||||
|
||||
## Critical Finding #2: Token Usage Storage
|
||||
|
||||
### What We Track
|
||||
|
||||
**Session Level** (`session-manager.ts`):
|
||||
```typescript
|
||||
sessionData.tokenUsage = {
|
||||
inputTokens: 0,
|
||||
outputTokens: 0,
|
||||
reasoningTokens: 0,
|
||||
cacheReadTokens: 0,
|
||||
cacheWriteTokens: 0,
|
||||
totalTokens: 0,
|
||||
};
|
||||
```
|
||||
|
||||
**Message Level** (`AssistantMessage`):
|
||||
```typescript
|
||||
interface AssistantMessage {
|
||||
tokenUsage?: TokenUsage; // Available but...
|
||||
}
|
||||
```
|
||||
|
||||
### Current Flow
|
||||
|
||||
1. `stream-processor.ts` creates assistant message with empty metadata:
|
||||
```typescript
|
||||
await this.contextManager.addAssistantMessage('', [], {});
|
||||
```
|
||||
|
||||
2. After streaming completes, we DO update with token usage:
|
||||
```typescript
|
||||
await this.contextManager.updateAssistantMessage(
|
||||
this.assistantMessageId,
|
||||
{ tokenUsage: usage }
|
||||
);
|
||||
```
|
||||
|
||||
**So we HAVE the data on each message**, we just don't use it for context calculation!
|
||||
|
||||
---
|
||||
|
||||
## Critical Finding #3: Estimate vs Actual Mismatch
|
||||
|
||||
### The Problem
|
||||
|
||||
```
|
||||
API actual inputTokens: 122.4k
|
||||
Our length/4 estimate: 73.0k
|
||||
Difference: 49.4k (67% underestimate!)
|
||||
```
|
||||
|
||||
### Why So Different?
|
||||
|
||||
1. **Tokenizers don't split evenly by characters**
|
||||
- Code tokenizes differently than prose
|
||||
- JSON schemas are verbose when tokenized
|
||||
- Special characters, whitespace handling varies
|
||||
|
||||
2. **We're comparing different things**
|
||||
- `actualTokens` = from last LLM call (includes everything sent)
|
||||
- `breakdown estimate` = calculated now on current history
|
||||
|
||||
3. **Context has grown since last call**
|
||||
- Last call's `inputTokens` doesn't include the response that followed
|
||||
- New user messages added since
|
||||
|
||||
---
|
||||
|
||||
## How Other Tools Handle This
|
||||
|
||||
### Claude Code (Anthropic)
|
||||
|
||||
**Uses `/v1/messages/count_tokens` API for exact counts!**
|
||||
|
||||
```javascript
|
||||
// From cli.js (minified)
|
||||
countTokens(A,Q) {
|
||||
return this._client.post("/v1/messages/count_tokens", { body: A, ...Q })
|
||||
}
|
||||
```
|
||||
|
||||
**Categories tracked:**
|
||||
- System prompt
|
||||
- System tools
|
||||
- Memory files
|
||||
- Skills
|
||||
- MCP tools (with deferred loading)
|
||||
- Agents
|
||||
- Messages (with sub-breakdown)
|
||||
- Free space
|
||||
- Autocompact buffer
|
||||
|
||||
**Free space calculation:**
|
||||
```javascript
|
||||
// YA = sum of all category tokens (excluding deferred)
|
||||
let YA = k.reduce((CA, _A) => CA + (_A.isDeferred ? 0 : _A.tokens), 0)
|
||||
|
||||
// WA = buffer (autocompact or compact)
|
||||
let WA = autocompactEnabled ? (maxTokens - contextUsed) : 500;
|
||||
|
||||
// Free space
|
||||
let wA = Math.max(0, maxTokens - YA - WA)
|
||||
```
|
||||
|
||||
### gemini-cli
|
||||
|
||||
**Hybrid approach:**
|
||||
|
||||
```typescript
|
||||
// Sync estimation (fast)
|
||||
estimateTokenCountSync(parts): number {
|
||||
// ASCII: ~4 chars per token (0.25 tokens/char)
|
||||
// Non-ASCII/CJK: ~1-2 chars per token (1.3 tokens/char)
|
||||
}
|
||||
|
||||
// API counting (when needed)
|
||||
if (hasMedia) {
|
||||
use Gemini countTokens API
|
||||
} else {
|
||||
use sync estimation
|
||||
}
|
||||
```
|
||||
|
||||
**Token tracking from API response:**
|
||||
```typescript
|
||||
{
|
||||
input: promptTokenCount,
|
||||
output: candidatesTokenCount,
|
||||
cached: cachedContentTokenCount,
|
||||
thoughts: thoughtsTokenCount, // Reasoning!
|
||||
tool: toolUsePromptTokenCount,
|
||||
total: totalTokenCount
|
||||
}
|
||||
```
|
||||
|
||||
### opencode
|
||||
|
||||
**Simple estimation + detailed tracking:**
|
||||
|
||||
```typescript
|
||||
Token.estimate(input: string): number {
|
||||
return Math.round(input.length / 4)
|
||||
}
|
||||
|
||||
// But tracks actuals per message:
|
||||
StepFinishPart {
|
||||
tokens: {
|
||||
input: number,
|
||||
output: number,
|
||||
reasoning: number,
|
||||
cache: { read: number, write: number }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Current Architecture Issues
|
||||
|
||||
### 1. Reasoning Pipeline (BROKEN - Two Bugs)
|
||||
|
||||
**Current (broken):**
|
||||
```
|
||||
LLM Response → reasoning-delta events received
|
||||
↓
|
||||
stream-processor.ts → accumulates reasoningText ✓
|
||||
↓
|
||||
updateAssistantMessage() → ONLY saves tokenUsage, NOT reasoning ✗
|
||||
↓
|
||||
AssistantMessage.reasoning = undefined (never set!)
|
||||
↓
|
||||
formatAssistantMessage() → has nothing to format anyway
|
||||
↓
|
||||
Reasoning NOT sent back to LLM ❌
|
||||
```
|
||||
|
||||
**Should be (following OpenCode):**
|
||||
```
|
||||
LLM Response → reasoning-delta events received (with providerMetadata)
|
||||
↓
|
||||
stream-processor.ts → accumulates reasoningText AND reasoningMetadata
|
||||
↓
|
||||
updateAssistantMessage() → saves reasoning + reasoningMetadata + tokenUsage
|
||||
↓
|
||||
AssistantMessage.reasoning = "thinking..." ✓
|
||||
AssistantMessage.reasoningMetadata = { openai: { itemId: "..." } } ✓
|
||||
↓
|
||||
formatAssistantMessage() → includes reasoning part with providerMetadata
|
||||
↓
|
||||
Reasoning sent back to LLM ✓
|
||||
```
|
||||
|
||||
### 2. Token Calculation (/context)
|
||||
|
||||
**Current:**
|
||||
```typescript
|
||||
// Uses length/4 estimate for everything
|
||||
systemPromptTokens = estimateStringTokens(systemPrompt); // length/4
|
||||
messagesTokens = estimateMessagesTokens(preparedHistory); // length/4
|
||||
toolsTokens = estimateToolTokens(tools); // length/4
|
||||
|
||||
total = systemPromptTokens + messagesTokens + toolsTokens;
|
||||
freeSpace = maxTokens - total - outputBuffer;
|
||||
```
|
||||
|
||||
**Problem:** Total doesn't match API's actual count.
|
||||
|
||||
### 3. Compaction Decision
|
||||
|
||||
**Current (`turn-executor.ts`):**
|
||||
```typescript
|
||||
const estimatedTokens = estimateMessagesTokens(prepared.preparedHistory);
|
||||
if (estimatedTokens > compactionThreshold) {
|
||||
// Compact!
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:** Uses different calculation than `/context`, and both are wrong!
|
||||
|
||||
---
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
### Principle: Single Source of Truth
|
||||
|
||||
1. **Use actual token counts from API as ground truth**
|
||||
2. **Track tokens per message for accurate history calculation**
|
||||
3. **Estimate only what we cannot measure**
|
||||
4. **Same formula for `/context` AND compaction decisions**
|
||||
|
||||
---
|
||||
|
||||
## THE FORMULA (Precise Specification)
|
||||
|
||||
### Core Formula
|
||||
|
||||
```
|
||||
estimatedNextInput = lastInputTokens + lastOutputTokens + newMessagesEstimate
|
||||
```
|
||||
|
||||
### Variable Definitions
|
||||
|
||||
| Variable | Definition | Source | When Updated |
|
||||
|----------|------------|--------|--------------|
|
||||
| `lastInputTokens` | Tokens we SENT in the most recent LLM call | `tokenUsage.inputTokens` from API response | After EVERY LLM call |
|
||||
| `lastOutputTokens` | Tokens the LLM RETURNED in its response | `tokenUsage.outputTokens` from API response | After EVERY LLM call |
|
||||
| `newMessagesEstimate` | Estimate for messages added AFTER the last LLM call | `length/4` heuristic | Calculated on demand |
|
||||
|
||||
### What Counts as "New Messages"?
|
||||
|
||||
Messages added to history AFTER `lastInputTokens` was recorded:
|
||||
- **Tool results** (role='tool') from the last assistant's tool calls
|
||||
- **New user messages** typed since last LLM call
|
||||
- **Any injected system messages** added between calls
|
||||
|
||||
### Example Flow
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Turn 1: User asks "What's the weather in NYC?" │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ LLM Call: │
|
||||
│ inputTokens = 5000 (system + tools + user message) │
|
||||
│ outputTokens = 100 (assistant: "I'll check" + tool_call) │
|
||||
│ │
|
||||
│ After call: UPDATE lastInputTokens=5000, lastOutputTokens=100 │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Tool executes, result added to history │
|
||||
│ Tool result: "NYC: 72°F, sunny" (role='tool') │
|
||||
│ │
|
||||
│ This is a NEW MESSAGE (added after lastInputTokens recorded) │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Before Turn 2: Calculate estimated context │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ lastInputTokens = 5000 (from Turn 1) │
|
||||
│ lastOutputTokens = 100 (from Turn 1) │
|
||||
│ newMessagesEstimate = estimate(tool_result) ≈ 20 │
|
||||
│ │
|
||||
│ estimatedNextInput = 5000 + 100 + 20 = 5120 │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Turn 2: LLM processes tool result │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ LLM Call: │
|
||||
│ inputTokens = 5115 (ACTUAL - this is our ground truth!) │
|
||||
│ outputTokens = 50 (assistant: "The weather is 72°F...") │
|
||||
│ │
|
||||
│ VERIFICATION: estimated=5120, actual=5115, error=+5 (+0.1%) │
|
||||
│ │
|
||||
│ After call: UPDATE lastInputTokens=5115, lastOutputTokens=50 │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Verification Metrics
|
||||
|
||||
On EVERY LLM call, log the accuracy of our previous estimate:
|
||||
|
||||
```typescript
|
||||
// Before LLM call
|
||||
const estimated = lastInputTokens + lastOutputTokens + newMessagesEstimate;
|
||||
|
||||
// After LLM call, compare to actual
|
||||
const actual = response.tokenUsage.inputTokens;
|
||||
const error = estimated - actual;
|
||||
const errorPercent = (error / actual) * 100;
|
||||
|
||||
logger.info(`Context estimate: estimated=${estimated}, actual=${actual}, error=${error > 0 ? '+' : ''}${error} (${errorPercent.toFixed(1)}%)`);
|
||||
```
|
||||
|
||||
### Breakdown for Display (Back-Calculation)
|
||||
|
||||
For `/context` overlay, we show a breakdown. Since we only know the TOTAL accurately, we back-calculate messages:
|
||||
|
||||
```typescript
|
||||
const total = lastInputTokens + lastOutputTokens + newMessagesEstimate;
|
||||
|
||||
// These are estimates (we can't measure them directly)
|
||||
const systemPromptEstimate = estimateTokens(systemPrompt); // length/4
|
||||
const toolsEstimate = estimateToolsTokens(tools); // length/4
|
||||
|
||||
// Back-calculate messages so the math adds up
|
||||
const messagesDisplay = total - systemPromptEstimate - toolsEstimate;
|
||||
|
||||
// If negative, our estimates are too high - cap at 0 and log warning
|
||||
if (messagesDisplay < 0) {
|
||||
logger.warn(`Back-calculated messages negative (${messagesDisplay}), estimates may be too high`);
|
||||
messagesDisplay = 0;
|
||||
}
|
||||
```
|
||||
|
||||
### Edge Cases
|
||||
|
||||
| Scenario | Behavior |
|
||||
|----------|----------|
|
||||
| **No LLM call yet** | `lastInputTokens=null`, fall back to pure estimation, show "(estimated)" label |
|
||||
| **After compaction** | History changed significantly, set `lastInputTokens=null`, fall back to estimation until next call |
|
||||
| **messagesDisplay negative** | Cap at 0, log warning - indicates system/tools estimates too high |
|
||||
| **System prompt changed** | Next estimate may be off, but next actual will correct it |
|
||||
| **Tools changed (MCP)** | Same as above - self-correcting after next call |
|
||||
|
||||
### What /context Should Display
|
||||
|
||||
```
|
||||
Context Usage: 52,100 / 200,000 tokens (26%)
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
|
||||
Breakdown:
|
||||
System prompt: 4,000 tokens (estimated)
|
||||
Tools: 8,000 tokens (estimated)
|
||||
Messages: 40,100 tokens (back-calculated)
|
||||
─────────────────────────────
|
||||
Total: 52,100 tokens
|
||||
|
||||
Calculation basis:
|
||||
Last actual input: 50,000 tokens
|
||||
Last output: 2,000 tokens
|
||||
New since then: 100 tokens (estimated)
|
||||
|
||||
Last estimate accuracy: +0.6% error
|
||||
|
||||
Free space: 131,900 tokens (after 16,000 output buffer)
|
||||
```
|
||||
|
||||
### Implementation Checklist
|
||||
|
||||
- [ ] Store `lastInputTokens` and `lastOutputTokens` after each LLM call
|
||||
- [ ] Track which messages are "new" since last LLM call (need message timestamp or index tracking)
|
||||
- [ ] Calculate `newMessagesEstimate` only for messages added after last call
|
||||
- [ ] Log verification metrics on every LLM call
|
||||
- [ ] Update `/context` overlay to show this breakdown
|
||||
- [ ] Handle edge cases (no call yet, after compaction)
|
||||
- [ ] Use SAME formula for compaction decisions
|
||||
|
||||
---
|
||||
|
||||
### Legacy Edge Cases (keeping for reference)
|
||||
|
||||
1. **No LLM call yet (new session)**
|
||||
- Fall back to pure estimation
|
||||
- All numbers are estimates with "(estimated)" label
|
||||
|
||||
2. **messagesDisplay comes out negative**
|
||||
- Our estimates for system/tools are too high
|
||||
- Cap at 0, log warning
|
||||
- Indicates estimation needs calibration
|
||||
|
||||
3. **After compaction**
|
||||
- Token counts reset with new session
|
||||
- `compactionCount` tracks how many times compacted
|
||||
|
||||
4. **Reasoning tokens**
|
||||
- Must be sent back to LLM (fix formatter) ✅ DONE
|
||||
- Include in context calculation
|
||||
- Track separately for display
|
||||
|
||||
### Verification: Why `lastOutputTokens` Is Safe to Use Directly
|
||||
|
||||
*Verified on 2025-01-20 by analyzing AI SDK source code and our codebase*
|
||||
|
||||
**Question:** Does `outputTokens` include content that might be pruned before the next LLM call?
|
||||
|
||||
**Answer:** No. `outputTokens` is safe to use directly because:
|
||||
|
||||
#### Part 1: What does `outputTokens` include? (AI SDK Verification)
|
||||
|
||||
**Anthropic** - verified via `ai/packages/anthropic/src/__fixtures__/anthropic-json-tool.1.chunks.txt`:
|
||||
```json
|
||||
{"type":"message_delta","delta":{"stop_reason":"tool_use"},"usage":{"output_tokens":47}}
|
||||
```
|
||||
Tool call response reports `output_tokens: 47` - **includes tool calls** ✅
|
||||
|
||||
**OpenAI** - verified via `ai/packages/openai/src/responses/__fixtures__/openai-shell-tool.1.chunks.txt`:
|
||||
```json
|
||||
{"output":[{"type":"shell_call","action":{"commands":["ls -a ~/Desktop"]}}],"usage":{"output_tokens":41}}
|
||||
```
|
||||
Shell tool call reports `output_tokens: 41` - **includes tool calls** ✅
|
||||
|
||||
**Google** - verified via `ai/packages/google/src/google-generative-ai-language-model.test.ts` lines 2274-2302:
|
||||
```typescript
|
||||
content: { parts: [{ functionCall: { name: 'test-tool', args: { value: 'test' } } }] },
|
||||
usageMetadata: { promptTokenCount: 10, candidatesTokenCount: 20, totalTokenCount: 30 }
|
||||
```
|
||||
Function call response reports `candidatesTokenCount: 20` - **includes tool calls** ✅
|
||||
|
||||
#### Part 2: What gets pruned in our system?
|
||||
|
||||
From `manager.ts` `prepareHistory()`:
|
||||
- Only **tool result messages** (role='tool') can be pruned
|
||||
- They're marked with `compactedAt` timestamp
|
||||
- Replaced with placeholder: `[Old tool result content cleared]`
|
||||
|
||||
**What is NEVER pruned:**
|
||||
- Assistant messages (text content)
|
||||
- Assistant's tool calls
|
||||
- User messages
|
||||
|
||||
#### Verification Table
|
||||
|
||||
| Message Type | Pruned? | Part of outputTokens? |
|
||||
|-------------|---------|----------------------|
|
||||
| Assistant text | ❌ Never | ✅ Yes |
|
||||
| Assistant tool calls | ❌ Never | ✅ Yes (verified across all providers) |
|
||||
| Tool results (role='tool') | ✅ Can be pruned | ❌ No (separate messages) |
|
||||
|
||||
#### Code Evidence
|
||||
|
||||
- `stream-processor.ts`: Tool calls stored via `addToolCall()` with full arguments
|
||||
- `manager.ts` line 279: Only `msg.role === 'tool' && msg.compactedAt` gets placeholder
|
||||
- No code path exists to prune assistant messages
|
||||
|
||||
**Conclusion:** The formula `lastInputTokens + lastOutputTokens + newMessagesEstimate` is correct because:
|
||||
- `lastInputTokens` reflects pruned history (API tells us exactly what was sent)
|
||||
- `lastOutputTokens` is the assistant's response (text + tool calls) which is stored and sent back as-is
|
||||
- All major providers (Anthropic, OpenAI, Google) include tool calls in their output token counts
|
||||
- Only tool results (separate messages) can be pruned, and those are in `inputTokens`
|
||||
|
||||
---
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Fix Reasoning Storage (HIGH PRIORITY - Bug #1) ✅ COMPLETED
|
||||
|
||||
**The root cause:** `stream-processor.ts` collects reasoning but never persists it.
|
||||
|
||||
**Files to modify:**
|
||||
- `packages/core/src/llm/executor/stream-processor.ts`
|
||||
- `packages/core/src/context/types.ts`
|
||||
|
||||
**Changes:**
|
||||
|
||||
1. Add `reasoningMetadata` field to `AssistantMessage` type:
|
||||
```typescript
|
||||
// In context/types.ts
|
||||
interface AssistantMessage {
|
||||
reasoning?: string;
|
||||
reasoningMetadata?: Record<string, unknown>; // NEW - for provider round-tripping
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
2. Capture `providerMetadata` from reasoning-delta events:
|
||||
```typescript
|
||||
// In stream-processor.ts, add field:
|
||||
private reasoningMetadata: Record<string, unknown> | undefined;
|
||||
|
||||
// In reasoning-delta case:
|
||||
case 'reasoning-delta':
|
||||
this.reasoningText += event.text;
|
||||
// Capture provider metadata for round-tripping (OpenAI itemId, etc.)
|
||||
if (event.providerMetadata) {
|
||||
this.reasoningMetadata = event.providerMetadata;
|
||||
}
|
||||
// ... emit events
|
||||
```
|
||||
|
||||
3. **Fix the bug** - persist reasoning in `updateAssistantMessage()`:
|
||||
```typescript
|
||||
// In stream-processor.ts, 'finish' case (around line 315):
|
||||
if (this.assistantMessageId) {
|
||||
await this.contextManager.updateAssistantMessage(
|
||||
this.assistantMessageId,
|
||||
{
|
||||
tokenUsage: usage,
|
||||
reasoning: this.reasoningText || undefined, // ADD THIS
|
||||
reasoningMetadata: this.reasoningMetadata, // ADD THIS
|
||||
}
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 2: Fix Reasoning Round-Trip (Bug #2) ✅ COMPLETED
|
||||
|
||||
**Files to modify:**
|
||||
- `packages/core/src/llm/formatters/vercel.ts`
|
||||
|
||||
**Changes:**
|
||||
|
||||
1. Update `formatAssistantMessage()` to include reasoning:
|
||||
```typescript
|
||||
// In formatAssistantMessage(), before returning:
|
||||
if (msg.reasoning) {
|
||||
contentParts.push({
|
||||
type: 'reasoning',
|
||||
text: msg.reasoning,
|
||||
providerMetadata: msg.reasoningMetadata,
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Verified:** Vercel AI SDK's `AssistantContent` type supports `ReasoningPart`:
|
||||
```typescript
|
||||
// packages/provider-utils/src/types/assistant-model-message.ts
|
||||
export type AssistantContent = string | Array<TextPart | FilePart | ReasoningPart | ...>;
|
||||
|
||||
// packages/provider-utils/src/types/content-part.ts
|
||||
export interface ReasoningPart {
|
||||
type: 'reasoning';
|
||||
text: string;
|
||||
providerOptions?: ProviderOptions; // For round-tripping provider metadata
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 3: Unified Context Calculation ✅ COMPLETED
|
||||
|
||||
**Files to modify:**
|
||||
- `packages/core/src/context/manager.ts` - `getContextTokenEstimate()`
|
||||
- `packages/core/src/llm/executor/turn-executor.ts` - compaction check
|
||||
- `packages/cli/src/cli/ink-cli/components/overlays/ContextStatsOverlay.tsx`
|
||||
|
||||
**Changes:**
|
||||
|
||||
1. Create shared `calculateContextUsage()` function:
|
||||
```typescript
|
||||
// New file: packages/core/src/context/context-calculator.ts
|
||||
export async function calculateContextUsage(
|
||||
contextManager: ContextManager,
|
||||
tools: ToolDefinitions,
|
||||
maxContextTokens: number,
|
||||
outputBuffer: number
|
||||
): Promise<ContextUsage> {
|
||||
// Implement the formula above
|
||||
}
|
||||
```
|
||||
|
||||
2. Use in `/context`:
|
||||
```typescript
|
||||
// In DextoAgent.getContextStats()
|
||||
const usage = await calculateContextUsage(...);
|
||||
return usage;
|
||||
```
|
||||
|
||||
3. Use in compaction decision:
|
||||
```typescript
|
||||
// In turn-executor.ts
|
||||
const usage = await calculateContextUsage(...);
|
||||
if (usage.total > compactionThreshold) {
|
||||
// Compact!
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 4: Message-Level Token Tracking
|
||||
|
||||
**Already implemented!** We just need to use it:
|
||||
|
||||
```typescript
|
||||
// In calculateContextUsage(), sum from messages:
|
||||
const history = await contextManager.getHistory();
|
||||
let totalInputFromMessages = 0;
|
||||
let totalOutputFromMessages = 0;
|
||||
let totalReasoningFromMessages = 0;
|
||||
|
||||
for (const msg of history) {
|
||||
if (msg.role === 'assistant' && msg.tokenUsage) {
|
||||
totalOutputFromMessages += msg.tokenUsage.outputTokens ?? 0;
|
||||
totalReasoningFromMessages += msg.tokenUsage.reasoningTokens ?? 0;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 5: Calibration & Logging
|
||||
|
||||
1. Log estimate vs actual on every LLM call (already done, level=info)
|
||||
2. Track calibration ratio over time
|
||||
3. Consider adaptive estimation based on observed ratios
|
||||
|
||||
### Phase 6: Future - API Token Counting
|
||||
|
||||
**For Anthropic:**
|
||||
```typescript
|
||||
// New method in Anthropic service
|
||||
async countTokens(messages: Message[], tools: Tool[]): Promise<{
|
||||
input_tokens: number;
|
||||
}>
|
||||
```
|
||||
|
||||
**For other providers:**
|
||||
- tiktoken for OpenAI
|
||||
- Gemini countTokens API
|
||||
- Fallback to estimation
|
||||
|
||||
---
|
||||
|
||||
## Data Flow Diagram
|
||||
|
||||
### Current State (BROKEN)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ LLM Response Stream │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ reasoning-delta events → reasoningText accumulated ✓ │
|
||||
│ text-delta events → content accumulated ✓ │
|
||||
│ finish event → usage: { inputTokens, outputTokens, ... } │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ stream-processor.ts updateAssistantMessage() │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ await this.contextManager.updateAssistantMessage( │
|
||||
│ this.assistantMessageId, │
|
||||
│ { tokenUsage: usage } ← ONLY tokenUsage saved! │
|
||||
│ ); ← reasoning NOT included! ✗ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ AssistantMessage Stored │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ { │
|
||||
│ role: 'assistant', │
|
||||
│ content: [...], ← ✓ Stored │
|
||||
│ reasoning: undefined, ← ✗ NEVER SET! │
|
||||
│ tokenUsage: {...} ← ✓ Stored │
|
||||
│ } │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
### Target State (FIXED)
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ LLM Response Stream │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ reasoning-delta events → reasoningText + providerMetadata ✓ │
|
||||
│ text-delta events → content accumulated ✓ │
|
||||
│ finish event → usage: { inputTokens, outputTokens, ... } │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ stream-processor.ts updateAssistantMessage() │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ await this.contextManager.updateAssistantMessage( │
|
||||
│ this.assistantMessageId, │
|
||||
│ { │
|
||||
│ tokenUsage: usage, │
|
||||
│ reasoning: this.reasoningText, ← NEW │
|
||||
│ reasoningMetadata: this.reasoningMetadata ← NEW │
|
||||
│ } │
|
||||
│ ); │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ AssistantMessage Stored │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ { │
|
||||
│ role: 'assistant', │
|
||||
│ content: [...], │
|
||||
│ reasoning: 'Let me think...', ← ✓ Now stored │
|
||||
│ reasoningMetadata: { openai: { itemId: '...' } }, ← ✓ For round-trip
|
||||
│ tokenUsage: { inputTokens, outputTokens, reasoningTokens } │
|
||||
│ } │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Next LLM Call (Formatter) │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ formatAssistantMessage() includes: │
|
||||
│ - content (text parts) ✓ Already done │
|
||||
│ - toolCalls ✓ Already done │
|
||||
│ - reasoning + providerMetadata ✓ NEW - enables round-trip │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ /context Calculation │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ currentTotal = lastInput + lastOutput + newMessagesEstimate │
|
||||
│ │
|
||||
│ Breakdown: │
|
||||
│ systemPrompt = estimate (length/4) │
|
||||
│ tools = estimate (length/4) │
|
||||
│ messages = currentTotal - systemPrompt - tools (back-calc) │
|
||||
│ reasoning = sum(msg.tokenUsage.reasoningTokens) (for display) │
|
||||
│ │
|
||||
│ freeSpace = maxTokens - currentTotal - outputBuffer │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Compaction Decision │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ SAME FORMULA as /context! │
|
||||
│ │
|
||||
│ if (currentTotal > compactionThreshold) { │
|
||||
│ triggerCompaction(); │
|
||||
│ } │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
1. **Reasoning storage test (Phase 1)**
|
||||
- Mock LLM stream with reasoning-delta events
|
||||
- Verify `stream-processor.ts` calls `updateAssistantMessage()` with reasoning
|
||||
- Verify `reasoningMetadata` is captured from `providerMetadata`
|
||||
|
||||
2. **Reasoning round-trip test (Phase 2)**
|
||||
- Create `AssistantMessage` with `reasoning` and `reasoningMetadata`
|
||||
- Call `formatAssistantMessage()`
|
||||
- Verify output contains reasoning part with `providerMetadata`
|
||||
|
||||
3. **Token calculation test (Phase 3)**
|
||||
- Mock message with known tokenUsage
|
||||
- Verify calculation matches expected
|
||||
|
||||
4. **Edge case tests**
|
||||
- New session (no actuals) - falls back to estimation
|
||||
- Negative messagesDisplay (capped at 0)
|
||||
- Post-compaction state
|
||||
- Empty reasoning (should not create empty reasoning part)
|
||||
|
||||
### Integration Tests
|
||||
|
||||
1. **Full reasoning flow test**
|
||||
- Enable extended thinking on Claude
|
||||
- Send message that triggers reasoning
|
||||
- Verify reasoning persisted to message
|
||||
- Send follow-up message
|
||||
- Verify reasoning sent back to LLM (check formatted messages)
|
||||
|
||||
2. **Token tracking test**
|
||||
- Send message
|
||||
- Verify tokenUsage stored on message
|
||||
- Open /context
|
||||
- Verify numbers use actual from last call
|
||||
|
||||
3. **Compaction alignment test**
|
||||
- Fill context near threshold
|
||||
- Verify /context and compaction trigger at same point
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. **Numbers add up**: Total = SystemPrompt + Tools + Messages
|
||||
2. **Consistency**: /context and compaction use same calculation
|
||||
3. **Reasoning works**: Traces sent back to LLM correctly
|
||||
4. **Calibration visible**: Logs show estimate vs actual ratio
|
||||
5. **Provider compatibility**: Works with Anthropic, OpenAI, Google, etc.
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Verification Against Other Implementations
|
||||
|
||||
*This plan was verified against actual implementations on 2025-01-20.*
|
||||
|
||||
### OpenCode Verification (~/Projects/external/opencode)
|
||||
|
||||
| Claim | Verified | Evidence |
|
||||
|-------|----------|----------|
|
||||
| Stores reasoning as `ReasoningPart` | ✅ | `message-v2.ts` lines 78-89 |
|
||||
| Includes `providerMetadata` for round-tripping | ✅ | `message-v2.ts` lines 554-560 |
|
||||
| `toModelMessage()` sends reasoning back | ✅ | `message-v2.ts` lines 435-569 |
|
||||
| Tracks reasoning tokens separately | ✅ | `session/index.ts` line 432, schemas throughout |
|
||||
| Handles provider-specific metadata | ✅ | `openai-responses-language-model.ts` lines 520-538 |
|
||||
|
||||
**OpenCode approach:** Full round-trip of reasoning with provider metadata. This is our reference implementation.
|
||||
|
||||
### Gemini-CLI Verification (~/Projects/external/gemini-cli)
|
||||
|
||||
| Claim in Original Plan | Actual Behavior | Status |
|
||||
|------------------------|-----------------|--------|
|
||||
| "Parts with thought: true included when sending history back" | **WRONG** - They filter OUT thoughts at line 815 | ❌ Corrected |
|
||||
| Uses `thought: true` flag | ✅ Correct | ✅ |
|
||||
| Tracks `thoughtsTokenCount` | ✅ Correct - `chatRecordingService.ts` line 278 | ✅ |
|
||||
|
||||
**Gemini-CLI approach:** Track thought tokens for cost/display but do NOT round-trip them.
|
||||
This is a simpler approach but requires Google-specific handling.
|
||||
|
||||
### Why We Follow OpenCode
|
||||
|
||||
1. **Same SDK**: Both use Vercel AI SDK
|
||||
2. **Provider-agnostic**: Works across all providers without special-casing
|
||||
3. **Future-proof**: Preserves metadata for providers that need it
|
||||
4. **Simpler code**: No provider-specific filtering logic
|
||||
|
||||
### Dexto Implementation Verification
|
||||
|
||||
| Component | Current State | Bug |
|
||||
|-----------|---------------|-----|
|
||||
| `stream-processor.ts` | Accumulates `reasoningText` but doesn't persist | **Bug #1** |
|
||||
| `vercel.ts` formatter | Ignores `msg.reasoning` | **Bug #2** (blocked by #1) |
|
||||
| `AssistantMessage` type | Has `reasoning?: string` field | ✅ Ready |
|
||||
| Per-message `tokenUsage` | Stored via `updateAssistantMessage()` | ✅ Working |
|
||||
| `lastActualInputTokens` | Set after each LLM call | ✅ Working |
|
||||
| Compaction calculation | Uses `estimateMessagesTokens()` only | Different from /context |
|
||||
| `/context` calculation | Uses full estimation (system + tools + messages) | Different from compaction |
|
||||
Reference in New Issue
Block a user