3 Commits

6 changed files with 1353 additions and 17 deletions

638
AI-MONITORING-DESIGN.md Normal file
View File

@@ -0,0 +1,638 @@
# AI Monitoring — Design Specification
> **Codex Launcher v3.8.0 Feature Design**
> Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions.
---
## 1. Problem Statement
Over 42 sessions in production, we observed these failure categories:
| # | Failure Category | Count | Example |
|---|-----------------|-------|---------|
| F1 | **parsed_tool_calls=0** — model produces unparseable output | 42 | Bare `<explore_agent>`, `<bash>` without cmd, plain English intent |
| F2 | **Stuck recovery triggered** — Intelligence Routing Layer 3 | 13 | "I need to fetch the README", "let me write the script" |
| F3 | **Sanitizer flagged suspicious cmd** — cmd still JSON after unwrap | 11 | `{/'cmd/': /'sshpass -p .../'}` — double-escaped quoting |
| F4 | **Upstream 500** — provider internal error | ~5 | `"An internal error occurred. Please try again later."` |
| F5 | **Connection timeout** — upstream unreachable | ~3 | `Connection timed out after 15002 milliseconds` |
| F6 | **Upstream 401/403** — auth failure | ~2 | Wrong API key, expired token, `upgrade_required` |
| F7 | **Stream crash** — exception mid-stream | ~2 | `BrokenPipeError`, `ConnectionResetError` during SSE |
| F8 | **Proxy port conflict** — Address already in use | ~1 | Stale process holding port |
| F9 | **Schema cache corruption** — stale content_type=array | ~1 | `ErrorAnalyzer` learned wrong schema |
| F10 | **Codex Desktop crash** — SIGKILL at ~27GB | ~1 | Issue #24048 — unbounded tool output memory |
| F11 | **Codex 300s stall** — turn state machine race | ~1 | Issue #23807`stream disconnected` after 300s |
### The Gap
Intelligence Routing (v3.7.0) handles F1/F2/F3 **inside a single request**. But it can't:
- **Detect a dead proxy process** (F7/F8) — the proxy already crashed
- **Reconnect Codex to a restarted proxy** (F5/F7/F8) — Codex doesn't auto-reconnect
- **Switch to a backup provider** when the primary is down (F4/F5)
- **Clear corrupt caches** (F9) — requires out-of-band action
- **Restart Codex Desktop** after a crash (F10/F11)
- **Learn from failure patterns** across sessions — each failure is handled independently
### What We Need
A **separate lightweight watchdog process** that:
1. Monitors proxy health continuously
2. Detects failures the proxy can't detect itself
3. Uses a cheap AI model to diagnose novel failures
4. Takes corrective action automatically
5. Learns from past incidents to prevent repeats
---
## 2. Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Codex Launcher GUI │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────────────────────┐ │
│ │ Proxy │ │ Codex │ │ AI Monitoring Panel │ │
│ │ Manager │ │ Launcher │ │ ┌─────────────────────┐ │ │
│ │ │ │ │ │ │ ON/OFF Toggle │ │ │
│ └────┬─────┘ └──────┬───────┘ │ │ Provider Selector │ │ │
│ │ │ │ │ Model Selector │ │ │
│ │ │ │ │ Incident Log │ │ │
│ │ │ │ │ [View Diagnostics] │ │ │
│ │ │ │ └─────────────────────┘ │ │
│ │ │ └───────────────────────────────┘ │
└───────┼───────────────┼────────────────────────────────────────────┘
│ │
▼ ▼
┌───────────────┐ ┌────────────────┐
│ translate- │ │ Codex Desktop │
│ proxy.py │ │ / CLI │
│ (port 8080) │ │ │
│ │ │ │
│ /health ──────┼──┼─► health check │
│ /responses ───┼──┼─► main API │
└───────────────┘ └────────────────┘
│ health probes + log analysis + corrective actions
┌───────┴────────────────────────────────────────────────────────────┐
│ AI Monitor Watchdog │
│ (thread in codex-launcher-gui) │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │
│ │ Health Watcher │ │ Log Analyzer │ │ AI Diagnostic │ │
│ │ (every 5s) │ │ (continuous) │ │ Agent (on-call) │ │
│ │ │ │ │ │ │ │
│ │ - /health probe │ │ - tail cc-debug │ │ - Classify err │ │
│ │ - process alive │ │ - tail proxy.log│ │ - Root cause │ │
│ │ - port check │ │ - pattern match │ │ - Suggest fix │ │
│ │ - memory watch │ │ - incident DB │ │ - Execute fix │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬─────────┘ │
│ │ │ │ │
│ └────────────────────┼─────────────────────┘ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Incident Store │ │
│ │ (JSON file) │ │
│ │ - Known patterns │ │
│ │ - Past resolutions │ │
│ │ - Success rates │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
```
---
## 3. Three-Tier Response System
### Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second)
Immediate reactions to **known failure patterns**. No AI needed.
```python
TIER1_RULES = [
# (trigger_pattern, action, cooldown)
# --- Proxy Health ---
("proxy_health_fail", "restart_proxy", 30),
("proxy_port_conflict", "kill_stale + restart", 60),
("proxy_memory_over_1gb", "restart_proxy", 120),
# --- Upstream Errors ---
("upstream_429", "wait_retry_after", 0),
("upstream_502_503", "retry_with_backoff", 30),
("upstream_500_repeat_3x", "switch_provider", 60),
("upstream_timeout", "retry + increase_timeout", 30),
("upstream_401_403", "alert_user_bad_key", 0),
# --- Stream Errors ---
("stream_broken_pipe", "restart_proxy", 30),
("stream_reset", "restart_proxy", 30),
("stream_idle_300s", "restart_proxy", 60),
# --- Parser Failures ---
("parsed_tool_calls_0_x3", "clear_schema_cache", 300),
("sanitizer_suspicious_5x","alert_user_model_issue", 0),
("stuck_recovery_x5", "suggest_switch_model", 0),
# --- Codex Process ---
("codex_process_dead", "alert_user_restart", 0),
("codex_memory_over_4gb", "alert_user_memory", 0),
# --- Cache Corruption ---
("schema_content_type_array", "delete_provider_caps", 0),
]
```
### Tier 2: Pattern Matching — Incident Store Lookup (< 100ms)
For failures we've **seen before and resolved**, look up the fix:
```json
{
"incidents": [
{
"pattern": "cc_stream_ended_empty + explore_agent + no_url",
"fix": "synth_explore_from_last_user_urls",
"source": "FIX-23",
"success_rate": 0.85,
"last_seen": "2026-05-22T16:00:00Z",
"occurrences": 5
},
{
"pattern": "require_escalation + no_cmd",
"fix": "auto_proceed_echo",
"source": "FIX-24",
"success_rate": 1.0,
"last_seen": "2026-05-22T15:30:00Z",
"occurrences": 3
}
]
}
```
### Tier 3: AI Diagnostic — Nano Agent (2-5 seconds)
For **novel failures** that don't match any rule or pattern, invoke a cheap AI model:
```
Prompt Template (system):
─────────────────────
You are a diagnostic agent for a translation proxy that sits between
OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat,
Anthropic, etc.). You analyze error context and suggest ONE corrective action.
Available actions: restart_proxy, kill_stale_processes, clear_schema_cache,
switch_provider, increase_timeout, alert_user, ignore, retry_now,
regenerate_config, cleanup_codex_stale
Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0}
Prompt Template (user):
─────────────────────
INCIDENT REPORT:
Time: {timestamp}
Session: {session_id}
Proxy health: {alive/dead, port, uptime, memory_mb}
Upstream: {url, model, last_http_code, last_error}
Recent errors (last 60s):
{log_lines}
Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags}
Provider: {backend_type, model}
History: {last_5_incidents_for_this_pattern}
What corrective action should be taken?
```
---
## 4. Complete Failure Catalog
### Category A: Proxy-Level Failures (watchdog detects, auto-recovers)
| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
|----|---------|----------|---------------|---------------|
| A1 | Proxy process crashed | `/health` returns connection refused | `restart_proxy` | `urllib.error.URLError: [Errno 111] Connection refused` |
| A2 | Port conflict | `Address already in use` on startup | `kill_stale + restart` | `OSError: [Errno 98] Address already in use` |
| A3 | Memory leak | Process RSS > 1GB | `restart_proxy` | `/proc/{pid}/status` VmRSS check |
| A4 | Deadlock | Health check hangs > 15s | `restart_proxy` | health probe timeout |
| A5 | Unhandled exception | Process exits with non-zero | `restart_proxy` | `SELF-REVIVE CRASH #{n}` |
| A6 | SSL/TLS error | `CERTIFICATE_VERIFY_FAILED` upstream | `alert_user` | `urllib.error.URLError: certificate verify failed` |
| A7 | DNS resolution failure | `getaddrinfo failed` | `retry_with_backoff` | `socket.gaierror: Name or service not known` |
### Category B: Upstream Provider Failures (proxy detects, watchdog analyzes)
| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
|----|---------|----------|---------------|---------------|
| B1 | Rate limit (429) | Too many requests | `wait_retry_after` | `HTTP 429` + `Retry-After` header |
| B2 | Server error (5xx) | Provider down | `retry_with_backoff` | `HTTP 500/502/503` |
| B3 | Auth failure (401/403) | Bad/expired key | `alert_user_bad_key` | `HTTP 401 {"error":"invalid_api_key"}` |
| B4 | CC upgrade required (403) | Version mismatch | `update_cc_version` | `HTTP 403 upgrade_required` |
| B5 | Connection timeout | Upstream silent | `retry + increase_timeout` | `urllib.error.URLError: timed out` |
| B6 | Connection reset | Upstream dropped mid-stream | `restart_proxy` | `ConnectionResetError: Connection reset by peer` |
| B7 | Broken pipe | Client disconnected | `ignore` | `BrokenPipeError: Broken pipe` |
| B8 | Upstream 400 bad request | Malformed request | `clear_schema_cache` | `HTTP 400 {"error":"...expected string..."}` |
| B9 | Provider capacity (503) | Overloaded | `switch_provider` | `HTTP 503` after 3 retries |
| B10 | Cloudflare block (403/1010) | Bot detection | `check_browser_ua` | `HTTP 403 error 1010` |
### Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks)
| ID | Failure | Symptoms | Auto-Fix (IR Layer) | Watchdog Escalation |
|----|---------|----------|--------------------|--------------------|
| C1 | Bare `<explore_agent>` | `parsed_tool_calls=0` | Layer 1: URL extraction | If 3x in a row → suggest model switch |
| C2 | `<require_escalation>` block | Model wants permissions | Layer 2: Auto-proceed | If 5x → suggest different provider |
| C3 | Unrecognized format | No parser matches | Layer 3: Intent synthesis | If 5x → log for AI diagnosis |
| C4 | Double-wrapped cmd | `cmd = "{\"cmd\": ...}"` | Sanitizer: unwrap | If cmd still JSON → alert |
| C5 | Suspicious cmd (JSON) | `cmd starts with {` | Sanitizer: flag | If 3x → clear cache + restart |
| C6 | Empty cmd | `cmd = ""` or `cmd = "{}"` | Sanitizer: diagnostic echo | If 3x → suggest model switch |
| C7 | Bare `{` token | Model outputs incomplete JSON | Layer 3: heuristic 5 | If persistent → AI diagnosis |
| C8 | `<bash>` without cmd | Block has sandbox but no command | Layer 3: heuristic | If 3x → AI diagnosis |
| C9 | DSML name mismatch | `name="cmd"` vs `name="command"` | DSML parser handles both | Self-test catches regression |
| C10 | Stuck model loop | Same recovery 5+ times | Layer 3 max 3x then alert | Switch model or provider |
### Category D: Codex Process Failures (watchdog detects, alerts user)
| ID | Failure | Symptoms | Action | Log Signature |
|----|---------|----------|--------|---------------|
| D1 | Codex process killed | PID gone from pids.json | `alert_user_restart` | Process not in `/proc/{pid}` |
| D2 | Codex memory explosion | RSS > 4GB | `alert_user_memory` | `/proc/{pid}/status` check |
| D3 | Codex 300s stall | `stream disconnected` loop | `restart_proxy` | Codex stderr: `stream disconnected` |
| D4 | Config corruption | `database disk image is malformed` | `regenerate_config` | Codex stderr: `malformed` |
| D5 | Session context overflow | `context_length_exceeded` | `alert_user_context` | Codex stderr: `context_length_exceeded` |
| D6 | WebSocket reconnect loop | `Reconnecting... N/5` | `check_proxy_health` | Codex stderr: `Reconnecting` |
### Category E: Config/State Failures (watchdog detects, auto-fixes)
| ID | Failure | Symptoms | Action | Detection |
|----|---------|----------|--------|-----------|
| E1 | Schema cache corruption | `content_type: "array"` in provider-caps.json | `delete_provider_caps` | Read file, check for known-bad values |
| E2 | Stale PID file | pids.json has dead PIDs | `cleanup_pids` | Check `/proc/{pid}` existence |
| E3 | Port from old session | config.toml has stale port | `regenerate_config` | Port in config != running port |
| E4 | OAuth token expired | Google/Gemini token refresh fails | `alert_user_reauth` | Token file `expiry_ts < now` |
| E5 | BGP all routes down | Every route returned error | `alert_user_no_provider` | All routes in cooldown |
---
## 5. Component Design
### 5.1 Health Watcher Thread
Runs in the GUI process as a background thread. Pings proxy `/health` endpoint every 5 seconds.
```python
class HealthWatcher(threading.Thread):
def __init__(self, proxy_port, on_failure, on_recovery):
super().__init__(daemon=True)
self.proxy_port = proxy_port
self.on_failure = on_failure
self.on_recovery = on_recovery
self.check_interval = 5 # seconds
self.failures = 0
self.running = True
def run(self):
while self.running:
healthy = self._check_health()
if healthy:
if self.failures > 0:
self.failures = 0
self.on_recovery()
else:
self.failures += 1
if self.failures >= 3: # 15s of consecutive failures
self.on_failure(self.failures)
time.sleep(self.check_interval)
def _check_health(self):
try:
req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health")
resp = urllib.request.urlopen(req, timeout=5)
return resp.status == 200
except Exception:
return False
```
### 5.2 Log Analyzer Thread
Tails the debug log and extracts failure signals in real-time.
```python
FAILURE_SIGNALS = {
"parsed_tool_calls=0": ("C1", "parser_empty"),
"[STUCK-RECOVERY]": ("C3", "stuck_recovery"),
"suspicious cmd": ("C4", "sanitizer_flag"),
"empty cmd recovered": ("C6", "empty_cmd"),
"HTTP 429": ("B1", "rate_limited"),
"HTTP 500": ("B2", "server_error"),
"HTTP 401": ("B3", "auth_failure"),
"HTTP 403": ("B4", "forbidden"),
"Connection refused": ("A1", "proxy_dead"),
"Address already in use": ("A2", "port_conflict"),
"Broken pipe": ("B7", "broken_pipe"),
"Connection reset": ("B6", "connection_reset"),
"timed out": ("B5", "timeout"),
"SELF-REVIVE CRASH": ("A5", "proxy_crash"),
"stream error": ("B6", "stream_error"),
}
class LogAnalyzer(threading.Thread):
def __init__(self, log_path, on_signal):
super().__init__(daemon=True)
self.log_path = log_path
self.on_signal = on_signal
self.running = True
def run(self):
fh = open(self.log_path, "r")
fh.seek(0, 2) # seek to end
while self.running:
line = fh.readline()
if not line:
time.sleep(0.5)
continue
for pattern, (fault_id, category) in FAILURE_SIGNALS.items():
if pattern in line:
self.on_signal(fault_id, category, line.strip())
break
```
### 5.3 AI Diagnostic Agent
Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns.
```python
class AIDiagnosticAgent:
def __init__(self, provider_url, model, api_key):
self.provider_url = provider_url
self.model = model
self.api_key = api_key
self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT # defined below
self.incident_store = IncidentStore()
def diagnose(self, context):
# Tier 2: Check incident store first
pattern = self._extract_pattern(context)
known_fix = self.incident_store.lookup(pattern)
if known_fix and known_fix["success_rate"] > 0.7:
return known_fix["fix"], "tier2_pattern", known_fix["success_rate"]
# Tier 3: Ask AI
prompt = self._build_prompt(context)
response = self._call_model(prompt)
action = self._parse_response(response)
# Learn from this incident
if action:
self.incident_store.record(pattern, action)
return action, "tier3_ai", None
def _call_model(self, prompt):
body = {
"model": self.model,
"messages": [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": prompt}
],
"max_tokens": 200,
"temperature": 0.1,
}
req = urllib.request.Request(
self.provider_url,
data=json.dumps(body).encode(),
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {self.api_key}",
}
)
resp = urllib.request.urlopen(req, timeout=15)
return json.loads(resp.read())["choices"][0]["message"]["content"]
```
### 5.4 Incident Store
JSON file that accumulates failure patterns and their resolutions.
```json
{
"version": 1,
"incidents": {
"parser_empty+explore_agent": {
"fault_ids": ["C1"],
"fix": "synth_explore_from_urls",
"source": "intelligent_routing",
"success_count": 8,
"fail_count": 1,
"last_seen": "2026-05-22T16:00:00Z",
"auto_applied": true
},
"server_error+repeat_3x": {
"fault_ids": ["B2"],
"fix": "switch_provider",
"source": "tier1_rule",
"success_count": 2,
"fail_count": 0,
"last_seen": "2026-05-22T14:00:00Z",
"auto_applied": true
}
},
"ai_diagnostic_calls": 0,
"tokens_used": 0,
"cost_usd": 0.0
}
```
### 5.5 Diagnostic Agent System Prompt
```
You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local
translation proxy between OpenAI Codex CLI/Desktop and various AI providers.
## Your Job
Analyze the incident report and recommend ONE corrective action.
## Available Actions
- restart_proxy: Kill and restart translate-proxy.py
- kill_stale_processes: Kill orphaned proxy/codex processes
- clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json
- switch_provider: Switch to a different configured endpoint
- increase_timeout: Increase upstream timeout for slow providers
- regenerate_config: Regenerate Codex config.toml
- cleanup_codex_stale: Run cleanup-codex-stale.sh
- alert_user: Show notification to user (can't auto-fix)
- ignore: Transient error, no action needed
- retry_now: Immediate retry without changes
## Decision Rules
- If upstream returns 401/403 with auth error → alert_user (can't fix bad keys)
- If proxy process is dead → restart_proxy
- If same error repeated 5+ times → switch_provider or alert_user
- If error is about content_type/schema → clear_schema_cache
- If "Address already in use" → kill_stale_processes then restart_proxy
- If timeout and upstream is slow → increase_timeout
- If single transient 429/502/503 → ignore (retry handles it)
- If "stream disconnected" and proxy is healthy → ignore (Codex retries)
## Response Format
Reply with ONLY a JSON object:
{"action": "...", "reason": "...", "confidence": 0.0-1.0}
No explanation, no markdown, no extra text.
```
---
## 6. GUI Integration
### AI Monitoring Panel (in Settings tab)
```
┌─────────────────────────────────────────────────────────┐
│ AI Monitoring [ON] │
│ │
│ ┌─ Diagnostic Agent ─────────────────────────────────┐ │
│ │ Provider: [OpenCode Zen ▼] │ │
│ │ Model: [Qwen3-32B ▼] │ │
│ │ API Key: [sk-•••••••••••••••••••• ] │ │
│ │ │ │
│ │ Cost this month: $0.12 (3 diagnostic calls) │ │
│ │ Tokens used: 1,847 input / 423 output │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─ Incident Log (last 7 days) ──────────────────────┐ │
│ │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │ │
│ │ ⚠️ 15:30 B2 server_error → retry (Tier 1) │ │
│ │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1) │ │
│ │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3) │ │
│ │ ... │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ [View Full Diagnostics] [Export Incident Report] │
└─────────────────────────────────────────────────────────┘
```
### Config Storage (in endpoints.json)
```json
{
"ai_monitoring": {
"enabled": true,
"provider_url": "https://opencode.ai/zen/v1/chat/completions",
"model": "Qwen/Qwen3-32B",
"api_key": "sk-...",
"tier1_enabled": true,
"tier2_enabled": true,
"tier3_enabled": true,
"auto_restart_proxy": true,
"auto_switch_provider": false,
"health_check_interval_s": 5,
"max_memory_mb": 1024,
"notification_level": "important_only"
}
}
```
### Recommended Models (by cost)
| Model | Cost/Diagnosis | Latency | Quality | Recommended For |
|-------|---------------|---------|---------|----------------|
| **Qwen3-32B** (OpenCode) | ~$0.0005 | 2-4s | Good | Default — cheapest decent model |
| **DeepSeek V4 Flash** | ~$0.0003 | 2-3s | Good | Cheapest option |
| **GPT-4o-mini** | ~$0.001 | 1-2s | Excellent | Best quality/latency |
| **Gemini 2.0 Flash** | ~$0.0002 | 1-2s | Good | Cheapest + fastest |
| **Claude Haiku 4.5** | ~$0.001 | 2-3s | Excellent | Best reasoning quality |
| **Local Ollama** (if running) | $0 | 5-15s | Varies | Zero-cost offline option |
### Cost Estimate
- Average diagnostic prompt: ~800 tokens input, ~100 tokens output
- Expected frequency: ~1-5 incidents per day that reach Tier 3
- **Monthly cost**: $0.10 - $1.50 depending on model and usage
---
## 7. Watchdog Response Flow
```
Failure Detected
┌─────────────┐ YES ┌──────────────────┐
│ Tier 1 Rule? ├─────────►│ Execute Action │
│ (known) │ │ Log incident │
└──────┬───────┘ └──────────────────┘
│ NO
┌─────────────┐ YES ┌──────────────────┐
│ Tier 2 Match?├─────────►│ Apply Known Fix │
│ (incident DB)│ │ Update success │
└──────┬───────┘ └──────────────────┘
│ NO
┌─────────────┐ YES ┌──────────────────┐
│ AI Enabled? ├─────────►│ Collect Context │
│ (Tier 3) │ │ Build Prompt │
└──────┬───────┘ │ Call AI Model │
│ NO │ Parse Response │
▼ │ Execute if auto │
┌─────────────┐ │ Store incident │
│ Alert User │ └──────────────────┘
│ (can't fix) │
└─────────────┘
```
---
## 8. Safety Guards
1. **Rate limit AI calls** — max 1 Tier 3 call per 60 seconds, max 10 per day
2. **Never auto-execute destructive actions**`alert_user` for: delete files, change API keys, modify source code
3. **Auto-restart cap** — max 5 proxy restarts per 10 minutes, then alert user
4. **Cost cap** — monthly AI diagnostic budget (configurable, default $2/month)
5. **Cooldown per pattern** — same failure pattern has escalating cooldown (30s → 60s → 300s → alert)
6. **User override** — any auto-action can be cancelled within 3 seconds via GUI
7. **Incident store max size** — 500 entries, LRU eviction
8. **Health check bypass** — if user manually stopped proxy, don't alert
---
## 9. Implementation Plan
### Phase 1: Core Watchdog (v3.8.0)
- `HealthWatcher` thread in `codex-launcher-gui`
- `LogAnalyzer` thread tailing `cc-debug.log` and `proxy.log`
- Tier 1 rule engine with all 20+ rules
- Incident store (JSON file)
- GUI toggle (ON/OFF) in settings
- Auto-restart proxy on crash
### Phase 2: Pattern Learning (v3.8.1)
- Tier 2 incident store lookup
- Auto-learn from Intelligence Routing outcomes
- Success rate tracking per pattern
- Incident log viewer in GUI
### Phase 3: AI Diagnostic Agent (v3.9.0)
- Tier 3 AI model integration
- Provider/model selector in GUI
- Diagnostic prompt template
- Cost tracking
- Full incident report export
### Phase 4: Advanced Recovery (v4.0.0)
- Auto-switch to backup provider on repeated failure
- BGP route health monitoring
- Predictive failure detection (memory growth, latency trends)
- Codex process memory monitoring
- WebSocket reconnect assistance
---
## 10. File Changes Summary
| File | Changes |
|------|---------|
| `codex-launcher-gui` | +HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer |
| `translate-proxy.py` | +`/monitoring` endpoint (returns health + metrics), enhanced `/health` with memory/uptime |
| `~/.cache/codex-proxy/incident-store.json` | New file — incident pattern database |
| `~/.cache/codex-proxy/monitoring.log` | New file — watchdog activity log |
| `~/.codex/endpoints.json` | +`ai_monitoring` config section |

View File

@@ -33,6 +33,7 @@
<img src="https://img.shields.io/badge/Streaming_SSE-✓-success" /> <img src="https://img.shields.io/badge/Streaming_SSE-✓-success" />
<img src="https://img.shields.io/badge/Tool_Calls-✓-success" /> <img src="https://img.shields.io/badge/Tool_Calls-✓-success" />
<img src="https://img.shields.io/badge/AI_Assist-✓-success" /> <img src="https://img.shields.io/badge/AI_Assist-✓-success" />
<img src="https://img.shields.io/badge/Intelligence_Routing-✓-success" />
<img src="https://img.shields.io/badge/Self_Revive_Watchdog-✓-success" /> <img src="https://img.shields.io/badge/Self_Revive_Watchdog-✓-success" />
</p> </p>
@@ -130,6 +131,19 @@ A three-component system:
- **ErrorAnalyzer** — learns from 4xx errors, retries with adjusted parameters (max 2 retries) - **ErrorAnalyzer** — learns from 4xx errors, retries with adjusted parameters (max 2 retries)
- **Schema cache** with 24h staleness TTL for provider capabilities - **Schema cache** with 24h staleness TTL for provider capabilities
### Intelligence Routing (v3.7.0)
- **Three-layer self-healing system** — the agent loop never stalls, even when the model speaks gibberish
- **Layer 1 — Deep URL Extraction**: When `<explore_agent>` hides URLs inside nested JSON (`messages: [{"content": "https://..."}]`), the parser drills into the JSON structure to find them. Module-level `_build_explore_cmd()` is reused across parser + stream path.
- **Layer 2 — Escalation Auto-Proceed**: `<require_escalation>` and `<request_escalation_permission>` blocks are detected and auto-resolved — the model doesn't get stuck waiting for permissions that don't exist.
- **Layer 3 — Intent-Based Command Synthesis**: When ALL parsers fail, 5 heuristics analyze the model's plain-text output and synthesize a working command:
1. URL detected → `curl` it
2. File path mentioned → `cat` or `ls` it
3. Shell command in quotes → extract and run it
4. "explore"/"fetch" intent → use the last URL the user mentioned
5. "I need to"/"let me" intent → echo a diagnostic so the loop continues
- **Session URL memory** — `_last_user_urls` deque (20 entries) tracks URLs from user messages across the session, giving the synthesizer context to work with
- **54 self-test patterns** — comprehensive coverage of all three layers
### GTK Launcher (`codex-launcher-gui`) ### GTK Launcher (`codex-launcher-gui`)
- **Endpoint manager** — add, edit, delete, set default providers - **Endpoint manager** — add, edit, delete, set default providers
- **Provider presets** — one-click setup for 15+ providers with pre-filled URLs and model lists - **Provider presets** — one-click setup for 15+ providers with pre-filled URLs and model lists
@@ -324,6 +338,83 @@ Built a cascading parser chain (`DSML → bash → explore → tool_call → XML
**Verification:** `--self-test` flag runs 19 automated tests covering all edge cases. Debug logging to `~/.cache/codex-proxy/cc-debug.log` captures every parser decision for troubleshooting. **Verification:** `--self-test` flag runs 19 automated tests covering all edge cases. Debug logging to `~/.cache/codex-proxy/cc-debug.log` captures every parser decision for troubleshooting.
### Phase 8: Intelligence Routing — When the Model Refuses to Speak Machine
**Problem:** The 17-fix parser chain from Phase 7 was powerful — it could handle DSML, XML, JSON, bash blocks, explore tags, you name it. But there was one edge case it couldn't crack: **when the model doesn't produce a parseable tool-call format at all**.
In production, `deepseek/deepseek-v4-flash` via Command Code kept doing things like:
```
<explore_agent>
messages: [{"content": "Understand the Z.AI-Chat-for-Android repo at https://..."}]
</explore_agent>
```
or:
```
<require_escalation>
I need elevated permissions to access the repository.
</require_escalation>
```
or just plain English: *"I need to fetch the README from the repository to understand the app structure."*
In every case, `parsed_tool_calls=0`. No tool to execute. The Codex agent loop ground to a halt. The user saw "thinking..." forever.
**The insight:** The model is trying to communicate *intent*, just not in a format we can parse. Instead of adding more regex patterns, what if we could **read the model's mind** — understand what it *wants* to do, and synthesize the command for it?
**Intelligence Routing — Three Layers of Escalation:**
```
Layer 1: "Fix the input" — Can we extract more from what the model gave us?
Layer 2: "Handle the intent" — Is the model asking for something we can auto-resolve?
Layer 3: "Read the mind" — What is the model trying to do? Just do it for it.
```
**Layer 1 — Deep URL Extraction (FIX 23):**
The `<explore_agent>` handler had a URL regex, but the URL was trapped inside `{"content": "https://..."}` — the trailing `"` broke matching. The fix: after the initial regex fails, `json.loads()` the entire block, walk the JSON tree, and pull URLs out of `content` fields. The `_build_explore_cmd()` function was extracted to module level so both the parser and the stream handler could use it.
```python
# Before: regex fails, URL lost
# After: json.loads -> iterate items -> extract content -> find URL
```
**Layer 2 — Escalation Auto-Proceed (FIX 24):**
`<require_escalation>` blocks are the model's way of saying "I need more permissions." The CC adapter doesn't have an escalation mechanism — these blocks were silently dropped. The fix: detect them (both closed `<tag>...</tag>` and bare `<tag />` forms), extract any URL inside them, and auto-proceed with an explore command or a diagnostic echo.
```python
# Model: <require_escalation>Please let me run curl</require_escalation>
# Proxy: Okay, here's your curl command → exec_command synthesized
```
**Layer 3 — Intent-Based Command Synthesis (FIX 25):**
The crown jewel. When ALL parsers return empty — no DSML, no XML, no JSON, no fallback regex matches — the system doesn't give up. It analyzes the model's raw text through **5 heuristic lenses** in priority order:
| Priority | Signal | Synthesized Command |
|:--------:|--------|---------------------|
| 1 | URL in text | `curl` to fetch it |
| 2 | File path reference | `cat` or `ls` the file |
| 3 | Shell command in backticks/quotes | Extract and run it |
| 4 | "explore"/"fetch" + last user URL | Full explore command |
| 5 | "I need to"/"let me" intent | Echo diagnostic |
The system also maintains a **session URL memory** (`_last_user_urls`, a deque of the last 20 URLs from user messages) so heuristic 4 always has a URL to work with, even when the model's text doesn't contain one.
```python
# Model: "I should explore the repository to understand its structure."
# Parser: empty (no parseable format)
# Layer 3 heuristic 4: "explore" detected, pulling URL from session memory...
# Result: exec_command with full curl pipeline
```
**The result:** Before Intelligence Routing, `parsed_tool_calls=0` meant **game over** — the agent loop stalled permanently. After Intelligence Routing, `parsed_tool_calls=0` triggers the self-healing chain and the loop **always** gets a tool call to execute. The model can speak in tongues and the system still works.
**Test coverage:** 54 self-test patterns (up from 41), with 13 new tests specifically for Intelligence Routing layers.
--- ---
## Architecture Deep Dive ## Architecture Deep Dive
@@ -454,6 +545,9 @@ README.md # This file
| CC tool calls have wrong args | Double-wrapped arguments | V3.5 three-tier parser + recursive unwrapping | | CC tool calls have wrong args | Double-wrapped arguments | V3.5 three-tier parser + recursive unwrapping |
| Proxy crashes mid-session | Unhandled streaming error | V3.5 self-revive watchdog auto-restarts | | Proxy crashes mid-session | Unhandled streaming error | V3.5 self-revive watchdog auto-restarts |
| CC 403 upgrade_required | Missing version header | V3.5 always sends `x-command-code-version` | | CC 403 upgrade_required | Missing version header | V3.5 always sends `x-command-code-version` |
| CC explore_agent can't find URL | URL hidden inside JSON messages | V3.7 Layer 1 drills into JSON to extract URLs |
| CC agent stalls on escalation blocks | `<require_escalation>` not handled | V3.7 Layer 2 auto-proceeds past escalation requests |
| CC agent stalls — no tool calls at all | Model output format unrecognized | V3.7 Layer 3 synthesizes command from text intent |
--- ---

Binary file not shown.

View File

@@ -3,11 +3,11 @@ set -e
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
if [ -f "$SCRIPT_DIR/codex-launcher_3.7.0_all.deb" ]; then if [ -f "$SCRIPT_DIR/codex-launcher_3.8.0_all.deb" ]; then
echo "Installing codex-launcher_3.7.0_all.deb ..." echo "Installing codex-launcher_3.8.0_all.deb ..."
sudo dpkg -i "$SCRIPT_DIR/codex-launcher_3.7.0_all.deb" sudo dpkg -i "$SCRIPT_DIR/codex-launcher_3.8.0_all.deb"
echo "" echo ""
echo "Installed v3.7.0 via .deb package." echo "Installed v3.8.0 via .deb package."
echo " translate-proxy.py -> /usr/bin/translate-proxy.py" echo " translate-proxy.py -> /usr/bin/translate-proxy.py"
echo " codex-launcher-gui -> /usr/bin/codex-launcher-gui" echo " codex-launcher-gui -> /usr/bin/codex-launcher-gui"
echo " cleanup-codex-stale -> /usr/bin/cleanup-codex-stale.sh" echo " cleanup-codex-stale -> /usr/bin/cleanup-codex-stale.sh"

View File

@@ -5,7 +5,7 @@ import gi
gi.require_version("Gtk", "3.0") gi.require_version("Gtk", "3.0")
from gi.repository import Gtk, GLib from gi.repository import Gtk, GLib
import subprocess, os, signal, sys, threading, time, json, urllib.request, urllib.parse, urllib.error, tempfile, shutil import subprocess, os, signal, sys, threading, time, json, urllib.request, urllib.parse, urllib.error, tempfile, shutil
import hashlib, socket, ssl, contextlib, re import hashlib, socket, ssl, contextlib, re, collections
import base64, secrets import base64, secrets
from pathlib import Path from pathlib import Path
@@ -1123,6 +1123,524 @@ def _check_codex_auth():
except Exception as e: except Exception as e:
return ("error", str(e)) return ("error", str(e))
# ═══════════════════════════════════════════════════════════════════
# AI Monitoring — Self-Healing Watchdog
# ═══════════════════════════════════════════════════════════════════
MONITORING_FILE = Path.home() / ".cache/codex-proxy/monitoring-config.json"
INCIDENT_STORE_FILE = Path.home() / ".cache/codex-proxy/incident-store.json"
MONITORING_LOG = Path.home() / ".cache/codex-proxy/monitoring.log"
_TIER1_RULES = [
("proxy_health_fail", "restart_proxy", 30),
("proxy_port_conflict", "kill_stale_restart", 60),
("upstream_429", "wait_retry", 0),
("upstream_502_503", "retry_backoff", 30),
("upstream_500_repeat", "switch_provider", 60),
("upstream_timeout", "retry_increase_timeout",30),
("upstream_401_403", "alert_bad_key", 0),
("stream_broken_pipe", "restart_proxy", 30),
("stream_reset", "restart_proxy", 30),
("parsed_tool_calls_0_x3", "clear_schema_cache", 300),
("sanitizer_suspicious_5x","alert_model_issue", 0),
("stuck_recovery_x5", "suggest_switch_model", 0),
("codex_process_dead", "alert_restart", 0),
("schema_corrupt", "delete_provider_caps", 0),
]
_FAILURE_SIGNALS = {
"parsed_tool_calls=0": ("C1", "parser_empty"),
"[STUCK-RECOVERY]": ("C3", "stuck_recovery"),
"suspicious cmd": ("C4", "sanitizer_flag"),
"empty cmd recovered": ("C6", "empty_cmd"),
"HTTP 429": ("B1", "rate_limited"),
"HTTP 500": ("B2", "server_error"),
"HTTP 502": ("B2", "server_error"),
"HTTP 503": ("B2", "server_error"),
"HTTP 401": ("B3", "auth_failure"),
"HTTP 403": ("B4", "forbidden"),
"Connection refused": ("A1", "proxy_dead"),
"Address already in use": ("A2", "port_conflict"),
"Broken pipe": ("B7", "broken_pipe"),
"Connection reset": ("B6", "connection_reset"),
"timed out": ("B5", "timeout"),
"SELF-REVIVE CRASH": ("A5", "proxy_crash"),
"stream error": ("B6", "stream_error"),
"content_type.*array": ("E1", "schema_corrupt"),
}
_DIAGNOSTIC_SYSTEM_PROMPT = (
'You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local '
'translation proxy between OpenAI Codex CLI/Desktop and AI providers.\n\n'
'Analyze the incident and respond with ONLY a JSON object:\n'
'{"action": "...", "reason": "...", "confidence": 0.0-1.0}\n\n'
'Available actions: restart_proxy, kill_stale_processes, clear_schema_cache, '
'switch_provider, increase_timeout, regenerate_config, cleanup_stale, '
'alert_user, ignore, retry_now\n\n'
'Rules:\n'
'- upstream 401/403 with auth error -> alert_user\n'
'- proxy dead -> restart_proxy\n'
'- same error 5+ times -> switch_provider or alert_user\n'
'- schema/content_type error -> clear_schema_cache\n'
'- "Address already in use" -> kill_stale_processes then restart_proxy\n'
'- timeout on slow upstream -> increase_timeout\n'
'- single transient 429/502/503 -> ignore\n'
'- "stream disconnected" + proxy healthy -> ignore\n'
'- no extra text, no markdown, just the JSON object'
)
def _load_monitoring_config():
if MONITORING_FILE.exists():
try:
return json.loads(MONITORING_FILE.read_text())
except Exception:
pass
return {
"enabled": False,
"provider_url": "",
"model": "",
"api_key": "",
"health_check_interval_s": 5,
"auto_restart_proxy": True,
"auto_switch_provider": False,
}
def _save_monitoring_config(cfg):
MONITORING_FILE.parent.mkdir(parents=True, exist_ok=True)
MONITORING_FILE.write_text(json.dumps(cfg, indent=2))
def _load_incident_store():
if INCIDENT_STORE_FILE.exists():
try:
return json.loads(INCIDENT_STORE_FILE.read_text())
except Exception:
pass
return {"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}}
def _save_incident_store(store):
INCIDENT_STORE_FILE.parent.mkdir(parents=True, exist_ok=True)
INCIDENT_STORE_FILE.write_text(json.dumps(store, indent=2))
def _monitoring_log(msg):
try:
with open(str(MONITORING_LOG), "a") as f:
f.write(f"[{time.strftime('%H:%M:%S')}] {msg}\n")
except Exception:
pass
class IncidentStore:
def __init__(self):
self._store = _load_incident_store()
self._dirty = False
def lookup(self, pattern):
inc = self._store.get("incidents", {}).get(pattern)
if inc and inc.get("success_count", 0) > 0:
rate = inc["success_count"] / max(inc["success_count"] + inc.get("fail_count", 0), 1)
if rate > 0.5:
return inc
return None
def record(self, pattern, fix, success=True):
incs = self._store.setdefault("incidents", {})
inc = incs.setdefault(pattern, {
"fix": fix, "success_count": 0, "fail_count": 0,
"last_seen": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"occurrences": 0,
})
inc["last_seen"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
inc["occurrences"] = inc.get("occurrences", 0) + 1
if success:
inc["success_count"] = inc.get("success_count", 0) + 1
else:
inc["fail_count"] = inc.get("fail_count", 0) + 1
self._dirty = True
def record_ai_call(self, tokens=0):
stats = self._store.setdefault("stats", {"ai_calls": 0, "tokens_used": 0})
stats["ai_calls"] = stats.get("ai_calls", 0) + 1
stats["tokens_used"] = stats.get("tokens_used", 0) + tokens
self._dirty = True
def flush(self):
if self._dirty:
_save_incident_store(self._store)
self._dirty = False
@property
def stats(self):
return self._store.get("stats", {"ai_calls": 0, "tokens_used": 0})
class AIDiagnosticAgent:
def __init__(self, provider_url, model, api_key):
self.provider_url = provider_url
self.model = model
self.api_key = api_key
self.incident_store = IncidentStore()
def diagnose(self, context):
pattern = self._extract_pattern(context)
known = self.incident_store.lookup(pattern)
if known:
_monitoring_log(f"Tier 2 HIT: pattern={pattern} fix={known['fix']}")
return {"action": known["fix"], "reason": "known_pattern", "confidence": 0.9, "tier": 2}
action = self._call_model(context)
if action:
self.incident_store.record(pattern, action.get("action", "unknown"))
self.incident_store.flush()
return action
def _extract_pattern(self, context):
parts = []
for k in sorted(context.get("signals", [])):
parts.append(k)
if context.get("http_code"):
parts.append(f"http_{context['http_code']}")
return "+".join(parts[:3]) or "unknown"
def _call_model(self, context):
prompt = (
f"INCIDENT REPORT:\n"
f"Time: {time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())}\n"
f"Proxy health: {context.get('proxy_alive', 'unknown')}\n"
f"Upstream: {context.get('upstream_url', 'unknown')}\n"
f"Model: {context.get('model', 'unknown')}\n"
f"Last HTTP code: {context.get('http_code', 'n/a')}\n"
f"Recent signals: {context.get('signals', [])}\n"
f"Recent log tail:\n{context.get('log_tail', '')[:1500]}\n"
)
body = {
"model": self.model,
"messages": [
{"role": "system", "content": _DIAGNOSTIC_SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
"max_tokens": 200,
"temperature": 0.1,
}
try:
req = urllib.request.Request(
self.provider_url,
data=json.dumps(body).encode(),
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {self.api_key}",
},
)
resp = urllib.request.urlopen(req, timeout=15)
result = json.loads(resp.read())
text = result["choices"][0]["message"]["content"].strip()
self.incident_store.record_ai_call(tokens=800)
action = json.loads(text)
action["tier"] = 3
_monitoring_log(f"Tier 3 AI: action={action.get('action')} reason={action.get('reason')}")
return action
except Exception as e:
_monitoring_log(f"Tier 3 AI FAILED: {e}")
return {"action": "alert_user", "reason": f"ai_diag_failed: {e}", "confidence": 0.0, "tier": 3}
class HealthWatcher(threading.Thread):
def __init__(self, on_failure, on_recovery, on_signal, on_action):
super().__init__(daemon=True)
self.cfg = _load_monitoring_config()
self.on_failure = on_failure
self.on_recovery = on_recovery
self.on_signal = on_signal
self.on_action = on_action
self.failures = 0
self.running = False
self._signal_counts = collections.defaultdict(int)
self._last_actions = {}
self._restart_count = 0
self._last_restart_time = 0
def run(self):
self.running = True
self.incident_store = IncidentStore()
self._log_analyzer = _LogAnalyzerThread(self._on_log_signal)
self._log_analyzer.start()
while self.running:
self.cfg = _load_monitoring_config()
if not self.cfg.get("enabled"):
time.sleep(5)
continue
port = self._get_proxy_port()
if port:
healthy = self._check_health(port)
if healthy:
if self.failures > 0:
self.failures = 0
self.on_recovery()
else:
self.failures += 1
if self.failures >= 3:
self._handle_failure("proxy_health_fail")
self.incident_store.flush()
interval = self.cfg.get("health_check_interval_s", 5)
time.sleep(interval)
def stop(self):
self.running = False
if hasattr(self, '_log_analyzer'):
self._log_analyzer.running = False
def _get_proxy_port(self):
try:
cfg_path = Path.home() / ".cache/codex-proxy/proxy-config.json"
if cfg_path.exists():
d = json.loads(cfg_path.read_text())
return d.get("port")
except Exception:
pass
return None
def _check_health(self, port):
try:
req = urllib.request.Request(f"http://localhost:{port}/health")
resp = urllib.request.urlopen(req, timeout=5)
return resp.status == 200
except Exception:
return False
def _on_log_signal(self, fault_id, category, line):
self._signal_counts[category] += 1
self.on_signal(fault_id, category, line[:200])
count = self._signal_counts[category]
if category in ("proxy_dead", "port_conflict") and count >= 2:
self._handle_failure(category)
elif category in ("server_error", "timeout") and count >= 3:
self._handle_failure(category + "_repeat")
elif category in ("sanitizer_flag",) and count >= 5:
self._handle_failure("sanitizer_suspicious_5x")
elif category in ("stuck_recovery",) and count >= 5:
self._handle_failure("stuck_recovery_x5")
elif category in ("parser_empty",) and count >= 3:
self._handle_failure("parsed_tool_calls_0_x3")
elif category in ("schema_corrupt",):
self._handle_failure("schema_corrupt")
def _handle_failure(self, trigger):
now = time.time()
for rule_trigger, action, cooldown in _TIER1_RULES:
if rule_trigger == trigger:
last_t = self._last_actions.get(action, 0)
if now - last_t < cooldown:
return
self._last_actions[action] = now
_monitoring_log(f"Tier 1: trigger={trigger} action={action}")
self.on_action(action, trigger)
self.incident_store.record(trigger, action, success=True)
return
self._try_tier2_3(trigger)
def _try_tier2_3(self, trigger):
cfg = self.cfg
if not cfg.get("provider_url") or not cfg.get("model") or not cfg.get("api_key"):
_monitoring_log(f"No AI configured for Tier 2/3 — alerting user for trigger={trigger}")
self.on_action("alert_user", trigger)
return
agent = AIDiagnosticAgent(cfg["provider_url"], cfg["model"], cfg["api_key"])
context = {
"signals": [trigger],
"proxy_alive": self.failures == 0,
"log_tail": self._get_recent_log(),
}
result = agent.diagnose(context)
if result:
action = result.get("action", "alert_user")
_monitoring_log(f"Tier {result.get('tier', '?')}: action={action}")
self.on_action(action, trigger)
class _LogAnalyzerThread(threading.Thread):
def __init__(self, on_signal):
super().__init__(daemon=True)
self.on_signal = on_signal
self.running = False
def run(self):
self.running = True
log_paths = [
str(Path.home() / ".cache/codex-proxy/cc-debug.log"),
str(Path.home() / ".cache/codex-proxy/proxy.log"),
]
fhs = {}
for p in log_paths:
try:
f = open(p, "r")
f.seek(0, 2)
fhs[p] = f
except Exception:
pass
while self.running:
activity = False
for p, fh in list(fhs.items()):
try:
line = fh.readline()
if line:
activity = True
for pattern, (fault_id, category) in _FAILURE_SIGNALS.items():
if re.search(pattern, line):
self.on_signal(fault_id, category, line.strip())
break
except Exception:
pass
if not activity:
time.sleep(0.5)
class AIMonitoringWindow(Gtk.Window):
def __init__(self, parent=None):
super().__init__(title="AI Monitoring")
self.set_transient_for(parent)
self.set_default_size(580, 520)
self.set_border_width(12)
self._cfg = _load_monitoring_config()
self._store = _load_incident_store()
vbox = Gtk.Box(orientation=Gtk.Orientation.VERTICAL, spacing=8)
self.add(vbox)
hdr = Gtk.Box(spacing=8)
vbox.pack_start(hdr, False, False, 0)
lbl = Gtk.Label()
lbl.set_markup("<b>AI Monitoring</b>")
lbl.set_use_markup(True)
hdr.pack_start(lbl, False, False, 0)
self._toggle = Gtk.Switch()
self._toggle.set_active(self._cfg.get("enabled", False))
self._toggle.connect("state-set", self._on_toggle)
hdr.pack_end(self._toggle, False, False, 0)
lbl2 = Gtk.Label(label="Enabled")
hdr.pack_end(lbl2, False, False, 0)
frame = Gtk.Frame(label="Diagnostic Agent")
vbox.pack_start(frame, False, False, 0)
grid = Gtk.Grid(column_spacing=8, row_spacing=6, margin=8)
frame.add(grid)
grid.attach(Gtk.Label(label="Provider URL:", halign=Gtk.Align.END), 0, 0, 1, 1)
self._url_entry = Gtk.Entry(hexpand=True)
self._url_entry.set_text(self._cfg.get("provider_url", ""))
self._url_entry.set_placeholder_text("https://api.openai.com/v1/chat/completions")
grid.attach(self._url_entry, 1, 0, 2, 1)
grid.attach(Gtk.Label(label="Model:", halign=Gtk.Align.END), 0, 1, 1, 1)
self._model_entry = Gtk.Entry(hexpand=True)
self._model_entry.set_text(self._cfg.get("model", ""))
self._model_entry.set_placeholder_text("gpt-4o-mini or Qwen/Qwen3-32B")
grid.attach(self._model_entry, 1, 1, 2, 1)
grid.attach(Gtk.Label(label="API Key:", halign=Gtk.Align.END), 0, 2, 1, 1)
self._key_entry = Gtk.Entry(hexpand=True, visibility=False)
self._key_entry.set_text(self._cfg.get("api_key", ""))
self._key_entry.set_placeholder_text("sk-...")
grid.attach(self._key_entry, 1, 2, 1, 1)
self._reveal_btn = Gtk.ToggleButton(label="Show")
self._reveal_btn.connect("toggled", lambda b: self._key_entry.set_visibility(b.get_active()))
grid.attach(self._reveal_btn, 2, 2, 1, 1)
grid.attach(Gtk.Label(label="Health Check:", halign=Gtk.Align.END), 0, 3, 1, 1)
adj = Gtk.Adjustment(value=self._cfg.get("health_check_interval_s", 5), lower=2, upper=30, step_increment=1)
self._interval_spin = Gtk.SpinButton(adjustment=adj)
self._interval_spin.set_numeric(True)
grid.attach(self._interval_spin, 1, 3, 1, 1)
grid.attach(Gtk.Label(label="seconds"), 2, 3, 1, 1)
opts_box = Gtk.Box(spacing=12, margin_top=4)
grid.attach(opts_box, 0, 4, 3, 1)
self._auto_restart_cb = Gtk.CheckButton(label="Auto-restart proxy on crash")
self._auto_restart_cb.set_active(self._cfg.get("auto_restart_proxy", True))
opts_box.pack_start(self._auto_restart_cb, False, False, 0)
self._auto_switch_cb = Gtk.CheckButton(label="Auto-switch provider on repeated failure")
self._auto_switch_cb.set_active(self._cfg.get("auto_switch_provider", False))
opts_box.pack_start(self._auto_switch_cb, False, False, 0)
save_btn = Gtk.Button(label="Save Configuration")
save_btn.get_style_context().add_class("suggested-action")
save_btn.connect("clicked", self._on_save)
grid.attach(save_btn, 0, 5, 3, 1)
stats_box = Gtk.Box(spacing=16)
vbox.pack_start(stats_box, False, False, 0)
stats = self._store.get("stats", {"ai_calls": 0, "tokens_used": 0})
self._stats_lbl = Gtk.Label()
self._stats_lbl.set_markup(
f"<small>AI diagnostic calls: <b>{stats.get('ai_calls', 0)}</b> | "
f"Tokens used: <b>{stats.get('tokens_used', 0):,}</b> | "
f"Known patterns: <b>{len(self._store.get('incidents', {}))}</b></small>"
)
self._stats_lbl.set_use_markup(True)
stats_box.pack_start(self._stats_lbl, False, False, 0)
frame2 = Gtk.Frame(label="Recent Incidents")
vbox.pack_start(frame2, True, True, 0)
sw = Gtk.ScrolledWindow()
sw.set_policy(Gtk.PolicyType.AUTOMATIC, Gtk.PolicyType.AUTOMATIC)
frame2.add(sw)
self._inc_buf = Gtk.TextBuffer()
tv = Gtk.TextView(buffer=self._inc_buf)
tv.set_editable(False)
tv.set_cursor_visible(False)
tv.set_wrap_mode(Gtk.WrapMode.WORD_CHAR)
sw.add(tv)
self._refresh_incidents()
bb = Gtk.Box(spacing=8)
vbox.pack_start(bb, False, False, 0)
view_btn = Gtk.Button(label="View Monitoring Log")
view_btn.connect("clicked", lambda b: subprocess.Popen(["xdg-open", str(MONITORING_LOG)]))
bb.pack_start(view_btn, False, False, 0)
clear_btn = Gtk.Button(label="Clear Incident Store")
clear_btn.connect("clicked", self._on_clear_store)
bb.pack_start(clear_btn, False, False, 0)
close_btn = Gtk.Button(label="Close")
close_btn.connect("clicked", lambda b: self.destroy())
bb.pack_end(close_btn, False, False, 0)
self.show_all()
def _on_toggle(self, switch, state):
self._cfg["enabled"] = state
_save_monitoring_config(self._cfg)
def _on_save(self, btn):
self._cfg["provider_url"] = self._url_entry.get_text().strip()
self._cfg["model"] = self._model_entry.get_text().strip()
self._cfg["api_key"] = self._key_entry.get_text().strip()
self._cfg["health_check_interval_s"] = int(self._interval_spin.get_value())
self._cfg["auto_restart_proxy"] = self._auto_restart_cb.get_active()
self._cfg["auto_switch_provider"] = self._auto_switch_cb.get_active()
_save_monitoring_config(self._cfg)
self._inc_buf.set_text("Configuration saved.\n")
def _on_clear_store(self, btn):
_save_incident_store({"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}})
self._store = {"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}}
self._refresh_incidents()
def _refresh_incidents(self):
lines = []
for pattern, inc in sorted(self._store.get("incidents", {}).items(),
key=lambda x: x[1].get("last_seen", ""), reverse=True):
sc = inc.get("success_count", 0)
fc = inc.get("fail_count", 0)
rate = sc / max(sc + fc, 1)
bar = "+" * min(int(rate * 10), 10) + "-" * (10 - min(int(rate * 10), 10))
lines.append(
f"[{inc.get('last_seen', '?')[:16]}] {pattern}\n"
f" fix={inc.get('fix', '?')} success_rate={rate:.0%} [{bar}] "
f"seen={inc.get('occurrences', 0)}x\n"
)
if not lines:
lines.append("No incidents recorded yet.\n")
lines.append("\nEnable AI Monitoring and use Codex to populate the store.\n")
self._inc_buf.set_text("\n".join(lines))
# ═══════════════════════════════════════════════════════════════════ # ═══════════════════════════════════════════════════════════════════
# Main window # Main window
# ═══════════════════════════════════════════════════════════════════ # ═══════════════════════════════════════════════════════════════════
@@ -1143,7 +1661,7 @@ class LauncherWin(Gtk.Window):
# header row # header row
hdr = Gtk.Box(spacing=8) hdr = Gtk.Box(spacing=8)
vbox.pack_start(hdr, False, False, 0) vbox.pack_start(hdr, False, False, 0)
lbl = Gtk.Label(label="<b>Codex Launcher v3.7.0</b>") lbl = Gtk.Label(label="<b>Codex Launcher v3.8.0</b>")
lbl.set_use_markup(True) lbl.set_use_markup(True)
hdr.pack_start(lbl, False, False, 0) hdr.pack_start(lbl, False, False, 0)
changelog_btn = Gtk.Button(label="Changelog") changelog_btn = Gtk.Button(label="Changelog")
@@ -1161,6 +1679,9 @@ class LauncherWin(Gtk.Window):
bgp_btn = Gtk.Button(label="AI BGP") bgp_btn = Gtk.Button(label="AI BGP")
bgp_btn.connect("clicked", lambda b: self._open_bgp()) bgp_btn.connect("clicked", lambda b: self._open_bgp())
hdr.pack_end(bgp_btn, False, False, 0) hdr.pack_end(bgp_btn, False, False, 0)
mon_btn = Gtk.Button(label="AI Monitor")
mon_btn.connect("clicked", lambda b: self._open_monitoring())
hdr.pack_end(mon_btn, False, False, 0)
mgr_btn = Gtk.Button(label="Manage Endpoints") mgr_btn = Gtk.Button(label="Manage Endpoints")
mgr_btn.connect("clicked", lambda b: self._open_mgr()) mgr_btn.connect("clicked", lambda b: self._open_mgr())
hdr.pack_end(mgr_btn, False, False, 0) hdr.pack_end(mgr_btn, False, False, 0)
@@ -1310,6 +1831,7 @@ class LauncherWin(Gtk.Window):
self.show_all() self.show_all()
self._rebuild_combo() self._rebuild_combo()
self._log_dependency_status() self._log_dependency_status()
self._start_watcher()
# ── helpers ────────────────────────────────────────────────── # ── helpers ──────────────────────────────────────────────────
@@ -1464,6 +1986,77 @@ class LauncherWin(Gtk.Window):
d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}") d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
d.run(); d.destroy() d.run(); d.destroy()
def _open_monitoring(self):
try:
self._monitoring_window = AIMonitoringWindow(self)
self._monitoring_window.connect("destroy", lambda *_: setattr(self, "_monitoring_window", None))
except Exception as e:
import traceback; traceback.print_exc()
d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
d.run(); d.destroy()
def _start_watcher(self):
cfg = _load_monitoring_config()
if not cfg.get("enabled"):
return
self._watcher = HealthWatcher(
on_failure=self._on_watcher_failure,
on_recovery=self._on_watcher_recovery,
on_signal=self._on_watcher_signal,
on_action=self._on_watcher_action,
)
self._watcher.start()
self.log("AI Monitoring: watchdog started")
def _on_watcher_failure(self, count):
GLib.idle_add(self.log, f"[AI Monitor] Proxy unresponsive (failures={count})")
def _on_watcher_recovery(self):
GLib.idle_add(self.log, "[AI Monitor] Proxy recovered")
def _on_watcher_signal(self, fault_id, category, line):
pass
def _on_watcher_action(self, action, trigger):
cfg = _load_monitoring_config()
if action == "restart_proxy" and cfg.get("auto_restart_proxy"):
GLib.idle_add(self.log, f"[AI Monitor] Auto-restarting proxy (trigger: {trigger})")
GLib.idle_add(self._restart_proxy_from_watcher)
elif action == "clear_schema_cache":
try:
cap_file = Path.home() / ".cache/codex-proxy/provider-caps.json"
if cap_file.exists():
cap_file.unlink()
GLib.idle_add(self.log, "[AI Monitor] Cleared corrupt schema cache")
except Exception as e:
GLib.idle_add(self.log, f"[AI Monitor] Failed to clear cache: {e}")
elif action == "delete_provider_caps":
try:
cap_file = Path.home() / ".cache/codex-proxy/provider-caps.json"
if cap_file.exists():
cap_file.unlink()
GLib.idle_add(self.log, "[AI Monitor] Deleted corrupted provider-caps.json")
except Exception as e:
GLib.idle_add(self.log, f"[AI Monitor] Failed: {e}")
elif action == "kill_stale_restart":
GLib.idle_add(self.log, f"[AI Monitor] Killing stale processes + restarting (trigger: {trigger})")
self._kill()
GLib.idle_add(self._restart_proxy_from_watcher)
else:
GLib.idle_add(self.log, f"[AI Monitor] Alert: {action} (trigger: {trigger})")
def _restart_proxy_from_watcher(self):
try:
ep_name = load_endpoints().get("default")
if not ep_name:
return
for ep in load_endpoints().get("endpoints", []):
if ep.get("name") == ep_name:
self._start_proxy(ep)
break
except Exception as e:
self.log(f"[AI Monitor] Proxy restart failed: {e}")
def _open_usage(self): def _open_usage(self):
try: try:
self._usage_window = UsageWindow(self) self._usage_window = UsageWindow(self)

View File

@@ -3410,10 +3410,20 @@ class Handler(http.server.BaseHTTPRequestHandler):
if self.path in ("/v1/models", "/models"): if self.path in ("/v1/models", "/models"):
self.send_json(200, {"object": "list", "data": MODELS}) self.send_json(200, {"object": "list", "data": MODELS})
elif self.path in ("/health", "/v1/health"): elif self.path in ("/health", "/v1/health"):
import resource as _res
_mem_mb = 0
try:
_mem_mb = _res.getrusage(_res.RUSAGE_SELF).ru_maxrss / 1024
except Exception:
pass
_uptime = time.time() - _START_TIME if '_START_TIME' in dir() else 0
self.send_json(200, {"ok": True, "backend": BACKEND, self.send_json(200, {"ok": True, "backend": BACKEND,
"target_url": TARGET_URL, "target_url": TARGET_URL,
"models": [m.get("id") for m in MODELS], "models": [m.get("id") for m in MODELS],
"bgp_routes": len(BGP_ROUTES)}) "bgp_routes": len(BGP_ROUTES),
"uptime_s": round(_uptime, 1),
"memory_mb": round(_mem_mb, 1),
"requests_total": _STATS.get("requests", 0)})
else: else:
self.send_error(404) self.send_error(404)
@@ -4753,7 +4763,8 @@ def _handle_shutdown_signal(sig, frame):
SERVER.shutdown() SERVER.shutdown()
def main(): def main():
global SERVER global SERVER, _START_TIME
_START_TIME = time.time()
_init_runtime() _init_runtime()
signal.signal(signal.SIGTERM, _handle_shutdown_signal) signal.signal(signal.SIGTERM, _handle_shutdown_signal)
signal.signal(signal.SIGINT, _handle_shutdown_signal) signal.signal(signal.SIGINT, _handle_shutdown_signal)