diff --git a/AI-MONITORING-DESIGN.md b/AI-MONITORING-DESIGN.md new file mode 100644 index 0000000..a7314c7 --- /dev/null +++ b/AI-MONITORING-DESIGN.md @@ -0,0 +1,638 @@ +# AI Monitoring — Design Specification + +> **Codex Launcher v3.8.0 Feature Design** +> Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions. + +--- + +## 1. Problem Statement + +Over 42 sessions in production, we observed these failure categories: + +| # | Failure Category | Count | Example | +|---|-----------------|-------|---------| +| F1 | **parsed_tool_calls=0** — model produces unparseable output | 42 | Bare ``, `` without cmd, plain English intent | +| F2 | **Stuck recovery triggered** — Intelligence Routing Layer 3 | 13 | "I need to fetch the README", "let me write the script" | +| F3 | **Sanitizer flagged suspicious cmd** — cmd still JSON after unwrap | 11 | `{/'cmd/': /'sshpass -p .../'}` — double-escaped quoting | +| F4 | **Upstream 500** — provider internal error | ~5 | `"An internal error occurred. Please try again later."` | +| F5 | **Connection timeout** — upstream unreachable | ~3 | `Connection timed out after 15002 milliseconds` | +| F6 | **Upstream 401/403** — auth failure | ~2 | Wrong API key, expired token, `upgrade_required` | +| F7 | **Stream crash** — exception mid-stream | ~2 | `BrokenPipeError`, `ConnectionResetError` during SSE | +| F8 | **Proxy port conflict** — Address already in use | ~1 | Stale process holding port | +| F9 | **Schema cache corruption** — stale content_type=array | ~1 | `ErrorAnalyzer` learned wrong schema | +| F10 | **Codex Desktop crash** — SIGKILL at ~27GB | ~1 | Issue #24048 — unbounded tool output memory | +| F11 | **Codex 300s stall** — turn state machine race | ~1 | Issue #23807 — `stream disconnected` after 300s | + +### The Gap + +Intelligence Routing (v3.7.0) handles F1/F2/F3 **inside a single request**. But it can't: + +- **Detect a dead proxy process** (F7/F8) — the proxy already crashed +- **Reconnect Codex to a restarted proxy** (F5/F7/F8) — Codex doesn't auto-reconnect +- **Switch to a backup provider** when the primary is down (F4/F5) +- **Clear corrupt caches** (F9) — requires out-of-band action +- **Restart Codex Desktop** after a crash (F10/F11) +- **Learn from failure patterns** across sessions — each failure is handled independently + +### What We Need + +A **separate lightweight watchdog process** that: +1. Monitors proxy health continuously +2. Detects failures the proxy can't detect itself +3. Uses a cheap AI model to diagnose novel failures +4. Takes corrective action automatically +5. Learns from past incidents to prevent repeats + +--- + +## 2. Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Codex Launcher GUI │ +│ ┌──────────┐ ┌──────────────┐ ┌───────────────────────────────┐ │ +│ │ Proxy │ │ Codex │ │ AI Monitoring Panel │ │ +│ │ Manager │ │ Launcher │ │ ┌─────────────────────┐ │ │ +│ │ │ │ │ │ │ ON/OFF Toggle │ │ │ +│ └────┬─────┘ └──────┬───────┘ │ │ Provider Selector │ │ │ +│ │ │ │ │ Model Selector │ │ │ +│ │ │ │ │ Incident Log │ │ │ +│ │ │ │ │ [View Diagnostics] │ │ │ +│ │ │ │ └─────────────────────┘ │ │ +│ │ │ └───────────────────────────────┘ │ +└───────┼───────────────┼────────────────────────────────────────────┘ + │ │ + ▼ ▼ +┌───────────────┐ ┌────────────────┐ +│ translate- │ │ Codex Desktop │ +│ proxy.py │ │ / CLI │ +│ (port 8080) │ │ │ +│ │ │ │ +│ /health ──────┼──┼─► health check │ +│ /responses ───┼──┼─► main API │ +└───────────────┘ └────────────────┘ + ▲ + │ health probes + log analysis + corrective actions + │ +┌───────┴────────────────────────────────────────────────────────────┐ +│ AI Monitor Watchdog │ +│ (thread in codex-launcher-gui) │ +│ │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │ +│ │ Health Watcher │ │ Log Analyzer │ │ AI Diagnostic │ │ +│ │ (every 5s) │ │ (continuous) │ │ Agent (on-call) │ │ +│ │ │ │ │ │ │ │ +│ │ - /health probe │ │ - tail cc-debug │ │ - Classify err │ │ +│ │ - process alive │ │ - tail proxy.log│ │ - Root cause │ │ +│ │ - port check │ │ - pattern match │ │ - Suggest fix │ │ +│ │ - memory watch │ │ - incident DB │ │ - Execute fix │ │ +│ └────────┬────────┘ └────────┬────────┘ └────────┬─────────┘ │ +│ │ │ │ │ +│ └────────────────────┼─────────────────────┘ │ +│ ▼ │ +│ ┌──────────────────────┐ │ +│ │ Incident Store │ │ +│ │ (JSON file) │ │ +│ │ - Known patterns │ │ +│ │ - Past resolutions │ │ +│ │ - Success rates │ │ +│ └──────────────────────┘ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 3. Three-Tier Response System + +### Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second) + +Immediate reactions to **known failure patterns**. No AI needed. + +```python +TIER1_RULES = [ + # (trigger_pattern, action, cooldown) + + # --- Proxy Health --- + ("proxy_health_fail", "restart_proxy", 30), + ("proxy_port_conflict", "kill_stale + restart", 60), + ("proxy_memory_over_1gb", "restart_proxy", 120), + + # --- Upstream Errors --- + ("upstream_429", "wait_retry_after", 0), + ("upstream_502_503", "retry_with_backoff", 30), + ("upstream_500_repeat_3x", "switch_provider", 60), + ("upstream_timeout", "retry + increase_timeout", 30), + ("upstream_401_403", "alert_user_bad_key", 0), + + # --- Stream Errors --- + ("stream_broken_pipe", "restart_proxy", 30), + ("stream_reset", "restart_proxy", 30), + ("stream_idle_300s", "restart_proxy", 60), + + # --- Parser Failures --- + ("parsed_tool_calls_0_x3", "clear_schema_cache", 300), + ("sanitizer_suspicious_5x","alert_user_model_issue", 0), + ("stuck_recovery_x5", "suggest_switch_model", 0), + + # --- Codex Process --- + ("codex_process_dead", "alert_user_restart", 0), + ("codex_memory_over_4gb", "alert_user_memory", 0), + + # --- Cache Corruption --- + ("schema_content_type_array", "delete_provider_caps", 0), +] +``` + +### Tier 2: Pattern Matching — Incident Store Lookup (< 100ms) + +For failures we've **seen before and resolved**, look up the fix: + +```json +{ + "incidents": [ + { + "pattern": "cc_stream_ended_empty + explore_agent + no_url", + "fix": "synth_explore_from_last_user_urls", + "source": "FIX-23", + "success_rate": 0.85, + "last_seen": "2026-05-22T16:00:00Z", + "occurrences": 5 + }, + { + "pattern": "require_escalation + no_cmd", + "fix": "auto_proceed_echo", + "source": "FIX-24", + "success_rate": 1.0, + "last_seen": "2026-05-22T15:30:00Z", + "occurrences": 3 + } + ] +} +``` + +### Tier 3: AI Diagnostic — Nano Agent (2-5 seconds) + +For **novel failures** that don't match any rule or pattern, invoke a cheap AI model: + +``` +Prompt Template (system): +───────────────────── +You are a diagnostic agent for a translation proxy that sits between +OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat, +Anthropic, etc.). You analyze error context and suggest ONE corrective action. + +Available actions: restart_proxy, kill_stale_processes, clear_schema_cache, +switch_provider, increase_timeout, alert_user, ignore, retry_now, +regenerate_config, cleanup_codex_stale + +Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0} + +Prompt Template (user): +───────────────────── +INCIDENT REPORT: +Time: {timestamp} +Session: {session_id} +Proxy health: {alive/dead, port, uptime, memory_mb} +Upstream: {url, model, last_http_code, last_error} +Recent errors (last 60s): +{log_lines} +Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags} +Provider: {backend_type, model} +History: {last_5_incidents_for_this_pattern} + +What corrective action should be taken? +``` + +--- + +## 4. Complete Failure Catalog + +### Category A: Proxy-Level Failures (watchdog detects, auto-recovers) + +| ID | Failure | Symptoms | Tier 1 Action | Log Signature | +|----|---------|----------|---------------|---------------| +| A1 | Proxy process crashed | `/health` returns connection refused | `restart_proxy` | `urllib.error.URLError: [Errno 111] Connection refused` | +| A2 | Port conflict | `Address already in use` on startup | `kill_stale + restart` | `OSError: [Errno 98] Address already in use` | +| A3 | Memory leak | Process RSS > 1GB | `restart_proxy` | `/proc/{pid}/status` VmRSS check | +| A4 | Deadlock | Health check hangs > 15s | `restart_proxy` | health probe timeout | +| A5 | Unhandled exception | Process exits with non-zero | `restart_proxy` | `SELF-REVIVE CRASH #{n}` | +| A6 | SSL/TLS error | `CERTIFICATE_VERIFY_FAILED` upstream | `alert_user` | `urllib.error.URLError: certificate verify failed` | +| A7 | DNS resolution failure | `getaddrinfo failed` | `retry_with_backoff` | `socket.gaierror: Name or service not known` | + +### Category B: Upstream Provider Failures (proxy detects, watchdog analyzes) + +| ID | Failure | Symptoms | Tier 1 Action | Log Signature | +|----|---------|----------|---------------|---------------| +| B1 | Rate limit (429) | Too many requests | `wait_retry_after` | `HTTP 429` + `Retry-After` header | +| B2 | Server error (5xx) | Provider down | `retry_with_backoff` | `HTTP 500/502/503` | +| B3 | Auth failure (401/403) | Bad/expired key | `alert_user_bad_key` | `HTTP 401 {"error":"invalid_api_key"}` | +| B4 | CC upgrade required (403) | Version mismatch | `update_cc_version` | `HTTP 403 upgrade_required` | +| B5 | Connection timeout | Upstream silent | `retry + increase_timeout` | `urllib.error.URLError: timed out` | +| B6 | Connection reset | Upstream dropped mid-stream | `restart_proxy` | `ConnectionResetError: Connection reset by peer` | +| B7 | Broken pipe | Client disconnected | `ignore` | `BrokenPipeError: Broken pipe` | +| B8 | Upstream 400 bad request | Malformed request | `clear_schema_cache` | `HTTP 400 {"error":"...expected string..."}` | +| B9 | Provider capacity (503) | Overloaded | `switch_provider` | `HTTP 503` after 3 retries | +| B10 | Cloudflare block (403/1010) | Bot detection | `check_browser_ua` | `HTTP 403 error 1010` | + +### Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks) + +| ID | Failure | Symptoms | Auto-Fix (IR Layer) | Watchdog Escalation | +|----|---------|----------|--------------------|--------------------| +| C1 | Bare `` | `parsed_tool_calls=0` | Layer 1: URL extraction | If 3x in a row → suggest model switch | +| C2 | `` block | Model wants permissions | Layer 2: Auto-proceed | If 5x → suggest different provider | +| C3 | Unrecognized format | No parser matches | Layer 3: Intent synthesis | If 5x → log for AI diagnosis | +| C4 | Double-wrapped cmd | `cmd = "{\"cmd\": ...}"` | Sanitizer: unwrap | If cmd still JSON → alert | +| C5 | Suspicious cmd (JSON) | `cmd starts with {` | Sanitizer: flag | If 3x → clear cache + restart | +| C6 | Empty cmd | `cmd = ""` or `cmd = "{}"` | Sanitizer: diagnostic echo | If 3x → suggest model switch | +| C7 | Bare `{` token | Model outputs incomplete JSON | Layer 3: heuristic 5 | If persistent → AI diagnosis | +| C8 | `` without cmd | Block has sandbox but no command | Layer 3: heuristic | If 3x → AI diagnosis | +| C9 | DSML name mismatch | `name="cmd"` vs `name="command"` | DSML parser handles both | Self-test catches regression | +| C10 | Stuck model loop | Same recovery 5+ times | Layer 3 max 3x then alert | Switch model or provider | + +### Category D: Codex Process Failures (watchdog detects, alerts user) + +| ID | Failure | Symptoms | Action | Log Signature | +|----|---------|----------|--------|---------------| +| D1 | Codex process killed | PID gone from pids.json | `alert_user_restart` | Process not in `/proc/{pid}` | +| D2 | Codex memory explosion | RSS > 4GB | `alert_user_memory` | `/proc/{pid}/status` check | +| D3 | Codex 300s stall | `stream disconnected` loop | `restart_proxy` | Codex stderr: `stream disconnected` | +| D4 | Config corruption | `database disk image is malformed` | `regenerate_config` | Codex stderr: `malformed` | +| D5 | Session context overflow | `context_length_exceeded` | `alert_user_context` | Codex stderr: `context_length_exceeded` | +| D6 | WebSocket reconnect loop | `Reconnecting... N/5` | `check_proxy_health` | Codex stderr: `Reconnecting` | + +### Category E: Config/State Failures (watchdog detects, auto-fixes) + +| ID | Failure | Symptoms | Action | Detection | +|----|---------|----------|--------|-----------| +| E1 | Schema cache corruption | `content_type: "array"` in provider-caps.json | `delete_provider_caps` | Read file, check for known-bad values | +| E2 | Stale PID file | pids.json has dead PIDs | `cleanup_pids` | Check `/proc/{pid}` existence | +| E3 | Port from old session | config.toml has stale port | `regenerate_config` | Port in config != running port | +| E4 | OAuth token expired | Google/Gemini token refresh fails | `alert_user_reauth` | Token file `expiry_ts < now` | +| E5 | BGP all routes down | Every route returned error | `alert_user_no_provider` | All routes in cooldown | + +--- + +## 5. Component Design + +### 5.1 Health Watcher Thread + +Runs in the GUI process as a background thread. Pings proxy `/health` endpoint every 5 seconds. + +```python +class HealthWatcher(threading.Thread): + def __init__(self, proxy_port, on_failure, on_recovery): + super().__init__(daemon=True) + self.proxy_port = proxy_port + self.on_failure = on_failure + self.on_recovery = on_recovery + self.check_interval = 5 # seconds + self.failures = 0 + self.running = True + + def run(self): + while self.running: + healthy = self._check_health() + if healthy: + if self.failures > 0: + self.failures = 0 + self.on_recovery() + else: + self.failures += 1 + if self.failures >= 3: # 15s of consecutive failures + self.on_failure(self.failures) + time.sleep(self.check_interval) + + def _check_health(self): + try: + req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health") + resp = urllib.request.urlopen(req, timeout=5) + return resp.status == 200 + except Exception: + return False +``` + +### 5.2 Log Analyzer Thread + +Tails the debug log and extracts failure signals in real-time. + +```python +FAILURE_SIGNALS = { + "parsed_tool_calls=0": ("C1", "parser_empty"), + "[STUCK-RECOVERY]": ("C3", "stuck_recovery"), + "suspicious cmd": ("C4", "sanitizer_flag"), + "empty cmd recovered": ("C6", "empty_cmd"), + "HTTP 429": ("B1", "rate_limited"), + "HTTP 500": ("B2", "server_error"), + "HTTP 401": ("B3", "auth_failure"), + "HTTP 403": ("B4", "forbidden"), + "Connection refused": ("A1", "proxy_dead"), + "Address already in use": ("A2", "port_conflict"), + "Broken pipe": ("B7", "broken_pipe"), + "Connection reset": ("B6", "connection_reset"), + "timed out": ("B5", "timeout"), + "SELF-REVIVE CRASH": ("A5", "proxy_crash"), + "stream error": ("B6", "stream_error"), +} + +class LogAnalyzer(threading.Thread): + def __init__(self, log_path, on_signal): + super().__init__(daemon=True) + self.log_path = log_path + self.on_signal = on_signal + self.running = True + + def run(self): + fh = open(self.log_path, "r") + fh.seek(0, 2) # seek to end + while self.running: + line = fh.readline() + if not line: + time.sleep(0.5) + continue + for pattern, (fault_id, category) in FAILURE_SIGNALS.items(): + if pattern in line: + self.on_signal(fault_id, category, line.strip()) + break +``` + +### 5.3 AI Diagnostic Agent + +Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns. + +```python +class AIDiagnosticAgent: + def __init__(self, provider_url, model, api_key): + self.provider_url = provider_url + self.model = model + self.api_key = api_key + self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT # defined below + self.incident_store = IncidentStore() + + def diagnose(self, context): + # Tier 2: Check incident store first + pattern = self._extract_pattern(context) + known_fix = self.incident_store.lookup(pattern) + if known_fix and known_fix["success_rate"] > 0.7: + return known_fix["fix"], "tier2_pattern", known_fix["success_rate"] + + # Tier 3: Ask AI + prompt = self._build_prompt(context) + response = self._call_model(prompt) + action = self._parse_response(response) + + # Learn from this incident + if action: + self.incident_store.record(pattern, action) + + return action, "tier3_ai", None + + def _call_model(self, prompt): + body = { + "model": self.model, + "messages": [ + {"role": "system", "content": self.system_prompt}, + {"role": "user", "content": prompt} + ], + "max_tokens": 200, + "temperature": 0.1, + } + req = urllib.request.Request( + self.provider_url, + data=json.dumps(body).encode(), + headers={ + "Content-Type": "application/json", + "Authorization": f"Bearer {self.api_key}", + } + ) + resp = urllib.request.urlopen(req, timeout=15) + return json.loads(resp.read())["choices"][0]["message"]["content"] +``` + +### 5.4 Incident Store + +JSON file that accumulates failure patterns and their resolutions. + +```json +{ + "version": 1, + "incidents": { + "parser_empty+explore_agent": { + "fault_ids": ["C1"], + "fix": "synth_explore_from_urls", + "source": "intelligent_routing", + "success_count": 8, + "fail_count": 1, + "last_seen": "2026-05-22T16:00:00Z", + "auto_applied": true + }, + "server_error+repeat_3x": { + "fault_ids": ["B2"], + "fix": "switch_provider", + "source": "tier1_rule", + "success_count": 2, + "fail_count": 0, + "last_seen": "2026-05-22T14:00:00Z", + "auto_applied": true + } + }, + "ai_diagnostic_calls": 0, + "tokens_used": 0, + "cost_usd": 0.0 +} +``` + +### 5.5 Diagnostic Agent System Prompt + +``` +You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local +translation proxy between OpenAI Codex CLI/Desktop and various AI providers. + +## Your Job +Analyze the incident report and recommend ONE corrective action. + +## Available Actions +- restart_proxy: Kill and restart translate-proxy.py +- kill_stale_processes: Kill orphaned proxy/codex processes +- clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json +- switch_provider: Switch to a different configured endpoint +- increase_timeout: Increase upstream timeout for slow providers +- regenerate_config: Regenerate Codex config.toml +- cleanup_codex_stale: Run cleanup-codex-stale.sh +- alert_user: Show notification to user (can't auto-fix) +- ignore: Transient error, no action needed +- retry_now: Immediate retry without changes + +## Decision Rules +- If upstream returns 401/403 with auth error → alert_user (can't fix bad keys) +- If proxy process is dead → restart_proxy +- If same error repeated 5+ times → switch_provider or alert_user +- If error is about content_type/schema → clear_schema_cache +- If "Address already in use" → kill_stale_processes then restart_proxy +- If timeout and upstream is slow → increase_timeout +- If single transient 429/502/503 → ignore (retry handles it) +- If "stream disconnected" and proxy is healthy → ignore (Codex retries) + +## Response Format +Reply with ONLY a JSON object: +{"action": "...", "reason": "...", "confidence": 0.0-1.0} + +No explanation, no markdown, no extra text. +``` + +--- + +## 6. GUI Integration + +### AI Monitoring Panel (in Settings tab) + +``` +┌─────────────────────────────────────────────────────────┐ +│ AI Monitoring [ON] │ +│ │ +│ ┌─ Diagnostic Agent ─────────────────────────────────┐ │ +│ │ Provider: [OpenCode Zen ▼] │ │ +│ │ Model: [Qwen3-32B ▼] │ │ +│ │ API Key: [sk-•••••••••••••••••••• ] │ │ +│ │ │ │ +│ │ Cost this month: $0.12 (3 diagnostic calls) │ │ +│ │ Tokens used: 1,847 input / 423 output │ │ +│ └─────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─ Incident Log (last 7 days) ──────────────────────┐ │ +│ │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │ │ +│ │ ⚠️ 15:30 B2 server_error → retry (Tier 1) │ │ +│ │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1) │ │ +│ │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3) │ │ +│ │ ... │ │ +│ └────────────────────────────────────────────────────┘ │ +│ │ +│ [View Full Diagnostics] [Export Incident Report] │ +└─────────────────────────────────────────────────────────┘ +``` + +### Config Storage (in endpoints.json) + +```json +{ + "ai_monitoring": { + "enabled": true, + "provider_url": "https://opencode.ai/zen/v1/chat/completions", + "model": "Qwen/Qwen3-32B", + "api_key": "sk-...", + "tier1_enabled": true, + "tier2_enabled": true, + "tier3_enabled": true, + "auto_restart_proxy": true, + "auto_switch_provider": false, + "health_check_interval_s": 5, + "max_memory_mb": 1024, + "notification_level": "important_only" + } +} +``` + +### Recommended Models (by cost) + +| Model | Cost/Diagnosis | Latency | Quality | Recommended For | +|-------|---------------|---------|---------|----------------| +| **Qwen3-32B** (OpenCode) | ~$0.0005 | 2-4s | Good | Default — cheapest decent model | +| **DeepSeek V4 Flash** | ~$0.0003 | 2-3s | Good | Cheapest option | +| **GPT-4o-mini** | ~$0.001 | 1-2s | Excellent | Best quality/latency | +| **Gemini 2.0 Flash** | ~$0.0002 | 1-2s | Good | Cheapest + fastest | +| **Claude Haiku 4.5** | ~$0.001 | 2-3s | Excellent | Best reasoning quality | +| **Local Ollama** (if running) | $0 | 5-15s | Varies | Zero-cost offline option | + +### Cost Estimate + +- Average diagnostic prompt: ~800 tokens input, ~100 tokens output +- Expected frequency: ~1-5 incidents per day that reach Tier 3 +- **Monthly cost**: $0.10 - $1.50 depending on model and usage + +--- + +## 7. Watchdog Response Flow + +``` +Failure Detected + │ + ▼ +┌─────────────┐ YES ┌──────────────────┐ +│ Tier 1 Rule? ├─────────►│ Execute Action │ +│ (known) │ │ Log incident │ +└──────┬───────┘ └──────────────────┘ + │ NO + ▼ +┌─────────────┐ YES ┌──────────────────┐ +│ Tier 2 Match?├─────────►│ Apply Known Fix │ +│ (incident DB)│ │ Update success │ +└──────┬───────┘ └──────────────────┘ + │ NO + ▼ +┌─────────────┐ YES ┌──────────────────┐ +│ AI Enabled? ├─────────►│ Collect Context │ +│ (Tier 3) │ │ Build Prompt │ +└──────┬───────┘ │ Call AI Model │ + │ NO │ Parse Response │ + ▼ │ Execute if auto │ +┌─────────────┐ │ Store incident │ +│ Alert User │ └──────────────────┘ +│ (can't fix) │ +└─────────────┘ +``` + +--- + +## 8. Safety Guards + +1. **Rate limit AI calls** — max 1 Tier 3 call per 60 seconds, max 10 per day +2. **Never auto-execute destructive actions** — `alert_user` for: delete files, change API keys, modify source code +3. **Auto-restart cap** — max 5 proxy restarts per 10 minutes, then alert user +4. **Cost cap** — monthly AI diagnostic budget (configurable, default $2/month) +5. **Cooldown per pattern** — same failure pattern has escalating cooldown (30s → 60s → 300s → alert) +6. **User override** — any auto-action can be cancelled within 3 seconds via GUI +7. **Incident store max size** — 500 entries, LRU eviction +8. **Health check bypass** — if user manually stopped proxy, don't alert + +--- + +## 9. Implementation Plan + +### Phase 1: Core Watchdog (v3.8.0) +- `HealthWatcher` thread in `codex-launcher-gui` +- `LogAnalyzer` thread tailing `cc-debug.log` and `proxy.log` +- Tier 1 rule engine with all 20+ rules +- Incident store (JSON file) +- GUI toggle (ON/OFF) in settings +- Auto-restart proxy on crash + +### Phase 2: Pattern Learning (v3.8.1) +- Tier 2 incident store lookup +- Auto-learn from Intelligence Routing outcomes +- Success rate tracking per pattern +- Incident log viewer in GUI + +### Phase 3: AI Diagnostic Agent (v3.9.0) +- Tier 3 AI model integration +- Provider/model selector in GUI +- Diagnostic prompt template +- Cost tracking +- Full incident report export + +### Phase 4: Advanced Recovery (v4.0.0) +- Auto-switch to backup provider on repeated failure +- BGP route health monitoring +- Predictive failure detection (memory growth, latency trends) +- Codex process memory monitoring +- WebSocket reconnect assistance + +--- + +## 10. File Changes Summary + +| File | Changes | +|------|---------| +| `codex-launcher-gui` | +HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer | +| `translate-proxy.py` | +`/monitoring` endpoint (returns health + metrics), enhanced `/health` with memory/uptime | +| `~/.cache/codex-proxy/incident-store.json` | New file — incident pattern database | +| `~/.cache/codex-proxy/monitoring.log` | New file — watchdog activity log | +| `~/.codex/endpoints.json` | +`ai_monitoring` config section |