docs: AI Monitoring design spec v3.8.0 — self-healing watchdog with 3-tier response system

2026-05-22 22:22:30 +04:00
parent f49489c099
commit 4334540f33
1 changed files with 638 additions and 0 deletions
--- a/AI-MONITORING-DESIGN.md
+++ b/AI-MONITORING-DESIGN.md
@@ -0,0 +1,638 @@
+# AI Monitoring — Design Specification
+
+> **Codex Launcher v3.8.0 Feature Design**
+> Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions.
+
+---
+
+## 1. Problem Statement
+
+Over 42 sessions in production, we observed these failure categories:
+
+| # | Failure Category | Count | Example |
+|---|-----------------|-------|---------|
+| F1 | **parsed_tool_calls=0** — model produces unparseable output | 42 | Bare `<explore_agent>`, `<bash>` without cmd, plain English intent |
+| F2 | **Stuck recovery triggered** — Intelligence Routing Layer 3 | 13 | "I need to fetch the README", "let me write the script" |
+| F3 | **Sanitizer flagged suspicious cmd** — cmd still JSON after unwrap | 11 | `{/'cmd/': /'sshpass -p .../'}` — double-escaped quoting |
+| F4 | **Upstream 500** — provider internal error | ~5 | `"An internal error occurred. Please try again later."` |
+| F5 | **Connection timeout** — upstream unreachable | ~3 | `Connection timed out after 15002 milliseconds` |
+| F6 | **Upstream 401/403** — auth failure | ~2 | Wrong API key, expired token, `upgrade_required` |
+| F7 | **Stream crash** — exception mid-stream | ~2 | `BrokenPipeError`, `ConnectionResetError` during SSE |
+| F8 | **Proxy port conflict** — Address already in use | ~1 | Stale process holding port |
+| F9 | **Schema cache corruption** — stale content_type=array | ~1 | `ErrorAnalyzer` learned wrong schema |
+| F10 | **Codex Desktop crash** — SIGKILL at ~27GB | ~1 | Issue #24048 — unbounded tool output memory |
+| F11 | **Codex 300s stall** — turn state machine race | ~1 | Issue #23807 — `stream disconnected` after 300s |
+
+### The Gap
+
+Intelligence Routing (v3.7.0) handles F1/F2/F3 **inside a single request**. But it can't:
+
+- **Detect a dead proxy process** (F7/F8) — the proxy already crashed
+- **Reconnect Codex to a restarted proxy** (F5/F7/F8) — Codex doesn't auto-reconnect
+- **Switch to a backup provider** when the primary is down (F4/F5)
+- **Clear corrupt caches** (F9) — requires out-of-band action
+- **Restart Codex Desktop** after a crash (F10/F11)
+- **Learn from failure patterns** across sessions — each failure is handled independently
+
+### What We Need
+
+A **separate lightweight watchdog process** that:
+1. Monitors proxy health continuously
+2. Detects failures the proxy can't detect itself
+3. Uses a cheap AI model to diagnose novel failures
+4. Takes corrective action automatically
+5. Learns from past incidents to prevent repeats
+
+---
+
+## 2. Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        Codex Launcher GUI                            │
+│  ┌──────────┐  ┌──────────────┐  ┌───────────────────────────────┐ │
+│  │  Proxy   │  │   Codex      │  │   AI Monitoring Panel         │ │
+│  │  Manager │  │   Launcher   │  │   ┌─────────────────────┐     │ │
+│  │          │  │              │  │   │ ON/OFF Toggle        │     │ │
+│  └────┬─────┘  └──────┬───────┘  │   │ Provider Selector    │     │ │
+│       │               │          │   │ Model Selector        │     │ │
+│       │               │          │   │ Incident Log          │     │ │
+│       │               │          │   │ [View Diagnostics]    │     │ │
+│       │               │          │   └─────────────────────┘     │ │
+│       │               │          └───────────────────────────────┘ │
+└───────┼───────────────┼────────────────────────────────────────────┘
+        │               │
+        ▼               ▼
+┌───────────────┐  ┌────────────────┐
+│ translate-    │  │  Codex Desktop  │
+│ proxy.py      │  │  / CLI          │
+│ (port 8080)   │  │                 │
+│               │  │                 │
+│ /health ──────┼──┼─► health check  │
+│ /responses ───┼──┼─► main API      │
+└───────────────┘  └────────────────┘
+        ▲
+        │ health probes + log analysis + corrective actions
+        │
+┌───────┴────────────────────────────────────────────────────────────┐
+│                     AI Monitor Watchdog                             │
+│                    (thread in codex-launcher-gui)                   │
+│                                                                     │
+│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────────┐  │
+│  │  Health Watcher  │  │  Log Analyzer   │  │  AI Diagnostic   │  │
+│  │  (every 5s)      │  │  (continuous)    │  │  Agent (on-call) │  │
+│  │                  │  │                  │  │                  │  │
+│  │  - /health probe │  │  - tail cc-debug │  │  - Classify err  │  │
+│  │  - process alive │  │  - tail proxy.log│  │  - Root cause    │  │
+│  │  - port check    │  │  - pattern match │  │  - Suggest fix   │  │
+│  │  - memory watch  │  │  - incident DB   │  │  - Execute fix   │  │
+│  └────────┬────────┘  └────────┬────────┘  └────────┬─────────┘  │
+│           │                    │                     │             │
+│           └────────────────────┼─────────────────────┘             │
+│                                ▼                                   │
+│                    ┌──────────────────────┐                        │
+│                    │  Incident Store      │                        │
+│                    │  (JSON file)         │                        │
+│                    │  - Known patterns    │                        │
+│                    │  - Past resolutions  │                        │
+│                    │  - Success rates     │                        │
+│                    └──────────────────────┘                        │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 3. Three-Tier Response System
+
+### Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second)
+
+Immediate reactions to **known failure patterns**. No AI needed.
+
+```python
+TIER1_RULES = [
+    # (trigger_pattern, action, cooldown)
+    
+    # --- Proxy Health ---
+    ("proxy_health_fail",      "restart_proxy",           30),
+    ("proxy_port_conflict",    "kill_stale + restart",     60),
+    ("proxy_memory_over_1gb",  "restart_proxy",           120),
+    
+    # --- Upstream Errors ---
+    ("upstream_429",           "wait_retry_after",          0),
+    ("upstream_502_503",       "retry_with_backoff",       30),
+    ("upstream_500_repeat_3x", "switch_provider",          60),
+    ("upstream_timeout",       "retry + increase_timeout", 30),
+    ("upstream_401_403",       "alert_user_bad_key",        0),
+    
+    # --- Stream Errors ---
+    ("stream_broken_pipe",     "restart_proxy",            30),
+    ("stream_reset",           "restart_proxy",            30),
+    ("stream_idle_300s",       "restart_proxy",            60),
+    
+    # --- Parser Failures ---
+    ("parsed_tool_calls_0_x3", "clear_schema_cache",      300),
+    ("sanitizer_suspicious_5x","alert_user_model_issue",    0),
+    ("stuck_recovery_x5",      "suggest_switch_model",      0),
+    
+    # --- Codex Process ---
+    ("codex_process_dead",     "alert_user_restart",         0),
+    ("codex_memory_over_4gb",  "alert_user_memory",          0),
+    
+    # --- Cache Corruption ---
+    ("schema_content_type_array", "delete_provider_caps",     0),
+]
+```
+
+### Tier 2: Pattern Matching — Incident Store Lookup (< 100ms)
+
+For failures we've **seen before and resolved**, look up the fix:
+
+```json
+{
+  "incidents": [
+    {
+      "pattern": "cc_stream_ended_empty + explore_agent + no_url",
+      "fix": "synth_explore_from_last_user_urls",
+      "source": "FIX-23",
+      "success_rate": 0.85,
+      "last_seen": "2026-05-22T16:00:00Z",
+      "occurrences": 5
+    },
+    {
+      "pattern": "require_escalation + no_cmd",
+      "fix": "auto_proceed_echo",
+      "source": "FIX-24",
+      "success_rate": 1.0,
+      "last_seen": "2026-05-22T15:30:00Z",
+      "occurrences": 3
+    }
+  ]
+}
+```
+
+### Tier 3: AI Diagnostic — Nano Agent (2-5 seconds)
+
+For **novel failures** that don't match any rule or pattern, invoke a cheap AI model:
+
+```
+Prompt Template (system):
+─────────────────────
+You are a diagnostic agent for a translation proxy that sits between
+OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat,
+Anthropic, etc.). You analyze error context and suggest ONE corrective action.
+
+Available actions: restart_proxy, kill_stale_processes, clear_schema_cache,
+switch_provider, increase_timeout, alert_user, ignore, retry_now,
+regenerate_config, cleanup_codex_stale
+
+Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0}
+
+Prompt Template (user):
+─────────────────────
+INCIDENT REPORT:
+Time: {timestamp}
+Session: {session_id}
+Proxy health: {alive/dead, port, uptime, memory_mb}
+Upstream: {url, model, last_http_code, last_error}
+Recent errors (last 60s):
+{log_lines}
+Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags}
+Provider: {backend_type, model}
+History: {last_5_incidents_for_this_pattern}
+
+What corrective action should be taken?
+```
+
+---
+
+## 4. Complete Failure Catalog
+
+### Category A: Proxy-Level Failures (watchdog detects, auto-recovers)
+
+| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
+|----|---------|----------|---------------|---------------|
+| A1 | Proxy process crashed | `/health` returns connection refused | `restart_proxy` | `urllib.error.URLError: [Errno 111] Connection refused` |
+| A2 | Port conflict | `Address already in use` on startup | `kill_stale + restart` | `OSError: [Errno 98] Address already in use` |
+| A3 | Memory leak | Process RSS > 1GB | `restart_proxy` | `/proc/{pid}/status` VmRSS check |
+| A4 | Deadlock | Health check hangs > 15s | `restart_proxy` | health probe timeout |
+| A5 | Unhandled exception | Process exits with non-zero | `restart_proxy` | `SELF-REVIVE CRASH #{n}` |
+| A6 | SSL/TLS error | `CERTIFICATE_VERIFY_FAILED` upstream | `alert_user` | `urllib.error.URLError: certificate verify failed` |
+| A7 | DNS resolution failure | `getaddrinfo failed` | `retry_with_backoff` | `socket.gaierror: Name or service not known` |
+
+### Category B: Upstream Provider Failures (proxy detects, watchdog analyzes)
+
+| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
+|----|---------|----------|---------------|---------------|
+| B1 | Rate limit (429) | Too many requests | `wait_retry_after` | `HTTP 429` + `Retry-After` header |
+| B2 | Server error (5xx) | Provider down | `retry_with_backoff` | `HTTP 500/502/503` |
+| B3 | Auth failure (401/403) | Bad/expired key | `alert_user_bad_key` | `HTTP 401 {"error":"invalid_api_key"}` |
+| B4 | CC upgrade required (403) | Version mismatch | `update_cc_version` | `HTTP 403 upgrade_required` |
+| B5 | Connection timeout | Upstream silent | `retry + increase_timeout` | `urllib.error.URLError: timed out` |
+| B6 | Connection reset | Upstream dropped mid-stream | `restart_proxy` | `ConnectionResetError: Connection reset by peer` |
+| B7 | Broken pipe | Client disconnected | `ignore` | `BrokenPipeError: Broken pipe` |
+| B8 | Upstream 400 bad request | Malformed request | `clear_schema_cache` | `HTTP 400 {"error":"...expected string..."}` |
+| B9 | Provider capacity (503) | Overloaded | `switch_provider` | `HTTP 503` after 3 retries |
+| B10 | Cloudflare block (403/1010) | Bot detection | `check_browser_ua` | `HTTP 403 error 1010` |
+
+### Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks)
+
+| ID | Failure | Symptoms | Auto-Fix (IR Layer) | Watchdog Escalation |
+|----|---------|----------|--------------------|--------------------|
+| C1 | Bare `<explore_agent>` | `parsed_tool_calls=0` | Layer 1: URL extraction | If 3x in a row → suggest model switch |
+| C2 | `<require_escalation>` block | Model wants permissions | Layer 2: Auto-proceed | If 5x → suggest different provider |
+| C3 | Unrecognized format | No parser matches | Layer 3: Intent synthesis | If 5x → log for AI diagnosis |
+| C4 | Double-wrapped cmd | `cmd = "{\"cmd\": ...}"` | Sanitizer: unwrap | If cmd still JSON → alert |
+| C5 | Suspicious cmd (JSON) | `cmd starts with {` | Sanitizer: flag | If 3x → clear cache + restart |
+| C6 | Empty cmd | `cmd = ""` or `cmd = "{}"` | Sanitizer: diagnostic echo | If 3x → suggest model switch |
+| C7 | Bare `{` token | Model outputs incomplete JSON | Layer 3: heuristic 5 | If persistent → AI diagnosis |
+| C8 | `<bash>` without cmd | Block has sandbox but no command | Layer 3: heuristic | If 3x → AI diagnosis |
+| C9 | DSML name mismatch | `name="cmd"` vs `name="command"` | DSML parser handles both | Self-test catches regression |
+| C10 | Stuck model loop | Same recovery 5+ times | Layer 3 max 3x then alert | Switch model or provider |
+
+### Category D: Codex Process Failures (watchdog detects, alerts user)
+
+| ID | Failure | Symptoms | Action | Log Signature |
+|----|---------|----------|--------|---------------|
+| D1 | Codex process killed | PID gone from pids.json | `alert_user_restart` | Process not in `/proc/{pid}` |
+| D2 | Codex memory explosion | RSS > 4GB | `alert_user_memory` | `/proc/{pid}/status` check |
+| D3 | Codex 300s stall | `stream disconnected` loop | `restart_proxy` | Codex stderr: `stream disconnected` |
+| D4 | Config corruption | `database disk image is malformed` | `regenerate_config` | Codex stderr: `malformed` |
+| D5 | Session context overflow | `context_length_exceeded` | `alert_user_context` | Codex stderr: `context_length_exceeded` |
+| D6 | WebSocket reconnect loop | `Reconnecting... N/5` | `check_proxy_health` | Codex stderr: `Reconnecting` |
+
+### Category E: Config/State Failures (watchdog detects, auto-fixes)
+
+| ID | Failure | Symptoms | Action | Detection |
+|----|---------|----------|--------|-----------|
+| E1 | Schema cache corruption | `content_type: "array"` in provider-caps.json | `delete_provider_caps` | Read file, check for known-bad values |
+| E2 | Stale PID file | pids.json has dead PIDs | `cleanup_pids` | Check `/proc/{pid}` existence |
+| E3 | Port from old session | config.toml has stale port | `regenerate_config` | Port in config != running port |
+| E4 | OAuth token expired | Google/Gemini token refresh fails | `alert_user_reauth` | Token file `expiry_ts < now` |
+| E5 | BGP all routes down | Every route returned error | `alert_user_no_provider` | All routes in cooldown |
+
+---
+
+## 5. Component Design
+
+### 5.1 Health Watcher Thread
+
+Runs in the GUI process as a background thread. Pings proxy `/health` endpoint every 5 seconds.
+
+```python
+class HealthWatcher(threading.Thread):
+    def __init__(self, proxy_port, on_failure, on_recovery):
+        super().__init__(daemon=True)
+        self.proxy_port = proxy_port
+        self.on_failure = on_failure
+        self.on_recovery = on_recovery
+        self.check_interval = 5  # seconds
+        self.failures = 0
+        self.running = True
+    
+    def run(self):
+        while self.running:
+            healthy = self._check_health()
+            if healthy:
+                if self.failures > 0:
+                    self.failures = 0
+                    self.on_recovery()
+            else:
+                self.failures += 1
+                if self.failures >= 3:  # 15s of consecutive failures
+                    self.on_failure(self.failures)
+            time.sleep(self.check_interval)
+    
+    def _check_health(self):
+        try:
+            req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health")
+            resp = urllib.request.urlopen(req, timeout=5)
+            return resp.status == 200
+        except Exception:
+            return False
+```
+
+### 5.2 Log Analyzer Thread
+
+Tails the debug log and extracts failure signals in real-time.
+
+```python
+FAILURE_SIGNALS = {
+    "parsed_tool_calls=0":      ("C1", "parser_empty"),
+    "[STUCK-RECOVERY]":         ("C3", "stuck_recovery"),
+    "suspicious cmd":           ("C4", "sanitizer_flag"),
+    "empty cmd recovered":      ("C6", "empty_cmd"),
+    "HTTP 429":                 ("B1", "rate_limited"),
+    "HTTP 500":                 ("B2", "server_error"),
+    "HTTP 401":                 ("B3", "auth_failure"),
+    "HTTP 403":                 ("B4", "forbidden"),
+    "Connection refused":       ("A1", "proxy_dead"),
+    "Address already in use":   ("A2", "port_conflict"),
+    "Broken pipe":              ("B7", "broken_pipe"),
+    "Connection reset":         ("B6", "connection_reset"),
+    "timed out":                ("B5", "timeout"),
+    "SELF-REVIVE CRASH":        ("A5", "proxy_crash"),
+    "stream error":             ("B6", "stream_error"),
+}
+
+class LogAnalyzer(threading.Thread):
+    def __init__(self, log_path, on_signal):
+        super().__init__(daemon=True)
+        self.log_path = log_path
+        self.on_signal = on_signal
+        self.running = True
+    
+    def run(self):
+        fh = open(self.log_path, "r")
+        fh.seek(0, 2)  # seek to end
+        while self.running:
+            line = fh.readline()
+            if not line:
+                time.sleep(0.5)
+                continue
+            for pattern, (fault_id, category) in FAILURE_SIGNALS.items():
+                if pattern in line:
+                    self.on_signal(fault_id, category, line.strip())
+                    break
+```
+
+### 5.3 AI Diagnostic Agent
+
+Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns.
+
+```python
+class AIDiagnosticAgent:
+    def __init__(self, provider_url, model, api_key):
+        self.provider_url = provider_url
+        self.model = model
+        self.api_key = api_key
+        self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT  # defined below
+        self.incident_store = IncidentStore()
+    
+    def diagnose(self, context):
+        # Tier 2: Check incident store first
+        pattern = self._extract_pattern(context)
+        known_fix = self.incident_store.lookup(pattern)
+        if known_fix and known_fix["success_rate"] > 0.7:
+            return known_fix["fix"], "tier2_pattern", known_fix["success_rate"]
+        
+        # Tier 3: Ask AI
+        prompt = self._build_prompt(context)
+        response = self._call_model(prompt)
+        action = self._parse_response(response)
+        
+        # Learn from this incident
+        if action:
+            self.incident_store.record(pattern, action)
+        
+        return action, "tier3_ai", None
+    
+    def _call_model(self, prompt):
+        body = {
+            "model": self.model,
+            "messages": [
+                {"role": "system", "content": self.system_prompt},
+                {"role": "user", "content": prompt}
+            ],
+            "max_tokens": 200,
+            "temperature": 0.1,
+        }
+        req = urllib.request.Request(
+            self.provider_url,
+            data=json.dumps(body).encode(),
+            headers={
+                "Content-Type": "application/json",
+                "Authorization": f"Bearer {self.api_key}",
+            }
+        )
+        resp = urllib.request.urlopen(req, timeout=15)
+        return json.loads(resp.read())["choices"][0]["message"]["content"]
+```
+
+### 5.4 Incident Store
+
+JSON file that accumulates failure patterns and their resolutions.
+
+```json
+{
+  "version": 1,
+  "incidents": {
+    "parser_empty+explore_agent": {
+      "fault_ids": ["C1"],
+      "fix": "synth_explore_from_urls",
+      "source": "intelligent_routing",
+      "success_count": 8,
+      "fail_count": 1,
+      "last_seen": "2026-05-22T16:00:00Z",
+      "auto_applied": true
+    },
+    "server_error+repeat_3x": {
+      "fault_ids": ["B2"],
+      "fix": "switch_provider",
+      "source": "tier1_rule",
+      "success_count": 2,
+      "fail_count": 0,
+      "last_seen": "2026-05-22T14:00:00Z",
+      "auto_applied": true
+    }
+  },
+  "ai_diagnostic_calls": 0,
+  "tokens_used": 0,
+  "cost_usd": 0.0
+}
+```
+
+### 5.5 Diagnostic Agent System Prompt
+
+```
+You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local
+translation proxy between OpenAI Codex CLI/Desktop and various AI providers.
+
+## Your Job
+Analyze the incident report and recommend ONE corrective action.
+
+## Available Actions
+- restart_proxy: Kill and restart translate-proxy.py
+- kill_stale_processes: Kill orphaned proxy/codex processes
+- clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json
+- switch_provider: Switch to a different configured endpoint
+- increase_timeout: Increase upstream timeout for slow providers
+- regenerate_config: Regenerate Codex config.toml
+- cleanup_codex_stale: Run cleanup-codex-stale.sh
+- alert_user: Show notification to user (can't auto-fix)
+- ignore: Transient error, no action needed
+- retry_now: Immediate retry without changes
+
+## Decision Rules
+- If upstream returns 401/403 with auth error → alert_user (can't fix bad keys)
+- If proxy process is dead → restart_proxy
+- If same error repeated 5+ times → switch_provider or alert_user
+- If error is about content_type/schema → clear_schema_cache
+- If "Address already in use" → kill_stale_processes then restart_proxy
+- If timeout and upstream is slow → increase_timeout
+- If single transient 429/502/503 → ignore (retry handles it)
+- If "stream disconnected" and proxy is healthy → ignore (Codex retries)
+
+## Response Format
+Reply with ONLY a JSON object:
+{"action": "...", "reason": "...", "confidence": 0.0-1.0}
+
+No explanation, no markdown, no extra text.
+```
+
+---
+
+## 6. GUI Integration
+
+### AI Monitoring Panel (in Settings tab)
+
+```
+┌─────────────────────────────────────────────────────────┐
+│  AI Monitoring                                    [ON]  │
+│                                                          │
+│  ┌─ Diagnostic Agent ─────────────────────────────────┐ │
+│  │ Provider: [OpenCode Zen          ▼]                │ │
+│  │ Model:    [Qwen3-32B              ▼]                │ │
+│  │ API Key:  [sk-•••••••••••••••••••• ]                │ │
+│  │                                                     │ │
+│  │ Cost this month: $0.12 (3 diagnostic calls)         │ │
+│  │ Tokens used: 1,847 input / 423 output               │ │
+│  └─────────────────────────────────────────────────────┘ │
+│                                                          │
+│  ┌─ Incident Log (last 7 days) ──────────────────────┐  │
+│  │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │  │
+│  │ ⚠️ 15:30 B2 server_error → retry (Tier 1)         │  │
+│  │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1)    │  │
+│  │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3)   │  │
+│  │ ...                                               │  │
+│  └────────────────────────────────────────────────────┘  │
+│                                                          │
+│  [View Full Diagnostics]  [Export Incident Report]       │
+└─────────────────────────────────────────────────────────┘
+```
+
+### Config Storage (in endpoints.json)
+
+```json
+{
+  "ai_monitoring": {
+    "enabled": true,
+    "provider_url": "https://opencode.ai/zen/v1/chat/completions",
+    "model": "Qwen/Qwen3-32B",
+    "api_key": "sk-...",
+    "tier1_enabled": true,
+    "tier2_enabled": true,
+    "tier3_enabled": true,
+    "auto_restart_proxy": true,
+    "auto_switch_provider": false,
+    "health_check_interval_s": 5,
+    "max_memory_mb": 1024,
+    "notification_level": "important_only"
+  }
+}
+```
+
+### Recommended Models (by cost)
+
+| Model | Cost/Diagnosis | Latency | Quality | Recommended For |
+|-------|---------------|---------|---------|----------------|
+| **Qwen3-32B** (OpenCode) | ~$0.0005 | 2-4s | Good | Default — cheapest decent model |
+| **DeepSeek V4 Flash** | ~$0.0003 | 2-3s | Good | Cheapest option |
+| **GPT-4o-mini** | ~$0.001 | 1-2s | Excellent | Best quality/latency |
+| **Gemini 2.0 Flash** | ~$0.0002 | 1-2s | Good | Cheapest + fastest |
+| **Claude Haiku 4.5** | ~$0.001 | 2-3s | Excellent | Best reasoning quality |
+| **Local Ollama** (if running) | $0 | 5-15s | Varies | Zero-cost offline option |
+
+### Cost Estimate
+
+- Average diagnostic prompt: ~800 tokens input, ~100 tokens output
+- Expected frequency: ~1-5 incidents per day that reach Tier 3
+- **Monthly cost**: $0.10 - $1.50 depending on model and usage
+
+---
+
+## 7. Watchdog Response Flow
+
+```
+Failure Detected
+      │
+      ▼
+┌─────────────┐    YES    ┌──────────────────┐
+│ Tier 1 Rule? ├─────────►│ Execute Action    │
+│ (known)      │           │ Log incident      │
+└──────┬───────┘           └──────────────────┘
+       │ NO
+       ▼
+┌─────────────┐    YES    ┌──────────────────┐
+│ Tier 2 Match?├─────────►│ Apply Known Fix   │
+│ (incident DB)│           │ Update success    │
+└──────┬───────┘           └──────────────────┘
+       │ NO
+       ▼
+┌─────────────┐   YES     ┌──────────────────┐
+│ AI Enabled?  ├─────────►│ Collect Context   │
+│ (Tier 3)     │           │ Build Prompt      │
+└──────┬───────┘           │ Call AI Model     │
+       │ NO                │ Parse Response    │
+       ▼                   │ Execute if auto   │
+┌─────────────┐           │ Store incident    │
+│ Alert User   │           └──────────────────┘
+│ (can't fix)  │
+└─────────────┘
+```
+
+---
+
+## 8. Safety Guards
+
+1. **Rate limit AI calls** — max 1 Tier 3 call per 60 seconds, max 10 per day
+2. **Never auto-execute destructive actions** — `alert_user` for: delete files, change API keys, modify source code
+3. **Auto-restart cap** — max 5 proxy restarts per 10 minutes, then alert user
+4. **Cost cap** — monthly AI diagnostic budget (configurable, default $2/month)
+5. **Cooldown per pattern** — same failure pattern has escalating cooldown (30s → 60s → 300s → alert)
+6. **User override** — any auto-action can be cancelled within 3 seconds via GUI
+7. **Incident store max size** — 500 entries, LRU eviction
+8. **Health check bypass** — if user manually stopped proxy, don't alert
+
+---
+
+## 9. Implementation Plan
+
+### Phase 1: Core Watchdog (v3.8.0)
+- `HealthWatcher` thread in `codex-launcher-gui`
+- `LogAnalyzer` thread tailing `cc-debug.log` and `proxy.log`
+- Tier 1 rule engine with all 20+ rules
+- Incident store (JSON file)
+- GUI toggle (ON/OFF) in settings
+- Auto-restart proxy on crash
+
+### Phase 2: Pattern Learning (v3.8.1)
+- Tier 2 incident store lookup
+- Auto-learn from Intelligence Routing outcomes
+- Success rate tracking per pattern
+- Incident log viewer in GUI
+
+### Phase 3: AI Diagnostic Agent (v3.9.0)
+- Tier 3 AI model integration
+- Provider/model selector in GUI
+- Diagnostic prompt template
+- Cost tracking
+- Full incident report export
+
+### Phase 4: Advanced Recovery (v4.0.0)
+- Auto-switch to backup provider on repeated failure
+- BGP route health monitoring
+- Predictive failure detection (memory growth, latency trends)
+- Codex process memory monitoring
+- WebSocket reconnect assistance
+
+---
+
+## 10. File Changes Summary
+
+| File | Changes |
+|------|---------|
+| `codex-launcher-gui` | +HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer |
+| `translate-proxy.py` | +`/monitoring` endpoint (returns health + metrics), enhanced `/health` with memory/uptime |
+| `~/.cache/codex-proxy/incident-store.json` | New file — incident pattern database |
+| `~/.cache/codex-proxy/monitoring.log` | New file — watchdog activity log |
+| `~/.codex/endpoints.json` | +`ai_monitoring` config section |