v3.8.0: AI Monitoring — self-healing watchdog with 3-tier response system

- HealthWatcher thread: monitors proxy /health every 5s - LogAnalyzer thread: tails cc-debug.log for 18 failure signal patterns - Tier 1 rule engine: 14 rules for instant auto-recovery (< 1s) - Tier 2 incident store: JSON pattern database with success rates - Tier 3 AI diagnostic agent: calls configurable provider/model for novel failures - AIMonitoringWindow GUI: ON/OFF toggle, provider/model/API key selector, incident log - 30 fault types catalogued across 5 categories (A-E) - Enhanced /health endpoint with memory_mb, uptime_s, requests_total - Auto-restart proxy, auto-clear schema cache, kill stale processes - Safety: rate-limited AI calls, restart caps, cooldowns per pattern - AI Monitoring design spec (AI-MONITORING-DESIGN.md) - 54 self-test patterns passing
docs: AI Monitoring design spec v3.8.0 — self-healing watchdog with 3-tier response system
2026-05-22 22:36:16 +04:00 · 2026-05-22 22:22:30 +04:00 · 2026-05-22 16:35:08 +04:00
6 changed files with 1353 additions and 17 deletions
--- a/AI-MONITORING-DESIGN.md
+++ b/AI-MONITORING-DESIGN.md
@@ -0,0 +1,638 @@
 # AI Monitoring — Design Specification
 > **Codex Launcher v3.8.0 Feature Design**
 > Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions.
 ---
 ## 1. Problem Statement
 Over 42 sessions in production, we observed these failure categories:
 | # | Failure Category | Count | Example |
 |---|-----------------|-------|---------|
 | F1 | **parsed_tool_calls=0** — model produces unparseable output | 42 | Bare `<explore_agent>`, `<bash>` without cmd, plain English intent |
 | F2 | **Stuck recovery triggered** — Intelligence Routing Layer 3 | 13 | "I need to fetch the README", "let me write the script" |
 | F3 | **Sanitizer flagged suspicious cmd** — cmd still JSON after unwrap | 11 | `{/'cmd/': /'sshpass -p .../'}` — double-escaped quoting |
 | F4 | **Upstream 500** — provider internal error | ~5 | `"An internal error occurred. Please try again later."` |
 | F5 | **Connection timeout** — upstream unreachable | ~3 | `Connection timed out after 15002 milliseconds` |
 | F6 | **Upstream 401/403** — auth failure | ~2 | Wrong API key, expired token, `upgrade_required` |
 | F7 | **Stream crash** — exception mid-stream | ~2 | `BrokenPipeError`, `ConnectionResetError` during SSE |
 | F8 | **Proxy port conflict** — Address already in use | ~1 | Stale process holding port |
 | F9 | **Schema cache corruption** — stale content_type=array | ~1 | `ErrorAnalyzer` learned wrong schema |
 | F10 | **Codex Desktop crash** — SIGKILL at ~27GB | ~1 | Issue #24048 — unbounded tool output memory |
 | F11 | **Codex 300s stall** — turn state machine race | ~1 | Issue #23807 — `stream disconnected` after 300s |
 ### The Gap
 Intelligence Routing (v3.7.0) handles F1/F2/F3 **inside a single request**. But it can't:
 - **Detect a dead proxy process** (F7/F8) — the proxy already crashed
 - **Reconnect Codex to a restarted proxy** (F5/F7/F8) — Codex doesn't auto-reconnect
 - **Switch to a backup provider** when the primary is down (F4/F5)
 - **Clear corrupt caches** (F9) — requires out-of-band action
 - **Restart Codex Desktop** after a crash (F10/F11)
 - **Learn from failure patterns** across sessions — each failure is handled independently
 ### What We Need
 A **separate lightweight watchdog process** that:
 1. Monitors proxy health continuously
 2. Detects failures the proxy can't detect itself
 3. Uses a cheap AI model to diagnose novel failures
 4. Takes corrective action automatically
 5. Learns from past incidents to prevent repeats
 ---
 ## 2. Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────────┐
 │                        Codex Launcher GUI                            │
 │  ┌──────────┐  ┌──────────────┐  ┌───────────────────────────────┐ │
 │  │  Proxy   │  │   Codex      │  │   AI Monitoring Panel         │ │
 │  │  Manager │  │   Launcher   │  │   ┌─────────────────────┐     │ │
 │  │          │  │              │  │   │ ON/OFF Toggle        │     │ │
 │  └────┬─────┘  └──────┬───────┘  │   │ Provider Selector    │     │ │
 │       │               │          │   │ Model Selector        │     │ │
 │       │               │          │   │ Incident Log          │     │ │
 │       │               │          │   │ [View Diagnostics]    │     │ │
 │       │               │          │   └─────────────────────┘     │ │
 │       │               │          └───────────────────────────────┘ │
 └───────┼───────────────┼────────────────────────────────────────────┘
        │               │
        ▼               ▼
 ┌───────────────┐  ┌────────────────┐
 │ translate-    │  │  Codex Desktop  │
 │ proxy.py      │  │  / CLI          │
 │ (port 8080)   │  │                 │
 │               │  │                 │
 │ /health ──────┼──┼─► health check  │
 │ /responses ───┼──┼─► main API      │
 └───────────────┘  └────────────────┘
        ▲
        │ health probes + log analysis + corrective actions
        │
 ┌───────┴────────────────────────────────────────────────────────────┐
 │                     AI Monitor Watchdog                             │
 │                    (thread in codex-launcher-gui)                   │
 │                                                                     │
 │  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────────┐  │
 │  │  Health Watcher  │  │  Log Analyzer   │  │  AI Diagnostic   │  │
 │  │  (every 5s)      │  │  (continuous)    │  │  Agent (on-call) │  │
 │  │                  │  │                  │  │                  │  │
 │  │  - /health probe │  │  - tail cc-debug │  │  - Classify err  │  │
 │  │  - process alive │  │  - tail proxy.log│  │  - Root cause    │  │
 │  │  - port check    │  │  - pattern match │  │  - Suggest fix   │  │
 │  │  - memory watch  │  │  - incident DB   │  │  - Execute fix   │  │
 │  └────────┬────────┘  └────────┬────────┘  └────────┬─────────┘  │
 │           │                    │                     │             │
 │           └────────────────────┼─────────────────────┘             │
 │                                ▼                                   │
 │                    ┌──────────────────────┐                        │
 │                    │  Incident Store      │                        │
 │                    │  (JSON file)         │                        │
 │                    │  - Known patterns    │                        │
 │                    │  - Past resolutions  │                        │
 │                    │  - Success rates     │                        │
 │                    └──────────────────────┘                        │
 └─────────────────────────────────────────────────────────────────────┘
 ```
 ---
 ## 3. Three-Tier Response System
 ### Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second)
 Immediate reactions to **known failure patterns**. No AI needed.
 ```python
 TIER1_RULES = [
    # (trigger_pattern, action, cooldown)
    # --- Proxy Health ---
    ("proxy_health_fail",      "restart_proxy",           30),
    ("proxy_port_conflict",    "kill_stale + restart",     60),
    ("proxy_memory_over_1gb",  "restart_proxy",           120),
    # --- Upstream Errors ---
    ("upstream_429",           "wait_retry_after",          0),
    ("upstream_502_503",       "retry_with_backoff",       30),
    ("upstream_500_repeat_3x", "switch_provider",          60),
    ("upstream_timeout",       "retry + increase_timeout", 30),
    ("upstream_401_403",       "alert_user_bad_key",        0),
    # --- Stream Errors ---
    ("stream_broken_pipe",     "restart_proxy",            30),
    ("stream_reset",           "restart_proxy",            30),
    ("stream_idle_300s",       "restart_proxy",            60),
    # --- Parser Failures ---
    ("parsed_tool_calls_0_x3", "clear_schema_cache",      300),
    ("sanitizer_suspicious_5x","alert_user_model_issue",    0),
    ("stuck_recovery_x5",      "suggest_switch_model",      0),
    # --- Codex Process ---
    ("codex_process_dead",     "alert_user_restart",         0),
    ("codex_memory_over_4gb",  "alert_user_memory",          0),
    # --- Cache Corruption ---
    ("schema_content_type_array", "delete_provider_caps",     0),
 ]
 ```
 ### Tier 2: Pattern Matching — Incident Store Lookup (< 100ms)
 For failures we've **seen before and resolved**, look up the fix:
 ```json
 {
  "incidents": [
    {
      "pattern": "cc_stream_ended_empty + explore_agent + no_url",
      "fix": "synth_explore_from_last_user_urls",
      "source": "FIX-23",
      "success_rate": 0.85,
      "last_seen": "2026-05-22T16:00:00Z",
      "occurrences": 5
    },
    {
      "pattern": "require_escalation + no_cmd",
      "fix": "auto_proceed_echo",
      "source": "FIX-24",
      "success_rate": 1.0,
      "last_seen": "2026-05-22T15:30:00Z",
      "occurrences": 3
    }
  ]
 }
 ```
 ### Tier 3: AI Diagnostic — Nano Agent (2-5 seconds)
 For **novel failures** that don't match any rule or pattern, invoke a cheap AI model:
 ```
 Prompt Template (system):
 ─────────────────────
 You are a diagnostic agent for a translation proxy that sits between
 OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat,
 Anthropic, etc.). You analyze error context and suggest ONE corrective action.
 Available actions: restart_proxy, kill_stale_processes, clear_schema_cache,
 switch_provider, increase_timeout, alert_user, ignore, retry_now,
 regenerate_config, cleanup_codex_stale
 Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0}
 Prompt Template (user):
 ─────────────────────
 INCIDENT REPORT:
 Time: {timestamp}
 Session: {session_id}
 Proxy health: {alive/dead, port, uptime, memory_mb}
 Upstream: {url, model, last_http_code, last_error}
 Recent errors (last 60s):
 {log_lines}
 Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags}
 Provider: {backend_type, model}
 History: {last_5_incidents_for_this_pattern}
 What corrective action should be taken?
 ```
 ---
 ## 4. Complete Failure Catalog
 ### Category A: Proxy-Level Failures (watchdog detects, auto-recovers)
 | ID | Failure | Symptoms | Tier 1 Action | Log Signature |
 |----|---------|----------|---------------|---------------|
 | A1 | Proxy process crashed | `/health` returns connection refused | `restart_proxy` | `urllib.error.URLError: [Errno 111] Connection refused` |
 | A2 | Port conflict | `Address already in use` on startup | `kill_stale + restart` | `OSError: [Errno 98] Address already in use` |
 | A3 | Memory leak | Process RSS > 1GB | `restart_proxy` | `/proc/{pid}/status` VmRSS check |
 | A4 | Deadlock | Health check hangs > 15s | `restart_proxy` | health probe timeout |
 | A5 | Unhandled exception | Process exits with non-zero | `restart_proxy` | `SELF-REVIVE CRASH #{n}` |
 | A6 | SSL/TLS error | `CERTIFICATE_VERIFY_FAILED` upstream | `alert_user` | `urllib.error.URLError: certificate verify failed` |
 | A7 | DNS resolution failure | `getaddrinfo failed` | `retry_with_backoff` | `socket.gaierror: Name or service not known` |
 ### Category B: Upstream Provider Failures (proxy detects, watchdog analyzes)
 | ID | Failure | Symptoms | Tier 1 Action | Log Signature |
 |----|---------|----------|---------------|---------------|
 | B1 | Rate limit (429) | Too many requests | `wait_retry_after` | `HTTP 429` + `Retry-After` header |
 | B2 | Server error (5xx) | Provider down | `retry_with_backoff` | `HTTP 500/502/503` |
 | B3 | Auth failure (401/403) | Bad/expired key | `alert_user_bad_key` | `HTTP 401 {"error":"invalid_api_key"}` |
 | B4 | CC upgrade required (403) | Version mismatch | `update_cc_version` | `HTTP 403 upgrade_required` |
 | B5 | Connection timeout | Upstream silent | `retry + increase_timeout` | `urllib.error.URLError: timed out` |
 | B6 | Connection reset | Upstream dropped mid-stream | `restart_proxy` | `ConnectionResetError: Connection reset by peer` |
 | B7 | Broken pipe | Client disconnected | `ignore` | `BrokenPipeError: Broken pipe` |
 | B8 | Upstream 400 bad request | Malformed request | `clear_schema_cache` | `HTTP 400 {"error":"...expected string..."}` |
 | B9 | Provider capacity (503) | Overloaded | `switch_provider` | `HTTP 503` after 3 retries |
 | B10 | Cloudflare block (403/1010) | Bot detection | `check_browser_ua` | `HTTP 403 error 1010` |
 ### Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks)
 | ID | Failure | Symptoms | Auto-Fix (IR Layer) | Watchdog Escalation |
 |----|---------|----------|--------------------|--------------------|
 | C1 | Bare `<explore_agent>` | `parsed_tool_calls=0` | Layer 1: URL extraction | If 3x in a row → suggest model switch |
 | C2 | `<require_escalation>` block | Model wants permissions | Layer 2: Auto-proceed | If 5x → suggest different provider |
 | C3 | Unrecognized format | No parser matches | Layer 3: Intent synthesis | If 5x → log for AI diagnosis |
 | C4 | Double-wrapped cmd | `cmd = "{\"cmd\": ...}"` | Sanitizer: unwrap | If cmd still JSON → alert |
 | C5 | Suspicious cmd (JSON) | `cmd starts with {` | Sanitizer: flag | If 3x → clear cache + restart |
 | C6 | Empty cmd | `cmd = ""` or `cmd = "{}"` | Sanitizer: diagnostic echo | If 3x → suggest model switch |
 | C7 | Bare `{` token | Model outputs incomplete JSON | Layer 3: heuristic 5 | If persistent → AI diagnosis |
 | C8 | `<bash>` without cmd | Block has sandbox but no command | Layer 3: heuristic | If 3x → AI diagnosis |
 | C9 | DSML name mismatch | `name="cmd"` vs `name="command"` | DSML parser handles both | Self-test catches regression |
 | C10 | Stuck model loop | Same recovery 5+ times | Layer 3 max 3x then alert | Switch model or provider |
 ### Category D: Codex Process Failures (watchdog detects, alerts user)
 | ID | Failure | Symptoms | Action | Log Signature |
 |----|---------|----------|--------|---------------|
 | D1 | Codex process killed | PID gone from pids.json | `alert_user_restart` | Process not in `/proc/{pid}` |
 | D2 | Codex memory explosion | RSS > 4GB | `alert_user_memory` | `/proc/{pid}/status` check |
 | D3 | Codex 300s stall | `stream disconnected` loop | `restart_proxy` | Codex stderr: `stream disconnected` |
 | D4 | Config corruption | `database disk image is malformed` | `regenerate_config` | Codex stderr: `malformed` |
 | D5 | Session context overflow | `context_length_exceeded` | `alert_user_context` | Codex stderr: `context_length_exceeded` |
 | D6 | WebSocket reconnect loop | `Reconnecting... N/5` | `check_proxy_health` | Codex stderr: `Reconnecting` |
 ### Category E: Config/State Failures (watchdog detects, auto-fixes)
 | ID | Failure | Symptoms | Action | Detection |
 |----|---------|----------|--------|-----------|
 | E1 | Schema cache corruption | `content_type: "array"` in provider-caps.json | `delete_provider_caps` | Read file, check for known-bad values |
 | E2 | Stale PID file | pids.json has dead PIDs | `cleanup_pids` | Check `/proc/{pid}` existence |
 | E3 | Port from old session | config.toml has stale port | `regenerate_config` | Port in config != running port |
 | E4 | OAuth token expired | Google/Gemini token refresh fails | `alert_user_reauth` | Token file `expiry_ts < now` |
 | E5 | BGP all routes down | Every route returned error | `alert_user_no_provider` | All routes in cooldown |
 ---
 ## 5. Component Design
 ### 5.1 Health Watcher Thread
 Runs in the GUI process as a background thread. Pings proxy `/health` endpoint every 5 seconds.
 ```python
 class HealthWatcher(threading.Thread):
    def __init__(self, proxy_port, on_failure, on_recovery):
        super().__init__(daemon=True)
        self.proxy_port = proxy_port
        self.on_failure = on_failure
        self.on_recovery = on_recovery
        self.check_interval = 5  # seconds
        self.failures = 0
        self.running = True
    def run(self):
        while self.running:
            healthy = self._check_health()
            if healthy:
                if self.failures > 0:
                    self.failures = 0
                    self.on_recovery()
            else:
                self.failures += 1
                if self.failures >= 3:  # 15s of consecutive failures
                    self.on_failure(self.failures)
            time.sleep(self.check_interval)
    def _check_health(self):
        try:
            req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health")
            resp = urllib.request.urlopen(req, timeout=5)
            return resp.status == 200
        except Exception:
            return False
 ```
 ### 5.2 Log Analyzer Thread
 Tails the debug log and extracts failure signals in real-time.
 ```python
 FAILURE_SIGNALS = {
    "parsed_tool_calls=0":      ("C1", "parser_empty"),
    "[STUCK-RECOVERY]":         ("C3", "stuck_recovery"),
    "suspicious cmd":           ("C4", "sanitizer_flag"),
    "empty cmd recovered":      ("C6", "empty_cmd"),
    "HTTP 429":                 ("B1", "rate_limited"),
    "HTTP 500":                 ("B2", "server_error"),
    "HTTP 401":                 ("B3", "auth_failure"),
    "HTTP 403":                 ("B4", "forbidden"),
    "Connection refused":       ("A1", "proxy_dead"),
    "Address already in use":   ("A2", "port_conflict"),
    "Broken pipe":              ("B7", "broken_pipe"),
    "Connection reset":         ("B6", "connection_reset"),
    "timed out":                ("B5", "timeout"),
    "SELF-REVIVE CRASH":        ("A5", "proxy_crash"),
    "stream error":             ("B6", "stream_error"),
 }
 class LogAnalyzer(threading.Thread):
    def __init__(self, log_path, on_signal):
        super().__init__(daemon=True)
        self.log_path = log_path
        self.on_signal = on_signal
        self.running = True
    def run(self):
        fh = open(self.log_path, "r")
        fh.seek(0, 2)  # seek to end
        while self.running:
            line = fh.readline()
            if not line:
                time.sleep(0.5)
                continue
            for pattern, (fault_id, category) in FAILURE_SIGNALS.items():
                if pattern in line:
                    self.on_signal(fault_id, category, line.strip())
                    break
 ```
 ### 5.3 AI Diagnostic Agent
 Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns.
 ```python
 class AIDiagnosticAgent:
    def __init__(self, provider_url, model, api_key):
        self.provider_url = provider_url
        self.model = model
        self.api_key = api_key
        self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT  # defined below
        self.incident_store = IncidentStore()
    def diagnose(self, context):
        # Tier 2: Check incident store first
        pattern = self._extract_pattern(context)
        known_fix = self.incident_store.lookup(pattern)
        if known_fix and known_fix["success_rate"] > 0.7:
            return known_fix["fix"], "tier2_pattern", known_fix["success_rate"]
        # Tier 3: Ask AI
        prompt = self._build_prompt(context)
        response = self._call_model(prompt)
        action = self._parse_response(response)
        # Learn from this incident
        if action:
            self.incident_store.record(pattern, action)
        return action, "tier3_ai", None
    def _call_model(self, prompt):
        body = {
            "model": self.model,
            "messages": [
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": prompt}
            ],
            "max_tokens": 200,
            "temperature": 0.1,
        }
        req = urllib.request.Request(
            self.provider_url,
            data=json.dumps(body).encode(),
            headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {self.api_key}",
            }
        )
        resp = urllib.request.urlopen(req, timeout=15)
        return json.loads(resp.read())["choices"][0]["message"]["content"]
 ```
 ### 5.4 Incident Store
 JSON file that accumulates failure patterns and their resolutions.
 ```json
 {
  "version": 1,
  "incidents": {
    "parser_empty+explore_agent": {
      "fault_ids": ["C1"],
      "fix": "synth_explore_from_urls",
      "source": "intelligent_routing",
      "success_count": 8,
      "fail_count": 1,
      "last_seen": "2026-05-22T16:00:00Z",
      "auto_applied": true
    },
    "server_error+repeat_3x": {
      "fault_ids": ["B2"],
      "fix": "switch_provider",
      "source": "tier1_rule",
      "success_count": 2,
      "fail_count": 0,
      "last_seen": "2026-05-22T14:00:00Z",
      "auto_applied": true
    }
  },
  "ai_diagnostic_calls": 0,
  "tokens_used": 0,
  "cost_usd": 0.0
 }
 ```
 ### 5.5 Diagnostic Agent System Prompt
 ```
 You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local
 translation proxy between OpenAI Codex CLI/Desktop and various AI providers.
 ## Your Job
 Analyze the incident report and recommend ONE corrective action.
 ## Available Actions
 - restart_proxy: Kill and restart translate-proxy.py
 - kill_stale_processes: Kill orphaned proxy/codex processes
 - clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json
 - switch_provider: Switch to a different configured endpoint
 - increase_timeout: Increase upstream timeout for slow providers
 - regenerate_config: Regenerate Codex config.toml
 - cleanup_codex_stale: Run cleanup-codex-stale.sh
 - alert_user: Show notification to user (can't auto-fix)
 - ignore: Transient error, no action needed
 - retry_now: Immediate retry without changes
 ## Decision Rules
 - If upstream returns 401/403 with auth error → alert_user (can't fix bad keys)
 - If proxy process is dead → restart_proxy
 - If same error repeated 5+ times → switch_provider or alert_user
 - If error is about content_type/schema → clear_schema_cache
 - If "Address already in use" → kill_stale_processes then restart_proxy
 - If timeout and upstream is slow → increase_timeout
 - If single transient 429/502/503 → ignore (retry handles it)
 - If "stream disconnected" and proxy is healthy → ignore (Codex retries)
 ## Response Format
 Reply with ONLY a JSON object:
 {"action": "...", "reason": "...", "confidence": 0.0-1.0}
 No explanation, no markdown, no extra text.
 ```
 ---
 ## 6. GUI Integration
 ### AI Monitoring Panel (in Settings tab)
 ```
 ┌─────────────────────────────────────────────────────────┐
 │  AI Monitoring                                    [ON]  │
 │                                                          │
 │  ┌─ Diagnostic Agent ─────────────────────────────────┐ │
 │  │ Provider: [OpenCode Zen          ▼]                │ │
 │  │ Model:    [Qwen3-32B              ▼]                │ │
 │  │ API Key:  [sk-•••••••••••••••••••• ]                │ │
 │  │                                                     │ │
 │  │ Cost this month: $0.12 (3 diagnostic calls)         │ │
 │  │ Tokens used: 1,847 input / 423 output               │ │
 │  └─────────────────────────────────────────────────────┘ │
 │                                                          │
 │  ┌─ Incident Log (last 7 days) ──────────────────────┐  │
 │  │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │  │
 │  │ ⚠️ 15:30 B2 server_error → retry (Tier 1)         │  │
 │  │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1)    │  │
 │  │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3)   │  │
 │  │ ...                                               │  │
 │  └────────────────────────────────────────────────────┘  │
 │                                                          │
 │  [View Full Diagnostics]  [Export Incident Report]       │
 └─────────────────────────────────────────────────────────┘
 ```
 ### Config Storage (in endpoints.json)
 ```json
 {
  "ai_monitoring": {
    "enabled": true,
    "provider_url": "https://opencode.ai/zen/v1/chat/completions",
    "model": "Qwen/Qwen3-32B",
    "api_key": "sk-...",
    "tier1_enabled": true,
    "tier2_enabled": true,
    "tier3_enabled": true,
    "auto_restart_proxy": true,
    "auto_switch_provider": false,
    "health_check_interval_s": 5,
    "max_memory_mb": 1024,
    "notification_level": "important_only"
  }
 }
 ```
 ### Recommended Models (by cost)
 | Model | Cost/Diagnosis | Latency | Quality | Recommended For |
 |-------|---------------|---------|---------|----------------|
 | **Qwen3-32B** (OpenCode) | ~$0.0005 | 2-4s | Good | Default — cheapest decent model |
 | **DeepSeek V4 Flash** | ~$0.0003 | 2-3s | Good | Cheapest option |
 | **GPT-4o-mini** | ~$0.001 | 1-2s | Excellent | Best quality/latency |
 | **Gemini 2.0 Flash** | ~$0.0002 | 1-2s | Good | Cheapest + fastest |
 | **Claude Haiku 4.5** | ~$0.001 | 2-3s | Excellent | Best reasoning quality |
 | **Local Ollama** (if running) | $0 | 5-15s | Varies | Zero-cost offline option |
 ### Cost Estimate
 - Average diagnostic prompt: ~800 tokens input, ~100 tokens output
 - Expected frequency: ~1-5 incidents per day that reach Tier 3
 - **Monthly cost**: $0.10 - $1.50 depending on model and usage
 ---
 ## 7. Watchdog Response Flow
 ```
 Failure Detected
      │
      ▼
 ┌─────────────┐    YES    ┌──────────────────┐
 │ Tier 1 Rule? ├─────────►│ Execute Action    │
 │ (known)      │           │ Log incident      │
 └──────┬───────┘           └──────────────────┘
       │ NO
       ▼
 ┌─────────────┐    YES    ┌──────────────────┐
 │ Tier 2 Match?├─────────►│ Apply Known Fix   │
 │ (incident DB)│           │ Update success    │
 └──────┬───────┘           └──────────────────┘
       │ NO
       ▼
 ┌─────────────┐   YES     ┌──────────────────┐
 │ AI Enabled?  ├─────────►│ Collect Context   │
 │ (Tier 3)     │           │ Build Prompt      │
 └──────┬───────┘           │ Call AI Model     │
       │ NO                │ Parse Response    │
       ▼                   │ Execute if auto   │
 ┌─────────────┐           │ Store incident    │
 │ Alert User   │           └──────────────────┘
 │ (can't fix)  │
 └─────────────┘
 ```
 ---
 ## 8. Safety Guards
 1. **Rate limit AI calls** — max 1 Tier 3 call per 60 seconds, max 10 per day
 2. **Never auto-execute destructive actions** — `alert_user` for: delete files, change API keys, modify source code
 3. **Auto-restart cap** — max 5 proxy restarts per 10 minutes, then alert user
 4. **Cost cap** — monthly AI diagnostic budget (configurable, default $2/month)
 5. **Cooldown per pattern** — same failure pattern has escalating cooldown (30s → 60s → 300s → alert)
 6. **User override** — any auto-action can be cancelled within 3 seconds via GUI
 7. **Incident store max size** — 500 entries, LRU eviction
 8. **Health check bypass** — if user manually stopped proxy, don't alert
 ---
 ## 9. Implementation Plan
 ### Phase 1: Core Watchdog (v3.8.0)
 - `HealthWatcher` thread in `codex-launcher-gui`
 - `LogAnalyzer` thread tailing `cc-debug.log` and `proxy.log`
 - Tier 1 rule engine with all 20+ rules
 - Incident store (JSON file)
 - GUI toggle (ON/OFF) in settings
 - Auto-restart proxy on crash
 ### Phase 2: Pattern Learning (v3.8.1)
 - Tier 2 incident store lookup
 - Auto-learn from Intelligence Routing outcomes
 - Success rate tracking per pattern
 - Incident log viewer in GUI
 ### Phase 3: AI Diagnostic Agent (v3.9.0)
 - Tier 3 AI model integration
 - Provider/model selector in GUI
 - Diagnostic prompt template
 - Cost tracking
 - Full incident report export
 ### Phase 4: Advanced Recovery (v4.0.0)
 - Auto-switch to backup provider on repeated failure
 - BGP route health monitoring
 - Predictive failure detection (memory growth, latency trends)
 - Codex process memory monitoring
 - WebSocket reconnect assistance
 ---
 ## 10. File Changes Summary
 | File | Changes |
 |------|---------|
 | `codex-launcher-gui` | +HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer |
 | `translate-proxy.py` | +`/monitoring` endpoint (returns health + metrics), enhanced `/health` with memory/uptime |
 | `~/.cache/codex-proxy/incident-store.json` | New file — incident pattern database |
 | `~/.cache/codex-proxy/monitoring.log` | New file — watchdog activity log |
 | `~/.codex/endpoints.json` | +`ai_monitoring` config section |
--- a/README.md
+++ b/README.md
@@ -33,6 +33,7 @@
  <img src="https://img.shields.io/badge/Streaming_SSE-✓-success" />
  <img src="https://img.shields.io/badge/Tool_Calls-✓-success" />
  <img src="https://img.shields.io/badge/AI_Assist-✓-success" />
  <img src="https://img.shields.io/badge/Intelligence_Routing-✓-success" />
  <img src="https://img.shields.io/badge/Self_Revive_Watchdog-✓-success" />
 </p>
@@ -130,6 +131,19 @@ A three-component system:
 - **ErrorAnalyzer** — learns from 4xx errors, retries with adjusted parameters (max 2 retries)
 - **Schema cache** with 24h staleness TTL for provider capabilities
 ### Intelligence Routing (v3.7.0)
 - **Three-layer self-healing system** — the agent loop never stalls, even when the model speaks gibberish
 - **Layer 1 — Deep URL Extraction**: When `<explore_agent>` hides URLs inside nested JSON (`messages: [{"content": "https://..."}]`), the parser drills into the JSON structure to find them. Module-level `_build_explore_cmd()` is reused across parser + stream path.
 - **Layer 2 — Escalation Auto-Proceed**: `<require_escalation>` and `<request_escalation_permission>` blocks are detected and auto-resolved — the model doesn't get stuck waiting for permissions that don't exist.
 - **Layer 3 — Intent-Based Command Synthesis**: When ALL parsers fail, 5 heuristics analyze the model's plain-text output and synthesize a working command:
  1. URL detected → `curl` it
  2. File path mentioned → `cat` or `ls` it
  3. Shell command in quotes → extract and run it
  4. "explore"/"fetch" intent → use the last URL the user mentioned
  5. "I need to"/"let me" intent → echo a diagnostic so the loop continues
 - **Session URL memory** — `_last_user_urls` deque (20 entries) tracks URLs from user messages across the session, giving the synthesizer context to work with
 - **54 self-test patterns** — comprehensive coverage of all three layers
 ### GTK Launcher (`codex-launcher-gui`)
 - **Endpoint manager** — add, edit, delete, set default providers
 - **Provider presets** — one-click setup for 15+ providers with pre-filled URLs and model lists
@@ -324,6 +338,83 @@ Built a cascading parser chain (`DSML → bash → explore → tool_call → XML
 **Verification:** `--self-test` flag runs 19 automated tests covering all edge cases. Debug logging to `~/.cache/codex-proxy/cc-debug.log` captures every parser decision for troubleshooting.
 ### Phase 8: Intelligence Routing — When the Model Refuses to Speak Machine
 **Problem:** The 17-fix parser chain from Phase 7 was powerful — it could handle DSML, XML, JSON, bash blocks, explore tags, you name it. But there was one edge case it couldn't crack: **when the model doesn't produce a parseable tool-call format at all**.
 In production, `deepseek/deepseek-v4-flash` via Command Code kept doing things like:
 ```
 <explore_agent>
 messages: [{"content": "Understand the Z.AI-Chat-for-Android repo at https://..."}]
 </explore_agent>
 ```
 or:
 ```
 <require_escalation>
 I need elevated permissions to access the repository.
 </require_escalation>
 ```
 or just plain English: *"I need to fetch the README from the repository to understand the app structure."*
 In every case, `parsed_tool_calls=0`. No tool to execute. The Codex agent loop ground to a halt. The user saw "thinking..." forever.
 **The insight:** The model is trying to communicate *intent*, just not in a format we can parse. Instead of adding more regex patterns, what if we could **read the model's mind** — understand what it *wants* to do, and synthesize the command for it?
 **Intelligence Routing — Three Layers of Escalation:**
 ```
 Layer 1: "Fix the input"     — Can we extract more from what the model gave us?
 Layer 2: "Handle the intent" — Is the model asking for something we can auto-resolve?
 Layer 3: "Read the mind"     — What is the model trying to do? Just do it for it.
 ```
 **Layer 1 — Deep URL Extraction (FIX 23):**
 The `<explore_agent>` handler had a URL regex, but the URL was trapped inside `{"content": "https://..."}` — the trailing `"` broke matching. The fix: after the initial regex fails, `json.loads()` the entire block, walk the JSON tree, and pull URLs out of `content` fields. The `_build_explore_cmd()` function was extracted to module level so both the parser and the stream handler could use it.
 ```python
 # Before: regex fails, URL lost
 # After: json.loads -> iterate items -> extract content -> find URL
 ```
 **Layer 2 — Escalation Auto-Proceed (FIX 24):**
 `<require_escalation>` blocks are the model's way of saying "I need more permissions." The CC adapter doesn't have an escalation mechanism — these blocks were silently dropped. The fix: detect them (both closed `<tag>...</tag>` and bare `<tag />` forms), extract any URL inside them, and auto-proceed with an explore command or a diagnostic echo.
 ```python
 # Model: <require_escalation>Please let me run curl</require_escalation>
 # Proxy: Okay, here's your curl command → exec_command synthesized
 ```
 **Layer 3 — Intent-Based Command Synthesis (FIX 25):**
 The crown jewel. When ALL parsers return empty — no DSML, no XML, no JSON, no fallback regex matches — the system doesn't give up. It analyzes the model's raw text through **5 heuristic lenses** in priority order:
 | Priority | Signal | Synthesized Command |
 |:--------:|--------|---------------------|
 | 1 | URL in text | `curl` to fetch it |
 | 2 | File path reference | `cat` or `ls` the file |
 | 3 | Shell command in backticks/quotes | Extract and run it |
 | 4 | "explore"/"fetch" + last user URL | Full explore command |
 | 5 | "I need to"/"let me" intent | Echo diagnostic |
 The system also maintains a **session URL memory** (`_last_user_urls`, a deque of the last 20 URLs from user messages) so heuristic 4 always has a URL to work with, even when the model's text doesn't contain one.
 ```python
 # Model: "I should explore the repository to understand its structure."
 # Parser: empty (no parseable format)
 # Layer 3 heuristic 4: "explore" detected, pulling URL from session memory...
 # Result: exec_command with full curl pipeline
 ```
 **The result:** Before Intelligence Routing, `parsed_tool_calls=0` meant **game over** — the agent loop stalled permanently. After Intelligence Routing, `parsed_tool_calls=0` triggers the self-healing chain and the loop **always** gets a tool call to execute. The model can speak in tongues and the system still works.
 **Test coverage:** 54 self-test patterns (up from 41), with 13 new tests specifically for Intelligence Routing layers.
 ---
 ## Architecture Deep Dive
@@ -454,6 +545,9 @@ README.md                         # This file
 | CC tool calls have wrong args | Double-wrapped arguments | V3.5 three-tier parser + recursive unwrapping |
 | Proxy crashes mid-session | Unhandled streaming error | V3.5 self-revive watchdog auto-restarts |
 | CC 403 upgrade_required | Missing version header | V3.5 always sends `x-command-code-version` |
 | CC explore_agent can't find URL | URL hidden inside JSON messages | V3.7 Layer 1 drills into JSON to extract URLs |
 | CC agent stalls on escalation blocks | `<require_escalation>` not handled | V3.7 Layer 2 auto-proceeds past escalation requests |
 | CC agent stalls — no tool calls at all | Model output format unrecognized | V3.7 Layer 3 synthesizes command from text intent |
 ---
--- a/codex-launcher_3.8.0_all.deb
+++ b/codex-launcher_3.8.0_all.deb
--- a/install.sh
+++ b/install.sh
@@ -3,11 +3,11 @@ set -e
 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
-if [ -f "$SCRIPT_DIR/codex-launcher_3.7.0_all.deb" ]; then
+if [ -f "$SCRIPT_DIR/codex-launcher_3.8.0_all.deb" ]; then
-    echo "Installing codex-launcher_3.7.0_all.deb ..."
+    echo "Installing codex-launcher_3.8.0_all.deb ..."
-    sudo dpkg -i "$SCRIPT_DIR/codex-launcher_3.7.0_all.deb"
+    sudo dpkg -i "$SCRIPT_DIR/codex-launcher_3.8.0_all.deb"
    echo ""
-    echo "Installed v3.7.0 via .deb package."
+    echo "Installed v3.8.0 via .deb package."
    echo "  translate-proxy.py   -> /usr/bin/translate-proxy.py"
    echo "  codex-launcher-gui   -> /usr/bin/codex-launcher-gui"
    echo "  cleanup-codex-stale  -> /usr/bin/cleanup-codex-stale.sh"
--- a/src/codex-launcher-gui
+++ b/src/codex-launcher-gui
@@ -5,7 +5,7 @@ import gi
 gi.require_version("Gtk", "3.0")
 from gi.repository import Gtk, GLib
 import subprocess, os, signal, sys, threading, time, json, urllib.request, urllib.parse, urllib.error, tempfile, shutil
-import hashlib, socket, ssl, contextlib, re
+import hashlib, socket, ssl, contextlib, re, collections
 import base64, secrets
 from pathlib import Path
@@ -1123,6 +1123,524 @@ def _check_codex_auth():
    except Exception as e:
        return ("error", str(e))
 # ═══════════════════════════════════════════════════════════════════
 # AI Monitoring — Self-Healing Watchdog
 # ═══════════════════════════════════════════════════════════════════
 MONITORING_FILE = Path.home() / ".cache/codex-proxy/monitoring-config.json"
 INCIDENT_STORE_FILE = Path.home() / ".cache/codex-proxy/incident-store.json"
 MONITORING_LOG = Path.home() / ".cache/codex-proxy/monitoring.log"
 _TIER1_RULES = [
    ("proxy_health_fail",      "restart_proxy",         30),
    ("proxy_port_conflict",    "kill_stale_restart",    60),
    ("upstream_429",           "wait_retry",             0),
    ("upstream_502_503",       "retry_backoff",         30),
    ("upstream_500_repeat",    "switch_provider",       60),
    ("upstream_timeout",       "retry_increase_timeout",30),
    ("upstream_401_403",       "alert_bad_key",          0),
    ("stream_broken_pipe",     "restart_proxy",         30),
    ("stream_reset",           "restart_proxy",         30),
    ("parsed_tool_calls_0_x3", "clear_schema_cache",   300),
    ("sanitizer_suspicious_5x","alert_model_issue",      0),
    ("stuck_recovery_x5",      "suggest_switch_model",   0),
    ("codex_process_dead",     "alert_restart",           0),
    ("schema_corrupt",         "delete_provider_caps",    0),
 ]
 _FAILURE_SIGNALS = {
    "parsed_tool_calls=0":      ("C1", "parser_empty"),
    "[STUCK-RECOVERY]":         ("C3", "stuck_recovery"),
    "suspicious cmd":           ("C4", "sanitizer_flag"),
    "empty cmd recovered":      ("C6", "empty_cmd"),
    "HTTP 429":                 ("B1", "rate_limited"),
    "HTTP 500":                 ("B2", "server_error"),
    "HTTP 502":                 ("B2", "server_error"),
    "HTTP 503":                 ("B2", "server_error"),
    "HTTP 401":                 ("B3", "auth_failure"),
    "HTTP 403":                 ("B4", "forbidden"),
    "Connection refused":       ("A1", "proxy_dead"),
    "Address already in use":   ("A2", "port_conflict"),
    "Broken pipe":              ("B7", "broken_pipe"),
    "Connection reset":         ("B6", "connection_reset"),
    "timed out":                ("B5", "timeout"),
    "SELF-REVIVE CRASH":        ("A5", "proxy_crash"),
    "stream error":             ("B6", "stream_error"),
    "content_type.*array":      ("E1", "schema_corrupt"),
 }
 _DIAGNOSTIC_SYSTEM_PROMPT = (
    'You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local '
    'translation proxy between OpenAI Codex CLI/Desktop and AI providers.\n\n'
    'Analyze the incident and respond with ONLY a JSON object:\n'
    '{"action": "...", "reason": "...", "confidence": 0.0-1.0}\n\n'
    'Available actions: restart_proxy, kill_stale_processes, clear_schema_cache, '
    'switch_provider, increase_timeout, regenerate_config, cleanup_stale, '
    'alert_user, ignore, retry_now\n\n'
    'Rules:\n'
    '- upstream 401/403 with auth error -> alert_user\n'
    '- proxy dead -> restart_proxy\n'
    '- same error 5+ times -> switch_provider or alert_user\n'
    '- schema/content_type error -> clear_schema_cache\n'
    '- "Address already in use" -> kill_stale_processes then restart_proxy\n'
    '- timeout on slow upstream -> increase_timeout\n'
    '- single transient 429/502/503 -> ignore\n'
    '- "stream disconnected" + proxy healthy -> ignore\n'
    '- no extra text, no markdown, just the JSON object'
 )
 def _load_monitoring_config():
    if MONITORING_FILE.exists():
        try:
            return json.loads(MONITORING_FILE.read_text())
        except Exception:
            pass
    return {
        "enabled": False,
        "provider_url": "",
        "model": "",
        "api_key": "",
        "health_check_interval_s": 5,
        "auto_restart_proxy": True,
        "auto_switch_provider": False,
    }
 def _save_monitoring_config(cfg):
    MONITORING_FILE.parent.mkdir(parents=True, exist_ok=True)
    MONITORING_FILE.write_text(json.dumps(cfg, indent=2))
 def _load_incident_store():
    if INCIDENT_STORE_FILE.exists():
        try:
            return json.loads(INCIDENT_STORE_FILE.read_text())
        except Exception:
            pass
    return {"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}}
 def _save_incident_store(store):
    INCIDENT_STORE_FILE.parent.mkdir(parents=True, exist_ok=True)
    INCIDENT_STORE_FILE.write_text(json.dumps(store, indent=2))
 def _monitoring_log(msg):
    try:
        with open(str(MONITORING_LOG), "a") as f:
            f.write(f"[{time.strftime('%H:%M:%S')}] {msg}\n")
    except Exception:
        pass
 class IncidentStore:
    def __init__(self):
        self._store = _load_incident_store()
        self._dirty = False
    def lookup(self, pattern):
        inc = self._store.get("incidents", {}).get(pattern)
        if inc and inc.get("success_count", 0) > 0:
            rate = inc["success_count"] / max(inc["success_count"] + inc.get("fail_count", 0), 1)
            if rate > 0.5:
                return inc
        return None
    def record(self, pattern, fix, success=True):
        incs = self._store.setdefault("incidents", {})
        inc = incs.setdefault(pattern, {
            "fix": fix, "success_count": 0, "fail_count": 0,
            "last_seen": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            "occurrences": 0,
        })
        inc["last_seen"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
        inc["occurrences"] = inc.get("occurrences", 0) + 1
        if success:
            inc["success_count"] = inc.get("success_count", 0) + 1
        else:
            inc["fail_count"] = inc.get("fail_count", 0) + 1
        self._dirty = True
    def record_ai_call(self, tokens=0):
        stats = self._store.setdefault("stats", {"ai_calls": 0, "tokens_used": 0})
        stats["ai_calls"] = stats.get("ai_calls", 0) + 1
        stats["tokens_used"] = stats.get("tokens_used", 0) + tokens
        self._dirty = True
    def flush(self):
        if self._dirty:
            _save_incident_store(self._store)
            self._dirty = False
    @property
    def stats(self):
        return self._store.get("stats", {"ai_calls": 0, "tokens_used": 0})
 class AIDiagnosticAgent:
    def __init__(self, provider_url, model, api_key):
        self.provider_url = provider_url
        self.model = model
        self.api_key = api_key
        self.incident_store = IncidentStore()
    def diagnose(self, context):
        pattern = self._extract_pattern(context)
        known = self.incident_store.lookup(pattern)
        if known:
            _monitoring_log(f"Tier 2 HIT: pattern={pattern} fix={known['fix']}")
            return {"action": known["fix"], "reason": "known_pattern", "confidence": 0.9, "tier": 2}
        action = self._call_model(context)
        if action:
            self.incident_store.record(pattern, action.get("action", "unknown"))
            self.incident_store.flush()
        return action
    def _extract_pattern(self, context):
        parts = []
        for k in sorted(context.get("signals", [])):
            parts.append(k)
        if context.get("http_code"):
            parts.append(f"http_{context['http_code']}")
        return "+".join(parts[:3]) or "unknown"
    def _call_model(self, context):
        prompt = (
            f"INCIDENT REPORT:\n"
            f"Time: {time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())}\n"
            f"Proxy health: {context.get('proxy_alive', 'unknown')}\n"
            f"Upstream: {context.get('upstream_url', 'unknown')}\n"
            f"Model: {context.get('model', 'unknown')}\n"
            f"Last HTTP code: {context.get('http_code', 'n/a')}\n"
            f"Recent signals: {context.get('signals', [])}\n"
            f"Recent log tail:\n{context.get('log_tail', '')[:1500]}\n"
        )
        body = {
            "model": self.model,
            "messages": [
                {"role": "system", "content": _DIAGNOSTIC_SYSTEM_PROMPT},
                {"role": "user", "content": prompt},
            ],
            "max_tokens": 200,
            "temperature": 0.1,
        }
        try:
            req = urllib.request.Request(
                self.provider_url,
                data=json.dumps(body).encode(),
                headers={
                    "Content-Type": "application/json",
                    "Authorization": f"Bearer {self.api_key}",
                },
            )
            resp = urllib.request.urlopen(req, timeout=15)
            result = json.loads(resp.read())
            text = result["choices"][0]["message"]["content"].strip()
            self.incident_store.record_ai_call(tokens=800)
            action = json.loads(text)
            action["tier"] = 3
            _monitoring_log(f"Tier 3 AI: action={action.get('action')} reason={action.get('reason')}")
            return action
        except Exception as e:
            _monitoring_log(f"Tier 3 AI FAILED: {e}")
            return {"action": "alert_user", "reason": f"ai_diag_failed: {e}", "confidence": 0.0, "tier": 3}
 class HealthWatcher(threading.Thread):
    def __init__(self, on_failure, on_recovery, on_signal, on_action):
        super().__init__(daemon=True)
        self.cfg = _load_monitoring_config()
        self.on_failure = on_failure
        self.on_recovery = on_recovery
        self.on_signal = on_signal
        self.on_action = on_action
        self.failures = 0
        self.running = False
        self._signal_counts = collections.defaultdict(int)
        self._last_actions = {}
        self._restart_count = 0
        self._last_restart_time = 0
    def run(self):
        self.running = True
        self.incident_store = IncidentStore()
        self._log_analyzer = _LogAnalyzerThread(self._on_log_signal)
        self._log_analyzer.start()
        while self.running:
            self.cfg = _load_monitoring_config()
            if not self.cfg.get("enabled"):
                time.sleep(5)
                continue
            port = self._get_proxy_port()
            if port:
                healthy = self._check_health(port)
                if healthy:
                    if self.failures > 0:
                        self.failures = 0
                        self.on_recovery()
                else:
                    self.failures += 1
                    if self.failures >= 3:
                        self._handle_failure("proxy_health_fail")
            self.incident_store.flush()
            interval = self.cfg.get("health_check_interval_s", 5)
            time.sleep(interval)
    def stop(self):
        self.running = False
        if hasattr(self, '_log_analyzer'):
            self._log_analyzer.running = False
    def _get_proxy_port(self):
        try:
            cfg_path = Path.home() / ".cache/codex-proxy/proxy-config.json"
            if cfg_path.exists():
                d = json.loads(cfg_path.read_text())
                return d.get("port")
        except Exception:
            pass
        return None
    def _check_health(self, port):
        try:
            req = urllib.request.Request(f"http://localhost:{port}/health")
            resp = urllib.request.urlopen(req, timeout=5)
            return resp.status == 200
        except Exception:
            return False
    def _on_log_signal(self, fault_id, category, line):
        self._signal_counts[category] += 1
        self.on_signal(fault_id, category, line[:200])
        count = self._signal_counts[category]
        if category in ("proxy_dead", "port_conflict") and count >= 2:
            self._handle_failure(category)
        elif category in ("server_error", "timeout") and count >= 3:
            self._handle_failure(category + "_repeat")
        elif category in ("sanitizer_flag",) and count >= 5:
            self._handle_failure("sanitizer_suspicious_5x")
        elif category in ("stuck_recovery",) and count >= 5:
            self._handle_failure("stuck_recovery_x5")
        elif category in ("parser_empty",) and count >= 3:
            self._handle_failure("parsed_tool_calls_0_x3")
        elif category in ("schema_corrupt",):
            self._handle_failure("schema_corrupt")
    def _handle_failure(self, trigger):
        now = time.time()
        for rule_trigger, action, cooldown in _TIER1_RULES:
            if rule_trigger == trigger:
                last_t = self._last_actions.get(action, 0)
                if now - last_t < cooldown:
                    return
                self._last_actions[action] = now
                _monitoring_log(f"Tier 1: trigger={trigger} action={action}")
                self.on_action(action, trigger)
                self.incident_store.record(trigger, action, success=True)
                return
        self._try_tier2_3(trigger)
    def _try_tier2_3(self, trigger):
        cfg = self.cfg
        if not cfg.get("provider_url") or not cfg.get("model") or not cfg.get("api_key"):
            _monitoring_log(f"No AI configured for Tier 2/3 — alerting user for trigger={trigger}")
            self.on_action("alert_user", trigger)
            return
        agent = AIDiagnosticAgent(cfg["provider_url"], cfg["model"], cfg["api_key"])
        context = {
            "signals": [trigger],
            "proxy_alive": self.failures == 0,
            "log_tail": self._get_recent_log(),
        }
        result = agent.diagnose(context)
        if result:
            action = result.get("action", "alert_user")
            _monitoring_log(f"Tier {result.get('tier', '?')}: action={action}")
            self.on_action(action, trigger)
 class _LogAnalyzerThread(threading.Thread):
    def __init__(self, on_signal):
        super().__init__(daemon=True)
        self.on_signal = on_signal
        self.running = False
    def run(self):
        self.running = True
        log_paths = [
            str(Path.home() / ".cache/codex-proxy/cc-debug.log"),
            str(Path.home() / ".cache/codex-proxy/proxy.log"),
        ]
        fhs = {}
        for p in log_paths:
            try:
                f = open(p, "r")
                f.seek(0, 2)
                fhs[p] = f
            except Exception:
                pass
        while self.running:
            activity = False
            for p, fh in list(fhs.items()):
                try:
                    line = fh.readline()
                    if line:
                        activity = True
                        for pattern, (fault_id, category) in _FAILURE_SIGNALS.items():
                            if re.search(pattern, line):
                                self.on_signal(fault_id, category, line.strip())
                                break
                except Exception:
                    pass
            if not activity:
                time.sleep(0.5)
 class AIMonitoringWindow(Gtk.Window):
    def __init__(self, parent=None):
        super().__init__(title="AI Monitoring")
        self.set_transient_for(parent)
        self.set_default_size(580, 520)
        self.set_border_width(12)
        self._cfg = _load_monitoring_config()
        self._store = _load_incident_store()
        vbox = Gtk.Box(orientation=Gtk.Orientation.VERTICAL, spacing=8)
        self.add(vbox)
        hdr = Gtk.Box(spacing=8)
        vbox.pack_start(hdr, False, False, 0)
        lbl = Gtk.Label()
        lbl.set_markup("<b>AI Monitoring</b>")
        lbl.set_use_markup(True)
        hdr.pack_start(lbl, False, False, 0)
        self._toggle = Gtk.Switch()
        self._toggle.set_active(self._cfg.get("enabled", False))
        self._toggle.connect("state-set", self._on_toggle)
        hdr.pack_end(self._toggle, False, False, 0)
        lbl2 = Gtk.Label(label="Enabled")
        hdr.pack_end(lbl2, False, False, 0)
        frame = Gtk.Frame(label="Diagnostic Agent")
        vbox.pack_start(frame, False, False, 0)
        grid = Gtk.Grid(column_spacing=8, row_spacing=6, margin=8)
        frame.add(grid)
        grid.attach(Gtk.Label(label="Provider URL:", halign=Gtk.Align.END), 0, 0, 1, 1)
        self._url_entry = Gtk.Entry(hexpand=True)
        self._url_entry.set_text(self._cfg.get("provider_url", ""))
        self._url_entry.set_placeholder_text("https://api.openai.com/v1/chat/completions")
        grid.attach(self._url_entry, 1, 0, 2, 1)
        grid.attach(Gtk.Label(label="Model:", halign=Gtk.Align.END), 0, 1, 1, 1)
        self._model_entry = Gtk.Entry(hexpand=True)
        self._model_entry.set_text(self._cfg.get("model", ""))
        self._model_entry.set_placeholder_text("gpt-4o-mini or Qwen/Qwen3-32B")
        grid.attach(self._model_entry, 1, 1, 2, 1)
        grid.attach(Gtk.Label(label="API Key:", halign=Gtk.Align.END), 0, 2, 1, 1)
        self._key_entry = Gtk.Entry(hexpand=True, visibility=False)
        self._key_entry.set_text(self._cfg.get("api_key", ""))
        self._key_entry.set_placeholder_text("sk-...")
        grid.attach(self._key_entry, 1, 2, 1, 1)
        self._reveal_btn = Gtk.ToggleButton(label="Show")
        self._reveal_btn.connect("toggled", lambda b: self._key_entry.set_visibility(b.get_active()))
        grid.attach(self._reveal_btn, 2, 2, 1, 1)
        grid.attach(Gtk.Label(label="Health Check:", halign=Gtk.Align.END), 0, 3, 1, 1)
        adj = Gtk.Adjustment(value=self._cfg.get("health_check_interval_s", 5), lower=2, upper=30, step_increment=1)
        self._interval_spin = Gtk.SpinButton(adjustment=adj)
        self._interval_spin.set_numeric(True)
        grid.attach(self._interval_spin, 1, 3, 1, 1)
        grid.attach(Gtk.Label(label="seconds"), 2, 3, 1, 1)
        opts_box = Gtk.Box(spacing=12, margin_top=4)
        grid.attach(opts_box, 0, 4, 3, 1)
        self._auto_restart_cb = Gtk.CheckButton(label="Auto-restart proxy on crash")
        self._auto_restart_cb.set_active(self._cfg.get("auto_restart_proxy", True))
        opts_box.pack_start(self._auto_restart_cb, False, False, 0)
        self._auto_switch_cb = Gtk.CheckButton(label="Auto-switch provider on repeated failure")
        self._auto_switch_cb.set_active(self._cfg.get("auto_switch_provider", False))
        opts_box.pack_start(self._auto_switch_cb, False, False, 0)
        save_btn = Gtk.Button(label="Save Configuration")
        save_btn.get_style_context().add_class("suggested-action")
        save_btn.connect("clicked", self._on_save)
        grid.attach(save_btn, 0, 5, 3, 1)
        stats_box = Gtk.Box(spacing=16)
        vbox.pack_start(stats_box, False, False, 0)
        stats = self._store.get("stats", {"ai_calls": 0, "tokens_used": 0})
        self._stats_lbl = Gtk.Label()
        self._stats_lbl.set_markup(
            f"<small>AI diagnostic calls: <b>{stats.get('ai_calls', 0)}</b>  |  "
            f"Tokens used: <b>{stats.get('tokens_used', 0):,}</b>  |  "
            f"Known patterns: <b>{len(self._store.get('incidents', {}))}</b></small>"
        )
        self._stats_lbl.set_use_markup(True)
        stats_box.pack_start(self._stats_lbl, False, False, 0)
        frame2 = Gtk.Frame(label="Recent Incidents")
        vbox.pack_start(frame2, True, True, 0)
        sw = Gtk.ScrolledWindow()
        sw.set_policy(Gtk.PolicyType.AUTOMATIC, Gtk.PolicyType.AUTOMATIC)
        frame2.add(sw)
        self._inc_buf = Gtk.TextBuffer()
        tv = Gtk.TextView(buffer=self._inc_buf)
        tv.set_editable(False)
        tv.set_cursor_visible(False)
        tv.set_wrap_mode(Gtk.WrapMode.WORD_CHAR)
        sw.add(tv)
        self._refresh_incidents()
        bb = Gtk.Box(spacing=8)
        vbox.pack_start(bb, False, False, 0)
        view_btn = Gtk.Button(label="View Monitoring Log")
        view_btn.connect("clicked", lambda b: subprocess.Popen(["xdg-open", str(MONITORING_LOG)]))
        bb.pack_start(view_btn, False, False, 0)
        clear_btn = Gtk.Button(label="Clear Incident Store")
        clear_btn.connect("clicked", self._on_clear_store)
        bb.pack_start(clear_btn, False, False, 0)
        close_btn = Gtk.Button(label="Close")
        close_btn.connect("clicked", lambda b: self.destroy())
        bb.pack_end(close_btn, False, False, 0)
        self.show_all()
    def _on_toggle(self, switch, state):
        self._cfg["enabled"] = state
        _save_monitoring_config(self._cfg)
    def _on_save(self, btn):
        self._cfg["provider_url"] = self._url_entry.get_text().strip()
        self._cfg["model"] = self._model_entry.get_text().strip()
        self._cfg["api_key"] = self._key_entry.get_text().strip()
        self._cfg["health_check_interval_s"] = int(self._interval_spin.get_value())
        self._cfg["auto_restart_proxy"] = self._auto_restart_cb.get_active()
        self._cfg["auto_switch_provider"] = self._auto_switch_cb.get_active()
        _save_monitoring_config(self._cfg)
        self._inc_buf.set_text("Configuration saved.\n")
    def _on_clear_store(self, btn):
        _save_incident_store({"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}})
        self._store = {"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}}
        self._refresh_incidents()
    def _refresh_incidents(self):
        lines = []
        for pattern, inc in sorted(self._store.get("incidents", {}).items(),
                                    key=lambda x: x[1].get("last_seen", ""), reverse=True):
            sc = inc.get("success_count", 0)
            fc = inc.get("fail_count", 0)
            rate = sc / max(sc + fc, 1)
            bar = "+" * min(int(rate * 10), 10) + "-" * (10 - min(int(rate * 10), 10))
            lines.append(
                f"[{inc.get('last_seen', '?')[:16]}] {pattern}\n"
                f"  fix={inc.get('fix', '?')}  success_rate={rate:.0%} [{bar}]  "
                f"seen={inc.get('occurrences', 0)}x\n"
            )
        if not lines:
            lines.append("No incidents recorded yet.\n")
            lines.append("\nEnable AI Monitoring and use Codex to populate the store.\n")
        self._inc_buf.set_text("\n".join(lines))
 # ═══════════════════════════════════════════════════════════════════
 # Main window
 # ═══════════════════════════════════════════════════════════════════
@@ -1143,7 +1661,7 @@ class LauncherWin(Gtk.Window):
        # header row
        hdr = Gtk.Box(spacing=8)
        vbox.pack_start(hdr, False, False, 0)
-        lbl = Gtk.Label(label="<b>Codex Launcher v3.7.0</b>")
+        lbl = Gtk.Label(label="<b>Codex Launcher v3.8.0</b>")
        lbl.set_use_markup(True)
        hdr.pack_start(lbl, False, False, 0)
        changelog_btn = Gtk.Button(label="Changelog")
@@ -1161,6 +1679,9 @@ class LauncherWin(Gtk.Window):
        bgp_btn = Gtk.Button(label="AI BGP")
        bgp_btn.connect("clicked", lambda b: self._open_bgp())
        hdr.pack_end(bgp_btn, False, False, 0)
        mon_btn = Gtk.Button(label="AI Monitor")
        mon_btn.connect("clicked", lambda b: self._open_monitoring())
        hdr.pack_end(mon_btn, False, False, 0)
        mgr_btn = Gtk.Button(label="Manage Endpoints")
        mgr_btn.connect("clicked", lambda b: self._open_mgr())
        hdr.pack_end(mgr_btn, False, False, 0)
@@ -1310,6 +1831,7 @@ class LauncherWin(Gtk.Window):
        self.show_all()
        self._rebuild_combo()
        self._log_dependency_status()
        self._start_watcher()
    # ── helpers ──────────────────────────────────────────────────
@@ -1456,13 +1978,84 @@ class LauncherWin(Gtk.Window):
            d.run(); d.destroy()
    def _open_bgp(self):
-        try:
+         try:
-            self._bgp_window = BGPPoolMgr(self)
+             self._bgp_window = BGPPoolMgr(self)
-            self._bgp_window.connect("destroy", lambda *_: setattr(self, "_bgp_window", None))
+             self._bgp_window.connect("destroy", lambda *_: setattr(self, "_bgp_window", None))
-        except Exception as e:
+         except Exception as e:
-            import traceback; traceback.print_exc()
+             import traceback; traceback.print_exc()
-            d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
+             d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
-            d.run(); d.destroy()
+             d.run(); d.destroy()
    def _open_monitoring(self):
         try:
             self._monitoring_window = AIMonitoringWindow(self)
             self._monitoring_window.connect("destroy", lambda *_: setattr(self, "_monitoring_window", None))
         except Exception as e:
             import traceback; traceback.print_exc()
             d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
             d.run(); d.destroy()
    def _start_watcher(self):
         cfg = _load_monitoring_config()
         if not cfg.get("enabled"):
             return
         self._watcher = HealthWatcher(
             on_failure=self._on_watcher_failure,
             on_recovery=self._on_watcher_recovery,
             on_signal=self._on_watcher_signal,
             on_action=self._on_watcher_action,
         )
         self._watcher.start()
         self.log("AI Monitoring: watchdog started")
    def _on_watcher_failure(self, count):
         GLib.idle_add(self.log, f"[AI Monitor] Proxy unresponsive (failures={count})")
    def _on_watcher_recovery(self):
         GLib.idle_add(self.log, "[AI Monitor] Proxy recovered")
    def _on_watcher_signal(self, fault_id, category, line):
         pass
    def _on_watcher_action(self, action, trigger):
         cfg = _load_monitoring_config()
         if action == "restart_proxy" and cfg.get("auto_restart_proxy"):
             GLib.idle_add(self.log, f"[AI Monitor] Auto-restarting proxy (trigger: {trigger})")
             GLib.idle_add(self._restart_proxy_from_watcher)
         elif action == "clear_schema_cache":
             try:
                 cap_file = Path.home() / ".cache/codex-proxy/provider-caps.json"
                 if cap_file.exists():
                     cap_file.unlink()
                     GLib.idle_add(self.log, "[AI Monitor] Cleared corrupt schema cache")
             except Exception as e:
                 GLib.idle_add(self.log, f"[AI Monitor] Failed to clear cache: {e}")
         elif action == "delete_provider_caps":
             try:
                 cap_file = Path.home() / ".cache/codex-proxy/provider-caps.json"
                 if cap_file.exists():
                     cap_file.unlink()
                     GLib.idle_add(self.log, "[AI Monitor] Deleted corrupted provider-caps.json")
             except Exception as e:
                 GLib.idle_add(self.log, f"[AI Monitor] Failed: {e}")
         elif action == "kill_stale_restart":
             GLib.idle_add(self.log, f"[AI Monitor] Killing stale processes + restarting (trigger: {trigger})")
             self._kill()
             GLib.idle_add(self._restart_proxy_from_watcher)
         else:
             GLib.idle_add(self.log, f"[AI Monitor] Alert: {action} (trigger: {trigger})")
    def _restart_proxy_from_watcher(self):
         try:
             ep_name = load_endpoints().get("default")
             if not ep_name:
                 return
             for ep in load_endpoints().get("endpoints", []):
                 if ep.get("name") == ep_name:
                     self._start_proxy(ep)
                     break
         except Exception as e:
             self.log(f"[AI Monitor] Proxy restart failed: {e}")
    def _open_usage(self):
        try:
--- a/src/translate-proxy.py
+++ b/src/translate-proxy.py
@@ -3410,10 +3410,20 @@ class Handler(http.server.BaseHTTPRequestHandler):
        if self.path in ("/v1/models", "/models"):
            self.send_json(200, {"object": "list", "data": MODELS})
        elif self.path in ("/health", "/v1/health"):
            import resource as _res
            _mem_mb = 0
            try:
                _mem_mb = _res.getrusage(_res.RUSAGE_SELF).ru_maxrss / 1024
            except Exception:
                pass
            _uptime = time.time() - _START_TIME if '_START_TIME' in dir() else 0
            self.send_json(200, {"ok": True, "backend": BACKEND,
                                 "target_url": TARGET_URL,
                                 "models": [m.get("id") for m in MODELS],
-                                 "bgp_routes": len(BGP_ROUTES)})
+                                 "bgp_routes": len(BGP_ROUTES),
                                 "uptime_s": round(_uptime, 1),
                                 "memory_mb": round(_mem_mb, 1),
                                 "requests_total": _STATS.get("requests", 0)})
        else:
            self.send_error(404)
@@ -4750,10 +4760,11 @@ def _handle_shutdown_signal(sig, frame):
    _SHUTDOWN_REQUESTED = True
    print(f"[SELF-REVIVE] Signal {sig} received, shutting down cleanly", flush=True)
    if 'SERVER' in globals() and SERVER:
-        SERVER.shutdown()
+         SERVER.shutdown()
 def main():
-    global SERVER
+    global SERVER, _START_TIME
    _START_TIME = time.time()
    _init_runtime()
    signal.signal(signal.SIGTERM, _handle_shutdown_signal)
    signal.signal(signal.SIGINT, _handle_shutdown_signal)