# AI Monitoring — Design Specification > **Codex Launcher v3.8.0 Feature Design** > Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions. --- ## 1. Problem Statement Over 42 sessions in production, we observed these failure categories: | # | Failure Category | Count | Example | |---|-----------------|-------|---------| | F1 | **parsed_tool_calls=0** — model produces unparseable output | 42 | Bare ``, `` without cmd, plain English intent | | F2 | **Stuck recovery triggered** — Intelligence Routing Layer 3 | 13 | "I need to fetch the README", "let me write the script" | | F3 | **Sanitizer flagged suspicious cmd** — cmd still JSON after unwrap | 11 | `{/'cmd/': /'sshpass -p .../'}` — double-escaped quoting | | F4 | **Upstream 500** — provider internal error | ~5 | `"An internal error occurred. Please try again later."` | | F5 | **Connection timeout** — upstream unreachable | ~3 | `Connection timed out after 15002 milliseconds` | | F6 | **Upstream 401/403** — auth failure | ~2 | Wrong API key, expired token, `upgrade_required` | | F7 | **Stream crash** — exception mid-stream | ~2 | `BrokenPipeError`, `ConnectionResetError` during SSE | | F8 | **Proxy port conflict** — Address already in use | ~1 | Stale process holding port | | F9 | **Schema cache corruption** — stale content_type=array | ~1 | `ErrorAnalyzer` learned wrong schema | | F10 | **Codex Desktop crash** — SIGKILL at ~27GB | ~1 | Issue #24048 — unbounded tool output memory | | F11 | **Codex 300s stall** — turn state machine race | ~1 | Issue #23807 — `stream disconnected` after 300s | ### The Gap Intelligence Routing (v3.7.0) handles F1/F2/F3 **inside a single request**. But it can't: - **Detect a dead proxy process** (F7/F8) — the proxy already crashed - **Reconnect Codex to a restarted proxy** (F5/F7/F8) — Codex doesn't auto-reconnect - **Switch to a backup provider** when the primary is down (F4/F5) - **Clear corrupt caches** (F9) — requires out-of-band action - **Restart Codex Desktop** after a crash (F10/F11) - **Learn from failure patterns** across sessions — each failure is handled independently ### What We Need A **separate lightweight watchdog process** that: 1. Monitors proxy health continuously 2. Detects failures the proxy can't detect itself 3. Uses a cheap AI model to diagnose novel failures 4. Takes corrective action automatically 5. Learns from past incidents to prevent repeats --- ## 2. Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ Codex Launcher GUI │ │ ┌──────────┐ ┌──────────────┐ ┌───────────────────────────────┐ │ │ │ Proxy │ │ Codex │ │ AI Monitoring Panel │ │ │ │ Manager │ │ Launcher │ │ ┌─────────────────────┐ │ │ │ │ │ │ │ │ │ ON/OFF Toggle │ │ │ │ └────┬─────┘ └──────┬───────┘ │ │ Provider Selector │ │ │ │ │ │ │ │ Model Selector │ │ │ │ │ │ │ │ Incident Log │ │ │ │ │ │ │ │ [View Diagnostics] │ │ │ │ │ │ │ └─────────────────────┘ │ │ │ │ │ └───────────────────────────────┘ │ └───────┼───────────────┼────────────────────────────────────────────┘ │ │ ▼ ▼ ┌───────────────┐ ┌────────────────┐ │ translate- │ │ Codex Desktop │ │ proxy.py │ │ / CLI │ │ (port 8080) │ │ │ │ │ │ │ │ /health ──────┼──┼─► health check │ │ /responses ───┼──┼─► main API │ └───────────────┘ └────────────────┘ ▲ │ health probes + log analysis + corrective actions │ ┌───────┴────────────────────────────────────────────────────────────┐ │ AI Monitor Watchdog │ │ (thread in codex-launcher-gui) │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │ │ │ Health Watcher │ │ Log Analyzer │ │ AI Diagnostic │ │ │ │ (every 5s) │ │ (continuous) │ │ Agent (on-call) │ │ │ │ │ │ │ │ │ │ │ │ - /health probe │ │ - tail cc-debug │ │ - Classify err │ │ │ │ - process alive │ │ - tail proxy.log│ │ - Root cause │ │ │ │ - port check │ │ - pattern match │ │ - Suggest fix │ │ │ │ - memory watch │ │ - incident DB │ │ - Execute fix │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬─────────┘ │ │ │ │ │ │ │ └────────────────────┼─────────────────────┘ │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Incident Store │ │ │ │ (JSON file) │ │ │ │ - Known patterns │ │ │ │ - Past resolutions │ │ │ │ - Success rates │ │ │ └──────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ``` --- ## 3. Three-Tier Response System ### Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second) Immediate reactions to **known failure patterns**. No AI needed. ```python TIER1_RULES = [ # (trigger_pattern, action, cooldown) # --- Proxy Health --- ("proxy_health_fail", "restart_proxy", 30), ("proxy_port_conflict", "kill_stale + restart", 60), ("proxy_memory_over_1gb", "restart_proxy", 120), # --- Upstream Errors --- ("upstream_429", "wait_retry_after", 0), ("upstream_502_503", "retry_with_backoff", 30), ("upstream_500_repeat_3x", "switch_provider", 60), ("upstream_timeout", "retry + increase_timeout", 30), ("upstream_401_403", "alert_user_bad_key", 0), # --- Stream Errors --- ("stream_broken_pipe", "restart_proxy", 30), ("stream_reset", "restart_proxy", 30), ("stream_idle_300s", "restart_proxy", 60), # --- Parser Failures --- ("parsed_tool_calls_0_x3", "clear_schema_cache", 300), ("sanitizer_suspicious_5x","alert_user_model_issue", 0), ("stuck_recovery_x5", "suggest_switch_model", 0), # --- Codex Process --- ("codex_process_dead", "alert_user_restart", 0), ("codex_memory_over_4gb", "alert_user_memory", 0), # --- Cache Corruption --- ("schema_content_type_array", "delete_provider_caps", 0), ] ``` ### Tier 2: Pattern Matching — Incident Store Lookup (< 100ms) For failures we've **seen before and resolved**, look up the fix: ```json { "incidents": [ { "pattern": "cc_stream_ended_empty + explore_agent + no_url", "fix": "synth_explore_from_last_user_urls", "source": "FIX-23", "success_rate": 0.85, "last_seen": "2026-05-22T16:00:00Z", "occurrences": 5 }, { "pattern": "require_escalation + no_cmd", "fix": "auto_proceed_echo", "source": "FIX-24", "success_rate": 1.0, "last_seen": "2026-05-22T15:30:00Z", "occurrences": 3 } ] } ``` ### Tier 3: AI Diagnostic — Nano Agent (2-5 seconds) For **novel failures** that don't match any rule or pattern, invoke a cheap AI model: ``` Prompt Template (system): ───────────────────── You are a diagnostic agent for a translation proxy that sits between OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat, Anthropic, etc.). You analyze error context and suggest ONE corrective action. Available actions: restart_proxy, kill_stale_processes, clear_schema_cache, switch_provider, increase_timeout, alert_user, ignore, retry_now, regenerate_config, cleanup_codex_stale Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0} Prompt Template (user): ───────────────────── INCIDENT REPORT: Time: {timestamp} Session: {session_id} Proxy health: {alive/dead, port, uptime, memory_mb} Upstream: {url, model, last_http_code, last_error} Recent errors (last 60s): {log_lines} Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags} Provider: {backend_type, model} History: {last_5_incidents_for_this_pattern} What corrective action should be taken? ``` --- ## 4. Complete Failure Catalog ### Category A: Proxy-Level Failures (watchdog detects, auto-recovers) | ID | Failure | Symptoms | Tier 1 Action | Log Signature | |----|---------|----------|---------------|---------------| | A1 | Proxy process crashed | `/health` returns connection refused | `restart_proxy` | `urllib.error.URLError: [Errno 111] Connection refused` | | A2 | Port conflict | `Address already in use` on startup | `kill_stale + restart` | `OSError: [Errno 98] Address already in use` | | A3 | Memory leak | Process RSS > 1GB | `restart_proxy` | `/proc/{pid}/status` VmRSS check | | A4 | Deadlock | Health check hangs > 15s | `restart_proxy` | health probe timeout | | A5 | Unhandled exception | Process exits with non-zero | `restart_proxy` | `SELF-REVIVE CRASH #{n}` | | A6 | SSL/TLS error | `CERTIFICATE_VERIFY_FAILED` upstream | `alert_user` | `urllib.error.URLError: certificate verify failed` | | A7 | DNS resolution failure | `getaddrinfo failed` | `retry_with_backoff` | `socket.gaierror: Name or service not known` | ### Category B: Upstream Provider Failures (proxy detects, watchdog analyzes) | ID | Failure | Symptoms | Tier 1 Action | Log Signature | |----|---------|----------|---------------|---------------| | B1 | Rate limit (429) | Too many requests | `wait_retry_after` | `HTTP 429` + `Retry-After` header | | B2 | Server error (5xx) | Provider down | `retry_with_backoff` | `HTTP 500/502/503` | | B3 | Auth failure (401/403) | Bad/expired key | `alert_user_bad_key` | `HTTP 401 {"error":"invalid_api_key"}` | | B4 | CC upgrade required (403) | Version mismatch | `update_cc_version` | `HTTP 403 upgrade_required` | | B5 | Connection timeout | Upstream silent | `retry + increase_timeout` | `urllib.error.URLError: timed out` | | B6 | Connection reset | Upstream dropped mid-stream | `restart_proxy` | `ConnectionResetError: Connection reset by peer` | | B7 | Broken pipe | Client disconnected | `ignore` | `BrokenPipeError: Broken pipe` | | B8 | Upstream 400 bad request | Malformed request | `clear_schema_cache` | `HTTP 400 {"error":"...expected string..."}` | | B9 | Provider capacity (503) | Overloaded | `switch_provider` | `HTTP 503` after 3 retries | | B10 | Cloudflare block (403/1010) | Bot detection | `check_browser_ua` | `HTTP 403 error 1010` | ### Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks) | ID | Failure | Symptoms | Auto-Fix (IR Layer) | Watchdog Escalation | |----|---------|----------|--------------------|--------------------| | C1 | Bare `` | `parsed_tool_calls=0` | Layer 1: URL extraction | If 3x in a row → suggest model switch | | C2 | `` block | Model wants permissions | Layer 2: Auto-proceed | If 5x → suggest different provider | | C3 | Unrecognized format | No parser matches | Layer 3: Intent synthesis | If 5x → log for AI diagnosis | | C4 | Double-wrapped cmd | `cmd = "{\"cmd\": ...}"` | Sanitizer: unwrap | If cmd still JSON → alert | | C5 | Suspicious cmd (JSON) | `cmd starts with {` | Sanitizer: flag | If 3x → clear cache + restart | | C6 | Empty cmd | `cmd = ""` or `cmd = "{}"` | Sanitizer: diagnostic echo | If 3x → suggest model switch | | C7 | Bare `{` token | Model outputs incomplete JSON | Layer 3: heuristic 5 | If persistent → AI diagnosis | | C8 | `` without cmd | Block has sandbox but no command | Layer 3: heuristic | If 3x → AI diagnosis | | C9 | DSML name mismatch | `name="cmd"` vs `name="command"` | DSML parser handles both | Self-test catches regression | | C10 | Stuck model loop | Same recovery 5+ times | Layer 3 max 3x then alert | Switch model or provider | ### Category D: Codex Process Failures (watchdog detects, alerts user) | ID | Failure | Symptoms | Action | Log Signature | |----|---------|----------|--------|---------------| | D1 | Codex process killed | PID gone from pids.json | `alert_user_restart` | Process not in `/proc/{pid}` | | D2 | Codex memory explosion | RSS > 4GB | `alert_user_memory` | `/proc/{pid}/status` check | | D3 | Codex 300s stall | `stream disconnected` loop | `restart_proxy` | Codex stderr: `stream disconnected` | | D4 | Config corruption | `database disk image is malformed` | `regenerate_config` | Codex stderr: `malformed` | | D5 | Session context overflow | `context_length_exceeded` | `alert_user_context` | Codex stderr: `context_length_exceeded` | | D6 | WebSocket reconnect loop | `Reconnecting... N/5` | `check_proxy_health` | Codex stderr: `Reconnecting` | ### Category E: Config/State Failures (watchdog detects, auto-fixes) | ID | Failure | Symptoms | Action | Detection | |----|---------|----------|--------|-----------| | E1 | Schema cache corruption | `content_type: "array"` in provider-caps.json | `delete_provider_caps` | Read file, check for known-bad values | | E2 | Stale PID file | pids.json has dead PIDs | `cleanup_pids` | Check `/proc/{pid}` existence | | E3 | Port from old session | config.toml has stale port | `regenerate_config` | Port in config != running port | | E4 | OAuth token expired | Google/Gemini token refresh fails | `alert_user_reauth` | Token file `expiry_ts < now` | | E5 | BGP all routes down | Every route returned error | `alert_user_no_provider` | All routes in cooldown | --- ## 5. Component Design ### 5.1 Health Watcher Thread Runs in the GUI process as a background thread. Pings proxy `/health` endpoint every 5 seconds. ```python class HealthWatcher(threading.Thread): def __init__(self, proxy_port, on_failure, on_recovery): super().__init__(daemon=True) self.proxy_port = proxy_port self.on_failure = on_failure self.on_recovery = on_recovery self.check_interval = 5 # seconds self.failures = 0 self.running = True def run(self): while self.running: healthy = self._check_health() if healthy: if self.failures > 0: self.failures = 0 self.on_recovery() else: self.failures += 1 if self.failures >= 3: # 15s of consecutive failures self.on_failure(self.failures) time.sleep(self.check_interval) def _check_health(self): try: req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health") resp = urllib.request.urlopen(req, timeout=5) return resp.status == 200 except Exception: return False ``` ### 5.2 Log Analyzer Thread Tails the debug log and extracts failure signals in real-time. ```python FAILURE_SIGNALS = { "parsed_tool_calls=0": ("C1", "parser_empty"), "[STUCK-RECOVERY]": ("C3", "stuck_recovery"), "suspicious cmd": ("C4", "sanitizer_flag"), "empty cmd recovered": ("C6", "empty_cmd"), "HTTP 429": ("B1", "rate_limited"), "HTTP 500": ("B2", "server_error"), "HTTP 401": ("B3", "auth_failure"), "HTTP 403": ("B4", "forbidden"), "Connection refused": ("A1", "proxy_dead"), "Address already in use": ("A2", "port_conflict"), "Broken pipe": ("B7", "broken_pipe"), "Connection reset": ("B6", "connection_reset"), "timed out": ("B5", "timeout"), "SELF-REVIVE CRASH": ("A5", "proxy_crash"), "stream error": ("B6", "stream_error"), } class LogAnalyzer(threading.Thread): def __init__(self, log_path, on_signal): super().__init__(daemon=True) self.log_path = log_path self.on_signal = on_signal self.running = True def run(self): fh = open(self.log_path, "r") fh.seek(0, 2) # seek to end while self.running: line = fh.readline() if not line: time.sleep(0.5) continue for pattern, (fault_id, category) in FAILURE_SIGNALS.items(): if pattern in line: self.on_signal(fault_id, category, line.strip()) break ``` ### 5.3 AI Diagnostic Agent Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns. ```python class AIDiagnosticAgent: def __init__(self, provider_url, model, api_key): self.provider_url = provider_url self.model = model self.api_key = api_key self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT # defined below self.incident_store = IncidentStore() def diagnose(self, context): # Tier 2: Check incident store first pattern = self._extract_pattern(context) known_fix = self.incident_store.lookup(pattern) if known_fix and known_fix["success_rate"] > 0.7: return known_fix["fix"], "tier2_pattern", known_fix["success_rate"] # Tier 3: Ask AI prompt = self._build_prompt(context) response = self._call_model(prompt) action = self._parse_response(response) # Learn from this incident if action: self.incident_store.record(pattern, action) return action, "tier3_ai", None def _call_model(self, prompt): body = { "model": self.model, "messages": [ {"role": "system", "content": self.system_prompt}, {"role": "user", "content": prompt} ], "max_tokens": 200, "temperature": 0.1, } req = urllib.request.Request( self.provider_url, data=json.dumps(body).encode(), headers={ "Content-Type": "application/json", "Authorization": f"Bearer {self.api_key}", } ) resp = urllib.request.urlopen(req, timeout=15) return json.loads(resp.read())["choices"][0]["message"]["content"] ``` ### 5.4 Incident Store JSON file that accumulates failure patterns and their resolutions. ```json { "version": 1, "incidents": { "parser_empty+explore_agent": { "fault_ids": ["C1"], "fix": "synth_explore_from_urls", "source": "intelligent_routing", "success_count": 8, "fail_count": 1, "last_seen": "2026-05-22T16:00:00Z", "auto_applied": true }, "server_error+repeat_3x": { "fault_ids": ["B2"], "fix": "switch_provider", "source": "tier1_rule", "success_count": 2, "fail_count": 0, "last_seen": "2026-05-22T14:00:00Z", "auto_applied": true } }, "ai_diagnostic_calls": 0, "tokens_used": 0, "cost_usd": 0.0 } ``` ### 5.5 Diagnostic Agent System Prompt ``` You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local translation proxy between OpenAI Codex CLI/Desktop and various AI providers. ## Your Job Analyze the incident report and recommend ONE corrective action. ## Available Actions - restart_proxy: Kill and restart translate-proxy.py - kill_stale_processes: Kill orphaned proxy/codex processes - clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json - switch_provider: Switch to a different configured endpoint - increase_timeout: Increase upstream timeout for slow providers - regenerate_config: Regenerate Codex config.toml - cleanup_codex_stale: Run cleanup-codex-stale.sh - alert_user: Show notification to user (can't auto-fix) - ignore: Transient error, no action needed - retry_now: Immediate retry without changes ## Decision Rules - If upstream returns 401/403 with auth error → alert_user (can't fix bad keys) - If proxy process is dead → restart_proxy - If same error repeated 5+ times → switch_provider or alert_user - If error is about content_type/schema → clear_schema_cache - If "Address already in use" → kill_stale_processes then restart_proxy - If timeout and upstream is slow → increase_timeout - If single transient 429/502/503 → ignore (retry handles it) - If "stream disconnected" and proxy is healthy → ignore (Codex retries) ## Response Format Reply with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0} No explanation, no markdown, no extra text. ``` --- ## 6. GUI Integration ### AI Monitoring Panel (in Settings tab) ``` ┌─────────────────────────────────────────────────────────┐ │ AI Monitoring [ON] │ │ │ │ ┌─ Diagnostic Agent ─────────────────────────────────┐ │ │ │ Provider: [OpenCode Zen ▼] │ │ │ │ Model: [Qwen3-32B ▼] │ │ │ │ API Key: [sk-•••••••••••••••••••• ] │ │ │ │ │ │ │ │ Cost this month: $0.12 (3 diagnostic calls) │ │ │ │ Tokens used: 1,847 input / 423 output │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ┌─ Incident Log (last 7 days) ──────────────────────┐ │ │ │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │ │ │ │ ⚠️ 15:30 B2 server_error → retry (Tier 1) │ │ │ │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1) │ │ │ │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3) │ │ │ │ ... │ │ │ └────────────────────────────────────────────────────┘ │ │ │ │ [View Full Diagnostics] [Export Incident Report] │ └─────────────────────────────────────────────────────────┘ ``` ### Config Storage (in endpoints.json) ```json { "ai_monitoring": { "enabled": true, "provider_url": "https://opencode.ai/zen/v1/chat/completions", "model": "Qwen/Qwen3-32B", "api_key": "sk-...", "tier1_enabled": true, "tier2_enabled": true, "tier3_enabled": true, "auto_restart_proxy": true, "auto_switch_provider": false, "health_check_interval_s": 5, "max_memory_mb": 1024, "notification_level": "important_only" } } ``` ### Recommended Models (by cost) | Model | Cost/Diagnosis | Latency | Quality | Recommended For | |-------|---------------|---------|---------|----------------| | **Qwen3-32B** (OpenCode) | ~$0.0005 | 2-4s | Good | Default — cheapest decent model | | **DeepSeek V4 Flash** | ~$0.0003 | 2-3s | Good | Cheapest option | | **GPT-4o-mini** | ~$0.001 | 1-2s | Excellent | Best quality/latency | | **Gemini 2.0 Flash** | ~$0.0002 | 1-2s | Good | Cheapest + fastest | | **Claude Haiku 4.5** | ~$0.001 | 2-3s | Excellent | Best reasoning quality | | **Local Ollama** (if running) | $0 | 5-15s | Varies | Zero-cost offline option | ### Cost Estimate - Average diagnostic prompt: ~800 tokens input, ~100 tokens output - Expected frequency: ~1-5 incidents per day that reach Tier 3 - **Monthly cost**: $0.10 - $1.50 depending on model and usage --- ## 7. Watchdog Response Flow ``` Failure Detected │ ▼ ┌─────────────┐ YES ┌──────────────────┐ │ Tier 1 Rule? ├─────────►│ Execute Action │ │ (known) │ │ Log incident │ └──────┬───────┘ └──────────────────┘ │ NO ▼ ┌─────────────┐ YES ┌──────────────────┐ │ Tier 2 Match?├─────────►│ Apply Known Fix │ │ (incident DB)│ │ Update success │ └──────┬───────┘ └──────────────────┘ │ NO ▼ ┌─────────────┐ YES ┌──────────────────┐ │ AI Enabled? ├─────────►│ Collect Context │ │ (Tier 3) │ │ Build Prompt │ └──────┬───────┘ │ Call AI Model │ │ NO │ Parse Response │ ▼ │ Execute if auto │ ┌─────────────┐ │ Store incident │ │ Alert User │ └──────────────────┘ │ (can't fix) │ └─────────────┘ ``` --- ## 8. Safety Guards 1. **Rate limit AI calls** — max 1 Tier 3 call per 60 seconds, max 10 per day 2. **Never auto-execute destructive actions** — `alert_user` for: delete files, change API keys, modify source code 3. **Auto-restart cap** — max 5 proxy restarts per 10 minutes, then alert user 4. **Cost cap** — monthly AI diagnostic budget (configurable, default $2/month) 5. **Cooldown per pattern** — same failure pattern has escalating cooldown (30s → 60s → 300s → alert) 6. **User override** — any auto-action can be cancelled within 3 seconds via GUI 7. **Incident store max size** — 500 entries, LRU eviction 8. **Health check bypass** — if user manually stopped proxy, don't alert --- ## 9. Implementation Plan ### Phase 1: Core Watchdog (v3.8.0) - `HealthWatcher` thread in `codex-launcher-gui` - `LogAnalyzer` thread tailing `cc-debug.log` and `proxy.log` - Tier 1 rule engine with all 20+ rules - Incident store (JSON file) - GUI toggle (ON/OFF) in settings - Auto-restart proxy on crash ### Phase 2: Pattern Learning (v3.8.1) - Tier 2 incident store lookup - Auto-learn from Intelligence Routing outcomes - Success rate tracking per pattern - Incident log viewer in GUI ### Phase 3: AI Diagnostic Agent (v3.9.0) - Tier 3 AI model integration - Provider/model selector in GUI - Diagnostic prompt template - Cost tracking - Full incident report export ### Phase 4: Advanced Recovery (v4.0.0) - Auto-switch to backup provider on repeated failure - BGP route health monitoring - Predictive failure detection (memory growth, latency trends) - Codex process memory monitoring - WebSocket reconnect assistance --- ## 10. File Changes Summary | File | Changes | |------|---------| | `codex-launcher-gui` | +HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer | | `translate-proxy.py` | +`/monitoring` endpoint (returns health + metrics), enhanced `/health` with memory/uptime | | `~/.cache/codex-proxy/incident-store.json` | New file — incident pattern database | | `~/.cache/codex-proxy/monitoring.log` | New file — watchdog activity log | | `~/.codex/endpoints.json` | +`ai_monitoring` config section |