Files
Codex-Launcher---Any-AI-Por…/AI-MONITORING-DESIGN.md

30 KiB

AI Monitoring — Design Specification

Codex Launcher v3.8.0 Feature Design Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions.


1. Problem Statement

Over 42 sessions in production, we observed these failure categories:

# Failure Category Count Example
F1 parsed_tool_calls=0 — model produces unparseable output 42 Bare <explore_agent>, <bash> without cmd, plain English intent
F2 Stuck recovery triggered — Intelligence Routing Layer 3 13 "I need to fetch the README", "let me write the script"
F3 Sanitizer flagged suspicious cmd — cmd still JSON after unwrap 11 {/'cmd/': /'sshpass -p .../'} — double-escaped quoting
F4 Upstream 500 — provider internal error ~5 "An internal error occurred. Please try again later."
F5 Connection timeout — upstream unreachable ~3 Connection timed out after 15002 milliseconds
F6 Upstream 401/403 — auth failure ~2 Wrong API key, expired token, upgrade_required
F7 Stream crash — exception mid-stream ~2 BrokenPipeError, ConnectionResetError during SSE
F8 Proxy port conflict — Address already in use ~1 Stale process holding port
F9 Schema cache corruption — stale content_type=array ~1 ErrorAnalyzer learned wrong schema
F10 Codex Desktop crash — SIGKILL at ~27GB ~1 Issue #24048 — unbounded tool output memory
F11 Codex 300s stall — turn state machine race ~1 Issue #23807 — stream disconnected after 300s

The Gap

Intelligence Routing (v3.7.0) handles F1/F2/F3 inside a single request. But it can't:

  • Detect a dead proxy process (F7/F8) — the proxy already crashed
  • Reconnect Codex to a restarted proxy (F5/F7/F8) — Codex doesn't auto-reconnect
  • Switch to a backup provider when the primary is down (F4/F5)
  • Clear corrupt caches (F9) — requires out-of-band action
  • Restart Codex Desktop after a crash (F10/F11)
  • Learn from failure patterns across sessions — each failure is handled independently

What We Need

A separate lightweight watchdog process that:

  1. Monitors proxy health continuously
  2. Detects failures the proxy can't detect itself
  3. Uses a cheap AI model to diagnose novel failures
  4. Takes corrective action automatically
  5. Learns from past incidents to prevent repeats

2. Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Codex Launcher GUI                            │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────────────────────┐ │
│  │  Proxy   │  │   Codex      │  │   AI Monitoring Panel         │ │
│  │  Manager │  │   Launcher   │  │   ┌─────────────────────┐     │ │
│  │          │  │              │  │   │ ON/OFF Toggle        │     │ │
│  └────┬─────┘  └──────┬───────┘  │   │ Provider Selector    │     │ │
│       │               │          │   │ Model Selector        │     │ │
│       │               │          │   │ Incident Log          │     │ │
│       │               │          │   │ [View Diagnostics]    │     │ │
│       │               │          │   └─────────────────────┘     │ │
│       │               │          └───────────────────────────────┘ │
└───────┼───────────────┼────────────────────────────────────────────┘
        │               │
        ▼               ▼
┌───────────────┐  ┌────────────────┐
│ translate-    │  │  Codex Desktop  │
│ proxy.py      │  │  / CLI          │
│ (port 8080)   │  │                 │
│               │  │                 │
│ /health ──────┼──┼─► health check  │
│ /responses ───┼──┼─► main API      │
└───────────────┘  └────────────────┘
        ▲
        │ health probes + log analysis + corrective actions
        │
┌───────┴────────────────────────────────────────────────────────────┐
│                     AI Monitor Watchdog                             │
│                    (thread in codex-launcher-gui)                   │
│                                                                     │
│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────────┐  │
│  │  Health Watcher  │  │  Log Analyzer   │  │  AI Diagnostic   │  │
│  │  (every 5s)      │  │  (continuous)    │  │  Agent (on-call) │  │
│  │                  │  │                  │  │                  │  │
│  │  - /health probe │  │  - tail cc-debug │  │  - Classify err  │  │
│  │  - process alive │  │  - tail proxy.log│  │  - Root cause    │  │
│  │  - port check    │  │  - pattern match │  │  - Suggest fix   │  │
│  │  - memory watch  │  │  - incident DB   │  │  - Execute fix   │  │
│  └────────┬────────┘  └────────┬────────┘  └────────┬─────────┘  │
│           │                    │                     │             │
│           └────────────────────┼─────────────────────┘             │
│                                ▼                                   │
│                    ┌──────────────────────┐                        │
│                    │  Incident Store      │                        │
│                    │  (JSON file)         │                        │
│                    │  - Known patterns    │                        │
│                    │  - Past resolutions  │                        │
│                    │  - Success rates     │                        │
│                    └──────────────────────┘                        │
└─────────────────────────────────────────────────────────────────────┘

3. Three-Tier Response System

Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second)

Immediate reactions to known failure patterns. No AI needed.

TIER1_RULES = [
    # (trigger_pattern, action, cooldown)
    
    # --- Proxy Health ---
    ("proxy_health_fail",      "restart_proxy",           30),
    ("proxy_port_conflict",    "kill_stale + restart",     60),
    ("proxy_memory_over_1gb",  "restart_proxy",           120),
    
    # --- Upstream Errors ---
    ("upstream_429",           "wait_retry_after",          0),
    ("upstream_502_503",       "retry_with_backoff",       30),
    ("upstream_500_repeat_3x", "switch_provider",          60),
    ("upstream_timeout",       "retry + increase_timeout", 30),
    ("upstream_401_403",       "alert_user_bad_key",        0),
    
    # --- Stream Errors ---
    ("stream_broken_pipe",     "restart_proxy",            30),
    ("stream_reset",           "restart_proxy",            30),
    ("stream_idle_300s",       "restart_proxy",            60),
    
    # --- Parser Failures ---
    ("parsed_tool_calls_0_x3", "clear_schema_cache",      300),
    ("sanitizer_suspicious_5x","alert_user_model_issue",    0),
    ("stuck_recovery_x5",      "suggest_switch_model",      0),
    
    # --- Codex Process ---
    ("codex_process_dead",     "alert_user_restart",         0),
    ("codex_memory_over_4gb",  "alert_user_memory",          0),
    
    # --- Cache Corruption ---
    ("schema_content_type_array", "delete_provider_caps",     0),
]

Tier 2: Pattern Matching — Incident Store Lookup (< 100ms)

For failures we've seen before and resolved, look up the fix:

{
  "incidents": [
    {
      "pattern": "cc_stream_ended_empty + explore_agent + no_url",
      "fix": "synth_explore_from_last_user_urls",
      "source": "FIX-23",
      "success_rate": 0.85,
      "last_seen": "2026-05-22T16:00:00Z",
      "occurrences": 5
    },
    {
      "pattern": "require_escalation + no_cmd",
      "fix": "auto_proceed_echo",
      "source": "FIX-24",
      "success_rate": 1.0,
      "last_seen": "2026-05-22T15:30:00Z",
      "occurrences": 3
    }
  ]
}

Tier 3: AI Diagnostic — Nano Agent (2-5 seconds)

For novel failures that don't match any rule or pattern, invoke a cheap AI model:

Prompt Template (system):
─────────────────────
You are a diagnostic agent for a translation proxy that sits between
OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat,
Anthropic, etc.). You analyze error context and suggest ONE corrective action.

Available actions: restart_proxy, kill_stale_processes, clear_schema_cache,
switch_provider, increase_timeout, alert_user, ignore, retry_now,
regenerate_config, cleanup_codex_stale

Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0}

Prompt Template (user):
─────────────────────
INCIDENT REPORT:
Time: {timestamp}
Session: {session_id}
Proxy health: {alive/dead, port, uptime, memory_mb}
Upstream: {url, model, last_http_code, last_error}
Recent errors (last 60s):
{log_lines}
Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags}
Provider: {backend_type, model}
History: {last_5_incidents_for_this_pattern}

What corrective action should be taken?

4. Complete Failure Catalog

Category A: Proxy-Level Failures (watchdog detects, auto-recovers)

ID Failure Symptoms Tier 1 Action Log Signature
A1 Proxy process crashed /health returns connection refused restart_proxy urllib.error.URLError: [Errno 111] Connection refused
A2 Port conflict Address already in use on startup kill_stale + restart OSError: [Errno 98] Address already in use
A3 Memory leak Process RSS > 1GB restart_proxy /proc/{pid}/status VmRSS check
A4 Deadlock Health check hangs > 15s restart_proxy health probe timeout
A5 Unhandled exception Process exits with non-zero restart_proxy SELF-REVIVE CRASH #{n}
A6 SSL/TLS error CERTIFICATE_VERIFY_FAILED upstream alert_user urllib.error.URLError: certificate verify failed
A7 DNS resolution failure getaddrinfo failed retry_with_backoff socket.gaierror: Name or service not known

Category B: Upstream Provider Failures (proxy detects, watchdog analyzes)

ID Failure Symptoms Tier 1 Action Log Signature
B1 Rate limit (429) Too many requests wait_retry_after HTTP 429 + Retry-After header
B2 Server error (5xx) Provider down retry_with_backoff HTTP 500/502/503
B3 Auth failure (401/403) Bad/expired key alert_user_bad_key HTTP 401 {"error":"invalid_api_key"}
B4 CC upgrade required (403) Version mismatch update_cc_version HTTP 403 upgrade_required
B5 Connection timeout Upstream silent retry + increase_timeout urllib.error.URLError: timed out
B6 Connection reset Upstream dropped mid-stream restart_proxy ConnectionResetError: Connection reset by peer
B7 Broken pipe Client disconnected ignore BrokenPipeError: Broken pipe
B8 Upstream 400 bad request Malformed request clear_schema_cache HTTP 400 {"error":"...expected string..."}
B9 Provider capacity (503) Overloaded switch_provider HTTP 503 after 3 retries
B10 Cloudflare block (403/1010) Bot detection check_browser_ua HTTP 403 error 1010

Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks)

ID Failure Symptoms Auto-Fix (IR Layer) Watchdog Escalation
C1 Bare <explore_agent> parsed_tool_calls=0 Layer 1: URL extraction If 3x in a row → suggest model switch
C2 <require_escalation> block Model wants permissions Layer 2: Auto-proceed If 5x → suggest different provider
C3 Unrecognized format No parser matches Layer 3: Intent synthesis If 5x → log for AI diagnosis
C4 Double-wrapped cmd cmd = "{\"cmd\": ...}" Sanitizer: unwrap If cmd still JSON → alert
C5 Suspicious cmd (JSON) cmd starts with { Sanitizer: flag If 3x → clear cache + restart
C6 Empty cmd cmd = "" or cmd = "{}" Sanitizer: diagnostic echo If 3x → suggest model switch
C7 Bare { token Model outputs incomplete JSON Layer 3: heuristic 5 If persistent → AI diagnosis
C8 <bash> without cmd Block has sandbox but no command Layer 3: heuristic If 3x → AI diagnosis
C9 DSML name mismatch name="cmd" vs name="command" DSML parser handles both Self-test catches regression
C10 Stuck model loop Same recovery 5+ times Layer 3 max 3x then alert Switch model or provider

Category D: Codex Process Failures (watchdog detects, alerts user)

ID Failure Symptoms Action Log Signature
D1 Codex process killed PID gone from pids.json alert_user_restart Process not in /proc/{pid}
D2 Codex memory explosion RSS > 4GB alert_user_memory /proc/{pid}/status check
D3 Codex 300s stall stream disconnected loop restart_proxy Codex stderr: stream disconnected
D4 Config corruption database disk image is malformed regenerate_config Codex stderr: malformed
D5 Session context overflow context_length_exceeded alert_user_context Codex stderr: context_length_exceeded
D6 WebSocket reconnect loop Reconnecting... N/5 check_proxy_health Codex stderr: Reconnecting

Category E: Config/State Failures (watchdog detects, auto-fixes)

ID Failure Symptoms Action Detection
E1 Schema cache corruption content_type: "array" in provider-caps.json delete_provider_caps Read file, check for known-bad values
E2 Stale PID file pids.json has dead PIDs cleanup_pids Check /proc/{pid} existence
E3 Port from old session config.toml has stale port regenerate_config Port in config != running port
E4 OAuth token expired Google/Gemini token refresh fails alert_user_reauth Token file expiry_ts < now
E5 BGP all routes down Every route returned error alert_user_no_provider All routes in cooldown

5. Component Design

5.1 Health Watcher Thread

Runs in the GUI process as a background thread. Pings proxy /health endpoint every 5 seconds.

class HealthWatcher(threading.Thread):
    def __init__(self, proxy_port, on_failure, on_recovery):
        super().__init__(daemon=True)
        self.proxy_port = proxy_port
        self.on_failure = on_failure
        self.on_recovery = on_recovery
        self.check_interval = 5  # seconds
        self.failures = 0
        self.running = True
    
    def run(self):
        while self.running:
            healthy = self._check_health()
            if healthy:
                if self.failures > 0:
                    self.failures = 0
                    self.on_recovery()
            else:
                self.failures += 1
                if self.failures >= 3:  # 15s of consecutive failures
                    self.on_failure(self.failures)
            time.sleep(self.check_interval)
    
    def _check_health(self):
        try:
            req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health")
            resp = urllib.request.urlopen(req, timeout=5)
            return resp.status == 200
        except Exception:
            return False

5.2 Log Analyzer Thread

Tails the debug log and extracts failure signals in real-time.

FAILURE_SIGNALS = {
    "parsed_tool_calls=0":      ("C1", "parser_empty"),
    "[STUCK-RECOVERY]":         ("C3", "stuck_recovery"),
    "suspicious cmd":           ("C4", "sanitizer_flag"),
    "empty cmd recovered":      ("C6", "empty_cmd"),
    "HTTP 429":                 ("B1", "rate_limited"),
    "HTTP 500":                 ("B2", "server_error"),
    "HTTP 401":                 ("B3", "auth_failure"),
    "HTTP 403":                 ("B4", "forbidden"),
    "Connection refused":       ("A1", "proxy_dead"),
    "Address already in use":   ("A2", "port_conflict"),
    "Broken pipe":              ("B7", "broken_pipe"),
    "Connection reset":         ("B6", "connection_reset"),
    "timed out":                ("B5", "timeout"),
    "SELF-REVIVE CRASH":        ("A5", "proxy_crash"),
    "stream error":             ("B6", "stream_error"),
}

class LogAnalyzer(threading.Thread):
    def __init__(self, log_path, on_signal):
        super().__init__(daemon=True)
        self.log_path = log_path
        self.on_signal = on_signal
        self.running = True
    
    def run(self):
        fh = open(self.log_path, "r")
        fh.seek(0, 2)  # seek to end
        while self.running:
            line = fh.readline()
            if not line:
                time.sleep(0.5)
                continue
            for pattern, (fault_id, category) in FAILURE_SIGNALS.items():
                if pattern in line:
                    self.on_signal(fault_id, category, line.strip())
                    break

5.3 AI Diagnostic Agent

Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns.

class AIDiagnosticAgent:
    def __init__(self, provider_url, model, api_key):
        self.provider_url = provider_url
        self.model = model
        self.api_key = api_key
        self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT  # defined below
        self.incident_store = IncidentStore()
    
    def diagnose(self, context):
        # Tier 2: Check incident store first
        pattern = self._extract_pattern(context)
        known_fix = self.incident_store.lookup(pattern)
        if known_fix and known_fix["success_rate"] > 0.7:
            return known_fix["fix"], "tier2_pattern", known_fix["success_rate"]
        
        # Tier 3: Ask AI
        prompt = self._build_prompt(context)
        response = self._call_model(prompt)
        action = self._parse_response(response)
        
        # Learn from this incident
        if action:
            self.incident_store.record(pattern, action)
        
        return action, "tier3_ai", None
    
    def _call_model(self, prompt):
        body = {
            "model": self.model,
            "messages": [
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": prompt}
            ],
            "max_tokens": 200,
            "temperature": 0.1,
        }
        req = urllib.request.Request(
            self.provider_url,
            data=json.dumps(body).encode(),
            headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {self.api_key}",
            }
        )
        resp = urllib.request.urlopen(req, timeout=15)
        return json.loads(resp.read())["choices"][0]["message"]["content"]

5.4 Incident Store

JSON file that accumulates failure patterns and their resolutions.

{
  "version": 1,
  "incidents": {
    "parser_empty+explore_agent": {
      "fault_ids": ["C1"],
      "fix": "synth_explore_from_urls",
      "source": "intelligent_routing",
      "success_count": 8,
      "fail_count": 1,
      "last_seen": "2026-05-22T16:00:00Z",
      "auto_applied": true
    },
    "server_error+repeat_3x": {
      "fault_ids": ["B2"],
      "fix": "switch_provider",
      "source": "tier1_rule",
      "success_count": 2,
      "fail_count": 0,
      "last_seen": "2026-05-22T14:00:00Z",
      "auto_applied": true
    }
  },
  "ai_diagnostic_calls": 0,
  "tokens_used": 0,
  "cost_usd": 0.0
}

5.5 Diagnostic Agent System Prompt

You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local
translation proxy between OpenAI Codex CLI/Desktop and various AI providers.

## Your Job
Analyze the incident report and recommend ONE corrective action.

## Available Actions
- restart_proxy: Kill and restart translate-proxy.py
- kill_stale_processes: Kill orphaned proxy/codex processes
- clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json
- switch_provider: Switch to a different configured endpoint
- increase_timeout: Increase upstream timeout for slow providers
- regenerate_config: Regenerate Codex config.toml
- cleanup_codex_stale: Run cleanup-codex-stale.sh
- alert_user: Show notification to user (can't auto-fix)
- ignore: Transient error, no action needed
- retry_now: Immediate retry without changes

## Decision Rules
- If upstream returns 401/403 with auth error → alert_user (can't fix bad keys)
- If proxy process is dead → restart_proxy
- If same error repeated 5+ times → switch_provider or alert_user
- If error is about content_type/schema → clear_schema_cache
- If "Address already in use" → kill_stale_processes then restart_proxy
- If timeout and upstream is slow → increase_timeout
- If single transient 429/502/503 → ignore (retry handles it)
- If "stream disconnected" and proxy is healthy → ignore (Codex retries)

## Response Format
Reply with ONLY a JSON object:
{"action": "...", "reason": "...", "confidence": 0.0-1.0}

No explanation, no markdown, no extra text.

6. GUI Integration

AI Monitoring Panel (in Settings tab)

┌─────────────────────────────────────────────────────────┐
│  AI Monitoring                                    [ON]  │
│                                                          │
│  ┌─ Diagnostic Agent ─────────────────────────────────┐ │
│  │ Provider: [OpenCode Zen          ▼]                │ │
│  │ Model:    [Qwen3-32B              ▼]                │ │
│  │ API Key:  [sk-•••••••••••••••••••• ]                │ │
│  │                                                     │ │
│  │ Cost this month: $0.12 (3 diagnostic calls)         │ │
│  │ Tokens used: 1,847 input / 423 output               │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                          │
│  ┌─ Incident Log (last 7 days) ──────────────────────┐  │
│  │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │  │
│  │ ⚠️ 15:30 B2 server_error → retry (Tier 1)         │  │
│  │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1)    │  │
│  │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3)   │  │
│  │ ...                                               │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  [View Full Diagnostics]  [Export Incident Report]       │
└─────────────────────────────────────────────────────────┘

Config Storage (in endpoints.json)

{
  "ai_monitoring": {
    "enabled": true,
    "provider_url": "https://opencode.ai/zen/v1/chat/completions",
    "model": "Qwen/Qwen3-32B",
    "api_key": "sk-...",
    "tier1_enabled": true,
    "tier2_enabled": true,
    "tier3_enabled": true,
    "auto_restart_proxy": true,
    "auto_switch_provider": false,
    "health_check_interval_s": 5,
    "max_memory_mb": 1024,
    "notification_level": "important_only"
  }
}
Model Cost/Diagnosis Latency Quality Recommended For
Qwen3-32B (OpenCode) ~$0.0005 2-4s Good Default — cheapest decent model
DeepSeek V4 Flash ~$0.0003 2-3s Good Cheapest option
GPT-4o-mini ~$0.001 1-2s Excellent Best quality/latency
Gemini 2.0 Flash ~$0.0002 1-2s Good Cheapest + fastest
Claude Haiku 4.5 ~$0.001 2-3s Excellent Best reasoning quality
Local Ollama (if running) $0 5-15s Varies Zero-cost offline option

Cost Estimate

  • Average diagnostic prompt: ~800 tokens input, ~100 tokens output
  • Expected frequency: ~1-5 incidents per day that reach Tier 3
  • Monthly cost: $0.10 - $1.50 depending on model and usage

7. Watchdog Response Flow

Failure Detected
      │
      ▼
┌─────────────┐    YES    ┌──────────────────┐
│ Tier 1 Rule? ├─────────►│ Execute Action    │
│ (known)      │           │ Log incident      │
└──────┬───────┘           └──────────────────┘
       │ NO
       ▼
┌─────────────┐    YES    ┌──────────────────┐
│ Tier 2 Match?├─────────►│ Apply Known Fix   │
│ (incident DB)│           │ Update success    │
└──────┬───────┘           └──────────────────┘
       │ NO
       ▼
┌─────────────┐   YES     ┌──────────────────┐
│ AI Enabled?  ├─────────►│ Collect Context   │
│ (Tier 3)     │           │ Build Prompt      │
└──────┬───────┘           │ Call AI Model     │
       │ NO                │ Parse Response    │
       ▼                   │ Execute if auto   │
┌─────────────┐           │ Store incident    │
│ Alert User   │           └──────────────────┘
│ (can't fix)  │
└─────────────┘

8. Safety Guards

  1. Rate limit AI calls — max 1 Tier 3 call per 60 seconds, max 10 per day
  2. Never auto-execute destructive actionsalert_user for: delete files, change API keys, modify source code
  3. Auto-restart cap — max 5 proxy restarts per 10 minutes, then alert user
  4. Cost cap — monthly AI diagnostic budget (configurable, default $2/month)
  5. Cooldown per pattern — same failure pattern has escalating cooldown (30s → 60s → 300s → alert)
  6. User override — any auto-action can be cancelled within 3 seconds via GUI
  7. Incident store max size — 500 entries, LRU eviction
  8. Health check bypass — if user manually stopped proxy, don't alert

9. Implementation Plan

Phase 1: Core Watchdog (v3.8.0)

  • HealthWatcher thread in codex-launcher-gui
  • LogAnalyzer thread tailing cc-debug.log and proxy.log
  • Tier 1 rule engine with all 20+ rules
  • Incident store (JSON file)
  • GUI toggle (ON/OFF) in settings
  • Auto-restart proxy on crash

Phase 2: Pattern Learning (v3.8.1)

  • Tier 2 incident store lookup
  • Auto-learn from Intelligence Routing outcomes
  • Success rate tracking per pattern
  • Incident log viewer in GUI

Phase 3: AI Diagnostic Agent (v3.9.0)

  • Tier 3 AI model integration
  • Provider/model selector in GUI
  • Diagnostic prompt template
  • Cost tracking
  • Full incident report export

Phase 4: Advanced Recovery (v4.0.0)

  • Auto-switch to backup provider on repeated failure
  • BGP route health monitoring
  • Predictive failure detection (memory growth, latency trends)
  • Codex process memory monitoring
  • WebSocket reconnect assistance

10. File Changes Summary

File Changes
codex-launcher-gui +HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer
translate-proxy.py +/monitoring endpoint (returns health + metrics), enhanced /health with memory/uptime
~/.cache/codex-proxy/incident-store.json New file — incident pattern database
~/.cache/codex-proxy/monitoring.log New file — watchdog activity log
~/.codex/endpoints.json +ai_monitoring config section