Files

admin 4334540f33 docs: AI Monitoring design spec v3.8.0 — self-healing watchdog with 3-tier response system

2026-05-22 22:22:30 +04:00

30 KiB

Raw Permalink Blame History

AI Monitoring — Design Specification

Codex Launcher v3.8.0 Feature Design Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions.

1. Problem Statement

Over 42 sessions in production, we observed these failure categories:

#	Failure Category	Count	Example
F1	parsed_tool_calls=0 — model produces unparseable output	42	Bare `<explore_agent>`, `<bash>` without cmd, plain English intent
F2	Stuck recovery triggered — Intelligence Routing Layer 3	13	"I need to fetch the README", "let me write the script"
F3	Sanitizer flagged suspicious cmd — cmd still JSON after unwrap	11	`{/'cmd/': /'sshpass -p .../'}` — double-escaped quoting
F4	Upstream 500 — provider internal error	~5	`"An internal error occurred. Please try again later."`
F5	Connection timeout — upstream unreachable	~3	`Connection timed out after 15002 milliseconds`
F6	Upstream 401/403 — auth failure	~2	Wrong API key, expired token, `upgrade_required`
F7	Stream crash — exception mid-stream	~2	`BrokenPipeError`, `ConnectionResetError` during SSE
F8	Proxy port conflict — Address already in use	~1	Stale process holding port
F9	Schema cache corruption — stale content_type=array	~1	`ErrorAnalyzer` learned wrong schema
F10	Codex Desktop crash — SIGKILL at ~27GB	~1	Issue #24048 — unbounded tool output memory
F11	Codex 300s stall — turn state machine race	~1	Issue #23807 — `stream disconnected` after 300s

The Gap

Intelligence Routing (v3.7.0) handles F1/F2/F3 inside a single request. But it can't:

Detect a dead proxy process (F7/F8) — the proxy already crashed
Reconnect Codex to a restarted proxy (F5/F7/F8) — Codex doesn't auto-reconnect
Switch to a backup provider when the primary is down (F4/F5)
Clear corrupt caches (F9) — requires out-of-band action
Restart Codex Desktop after a crash (F10/F11)
Learn from failure patterns across sessions — each failure is handled independently

What We Need

A separate lightweight watchdog process that:

Monitors proxy health continuously
Detects failures the proxy can't detect itself
Uses a cheap AI model to diagnose novel failures
Takes corrective action automatically
Learns from past incidents to prevent repeats

2. Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        Codex Launcher GUI                            │
│  ┌──────────┐  ┌──────────────┐  ┌───────────────────────────────┐ │
│  │  Proxy   │  │   Codex      │  │   AI Monitoring Panel         │ │
│  │  Manager │  │   Launcher   │  │   ┌─────────────────────┐     │ │
│  │          │  │              │  │   │ ON/OFF Toggle        │     │ │
│  └────┬─────┘  └──────┬───────┘  │   │ Provider Selector    │     │ │
│       │               │          │   │ Model Selector        │     │ │
│       │               │          │   │ Incident Log          │     │ │
│       │               │          │   │ [View Diagnostics]    │     │ │
│       │               │          │   └─────────────────────┘     │ │
│       │               │          └───────────────────────────────┘ │
└───────┼───────────────┼────────────────────────────────────────────┘
        │               │
        ▼               ▼
┌───────────────┐  ┌────────────────┐
│ translate-    │  │  Codex Desktop  │
│ proxy.py      │  │  / CLI          │
│ (port 8080)   │  │                 │
│               │  │                 │
│ /health ──────┼──┼─► health check  │
│ /responses ───┼──┼─► main API      │
└───────────────┘  └────────────────┘
        ▲
        │ health probes + log analysis + corrective actions
        │
┌───────┴────────────────────────────────────────────────────────────┐
│                     AI Monitor Watchdog                             │
│                    (thread in codex-launcher-gui)                   │
│                                                                     │
│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────────┐  │
│  │  Health Watcher  │  │  Log Analyzer   │  │  AI Diagnostic   │  │
│  │  (every 5s)      │  │  (continuous)    │  │  Agent (on-call) │  │
│  │                  │  │                  │  │                  │  │
│  │  - /health probe │  │  - tail cc-debug │  │  - Classify err  │  │
│  │  - process alive │  │  - tail proxy.log│  │  - Root cause    │  │
│  │  - port check    │  │  - pattern match │  │  - Suggest fix   │  │
│  │  - memory watch  │  │  - incident DB   │  │  - Execute fix   │  │
│  └────────┬────────┘  └────────┬────────┘  └────────┬─────────┘  │
│           │                    │                     │             │
│           └────────────────────┼─────────────────────┘             │
│                                ▼                                   │
│                    ┌──────────────────────┐                        │
│                    │  Incident Store      │                        │
│                    │  (JSON file)         │                        │
│                    │  - Known patterns    │                        │
│                    │  - Past resolutions  │                        │
│                    │  - Success rates     │                        │
│                    └──────────────────────┘                        │
└─────────────────────────────────────────────────────────────────────┘

3. Three-Tier Response System

Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second)

Immediate reactions to known failure patterns. No AI needed.

TIER1_RULES = [
    # (trigger_pattern, action, cooldown)
    
    # --- Proxy Health ---
    ("proxy_health_fail",      "restart_proxy",           30),
    ("proxy_port_conflict",    "kill_stale + restart",     60),
    ("proxy_memory_over_1gb",  "restart_proxy",           120),
    
    # --- Upstream Errors ---
    ("upstream_429",           "wait_retry_after",          0),
    ("upstream_502_503",       "retry_with_backoff",       30),
    ("upstream_500_repeat_3x", "switch_provider",          60),
    ("upstream_timeout",       "retry + increase_timeout", 30),
    ("upstream_401_403",       "alert_user_bad_key",        0),
    
    # --- Stream Errors ---
    ("stream_broken_pipe",     "restart_proxy",            30),
    ("stream_reset",           "restart_proxy",            30),
    ("stream_idle_300s",       "restart_proxy",            60),
    
    # --- Parser Failures ---
    ("parsed_tool_calls_0_x3", "clear_schema_cache",      300),
    ("sanitizer_suspicious_5x","alert_user_model_issue",    0),
    ("stuck_recovery_x5",      "suggest_switch_model",      0),
    
    # --- Codex Process ---
    ("codex_process_dead",     "alert_user_restart",         0),
    ("codex_memory_over_4gb",  "alert_user_memory",          0),
    
    # --- Cache Corruption ---
    ("schema_content_type_array", "delete_provider_caps",     0),
]

Tier 2: Pattern Matching — Incident Store Lookup (< 100ms)

For failures we've seen before and resolved, look up the fix:

{
  "incidents": [
    {
      "pattern": "cc_stream_ended_empty + explore_agent + no_url",
      "fix": "synth_explore_from_last_user_urls",
      "source": "FIX-23",
      "success_rate": 0.85,
      "last_seen": "2026-05-22T16:00:00Z",
      "occurrences": 5
    },
    {
      "pattern": "require_escalation + no_cmd",
      "fix": "auto_proceed_echo",
      "source": "FIX-24",
      "success_rate": 1.0,
      "last_seen": "2026-05-22T15:30:00Z",
      "occurrences": 3
    }
  ]
}

Tier 3: AI Diagnostic — Nano Agent (2-5 seconds)

For novel failures that don't match any rule or pattern, invoke a cheap AI model:

Prompt Template (system):
─────────────────────
You are a diagnostic agent for a translation proxy that sits between
OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat,
Anthropic, etc.). You analyze error context and suggest ONE corrective action.

Available actions: restart_proxy, kill_stale_processes, clear_schema_cache,
switch_provider, increase_timeout, alert_user, ignore, retry_now,
regenerate_config, cleanup_codex_stale

Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0}

Prompt Template (user):
─────────────────────
INCIDENT REPORT:
Time: {timestamp}
Session: {session_id}
Proxy health: {alive/dead, port, uptime, memory_mb}
Upstream: {url, model, last_http_code, last_error}
Recent errors (last 60s):
{log_lines}
Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags}
Provider: {backend_type, model}
History: {last_5_incidents_for_this_pattern}

What corrective action should be taken?

4. Complete Failure Catalog

Category A: Proxy-Level Failures (watchdog detects, auto-recovers)

ID	Failure	Symptoms	Tier 1 Action	Log Signature
A1	Proxy process crashed	`/health` returns connection refused	`restart_proxy`	`urllib.error.URLError: [Errno 111] Connection refused`
A2	Port conflict	`Address already in use` on startup	`kill_stale + restart`	`OSError: [Errno 98] Address already in use`
A3	Memory leak	Process RSS > 1GB	`restart_proxy`	`/proc/{pid}/status` VmRSS check
A4	Deadlock	Health check hangs > 15s	`restart_proxy`	health probe timeout
A5	Unhandled exception	Process exits with non-zero	`restart_proxy`	`SELF-REVIVE CRASH #{n}`
A6	SSL/TLS error	`CERTIFICATE_VERIFY_FAILED` upstream	`alert_user`	`urllib.error.URLError: certificate verify failed`
A7	DNS resolution failure	`getaddrinfo failed`	`retry_with_backoff`	`socket.gaierror: Name or service not known`

Category B: Upstream Provider Failures (proxy detects, watchdog analyzes)

ID	Failure	Symptoms	Tier 1 Action	Log Signature
B1	Rate limit (429)	Too many requests	`wait_retry_after`	`HTTP 429` + `Retry-After` header
B2	Server error (5xx)	Provider down	`retry_with_backoff`	`HTTP 500/502/503`
B3	Auth failure (401/403)	Bad/expired key	`alert_user_bad_key`	`HTTP 401 {"error":"invalid_api_key"}`
B4	CC upgrade required (403)	Version mismatch	`update_cc_version`	`HTTP 403 upgrade_required`
B5	Connection timeout	Upstream silent	`retry + increase_timeout`	`urllib.error.URLError: timed out`
B6	Connection reset	Upstream dropped mid-stream	`restart_proxy`	`ConnectionResetError: Connection reset by peer`
B7	Broken pipe	Client disconnected	`ignore`	`BrokenPipeError: Broken pipe`
B8	Upstream 400 bad request	Malformed request	`clear_schema_cache`	`HTTP 400 {"error":"...expected string..."}`
B9	Provider capacity (503)	Overloaded	`switch_provider`	`HTTP 503` after 3 retries
B10	Cloudflare block (403/1010)	Bot detection	`check_browser_ua`	`HTTP 403 error 1010`

Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks)

ID	Failure	Symptoms	Auto-Fix (IR Layer)	Watchdog Escalation
C1	Bare `<explore_agent>`	`parsed_tool_calls=0`	Layer 1: URL extraction	If 3x in a row → suggest model switch
C2	`<require_escalation>` block	Model wants permissions	Layer 2: Auto-proceed	If 5x → suggest different provider
C3	Unrecognized format	No parser matches	Layer 3: Intent synthesis	If 5x → log for AI diagnosis
C4	Double-wrapped cmd	`cmd = "{\"cmd\": ...}"`	Sanitizer: unwrap	If cmd still JSON → alert
C5	Suspicious cmd (JSON)	`cmd starts with {`	Sanitizer: flag	If 3x → clear cache + restart
C6	Empty cmd	`cmd = ""` or `cmd = "{}"`	Sanitizer: diagnostic echo	If 3x → suggest model switch
C7	Bare `{` token	Model outputs incomplete JSON	Layer 3: heuristic 5	If persistent → AI diagnosis
C8	`<bash>` without cmd	Block has sandbox but no command	Layer 3: heuristic	If 3x → AI diagnosis
C9	DSML name mismatch	`name="cmd"` vs `name="command"`	DSML parser handles both	Self-test catches regression
C10	Stuck model loop	Same recovery 5+ times	Layer 3 max 3x then alert	Switch model or provider

Category D: Codex Process Failures (watchdog detects, alerts user)

ID	Failure	Symptoms	Action	Log Signature
D1	Codex process killed	PID gone from pids.json	`alert_user_restart`	Process not in `/proc/{pid}`
D2	Codex memory explosion	RSS > 4GB	`alert_user_memory`	`/proc/{pid}/status` check
D3	Codex 300s stall	`stream disconnected` loop	`restart_proxy`	Codex stderr: `stream disconnected`
D4	Config corruption	`database disk image is malformed`	`regenerate_config`	Codex stderr: `malformed`
D5	Session context overflow	`context_length_exceeded`	`alert_user_context`	Codex stderr: `context_length_exceeded`
D6	WebSocket reconnect loop	`Reconnecting... N/5`	`check_proxy_health`	Codex stderr: `Reconnecting`

Category E: Config/State Failures (watchdog detects, auto-fixes)

ID	Failure	Symptoms	Action	Detection
E1	Schema cache corruption	`content_type: "array"` in provider-caps.json	`delete_provider_caps`	Read file, check for known-bad values
E2	Stale PID file	pids.json has dead PIDs	`cleanup_pids`	Check `/proc/{pid}` existence
E3	Port from old session	config.toml has stale port	`regenerate_config`	Port in config != running port
E4	OAuth token expired	Google/Gemini token refresh fails	`alert_user_reauth`	Token file `expiry_ts < now`
E5	BGP all routes down	Every route returned error	`alert_user_no_provider`	All routes in cooldown

5. Component Design

5.1 Health Watcher Thread

Runs in the GUI process as a background thread. Pings proxy /health endpoint every 5 seconds.

class HealthWatcher(threading.Thread):
    def __init__(self, proxy_port, on_failure, on_recovery):
        super().__init__(daemon=True)
        self.proxy_port = proxy_port
        self.on_failure = on_failure
        self.on_recovery = on_recovery
        self.check_interval = 5  # seconds
        self.failures = 0
        self.running = True
    
    def run(self):
        while self.running:
            healthy = self._check_health()
            if healthy:
                if self.failures > 0:
                    self.failures = 0
                    self.on_recovery()
            else:
                self.failures += 1
                if self.failures >= 3:  # 15s of consecutive failures
                    self.on_failure(self.failures)
            time.sleep(self.check_interval)
    
    def _check_health(self):
        try:
            req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health")
            resp = urllib.request.urlopen(req, timeout=5)
            return resp.status == 200
        except Exception:
            return False

5.2 Log Analyzer Thread

Tails the debug log and extracts failure signals in real-time.

FAILURE_SIGNALS = {
    "parsed_tool_calls=0":      ("C1", "parser_empty"),
    "[STUCK-RECOVERY]":         ("C3", "stuck_recovery"),
    "suspicious cmd":           ("C4", "sanitizer_flag"),
    "empty cmd recovered":      ("C6", "empty_cmd"),
    "HTTP 429":                 ("B1", "rate_limited"),
    "HTTP 500":                 ("B2", "server_error"),
    "HTTP 401":                 ("B3", "auth_failure"),
    "HTTP 403":                 ("B4", "forbidden"),
    "Connection refused":       ("A1", "proxy_dead"),
    "Address already in use":   ("A2", "port_conflict"),
    "Broken pipe":              ("B7", "broken_pipe"),
    "Connection reset":         ("B6", "connection_reset"),
    "timed out":                ("B5", "timeout"),
    "SELF-REVIVE CRASH":        ("A5", "proxy_crash"),
    "stream error":             ("B6", "stream_error"),
}

class LogAnalyzer(threading.Thread):
    def __init__(self, log_path, on_signal):
        super().__init__(daemon=True)
        self.log_path = log_path
        self.on_signal = on_signal
        self.running = True
    
    def run(self):
        fh = open(self.log_path, "r")
        fh.seek(0, 2)  # seek to end
        while self.running:
            line = fh.readline()
            if not line:
                time.sleep(0.5)
                continue
            for pattern, (fault_id, category) in FAILURE_SIGNALS.items():
                if pattern in line:
                    self.on_signal(fault_id, category, line.strip())
                    break

5.3 AI Diagnostic Agent

Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns.

class AIDiagnosticAgent:
    def __init__(self, provider_url, model, api_key):
        self.provider_url = provider_url
        self.model = model
        self.api_key = api_key
        self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT  # defined below
        self.incident_store = IncidentStore()
    
    def diagnose(self, context):
        # Tier 2: Check incident store first
        pattern = self._extract_pattern(context)
        known_fix = self.incident_store.lookup(pattern)
        if known_fix and known_fix["success_rate"] > 0.7:
            return known_fix["fix"], "tier2_pattern", known_fix["success_rate"]
        
        # Tier 3: Ask AI
        prompt = self._build_prompt(context)
        response = self._call_model(prompt)
        action = self._parse_response(response)
        
        # Learn from this incident
        if action:
            self.incident_store.record(pattern, action)
        
        return action, "tier3_ai", None
    
    def _call_model(self, prompt):
        body = {
            "model": self.model,
            "messages": [
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": prompt}
            ],
            "max_tokens": 200,
            "temperature": 0.1,
        }
        req = urllib.request.Request(
            self.provider_url,
            data=json.dumps(body).encode(),
            headers={
                "Content-Type": "application/json",
                "Authorization": f"Bearer {self.api_key}",
            }
        )
        resp = urllib.request.urlopen(req, timeout=15)
        return json.loads(resp.read())["choices"][0]["message"]["content"]

5.4 Incident Store

JSON file that accumulates failure patterns and their resolutions.

{
  "version": 1,
  "incidents": {
    "parser_empty+explore_agent": {
      "fault_ids": ["C1"],
      "fix": "synth_explore_from_urls",
      "source": "intelligent_routing",
      "success_count": 8,
      "fail_count": 1,
      "last_seen": "2026-05-22T16:00:00Z",
      "auto_applied": true
    },
    "server_error+repeat_3x": {
      "fault_ids": ["B2"],
      "fix": "switch_provider",
      "source": "tier1_rule",
      "success_count": 2,
      "fail_count": 0,
      "last_seen": "2026-05-22T14:00:00Z",
      "auto_applied": true
    }
  },
  "ai_diagnostic_calls": 0,
  "tokens_used": 0,
  "cost_usd": 0.0
}

5.5 Diagnostic Agent System Prompt

You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local
translation proxy between OpenAI Codex CLI/Desktop and various AI providers.

## Your Job
Analyze the incident report and recommend ONE corrective action.

## Available Actions
- restart_proxy: Kill and restart translate-proxy.py
- kill_stale_processes: Kill orphaned proxy/codex processes
- clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json
- switch_provider: Switch to a different configured endpoint
- increase_timeout: Increase upstream timeout for slow providers
- regenerate_config: Regenerate Codex config.toml
- cleanup_codex_stale: Run cleanup-codex-stale.sh
- alert_user: Show notification to user (can't auto-fix)
- ignore: Transient error, no action needed
- retry_now: Immediate retry without changes

## Decision Rules
- If upstream returns 401/403 with auth error → alert_user (can't fix bad keys)
- If proxy process is dead → restart_proxy
- If same error repeated 5+ times → switch_provider or alert_user
- If error is about content_type/schema → clear_schema_cache
- If "Address already in use" → kill_stale_processes then restart_proxy
- If timeout and upstream is slow → increase_timeout
- If single transient 429/502/503 → ignore (retry handles it)
- If "stream disconnected" and proxy is healthy → ignore (Codex retries)

## Response Format
Reply with ONLY a JSON object:
{"action": "...", "reason": "...", "confidence": 0.0-1.0}

No explanation, no markdown, no extra text.

6. GUI Integration

AI Monitoring Panel (in Settings tab)

┌─────────────────────────────────────────────────────────┐
│  AI Monitoring                                    [ON]  │
│                                                          │
│  ┌─ Diagnostic Agent ─────────────────────────────────┐ │
│  │ Provider: [OpenCode Zen          ▼]                │ │
│  │ Model:    [Qwen3-32B              ▼]                │ │
│  │ API Key:  [sk-•••••••••••••••••••• ]                │ │
│  │                                                     │ │
│  │ Cost this month: $0.12 (3 diagnostic calls)         │ │
│  │ Tokens used: 1,847 input / 423 output               │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                          │
│  ┌─ Incident Log (last 7 days) ──────────────────────┐  │
│  │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │  │
│  │ ⚠️ 15:30 B2 server_error → retry (Tier 1)         │  │
│  │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1)    │  │
│  │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3)   │  │
│  │ ...                                               │  │
│  └────────────────────────────────────────────────────┘  │
│                                                          │
│  [View Full Diagnostics]  [Export Incident Report]       │
└─────────────────────────────────────────────────────────┘

Config Storage (in endpoints.json)

{
  "ai_monitoring": {
    "enabled": true,
    "provider_url": "https://opencode.ai/zen/v1/chat/completions",
    "model": "Qwen/Qwen3-32B",
    "api_key": "sk-...",
    "tier1_enabled": true,
    "tier2_enabled": true,
    "tier3_enabled": true,
    "auto_restart_proxy": true,
    "auto_switch_provider": false,
    "health_check_interval_s": 5,
    "max_memory_mb": 1024,
    "notification_level": "important_only"
  }
}

Recommended Models (by cost)

Model	Cost/Diagnosis	Latency	Quality	Recommended For
Qwen3-32B (OpenCode)	~$0.0005	2-4s	Good	Default — cheapest decent model
DeepSeek V4 Flash	~$0.0003	2-3s	Good	Cheapest option
GPT-4o-mini	~$0.001	1-2s	Excellent	Best quality/latency
Gemini 2.0 Flash	~$0.0002	1-2s	Good	Cheapest + fastest
Claude Haiku 4.5	~$0.001	2-3s	Excellent	Best reasoning quality
Local Ollama (if running)	$0	5-15s	Varies	Zero-cost offline option

Cost Estimate

Average diagnostic prompt: ~800 tokens input, ~100 tokens output
Expected frequency: ~1-5 incidents per day that reach Tier 3
Monthly cost: $0.10 - $1.50 depending on model and usage

7. Watchdog Response Flow

Failure Detected
      │
      ▼
┌─────────────┐    YES    ┌──────────────────┐
│ Tier 1 Rule? ├─────────►│ Execute Action    │
│ (known)      │           │ Log incident      │
└──────┬───────┘           └──────────────────┘
       │ NO
       ▼
┌─────────────┐    YES    ┌──────────────────┐
│ Tier 2 Match?├─────────►│ Apply Known Fix   │
│ (incident DB)│           │ Update success    │
└──────┬───────┘           └──────────────────┘
       │ NO
       ▼
┌─────────────┐   YES     ┌──────────────────┐
│ AI Enabled?  ├─────────►│ Collect Context   │
│ (Tier 3)     │           │ Build Prompt      │
└──────┬───────┘           │ Call AI Model     │
       │ NO                │ Parse Response    │
       ▼                   │ Execute if auto   │
┌─────────────┐           │ Store incident    │
│ Alert User   │           └──────────────────┘
│ (can't fix)  │
└─────────────┘

8. Safety Guards

Rate limit AI calls — max 1 Tier 3 call per 60 seconds, max 10 per day
Never auto-execute destructive actions — alert_user for: delete files, change API keys, modify source code
Auto-restart cap — max 5 proxy restarts per 10 minutes, then alert user
Cost cap — monthly AI diagnostic budget (configurable, default $2/month)
Cooldown per pattern — same failure pattern has escalating cooldown (30s → 60s → 300s → alert)
User override — any auto-action can be cancelled within 3 seconds via GUI
Incident store max size — 500 entries, LRU eviction
Health check bypass — if user manually stopped proxy, don't alert

9. Implementation Plan

Phase 1: Core Watchdog (v3.8.0)

HealthWatcher thread in codex-launcher-gui
LogAnalyzer thread tailing cc-debug.log and proxy.log
Tier 1 rule engine with all 20+ rules
Incident store (JSON file)
GUI toggle (ON/OFF) in settings
Auto-restart proxy on crash

Phase 2: Pattern Learning (v3.8.1)

Tier 2 incident store lookup
Auto-learn from Intelligence Routing outcomes
Success rate tracking per pattern
Incident log viewer in GUI

Phase 3: AI Diagnostic Agent (v3.9.0)

Tier 3 AI model integration
Provider/model selector in GUI
Diagnostic prompt template
Cost tracking
Full incident report export

Phase 4: Advanced Recovery (v4.0.0)

Auto-switch to backup provider on repeated failure
BGP route health monitoring
Predictive failure detection (memory growth, latency trends)
Codex process memory monitoring
WebSocket reconnect assistance

10. File Changes Summary

File	Changes
`codex-launcher-gui`	+HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer
`translate-proxy.py`	+`/monitoring` endpoint (returns health + metrics), enhanced `/health` with memory/uptime
`~/.cache/codex-proxy/incident-store.json`	New file — incident pattern database
`~/.cache/codex-proxy/monitoring.log`	New file — watchdog activity log
`~/.codex/endpoints.json`	+`ai_monitoring` config section

30 KiB Raw Permalink Blame History