docs: AI Monitoring design spec v3.8.0 — self-healing watchdog with 3-tier response system
This commit is contained in:
638
AI-MONITORING-DESIGN.md
Normal file
638
AI-MONITORING-DESIGN.md
Normal file
@@ -0,0 +1,638 @@
|
|||||||
|
# AI Monitoring — Design Specification
|
||||||
|
|
||||||
|
> **Codex Launcher v3.8.0 Feature Design**
|
||||||
|
> Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Problem Statement
|
||||||
|
|
||||||
|
Over 42 sessions in production, we observed these failure categories:
|
||||||
|
|
||||||
|
| # | Failure Category | Count | Example |
|
||||||
|
|---|-----------------|-------|---------|
|
||||||
|
| F1 | **parsed_tool_calls=0** — model produces unparseable output | 42 | Bare `<explore_agent>`, `<bash>` without cmd, plain English intent |
|
||||||
|
| F2 | **Stuck recovery triggered** — Intelligence Routing Layer 3 | 13 | "I need to fetch the README", "let me write the script" |
|
||||||
|
| F3 | **Sanitizer flagged suspicious cmd** — cmd still JSON after unwrap | 11 | `{/'cmd/': /'sshpass -p .../'}` — double-escaped quoting |
|
||||||
|
| F4 | **Upstream 500** — provider internal error | ~5 | `"An internal error occurred. Please try again later."` |
|
||||||
|
| F5 | **Connection timeout** — upstream unreachable | ~3 | `Connection timed out after 15002 milliseconds` |
|
||||||
|
| F6 | **Upstream 401/403** — auth failure | ~2 | Wrong API key, expired token, `upgrade_required` |
|
||||||
|
| F7 | **Stream crash** — exception mid-stream | ~2 | `BrokenPipeError`, `ConnectionResetError` during SSE |
|
||||||
|
| F8 | **Proxy port conflict** — Address already in use | ~1 | Stale process holding port |
|
||||||
|
| F9 | **Schema cache corruption** — stale content_type=array | ~1 | `ErrorAnalyzer` learned wrong schema |
|
||||||
|
| F10 | **Codex Desktop crash** — SIGKILL at ~27GB | ~1 | Issue #24048 — unbounded tool output memory |
|
||||||
|
| F11 | **Codex 300s stall** — turn state machine race | ~1 | Issue #23807 — `stream disconnected` after 300s |
|
||||||
|
|
||||||
|
### The Gap
|
||||||
|
|
||||||
|
Intelligence Routing (v3.7.0) handles F1/F2/F3 **inside a single request**. But it can't:
|
||||||
|
|
||||||
|
- **Detect a dead proxy process** (F7/F8) — the proxy already crashed
|
||||||
|
- **Reconnect Codex to a restarted proxy** (F5/F7/F8) — Codex doesn't auto-reconnect
|
||||||
|
- **Switch to a backup provider** when the primary is down (F4/F5)
|
||||||
|
- **Clear corrupt caches** (F9) — requires out-of-band action
|
||||||
|
- **Restart Codex Desktop** after a crash (F10/F11)
|
||||||
|
- **Learn from failure patterns** across sessions — each failure is handled independently
|
||||||
|
|
||||||
|
### What We Need
|
||||||
|
|
||||||
|
A **separate lightweight watchdog process** that:
|
||||||
|
1. Monitors proxy health continuously
|
||||||
|
2. Detects failures the proxy can't detect itself
|
||||||
|
3. Uses a cheap AI model to diagnose novel failures
|
||||||
|
4. Takes corrective action automatically
|
||||||
|
5. Learns from past incidents to prevent repeats
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Codex Launcher GUI │
|
||||||
|
│ ┌──────────┐ ┌──────────────┐ ┌───────────────────────────────┐ │
|
||||||
|
│ │ Proxy │ │ Codex │ │ AI Monitoring Panel │ │
|
||||||
|
│ │ Manager │ │ Launcher │ │ ┌─────────────────────┐ │ │
|
||||||
|
│ │ │ │ │ │ │ ON/OFF Toggle │ │ │
|
||||||
|
│ └────┬─────┘ └──────┬───────┘ │ │ Provider Selector │ │ │
|
||||||
|
│ │ │ │ │ Model Selector │ │ │
|
||||||
|
│ │ │ │ │ Incident Log │ │ │
|
||||||
|
│ │ │ │ │ [View Diagnostics] │ │ │
|
||||||
|
│ │ │ │ └─────────────────────┘ │ │
|
||||||
|
│ │ │ └───────────────────────────────┘ │
|
||||||
|
└───────┼───────────────┼────────────────────────────────────────────┘
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
┌───────────────┐ ┌────────────────┐
|
||||||
|
│ translate- │ │ Codex Desktop │
|
||||||
|
│ proxy.py │ │ / CLI │
|
||||||
|
│ (port 8080) │ │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ /health ──────┼──┼─► health check │
|
||||||
|
│ /responses ───┼──┼─► main API │
|
||||||
|
└───────────────┘ └────────────────┘
|
||||||
|
▲
|
||||||
|
│ health probes + log analysis + corrective actions
|
||||||
|
│
|
||||||
|
┌───────┴────────────────────────────────────────────────────────────┐
|
||||||
|
│ AI Monitor Watchdog │
|
||||||
|
│ (thread in codex-launcher-gui) │
|
||||||
|
│ │
|
||||||
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │
|
||||||
|
│ │ Health Watcher │ │ Log Analyzer │ │ AI Diagnostic │ │
|
||||||
|
│ │ (every 5s) │ │ (continuous) │ │ Agent (on-call) │ │
|
||||||
|
│ │ │ │ │ │ │ │
|
||||||
|
│ │ - /health probe │ │ - tail cc-debug │ │ - Classify err │ │
|
||||||
|
│ │ - process alive │ │ - tail proxy.log│ │ - Root cause │ │
|
||||||
|
│ │ - port check │ │ - pattern match │ │ - Suggest fix │ │
|
||||||
|
│ │ - memory watch │ │ - incident DB │ │ - Execute fix │ │
|
||||||
|
│ └────────┬────────┘ └────────┬────────┘ └────────┬─────────┘ │
|
||||||
|
│ │ │ │ │
|
||||||
|
│ └────────────────────┼─────────────────────┘ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌──────────────────────┐ │
|
||||||
|
│ │ Incident Store │ │
|
||||||
|
│ │ (JSON file) │ │
|
||||||
|
│ │ - Known patterns │ │
|
||||||
|
│ │ - Past resolutions │ │
|
||||||
|
│ │ - Success rates │ │
|
||||||
|
│ └──────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Three-Tier Response System
|
||||||
|
|
||||||
|
### Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second)
|
||||||
|
|
||||||
|
Immediate reactions to **known failure patterns**. No AI needed.
|
||||||
|
|
||||||
|
```python
|
||||||
|
TIER1_RULES = [
|
||||||
|
# (trigger_pattern, action, cooldown)
|
||||||
|
|
||||||
|
# --- Proxy Health ---
|
||||||
|
("proxy_health_fail", "restart_proxy", 30),
|
||||||
|
("proxy_port_conflict", "kill_stale + restart", 60),
|
||||||
|
("proxy_memory_over_1gb", "restart_proxy", 120),
|
||||||
|
|
||||||
|
# --- Upstream Errors ---
|
||||||
|
("upstream_429", "wait_retry_after", 0),
|
||||||
|
("upstream_502_503", "retry_with_backoff", 30),
|
||||||
|
("upstream_500_repeat_3x", "switch_provider", 60),
|
||||||
|
("upstream_timeout", "retry + increase_timeout", 30),
|
||||||
|
("upstream_401_403", "alert_user_bad_key", 0),
|
||||||
|
|
||||||
|
# --- Stream Errors ---
|
||||||
|
("stream_broken_pipe", "restart_proxy", 30),
|
||||||
|
("stream_reset", "restart_proxy", 30),
|
||||||
|
("stream_idle_300s", "restart_proxy", 60),
|
||||||
|
|
||||||
|
# --- Parser Failures ---
|
||||||
|
("parsed_tool_calls_0_x3", "clear_schema_cache", 300),
|
||||||
|
("sanitizer_suspicious_5x","alert_user_model_issue", 0),
|
||||||
|
("stuck_recovery_x5", "suggest_switch_model", 0),
|
||||||
|
|
||||||
|
# --- Codex Process ---
|
||||||
|
("codex_process_dead", "alert_user_restart", 0),
|
||||||
|
("codex_memory_over_4gb", "alert_user_memory", 0),
|
||||||
|
|
||||||
|
# --- Cache Corruption ---
|
||||||
|
("schema_content_type_array", "delete_provider_caps", 0),
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tier 2: Pattern Matching — Incident Store Lookup (< 100ms)
|
||||||
|
|
||||||
|
For failures we've **seen before and resolved**, look up the fix:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"incidents": [
|
||||||
|
{
|
||||||
|
"pattern": "cc_stream_ended_empty + explore_agent + no_url",
|
||||||
|
"fix": "synth_explore_from_last_user_urls",
|
||||||
|
"source": "FIX-23",
|
||||||
|
"success_rate": 0.85,
|
||||||
|
"last_seen": "2026-05-22T16:00:00Z",
|
||||||
|
"occurrences": 5
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"pattern": "require_escalation + no_cmd",
|
||||||
|
"fix": "auto_proceed_echo",
|
||||||
|
"source": "FIX-24",
|
||||||
|
"success_rate": 1.0,
|
||||||
|
"last_seen": "2026-05-22T15:30:00Z",
|
||||||
|
"occurrences": 3
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tier 3: AI Diagnostic — Nano Agent (2-5 seconds)
|
||||||
|
|
||||||
|
For **novel failures** that don't match any rule or pattern, invoke a cheap AI model:
|
||||||
|
|
||||||
|
```
|
||||||
|
Prompt Template (system):
|
||||||
|
─────────────────────
|
||||||
|
You are a diagnostic agent for a translation proxy that sits between
|
||||||
|
OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat,
|
||||||
|
Anthropic, etc.). You analyze error context and suggest ONE corrective action.
|
||||||
|
|
||||||
|
Available actions: restart_proxy, kill_stale_processes, clear_schema_cache,
|
||||||
|
switch_provider, increase_timeout, alert_user, ignore, retry_now,
|
||||||
|
regenerate_config, cleanup_codex_stale
|
||||||
|
|
||||||
|
Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0}
|
||||||
|
|
||||||
|
Prompt Template (user):
|
||||||
|
─────────────────────
|
||||||
|
INCIDENT REPORT:
|
||||||
|
Time: {timestamp}
|
||||||
|
Session: {session_id}
|
||||||
|
Proxy health: {alive/dead, port, uptime, memory_mb}
|
||||||
|
Upstream: {url, model, last_http_code, last_error}
|
||||||
|
Recent errors (last 60s):
|
||||||
|
{log_lines}
|
||||||
|
Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags}
|
||||||
|
Provider: {backend_type, model}
|
||||||
|
History: {last_5_incidents_for_this_pattern}
|
||||||
|
|
||||||
|
What corrective action should be taken?
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Complete Failure Catalog
|
||||||
|
|
||||||
|
### Category A: Proxy-Level Failures (watchdog detects, auto-recovers)
|
||||||
|
|
||||||
|
| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
|
||||||
|
|----|---------|----------|---------------|---------------|
|
||||||
|
| A1 | Proxy process crashed | `/health` returns connection refused | `restart_proxy` | `urllib.error.URLError: [Errno 111] Connection refused` |
|
||||||
|
| A2 | Port conflict | `Address already in use` on startup | `kill_stale + restart` | `OSError: [Errno 98] Address already in use` |
|
||||||
|
| A3 | Memory leak | Process RSS > 1GB | `restart_proxy` | `/proc/{pid}/status` VmRSS check |
|
||||||
|
| A4 | Deadlock | Health check hangs > 15s | `restart_proxy` | health probe timeout |
|
||||||
|
| A5 | Unhandled exception | Process exits with non-zero | `restart_proxy` | `SELF-REVIVE CRASH #{n}` |
|
||||||
|
| A6 | SSL/TLS error | `CERTIFICATE_VERIFY_FAILED` upstream | `alert_user` | `urllib.error.URLError: certificate verify failed` |
|
||||||
|
| A7 | DNS resolution failure | `getaddrinfo failed` | `retry_with_backoff` | `socket.gaierror: Name or service not known` |
|
||||||
|
|
||||||
|
### Category B: Upstream Provider Failures (proxy detects, watchdog analyzes)
|
||||||
|
|
||||||
|
| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
|
||||||
|
|----|---------|----------|---------------|---------------|
|
||||||
|
| B1 | Rate limit (429) | Too many requests | `wait_retry_after` | `HTTP 429` + `Retry-After` header |
|
||||||
|
| B2 | Server error (5xx) | Provider down | `retry_with_backoff` | `HTTP 500/502/503` |
|
||||||
|
| B3 | Auth failure (401/403) | Bad/expired key | `alert_user_bad_key` | `HTTP 401 {"error":"invalid_api_key"}` |
|
||||||
|
| B4 | CC upgrade required (403) | Version mismatch | `update_cc_version` | `HTTP 403 upgrade_required` |
|
||||||
|
| B5 | Connection timeout | Upstream silent | `retry + increase_timeout` | `urllib.error.URLError: timed out` |
|
||||||
|
| B6 | Connection reset | Upstream dropped mid-stream | `restart_proxy` | `ConnectionResetError: Connection reset by peer` |
|
||||||
|
| B7 | Broken pipe | Client disconnected | `ignore` | `BrokenPipeError: Broken pipe` |
|
||||||
|
| B8 | Upstream 400 bad request | Malformed request | `clear_schema_cache` | `HTTP 400 {"error":"...expected string..."}` |
|
||||||
|
| B9 | Provider capacity (503) | Overloaded | `switch_provider` | `HTTP 503` after 3 retries |
|
||||||
|
| B10 | Cloudflare block (403/1010) | Bot detection | `check_browser_ua` | `HTTP 403 error 1010` |
|
||||||
|
|
||||||
|
### Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks)
|
||||||
|
|
||||||
|
| ID | Failure | Symptoms | Auto-Fix (IR Layer) | Watchdog Escalation |
|
||||||
|
|----|---------|----------|--------------------|--------------------|
|
||||||
|
| C1 | Bare `<explore_agent>` | `parsed_tool_calls=0` | Layer 1: URL extraction | If 3x in a row → suggest model switch |
|
||||||
|
| C2 | `<require_escalation>` block | Model wants permissions | Layer 2: Auto-proceed | If 5x → suggest different provider |
|
||||||
|
| C3 | Unrecognized format | No parser matches | Layer 3: Intent synthesis | If 5x → log for AI diagnosis |
|
||||||
|
| C4 | Double-wrapped cmd | `cmd = "{\"cmd\": ...}"` | Sanitizer: unwrap | If cmd still JSON → alert |
|
||||||
|
| C5 | Suspicious cmd (JSON) | `cmd starts with {` | Sanitizer: flag | If 3x → clear cache + restart |
|
||||||
|
| C6 | Empty cmd | `cmd = ""` or `cmd = "{}"` | Sanitizer: diagnostic echo | If 3x → suggest model switch |
|
||||||
|
| C7 | Bare `{` token | Model outputs incomplete JSON | Layer 3: heuristic 5 | If persistent → AI diagnosis |
|
||||||
|
| C8 | `<bash>` without cmd | Block has sandbox but no command | Layer 3: heuristic | If 3x → AI diagnosis |
|
||||||
|
| C9 | DSML name mismatch | `name="cmd"` vs `name="command"` | DSML parser handles both | Self-test catches regression |
|
||||||
|
| C10 | Stuck model loop | Same recovery 5+ times | Layer 3 max 3x then alert | Switch model or provider |
|
||||||
|
|
||||||
|
### Category D: Codex Process Failures (watchdog detects, alerts user)
|
||||||
|
|
||||||
|
| ID | Failure | Symptoms | Action | Log Signature |
|
||||||
|
|----|---------|----------|--------|---------------|
|
||||||
|
| D1 | Codex process killed | PID gone from pids.json | `alert_user_restart` | Process not in `/proc/{pid}` |
|
||||||
|
| D2 | Codex memory explosion | RSS > 4GB | `alert_user_memory` | `/proc/{pid}/status` check |
|
||||||
|
| D3 | Codex 300s stall | `stream disconnected` loop | `restart_proxy` | Codex stderr: `stream disconnected` |
|
||||||
|
| D4 | Config corruption | `database disk image is malformed` | `regenerate_config` | Codex stderr: `malformed` |
|
||||||
|
| D5 | Session context overflow | `context_length_exceeded` | `alert_user_context` | Codex stderr: `context_length_exceeded` |
|
||||||
|
| D6 | WebSocket reconnect loop | `Reconnecting... N/5` | `check_proxy_health` | Codex stderr: `Reconnecting` |
|
||||||
|
|
||||||
|
### Category E: Config/State Failures (watchdog detects, auto-fixes)
|
||||||
|
|
||||||
|
| ID | Failure | Symptoms | Action | Detection |
|
||||||
|
|----|---------|----------|--------|-----------|
|
||||||
|
| E1 | Schema cache corruption | `content_type: "array"` in provider-caps.json | `delete_provider_caps` | Read file, check for known-bad values |
|
||||||
|
| E2 | Stale PID file | pids.json has dead PIDs | `cleanup_pids` | Check `/proc/{pid}` existence |
|
||||||
|
| E3 | Port from old session | config.toml has stale port | `regenerate_config` | Port in config != running port |
|
||||||
|
| E4 | OAuth token expired | Google/Gemini token refresh fails | `alert_user_reauth` | Token file `expiry_ts < now` |
|
||||||
|
| E5 | BGP all routes down | Every route returned error | `alert_user_no_provider` | All routes in cooldown |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Component Design
|
||||||
|
|
||||||
|
### 5.1 Health Watcher Thread
|
||||||
|
|
||||||
|
Runs in the GUI process as a background thread. Pings proxy `/health` endpoint every 5 seconds.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class HealthWatcher(threading.Thread):
|
||||||
|
def __init__(self, proxy_port, on_failure, on_recovery):
|
||||||
|
super().__init__(daemon=True)
|
||||||
|
self.proxy_port = proxy_port
|
||||||
|
self.on_failure = on_failure
|
||||||
|
self.on_recovery = on_recovery
|
||||||
|
self.check_interval = 5 # seconds
|
||||||
|
self.failures = 0
|
||||||
|
self.running = True
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
while self.running:
|
||||||
|
healthy = self._check_health()
|
||||||
|
if healthy:
|
||||||
|
if self.failures > 0:
|
||||||
|
self.failures = 0
|
||||||
|
self.on_recovery()
|
||||||
|
else:
|
||||||
|
self.failures += 1
|
||||||
|
if self.failures >= 3: # 15s of consecutive failures
|
||||||
|
self.on_failure(self.failures)
|
||||||
|
time.sleep(self.check_interval)
|
||||||
|
|
||||||
|
def _check_health(self):
|
||||||
|
try:
|
||||||
|
req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health")
|
||||||
|
resp = urllib.request.urlopen(req, timeout=5)
|
||||||
|
return resp.status == 200
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.2 Log Analyzer Thread
|
||||||
|
|
||||||
|
Tails the debug log and extracts failure signals in real-time.
|
||||||
|
|
||||||
|
```python
|
||||||
|
FAILURE_SIGNALS = {
|
||||||
|
"parsed_tool_calls=0": ("C1", "parser_empty"),
|
||||||
|
"[STUCK-RECOVERY]": ("C3", "stuck_recovery"),
|
||||||
|
"suspicious cmd": ("C4", "sanitizer_flag"),
|
||||||
|
"empty cmd recovered": ("C6", "empty_cmd"),
|
||||||
|
"HTTP 429": ("B1", "rate_limited"),
|
||||||
|
"HTTP 500": ("B2", "server_error"),
|
||||||
|
"HTTP 401": ("B3", "auth_failure"),
|
||||||
|
"HTTP 403": ("B4", "forbidden"),
|
||||||
|
"Connection refused": ("A1", "proxy_dead"),
|
||||||
|
"Address already in use": ("A2", "port_conflict"),
|
||||||
|
"Broken pipe": ("B7", "broken_pipe"),
|
||||||
|
"Connection reset": ("B6", "connection_reset"),
|
||||||
|
"timed out": ("B5", "timeout"),
|
||||||
|
"SELF-REVIVE CRASH": ("A5", "proxy_crash"),
|
||||||
|
"stream error": ("B6", "stream_error"),
|
||||||
|
}
|
||||||
|
|
||||||
|
class LogAnalyzer(threading.Thread):
|
||||||
|
def __init__(self, log_path, on_signal):
|
||||||
|
super().__init__(daemon=True)
|
||||||
|
self.log_path = log_path
|
||||||
|
self.on_signal = on_signal
|
||||||
|
self.running = True
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
fh = open(self.log_path, "r")
|
||||||
|
fh.seek(0, 2) # seek to end
|
||||||
|
while self.running:
|
||||||
|
line = fh.readline()
|
||||||
|
if not line:
|
||||||
|
time.sleep(0.5)
|
||||||
|
continue
|
||||||
|
for pattern, (fault_id, category) in FAILURE_SIGNALS.items():
|
||||||
|
if pattern in line:
|
||||||
|
self.on_signal(fault_id, category, line.strip())
|
||||||
|
break
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.3 AI Diagnostic Agent
|
||||||
|
|
||||||
|
Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class AIDiagnosticAgent:
|
||||||
|
def __init__(self, provider_url, model, api_key):
|
||||||
|
self.provider_url = provider_url
|
||||||
|
self.model = model
|
||||||
|
self.api_key = api_key
|
||||||
|
self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT # defined below
|
||||||
|
self.incident_store = IncidentStore()
|
||||||
|
|
||||||
|
def diagnose(self, context):
|
||||||
|
# Tier 2: Check incident store first
|
||||||
|
pattern = self._extract_pattern(context)
|
||||||
|
known_fix = self.incident_store.lookup(pattern)
|
||||||
|
if known_fix and known_fix["success_rate"] > 0.7:
|
||||||
|
return known_fix["fix"], "tier2_pattern", known_fix["success_rate"]
|
||||||
|
|
||||||
|
# Tier 3: Ask AI
|
||||||
|
prompt = self._build_prompt(context)
|
||||||
|
response = self._call_model(prompt)
|
||||||
|
action = self._parse_response(response)
|
||||||
|
|
||||||
|
# Learn from this incident
|
||||||
|
if action:
|
||||||
|
self.incident_store.record(pattern, action)
|
||||||
|
|
||||||
|
return action, "tier3_ai", None
|
||||||
|
|
||||||
|
def _call_model(self, prompt):
|
||||||
|
body = {
|
||||||
|
"model": self.model,
|
||||||
|
"messages": [
|
||||||
|
{"role": "system", "content": self.system_prompt},
|
||||||
|
{"role": "user", "content": prompt}
|
||||||
|
],
|
||||||
|
"max_tokens": 200,
|
||||||
|
"temperature": 0.1,
|
||||||
|
}
|
||||||
|
req = urllib.request.Request(
|
||||||
|
self.provider_url,
|
||||||
|
data=json.dumps(body).encode(),
|
||||||
|
headers={
|
||||||
|
"Content-Type": "application/json",
|
||||||
|
"Authorization": f"Bearer {self.api_key}",
|
||||||
|
}
|
||||||
|
)
|
||||||
|
resp = urllib.request.urlopen(req, timeout=15)
|
||||||
|
return json.loads(resp.read())["choices"][0]["message"]["content"]
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.4 Incident Store
|
||||||
|
|
||||||
|
JSON file that accumulates failure patterns and their resolutions.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"version": 1,
|
||||||
|
"incidents": {
|
||||||
|
"parser_empty+explore_agent": {
|
||||||
|
"fault_ids": ["C1"],
|
||||||
|
"fix": "synth_explore_from_urls",
|
||||||
|
"source": "intelligent_routing",
|
||||||
|
"success_count": 8,
|
||||||
|
"fail_count": 1,
|
||||||
|
"last_seen": "2026-05-22T16:00:00Z",
|
||||||
|
"auto_applied": true
|
||||||
|
},
|
||||||
|
"server_error+repeat_3x": {
|
||||||
|
"fault_ids": ["B2"],
|
||||||
|
"fix": "switch_provider",
|
||||||
|
"source": "tier1_rule",
|
||||||
|
"success_count": 2,
|
||||||
|
"fail_count": 0,
|
||||||
|
"last_seen": "2026-05-22T14:00:00Z",
|
||||||
|
"auto_applied": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"ai_diagnostic_calls": 0,
|
||||||
|
"tokens_used": 0,
|
||||||
|
"cost_usd": 0.0
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.5 Diagnostic Agent System Prompt
|
||||||
|
|
||||||
|
```
|
||||||
|
You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local
|
||||||
|
translation proxy between OpenAI Codex CLI/Desktop and various AI providers.
|
||||||
|
|
||||||
|
## Your Job
|
||||||
|
Analyze the incident report and recommend ONE corrective action.
|
||||||
|
|
||||||
|
## Available Actions
|
||||||
|
- restart_proxy: Kill and restart translate-proxy.py
|
||||||
|
- kill_stale_processes: Kill orphaned proxy/codex processes
|
||||||
|
- clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json
|
||||||
|
- switch_provider: Switch to a different configured endpoint
|
||||||
|
- increase_timeout: Increase upstream timeout for slow providers
|
||||||
|
- regenerate_config: Regenerate Codex config.toml
|
||||||
|
- cleanup_codex_stale: Run cleanup-codex-stale.sh
|
||||||
|
- alert_user: Show notification to user (can't auto-fix)
|
||||||
|
- ignore: Transient error, no action needed
|
||||||
|
- retry_now: Immediate retry without changes
|
||||||
|
|
||||||
|
## Decision Rules
|
||||||
|
- If upstream returns 401/403 with auth error → alert_user (can't fix bad keys)
|
||||||
|
- If proxy process is dead → restart_proxy
|
||||||
|
- If same error repeated 5+ times → switch_provider or alert_user
|
||||||
|
- If error is about content_type/schema → clear_schema_cache
|
||||||
|
- If "Address already in use" → kill_stale_processes then restart_proxy
|
||||||
|
- If timeout and upstream is slow → increase_timeout
|
||||||
|
- If single transient 429/502/503 → ignore (retry handles it)
|
||||||
|
- If "stream disconnected" and proxy is healthy → ignore (Codex retries)
|
||||||
|
|
||||||
|
## Response Format
|
||||||
|
Reply with ONLY a JSON object:
|
||||||
|
{"action": "...", "reason": "...", "confidence": 0.0-1.0}
|
||||||
|
|
||||||
|
No explanation, no markdown, no extra text.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. GUI Integration
|
||||||
|
|
||||||
|
### AI Monitoring Panel (in Settings tab)
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ AI Monitoring [ON] │
|
||||||
|
│ │
|
||||||
|
│ ┌─ Diagnostic Agent ─────────────────────────────────┐ │
|
||||||
|
│ │ Provider: [OpenCode Zen ▼] │ │
|
||||||
|
│ │ Model: [Qwen3-32B ▼] │ │
|
||||||
|
│ │ API Key: [sk-•••••••••••••••••••• ] │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ Cost this month: $0.12 (3 diagnostic calls) │ │
|
||||||
|
│ │ Tokens used: 1,847 input / 423 output │ │
|
||||||
|
│ └─────────────────────────────────────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ ┌─ Incident Log (last 7 days) ──────────────────────┐ │
|
||||||
|
│ │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │ │
|
||||||
|
│ │ ⚠️ 15:30 B2 server_error → retry (Tier 1) │ │
|
||||||
|
│ │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1) │ │
|
||||||
|
│ │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3) │ │
|
||||||
|
│ │ ... │ │
|
||||||
|
│ └────────────────────────────────────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
│ [View Full Diagnostics] [Export Incident Report] │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Config Storage (in endpoints.json)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"ai_monitoring": {
|
||||||
|
"enabled": true,
|
||||||
|
"provider_url": "https://opencode.ai/zen/v1/chat/completions",
|
||||||
|
"model": "Qwen/Qwen3-32B",
|
||||||
|
"api_key": "sk-...",
|
||||||
|
"tier1_enabled": true,
|
||||||
|
"tier2_enabled": true,
|
||||||
|
"tier3_enabled": true,
|
||||||
|
"auto_restart_proxy": true,
|
||||||
|
"auto_switch_provider": false,
|
||||||
|
"health_check_interval_s": 5,
|
||||||
|
"max_memory_mb": 1024,
|
||||||
|
"notification_level": "important_only"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Recommended Models (by cost)
|
||||||
|
|
||||||
|
| Model | Cost/Diagnosis | Latency | Quality | Recommended For |
|
||||||
|
|-------|---------------|---------|---------|----------------|
|
||||||
|
| **Qwen3-32B** (OpenCode) | ~$0.0005 | 2-4s | Good | Default — cheapest decent model |
|
||||||
|
| **DeepSeek V4 Flash** | ~$0.0003 | 2-3s | Good | Cheapest option |
|
||||||
|
| **GPT-4o-mini** | ~$0.001 | 1-2s | Excellent | Best quality/latency |
|
||||||
|
| **Gemini 2.0 Flash** | ~$0.0002 | 1-2s | Good | Cheapest + fastest |
|
||||||
|
| **Claude Haiku 4.5** | ~$0.001 | 2-3s | Excellent | Best reasoning quality |
|
||||||
|
| **Local Ollama** (if running) | $0 | 5-15s | Varies | Zero-cost offline option |
|
||||||
|
|
||||||
|
### Cost Estimate
|
||||||
|
|
||||||
|
- Average diagnostic prompt: ~800 tokens input, ~100 tokens output
|
||||||
|
- Expected frequency: ~1-5 incidents per day that reach Tier 3
|
||||||
|
- **Monthly cost**: $0.10 - $1.50 depending on model and usage
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Watchdog Response Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Failure Detected
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────┐ YES ┌──────────────────┐
|
||||||
|
│ Tier 1 Rule? ├─────────►│ Execute Action │
|
||||||
|
│ (known) │ │ Log incident │
|
||||||
|
└──────┬───────┘ └──────────────────┘
|
||||||
|
│ NO
|
||||||
|
▼
|
||||||
|
┌─────────────┐ YES ┌──────────────────┐
|
||||||
|
│ Tier 2 Match?├─────────►│ Apply Known Fix │
|
||||||
|
│ (incident DB)│ │ Update success │
|
||||||
|
└──────┬───────┘ └──────────────────┘
|
||||||
|
│ NO
|
||||||
|
▼
|
||||||
|
┌─────────────┐ YES ┌──────────────────┐
|
||||||
|
│ AI Enabled? ├─────────►│ Collect Context │
|
||||||
|
│ (Tier 3) │ │ Build Prompt │
|
||||||
|
└──────┬───────┘ │ Call AI Model │
|
||||||
|
│ NO │ Parse Response │
|
||||||
|
▼ │ Execute if auto │
|
||||||
|
┌─────────────┐ │ Store incident │
|
||||||
|
│ Alert User │ └──────────────────┘
|
||||||
|
│ (can't fix) │
|
||||||
|
└─────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Safety Guards
|
||||||
|
|
||||||
|
1. **Rate limit AI calls** — max 1 Tier 3 call per 60 seconds, max 10 per day
|
||||||
|
2. **Never auto-execute destructive actions** — `alert_user` for: delete files, change API keys, modify source code
|
||||||
|
3. **Auto-restart cap** — max 5 proxy restarts per 10 minutes, then alert user
|
||||||
|
4. **Cost cap** — monthly AI diagnostic budget (configurable, default $2/month)
|
||||||
|
5. **Cooldown per pattern** — same failure pattern has escalating cooldown (30s → 60s → 300s → alert)
|
||||||
|
6. **User override** — any auto-action can be cancelled within 3 seconds via GUI
|
||||||
|
7. **Incident store max size** — 500 entries, LRU eviction
|
||||||
|
8. **Health check bypass** — if user manually stopped proxy, don't alert
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Implementation Plan
|
||||||
|
|
||||||
|
### Phase 1: Core Watchdog (v3.8.0)
|
||||||
|
- `HealthWatcher` thread in `codex-launcher-gui`
|
||||||
|
- `LogAnalyzer` thread tailing `cc-debug.log` and `proxy.log`
|
||||||
|
- Tier 1 rule engine with all 20+ rules
|
||||||
|
- Incident store (JSON file)
|
||||||
|
- GUI toggle (ON/OFF) in settings
|
||||||
|
- Auto-restart proxy on crash
|
||||||
|
|
||||||
|
### Phase 2: Pattern Learning (v3.8.1)
|
||||||
|
- Tier 2 incident store lookup
|
||||||
|
- Auto-learn from Intelligence Routing outcomes
|
||||||
|
- Success rate tracking per pattern
|
||||||
|
- Incident log viewer in GUI
|
||||||
|
|
||||||
|
### Phase 3: AI Diagnostic Agent (v3.9.0)
|
||||||
|
- Tier 3 AI model integration
|
||||||
|
- Provider/model selector in GUI
|
||||||
|
- Diagnostic prompt template
|
||||||
|
- Cost tracking
|
||||||
|
- Full incident report export
|
||||||
|
|
||||||
|
### Phase 4: Advanced Recovery (v4.0.0)
|
||||||
|
- Auto-switch to backup provider on repeated failure
|
||||||
|
- BGP route health monitoring
|
||||||
|
- Predictive failure detection (memory growth, latency trends)
|
||||||
|
- Codex process memory monitoring
|
||||||
|
- WebSocket reconnect assistance
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. File Changes Summary
|
||||||
|
|
||||||
|
| File | Changes |
|
||||||
|
|------|---------|
|
||||||
|
| `codex-launcher-gui` | +HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer |
|
||||||
|
| `translate-proxy.py` | +`/monitoring` endpoint (returns health + metrics), enhanced `/health` with memory/uptime |
|
||||||
|
| `~/.cache/codex-proxy/incident-store.json` | New file — incident pattern database |
|
||||||
|
| `~/.cache/codex-proxy/monitoring.log` | New file — watchdog activity log |
|
||||||
|
| `~/.codex/endpoints.json` | +`ai_monitoring` config section |
|
||||||
Reference in New Issue
Block a user