Compare commits
9 Commits
638
AI-MONITORING-DESIGN.md
Normal file
638
AI-MONITORING-DESIGN.md
Normal file
@@ -0,0 +1,638 @@
|
||||
# AI Monitoring — Design Specification
|
||||
|
||||
> **Codex Launcher v3.8.0 Feature Design**
|
||||
> Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions.
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
Over 42 sessions in production, we observed these failure categories:
|
||||
|
||||
| # | Failure Category | Count | Example |
|
||||
|---|-----------------|-------|---------|
|
||||
| F1 | **parsed_tool_calls=0** — model produces unparseable output | 42 | Bare `<explore_agent>`, `<bash>` without cmd, plain English intent |
|
||||
| F2 | **Stuck recovery triggered** — Intelligence Routing Layer 3 | 13 | "I need to fetch the README", "let me write the script" |
|
||||
| F3 | **Sanitizer flagged suspicious cmd** — cmd still JSON after unwrap | 11 | `{/'cmd/': /'sshpass -p .../'}` — double-escaped quoting |
|
||||
| F4 | **Upstream 500** — provider internal error | ~5 | `"An internal error occurred. Please try again later."` |
|
||||
| F5 | **Connection timeout** — upstream unreachable | ~3 | `Connection timed out after 15002 milliseconds` |
|
||||
| F6 | **Upstream 401/403** — auth failure | ~2 | Wrong API key, expired token, `upgrade_required` |
|
||||
| F7 | **Stream crash** — exception mid-stream | ~2 | `BrokenPipeError`, `ConnectionResetError` during SSE |
|
||||
| F8 | **Proxy port conflict** — Address already in use | ~1 | Stale process holding port |
|
||||
| F9 | **Schema cache corruption** — stale content_type=array | ~1 | `ErrorAnalyzer` learned wrong schema |
|
||||
| F10 | **Codex Desktop crash** — SIGKILL at ~27GB | ~1 | Issue #24048 — unbounded tool output memory |
|
||||
| F11 | **Codex 300s stall** — turn state machine race | ~1 | Issue #23807 — `stream disconnected` after 300s |
|
||||
|
||||
### The Gap
|
||||
|
||||
Intelligence Routing (v3.7.0) handles F1/F2/F3 **inside a single request**. But it can't:
|
||||
|
||||
- **Detect a dead proxy process** (F7/F8) — the proxy already crashed
|
||||
- **Reconnect Codex to a restarted proxy** (F5/F7/F8) — Codex doesn't auto-reconnect
|
||||
- **Switch to a backup provider** when the primary is down (F4/F5)
|
||||
- **Clear corrupt caches** (F9) — requires out-of-band action
|
||||
- **Restart Codex Desktop** after a crash (F10/F11)
|
||||
- **Learn from failure patterns** across sessions — each failure is handled independently
|
||||
|
||||
### What We Need
|
||||
|
||||
A **separate lightweight watchdog process** that:
|
||||
1. Monitors proxy health continuously
|
||||
2. Detects failures the proxy can't detect itself
|
||||
3. Uses a cheap AI model to diagnose novel failures
|
||||
4. Takes corrective action automatically
|
||||
5. Learns from past incidents to prevent repeats
|
||||
|
||||
---
|
||||
|
||||
## 2. Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Codex Launcher GUI │
|
||||
│ ┌──────────┐ ┌──────────────┐ ┌───────────────────────────────┐ │
|
||||
│ │ Proxy │ │ Codex │ │ AI Monitoring Panel │ │
|
||||
│ │ Manager │ │ Launcher │ │ ┌─────────────────────┐ │ │
|
||||
│ │ │ │ │ │ │ ON/OFF Toggle │ │ │
|
||||
│ └────┬─────┘ └──────┬───────┘ │ │ Provider Selector │ │ │
|
||||
│ │ │ │ │ Model Selector │ │ │
|
||||
│ │ │ │ │ Incident Log │ │ │
|
||||
│ │ │ │ │ [View Diagnostics] │ │ │
|
||||
│ │ │ │ └─────────────────────┘ │ │
|
||||
│ │ │ └───────────────────────────────┘ │
|
||||
└───────┼───────────────┼────────────────────────────────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌───────────────┐ ┌────────────────┐
|
||||
│ translate- │ │ Codex Desktop │
|
||||
│ proxy.py │ │ / CLI │
|
||||
│ (port 8080) │ │ │
|
||||
│ │ │ │
|
||||
│ /health ──────┼──┼─► health check │
|
||||
│ /responses ───┼──┼─► main API │
|
||||
└───────────────┘ └────────────────┘
|
||||
▲
|
||||
│ health probes + log analysis + corrective actions
|
||||
│
|
||||
┌───────┴────────────────────────────────────────────────────────────┐
|
||||
│ AI Monitor Watchdog │
|
||||
│ (thread in codex-launcher-gui) │
|
||||
│ │
|
||||
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │
|
||||
│ │ Health Watcher │ │ Log Analyzer │ │ AI Diagnostic │ │
|
||||
│ │ (every 5s) │ │ (continuous) │ │ Agent (on-call) │ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ - /health probe │ │ - tail cc-debug │ │ - Classify err │ │
|
||||
│ │ - process alive │ │ - tail proxy.log│ │ - Root cause │ │
|
||||
│ │ - port check │ │ - pattern match │ │ - Suggest fix │ │
|
||||
│ │ - memory watch │ │ - incident DB │ │ - Execute fix │ │
|
||||
│ └────────┬────────┘ └────────┬────────┘ └────────┬─────────┘ │
|
||||
│ │ │ │ │
|
||||
│ └────────────────────┼─────────────────────┘ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────┐ │
|
||||
│ │ Incident Store │ │
|
||||
│ │ (JSON file) │ │
|
||||
│ │ - Known patterns │ │
|
||||
│ │ - Past resolutions │ │
|
||||
│ │ - Success rates │ │
|
||||
│ └──────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Three-Tier Response System
|
||||
|
||||
### Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second)
|
||||
|
||||
Immediate reactions to **known failure patterns**. No AI needed.
|
||||
|
||||
```python
|
||||
TIER1_RULES = [
|
||||
# (trigger_pattern, action, cooldown)
|
||||
|
||||
# --- Proxy Health ---
|
||||
("proxy_health_fail", "restart_proxy", 30),
|
||||
("proxy_port_conflict", "kill_stale + restart", 60),
|
||||
("proxy_memory_over_1gb", "restart_proxy", 120),
|
||||
|
||||
# --- Upstream Errors ---
|
||||
("upstream_429", "wait_retry_after", 0),
|
||||
("upstream_502_503", "retry_with_backoff", 30),
|
||||
("upstream_500_repeat_3x", "switch_provider", 60),
|
||||
("upstream_timeout", "retry + increase_timeout", 30),
|
||||
("upstream_401_403", "alert_user_bad_key", 0),
|
||||
|
||||
# --- Stream Errors ---
|
||||
("stream_broken_pipe", "restart_proxy", 30),
|
||||
("stream_reset", "restart_proxy", 30),
|
||||
("stream_idle_300s", "restart_proxy", 60),
|
||||
|
||||
# --- Parser Failures ---
|
||||
("parsed_tool_calls_0_x3", "clear_schema_cache", 300),
|
||||
("sanitizer_suspicious_5x","alert_user_model_issue", 0),
|
||||
("stuck_recovery_x5", "suggest_switch_model", 0),
|
||||
|
||||
# --- Codex Process ---
|
||||
("codex_process_dead", "alert_user_restart", 0),
|
||||
("codex_memory_over_4gb", "alert_user_memory", 0),
|
||||
|
||||
# --- Cache Corruption ---
|
||||
("schema_content_type_array", "delete_provider_caps", 0),
|
||||
]
|
||||
```
|
||||
|
||||
### Tier 2: Pattern Matching — Incident Store Lookup (< 100ms)
|
||||
|
||||
For failures we've **seen before and resolved**, look up the fix:
|
||||
|
||||
```json
|
||||
{
|
||||
"incidents": [
|
||||
{
|
||||
"pattern": "cc_stream_ended_empty + explore_agent + no_url",
|
||||
"fix": "synth_explore_from_last_user_urls",
|
||||
"source": "FIX-23",
|
||||
"success_rate": 0.85,
|
||||
"last_seen": "2026-05-22T16:00:00Z",
|
||||
"occurrences": 5
|
||||
},
|
||||
{
|
||||
"pattern": "require_escalation + no_cmd",
|
||||
"fix": "auto_proceed_echo",
|
||||
"source": "FIX-24",
|
||||
"success_rate": 1.0,
|
||||
"last_seen": "2026-05-22T15:30:00Z",
|
||||
"occurrences": 3
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Tier 3: AI Diagnostic — Nano Agent (2-5 seconds)
|
||||
|
||||
For **novel failures** that don't match any rule or pattern, invoke a cheap AI model:
|
||||
|
||||
```
|
||||
Prompt Template (system):
|
||||
─────────────────────
|
||||
You are a diagnostic agent for a translation proxy that sits between
|
||||
OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat,
|
||||
Anthropic, etc.). You analyze error context and suggest ONE corrective action.
|
||||
|
||||
Available actions: restart_proxy, kill_stale_processes, clear_schema_cache,
|
||||
switch_provider, increase_timeout, alert_user, ignore, retry_now,
|
||||
regenerate_config, cleanup_codex_stale
|
||||
|
||||
Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0}
|
||||
|
||||
Prompt Template (user):
|
||||
─────────────────────
|
||||
INCIDENT REPORT:
|
||||
Time: {timestamp}
|
||||
Session: {session_id}
|
||||
Proxy health: {alive/dead, port, uptime, memory_mb}
|
||||
Upstream: {url, model, last_http_code, last_error}
|
||||
Recent errors (last 60s):
|
||||
{log_lines}
|
||||
Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags}
|
||||
Provider: {backend_type, model}
|
||||
History: {last_5_incidents_for_this_pattern}
|
||||
|
||||
What corrective action should be taken?
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Complete Failure Catalog
|
||||
|
||||
### Category A: Proxy-Level Failures (watchdog detects, auto-recovers)
|
||||
|
||||
| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
|
||||
|----|---------|----------|---------------|---------------|
|
||||
| A1 | Proxy process crashed | `/health` returns connection refused | `restart_proxy` | `urllib.error.URLError: [Errno 111] Connection refused` |
|
||||
| A2 | Port conflict | `Address already in use` on startup | `kill_stale + restart` | `OSError: [Errno 98] Address already in use` |
|
||||
| A3 | Memory leak | Process RSS > 1GB | `restart_proxy` | `/proc/{pid}/status` VmRSS check |
|
||||
| A4 | Deadlock | Health check hangs > 15s | `restart_proxy` | health probe timeout |
|
||||
| A5 | Unhandled exception | Process exits with non-zero | `restart_proxy` | `SELF-REVIVE CRASH #{n}` |
|
||||
| A6 | SSL/TLS error | `CERTIFICATE_VERIFY_FAILED` upstream | `alert_user` | `urllib.error.URLError: certificate verify failed` |
|
||||
| A7 | DNS resolution failure | `getaddrinfo failed` | `retry_with_backoff` | `socket.gaierror: Name or service not known` |
|
||||
|
||||
### Category B: Upstream Provider Failures (proxy detects, watchdog analyzes)
|
||||
|
||||
| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
|
||||
|----|---------|----------|---------------|---------------|
|
||||
| B1 | Rate limit (429) | Too many requests | `wait_retry_after` | `HTTP 429` + `Retry-After` header |
|
||||
| B2 | Server error (5xx) | Provider down | `retry_with_backoff` | `HTTP 500/502/503` |
|
||||
| B3 | Auth failure (401/403) | Bad/expired key | `alert_user_bad_key` | `HTTP 401 {"error":"invalid_api_key"}` |
|
||||
| B4 | CC upgrade required (403) | Version mismatch | `update_cc_version` | `HTTP 403 upgrade_required` |
|
||||
| B5 | Connection timeout | Upstream silent | `retry + increase_timeout` | `urllib.error.URLError: timed out` |
|
||||
| B6 | Connection reset | Upstream dropped mid-stream | `restart_proxy` | `ConnectionResetError: Connection reset by peer` |
|
||||
| B7 | Broken pipe | Client disconnected | `ignore` | `BrokenPipeError: Broken pipe` |
|
||||
| B8 | Upstream 400 bad request | Malformed request | `clear_schema_cache` | `HTTP 400 {"error":"...expected string..."}` |
|
||||
| B9 | Provider capacity (503) | Overloaded | `switch_provider` | `HTTP 503` after 3 retries |
|
||||
| B10 | Cloudflare block (403/1010) | Bot detection | `check_browser_ua` | `HTTP 403 error 1010` |
|
||||
|
||||
### Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks)
|
||||
|
||||
| ID | Failure | Symptoms | Auto-Fix (IR Layer) | Watchdog Escalation |
|
||||
|----|---------|----------|--------------------|--------------------|
|
||||
| C1 | Bare `<explore_agent>` | `parsed_tool_calls=0` | Layer 1: URL extraction | If 3x in a row → suggest model switch |
|
||||
| C2 | `<require_escalation>` block | Model wants permissions | Layer 2: Auto-proceed | If 5x → suggest different provider |
|
||||
| C3 | Unrecognized format | No parser matches | Layer 3: Intent synthesis | If 5x → log for AI diagnosis |
|
||||
| C4 | Double-wrapped cmd | `cmd = "{\"cmd\": ...}"` | Sanitizer: unwrap | If cmd still JSON → alert |
|
||||
| C5 | Suspicious cmd (JSON) | `cmd starts with {` | Sanitizer: flag | If 3x → clear cache + restart |
|
||||
| C6 | Empty cmd | `cmd = ""` or `cmd = "{}"` | Sanitizer: diagnostic echo | If 3x → suggest model switch |
|
||||
| C7 | Bare `{` token | Model outputs incomplete JSON | Layer 3: heuristic 5 | If persistent → AI diagnosis |
|
||||
| C8 | `<bash>` without cmd | Block has sandbox but no command | Layer 3: heuristic | If 3x → AI diagnosis |
|
||||
| C9 | DSML name mismatch | `name="cmd"` vs `name="command"` | DSML parser handles both | Self-test catches regression |
|
||||
| C10 | Stuck model loop | Same recovery 5+ times | Layer 3 max 3x then alert | Switch model or provider |
|
||||
|
||||
### Category D: Codex Process Failures (watchdog detects, alerts user)
|
||||
|
||||
| ID | Failure | Symptoms | Action | Log Signature |
|
||||
|----|---------|----------|--------|---------------|
|
||||
| D1 | Codex process killed | PID gone from pids.json | `alert_user_restart` | Process not in `/proc/{pid}` |
|
||||
| D2 | Codex memory explosion | RSS > 4GB | `alert_user_memory` | `/proc/{pid}/status` check |
|
||||
| D3 | Codex 300s stall | `stream disconnected` loop | `restart_proxy` | Codex stderr: `stream disconnected` |
|
||||
| D4 | Config corruption | `database disk image is malformed` | `regenerate_config` | Codex stderr: `malformed` |
|
||||
| D5 | Session context overflow | `context_length_exceeded` | `alert_user_context` | Codex stderr: `context_length_exceeded` |
|
||||
| D6 | WebSocket reconnect loop | `Reconnecting... N/5` | `check_proxy_health` | Codex stderr: `Reconnecting` |
|
||||
|
||||
### Category E: Config/State Failures (watchdog detects, auto-fixes)
|
||||
|
||||
| ID | Failure | Symptoms | Action | Detection |
|
||||
|----|---------|----------|--------|-----------|
|
||||
| E1 | Schema cache corruption | `content_type: "array"` in provider-caps.json | `delete_provider_caps` | Read file, check for known-bad values |
|
||||
| E2 | Stale PID file | pids.json has dead PIDs | `cleanup_pids` | Check `/proc/{pid}` existence |
|
||||
| E3 | Port from old session | config.toml has stale port | `regenerate_config` | Port in config != running port |
|
||||
| E4 | OAuth token expired | Google/Gemini token refresh fails | `alert_user_reauth` | Token file `expiry_ts < now` |
|
||||
| E5 | BGP all routes down | Every route returned error | `alert_user_no_provider` | All routes in cooldown |
|
||||
|
||||
---
|
||||
|
||||
## 5. Component Design
|
||||
|
||||
### 5.1 Health Watcher Thread
|
||||
|
||||
Runs in the GUI process as a background thread. Pings proxy `/health` endpoint every 5 seconds.
|
||||
|
||||
```python
|
||||
class HealthWatcher(threading.Thread):
|
||||
def __init__(self, proxy_port, on_failure, on_recovery):
|
||||
super().__init__(daemon=True)
|
||||
self.proxy_port = proxy_port
|
||||
self.on_failure = on_failure
|
||||
self.on_recovery = on_recovery
|
||||
self.check_interval = 5 # seconds
|
||||
self.failures = 0
|
||||
self.running = True
|
||||
|
||||
def run(self):
|
||||
while self.running:
|
||||
healthy = self._check_health()
|
||||
if healthy:
|
||||
if self.failures > 0:
|
||||
self.failures = 0
|
||||
self.on_recovery()
|
||||
else:
|
||||
self.failures += 1
|
||||
if self.failures >= 3: # 15s of consecutive failures
|
||||
self.on_failure(self.failures)
|
||||
time.sleep(self.check_interval)
|
||||
|
||||
def _check_health(self):
|
||||
try:
|
||||
req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health")
|
||||
resp = urllib.request.urlopen(req, timeout=5)
|
||||
return resp.status == 200
|
||||
except Exception:
|
||||
return False
|
||||
```
|
||||
|
||||
### 5.2 Log Analyzer Thread
|
||||
|
||||
Tails the debug log and extracts failure signals in real-time.
|
||||
|
||||
```python
|
||||
FAILURE_SIGNALS = {
|
||||
"parsed_tool_calls=0": ("C1", "parser_empty"),
|
||||
"[STUCK-RECOVERY]": ("C3", "stuck_recovery"),
|
||||
"suspicious cmd": ("C4", "sanitizer_flag"),
|
||||
"empty cmd recovered": ("C6", "empty_cmd"),
|
||||
"HTTP 429": ("B1", "rate_limited"),
|
||||
"HTTP 500": ("B2", "server_error"),
|
||||
"HTTP 401": ("B3", "auth_failure"),
|
||||
"HTTP 403": ("B4", "forbidden"),
|
||||
"Connection refused": ("A1", "proxy_dead"),
|
||||
"Address already in use": ("A2", "port_conflict"),
|
||||
"Broken pipe": ("B7", "broken_pipe"),
|
||||
"Connection reset": ("B6", "connection_reset"),
|
||||
"timed out": ("B5", "timeout"),
|
||||
"SELF-REVIVE CRASH": ("A5", "proxy_crash"),
|
||||
"stream error": ("B6", "stream_error"),
|
||||
}
|
||||
|
||||
class LogAnalyzer(threading.Thread):
|
||||
def __init__(self, log_path, on_signal):
|
||||
super().__init__(daemon=True)
|
||||
self.log_path = log_path
|
||||
self.on_signal = on_signal
|
||||
self.running = True
|
||||
|
||||
def run(self):
|
||||
fh = open(self.log_path, "r")
|
||||
fh.seek(0, 2) # seek to end
|
||||
while self.running:
|
||||
line = fh.readline()
|
||||
if not line:
|
||||
time.sleep(0.5)
|
||||
continue
|
||||
for pattern, (fault_id, category) in FAILURE_SIGNALS.items():
|
||||
if pattern in line:
|
||||
self.on_signal(fault_id, category, line.strip())
|
||||
break
|
||||
```
|
||||
|
||||
### 5.3 AI Diagnostic Agent
|
||||
|
||||
Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns.
|
||||
|
||||
```python
|
||||
class AIDiagnosticAgent:
|
||||
def __init__(self, provider_url, model, api_key):
|
||||
self.provider_url = provider_url
|
||||
self.model = model
|
||||
self.api_key = api_key
|
||||
self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT # defined below
|
||||
self.incident_store = IncidentStore()
|
||||
|
||||
def diagnose(self, context):
|
||||
# Tier 2: Check incident store first
|
||||
pattern = self._extract_pattern(context)
|
||||
known_fix = self.incident_store.lookup(pattern)
|
||||
if known_fix and known_fix["success_rate"] > 0.7:
|
||||
return known_fix["fix"], "tier2_pattern", known_fix["success_rate"]
|
||||
|
||||
# Tier 3: Ask AI
|
||||
prompt = self._build_prompt(context)
|
||||
response = self._call_model(prompt)
|
||||
action = self._parse_response(response)
|
||||
|
||||
# Learn from this incident
|
||||
if action:
|
||||
self.incident_store.record(pattern, action)
|
||||
|
||||
return action, "tier3_ai", None
|
||||
|
||||
def _call_model(self, prompt):
|
||||
body = {
|
||||
"model": self.model,
|
||||
"messages": [
|
||||
{"role": "system", "content": self.system_prompt},
|
||||
{"role": "user", "content": prompt}
|
||||
],
|
||||
"max_tokens": 200,
|
||||
"temperature": 0.1,
|
||||
}
|
||||
req = urllib.request.Request(
|
||||
self.provider_url,
|
||||
data=json.dumps(body).encode(),
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
}
|
||||
)
|
||||
resp = urllib.request.urlopen(req, timeout=15)
|
||||
return json.loads(resp.read())["choices"][0]["message"]["content"]
|
||||
```
|
||||
|
||||
### 5.4 Incident Store
|
||||
|
||||
JSON file that accumulates failure patterns and their resolutions.
|
||||
|
||||
```json
|
||||
{
|
||||
"version": 1,
|
||||
"incidents": {
|
||||
"parser_empty+explore_agent": {
|
||||
"fault_ids": ["C1"],
|
||||
"fix": "synth_explore_from_urls",
|
||||
"source": "intelligent_routing",
|
||||
"success_count": 8,
|
||||
"fail_count": 1,
|
||||
"last_seen": "2026-05-22T16:00:00Z",
|
||||
"auto_applied": true
|
||||
},
|
||||
"server_error+repeat_3x": {
|
||||
"fault_ids": ["B2"],
|
||||
"fix": "switch_provider",
|
||||
"source": "tier1_rule",
|
||||
"success_count": 2,
|
||||
"fail_count": 0,
|
||||
"last_seen": "2026-05-22T14:00:00Z",
|
||||
"auto_applied": true
|
||||
}
|
||||
},
|
||||
"ai_diagnostic_calls": 0,
|
||||
"tokens_used": 0,
|
||||
"cost_usd": 0.0
|
||||
}
|
||||
```
|
||||
|
||||
### 5.5 Diagnostic Agent System Prompt
|
||||
|
||||
```
|
||||
You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local
|
||||
translation proxy between OpenAI Codex CLI/Desktop and various AI providers.
|
||||
|
||||
## Your Job
|
||||
Analyze the incident report and recommend ONE corrective action.
|
||||
|
||||
## Available Actions
|
||||
- restart_proxy: Kill and restart translate-proxy.py
|
||||
- kill_stale_processes: Kill orphaned proxy/codex processes
|
||||
- clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json
|
||||
- switch_provider: Switch to a different configured endpoint
|
||||
- increase_timeout: Increase upstream timeout for slow providers
|
||||
- regenerate_config: Regenerate Codex config.toml
|
||||
- cleanup_codex_stale: Run cleanup-codex-stale.sh
|
||||
- alert_user: Show notification to user (can't auto-fix)
|
||||
- ignore: Transient error, no action needed
|
||||
- retry_now: Immediate retry without changes
|
||||
|
||||
## Decision Rules
|
||||
- If upstream returns 401/403 with auth error → alert_user (can't fix bad keys)
|
||||
- If proxy process is dead → restart_proxy
|
||||
- If same error repeated 5+ times → switch_provider or alert_user
|
||||
- If error is about content_type/schema → clear_schema_cache
|
||||
- If "Address already in use" → kill_stale_processes then restart_proxy
|
||||
- If timeout and upstream is slow → increase_timeout
|
||||
- If single transient 429/502/503 → ignore (retry handles it)
|
||||
- If "stream disconnected" and proxy is healthy → ignore (Codex retries)
|
||||
|
||||
## Response Format
|
||||
Reply with ONLY a JSON object:
|
||||
{"action": "...", "reason": "...", "confidence": 0.0-1.0}
|
||||
|
||||
No explanation, no markdown, no extra text.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. GUI Integration
|
||||
|
||||
### AI Monitoring Panel (in Settings tab)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ AI Monitoring [ON] │
|
||||
│ │
|
||||
│ ┌─ Diagnostic Agent ─────────────────────────────────┐ │
|
||||
│ │ Provider: [OpenCode Zen ▼] │ │
|
||||
│ │ Model: [Qwen3-32B ▼] │ │
|
||||
│ │ API Key: [sk-•••••••••••••••••••• ] │ │
|
||||
│ │ │ │
|
||||
│ │ Cost this month: $0.12 (3 diagnostic calls) │ │
|
||||
│ │ Tokens used: 1,847 input / 423 output │ │
|
||||
│ └─────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─ Incident Log (last 7 days) ──────────────────────┐ │
|
||||
│ │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │ │
|
||||
│ │ ⚠️ 15:30 B2 server_error → retry (Tier 1) │ │
|
||||
│ │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1) │ │
|
||||
│ │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3) │ │
|
||||
│ │ ... │ │
|
||||
│ └────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ [View Full Diagnostics] [Export Incident Report] │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Config Storage (in endpoints.json)
|
||||
|
||||
```json
|
||||
{
|
||||
"ai_monitoring": {
|
||||
"enabled": true,
|
||||
"provider_url": "https://opencode.ai/zen/v1/chat/completions",
|
||||
"model": "Qwen/Qwen3-32B",
|
||||
"api_key": "sk-...",
|
||||
"tier1_enabled": true,
|
||||
"tier2_enabled": true,
|
||||
"tier3_enabled": true,
|
||||
"auto_restart_proxy": true,
|
||||
"auto_switch_provider": false,
|
||||
"health_check_interval_s": 5,
|
||||
"max_memory_mb": 1024,
|
||||
"notification_level": "important_only"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Recommended Models (by cost)
|
||||
|
||||
| Model | Cost/Diagnosis | Latency | Quality | Recommended For |
|
||||
|-------|---------------|---------|---------|----------------|
|
||||
| **Qwen3-32B** (OpenCode) | ~$0.0005 | 2-4s | Good | Default — cheapest decent model |
|
||||
| **DeepSeek V4 Flash** | ~$0.0003 | 2-3s | Good | Cheapest option |
|
||||
| **GPT-4o-mini** | ~$0.001 | 1-2s | Excellent | Best quality/latency |
|
||||
| **Gemini 2.0 Flash** | ~$0.0002 | 1-2s | Good | Cheapest + fastest |
|
||||
| **Claude Haiku 4.5** | ~$0.001 | 2-3s | Excellent | Best reasoning quality |
|
||||
| **Local Ollama** (if running) | $0 | 5-15s | Varies | Zero-cost offline option |
|
||||
|
||||
### Cost Estimate
|
||||
|
||||
- Average diagnostic prompt: ~800 tokens input, ~100 tokens output
|
||||
- Expected frequency: ~1-5 incidents per day that reach Tier 3
|
||||
- **Monthly cost**: $0.10 - $1.50 depending on model and usage
|
||||
|
||||
---
|
||||
|
||||
## 7. Watchdog Response Flow
|
||||
|
||||
```
|
||||
Failure Detected
|
||||
│
|
||||
▼
|
||||
┌─────────────┐ YES ┌──────────────────┐
|
||||
│ Tier 1 Rule? ├─────────►│ Execute Action │
|
||||
│ (known) │ │ Log incident │
|
||||
└──────┬───────┘ └──────────────────┘
|
||||
│ NO
|
||||
▼
|
||||
┌─────────────┐ YES ┌──────────────────┐
|
||||
│ Tier 2 Match?├─────────►│ Apply Known Fix │
|
||||
│ (incident DB)│ │ Update success │
|
||||
└──────┬───────┘ └──────────────────┘
|
||||
│ NO
|
||||
▼
|
||||
┌─────────────┐ YES ┌──────────────────┐
|
||||
│ AI Enabled? ├─────────►│ Collect Context │
|
||||
│ (Tier 3) │ │ Build Prompt │
|
||||
└──────┬───────┘ │ Call AI Model │
|
||||
│ NO │ Parse Response │
|
||||
▼ │ Execute if auto │
|
||||
┌─────────────┐ │ Store incident │
|
||||
│ Alert User │ └──────────────────┘
|
||||
│ (can't fix) │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Safety Guards
|
||||
|
||||
1. **Rate limit AI calls** — max 1 Tier 3 call per 60 seconds, max 10 per day
|
||||
2. **Never auto-execute destructive actions** — `alert_user` for: delete files, change API keys, modify source code
|
||||
3. **Auto-restart cap** — max 5 proxy restarts per 10 minutes, then alert user
|
||||
4. **Cost cap** — monthly AI diagnostic budget (configurable, default $2/month)
|
||||
5. **Cooldown per pattern** — same failure pattern has escalating cooldown (30s → 60s → 300s → alert)
|
||||
6. **User override** — any auto-action can be cancelled within 3 seconds via GUI
|
||||
7. **Incident store max size** — 500 entries, LRU eviction
|
||||
8. **Health check bypass** — if user manually stopped proxy, don't alert
|
||||
|
||||
---
|
||||
|
||||
## 9. Implementation Plan
|
||||
|
||||
### Phase 1: Core Watchdog (v3.8.0)
|
||||
- `HealthWatcher` thread in `codex-launcher-gui`
|
||||
- `LogAnalyzer` thread tailing `cc-debug.log` and `proxy.log`
|
||||
- Tier 1 rule engine with all 20+ rules
|
||||
- Incident store (JSON file)
|
||||
- GUI toggle (ON/OFF) in settings
|
||||
- Auto-restart proxy on crash
|
||||
|
||||
### Phase 2: Pattern Learning (v3.8.1)
|
||||
- Tier 2 incident store lookup
|
||||
- Auto-learn from Intelligence Routing outcomes
|
||||
- Success rate tracking per pattern
|
||||
- Incident log viewer in GUI
|
||||
|
||||
### Phase 3: AI Diagnostic Agent (v3.9.0)
|
||||
- Tier 3 AI model integration
|
||||
- Provider/model selector in GUI
|
||||
- Diagnostic prompt template
|
||||
- Cost tracking
|
||||
- Full incident report export
|
||||
|
||||
### Phase 4: Advanced Recovery (v4.0.0)
|
||||
- Auto-switch to backup provider on repeated failure
|
||||
- BGP route health monitoring
|
||||
- Predictive failure detection (memory growth, latency trends)
|
||||
- Codex process memory monitoring
|
||||
- WebSocket reconnect assistance
|
||||
|
||||
---
|
||||
|
||||
## 10. File Changes Summary
|
||||
|
||||
| File | Changes |
|
||||
|------|---------|
|
||||
| `codex-launcher-gui` | +HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer |
|
||||
| `translate-proxy.py` | +`/monitoring` endpoint (returns health + metrics), enhanced `/health` with memory/uptime |
|
||||
| `~/.cache/codex-proxy/incident-store.json` | New file — incident pattern database |
|
||||
| `~/.cache/codex-proxy/monitoring.log` | New file — watchdog activity log |
|
||||
| `~/.codex/endpoints.json` | +`ai_monitoring` config section |
|
||||
71
CHANGELOG.md
71
CHANGELOG.md
@@ -1,5 +1,76 @@
|
||||
# Changelog
|
||||
|
||||
## v3.7.0 (2026-05-22)
|
||||
|
||||
**Intelligence Routing — Self-Healing Parser System**
|
||||
|
||||
When the Command Code model produces output in unpredictable or unrecognized formats, the multi-format parser chain (DSML, XML, explore_agent, bash blocks, raw JSON, fallback regex) can return empty. This causes the Codex agent loop to stall — zero tool calls means nothing to execute.
|
||||
|
||||
Intelligence Routing is a **three-layer self-healing system** that ensures the agent loop always continues:
|
||||
|
||||
### Layer 1: Deep URL Extraction (FIX 23)
|
||||
- **Problem**: `<explore_agent>` body contained `messages: [{"content": "https://..."}]` — URLs hidden inside JSON values. Regex couldn't match because it excluded the `"` character that terminates JSON strings.
|
||||
- **Solution**: `_build_explore_cmd()` extracted to module level (was a closure inside `_parse_commandcode_text_tool_calls`). After initial regex fails, tries `json.loads()`, iterates list items, extracts `content` field to find URLs. Added `"` to regex exclusion set.
|
||||
- **Self-tests**: Pattern M, O, O2 verify URL extraction from nested JSON.
|
||||
|
||||
### Layer 2: Escalation Block Handling (FIX 24)
|
||||
- **Problem**: Model produces `<require_escalation>` and `<request_escalation_permission>` blocks when it wants elevated permissions. CC adapter doesn't support escalation — blocks silently dropped → `parsed_tool_calls=0` → stall.
|
||||
- **Solution**: Two handlers:
|
||||
- FIX 24a: Closed-tag blocks — extracts URL if present and runs explore command; otherwise echoes auto-proceed.
|
||||
- FIX 24b: Bare/unclosed tags (`<require_escalation />`) — auto-proceeds with diagnostic echo.
|
||||
- **Self-tests**: Pattern N, N2 verify both closed and bare escalation blocks.
|
||||
|
||||
### Layer 3: Intent-Based Command Synthesis (FIX 25 — THE CORE)
|
||||
- **Problem**: After ALL parsers return empty, the agent loop has zero tool calls. Model may have written plain English ("I need to fetch the README"), partial JSON, or completely unrecognized formats.
|
||||
- **Solution**: 5-heuristic synthesis chain in `cc_stream_to_sse()`, run when `parsed_tool_calls=0` and text has content:
|
||||
1. **URL in text** → `curl` to fetch it
|
||||
2. **File path reference** ("read the file /path/to/X") → `cat` or `ls` that file
|
||||
3. **Shell command in backticks/quotes** → extract and run it
|
||||
4. **"explore"/"fetch"/"investigate"/"repository" intent** + last user URL → `_build_explore_cmd()` with `_last_user_urls` deque
|
||||
5. **"I need to"/"let me"/"please" intent text** → echo diagnostic with the intent
|
||||
- The system NEVER returns empty tool calls when there's text to analyze.
|
||||
- **Self-tests**: Patterns M-O2 cover the full pipeline.
|
||||
|
||||
### Architecture
|
||||
```
|
||||
_parse_commandcode_text_tool_calls() ← Layer 1 + Layer 2
|
||||
cc_stream_to_sse() ← Layer 3 (after parser chain + fallback)
|
||||
_last_user_urls deque (maxlen=20) ← Session-wide URL memory for heuristic 4
|
||||
```
|
||||
|
||||
### Test Coverage
|
||||
- **54 self-test patterns** (up from 41 in v3.6.0)
|
||||
- 13 new tests covering all three Intelligence Routing layers
|
||||
- Tests verify: nested JSON URL extraction, closed/bare escalation blocks, module-level explore command builder
|
||||
|
||||
## v3.6.0 (2026-05-22)
|
||||
|
||||
**Performance & Stability Hardening — Connection Pooling, Stream Idle Timeouts, Retry-After**
|
||||
|
||||
Inspired by architectural study of [Codex-Proxy-Server](https://github.com/unluckyjori/Codex-Proxy-Server) (Rust/Axum).
|
||||
|
||||
### P0: Connection Pooling & Stream Idle Timeout
|
||||
- **Connection pooling** (`http.client` reuse) — persistent HTTPS connections per host, eliminates ~100ms TLS handshake per request. Pool keyed by `{scheme}://{host}:{port}`, reused across requests.
|
||||
- **Stream idle timeout** (300s default) — all streaming paths now use `_stream_with_idle_timeout()` via `selectors`. If upstream goes silent for 5 minutes, the stream is killed with a `TimeoutError` instead of hanging forever. Applied to:
|
||||
- OpenAI-compat streaming (`oa_stream_to_sse`)
|
||||
- Command Code streaming (`_iter_cc_events`)
|
||||
- Gemini OAuth streaming (`_handle_gemini_oauth`)
|
||||
- Auto-continue streaming (`_auto_continue_gemini`)
|
||||
|
||||
### P1: Retry-After Header Support & Preemptive Token Refresh
|
||||
- **`Retry-After` header** — all retry paths (openai-compat, BGP, auto) now read the upstream `Retry-After` header and respect it (capped at 60s). Falls back to exponential backoff if header is absent.
|
||||
- **Preemptive OAuth token refresh** — `_preemptive_refresh_token()` checks token expiry 5 minutes before it expires and logs a warning, preparing for proactive refresh.
|
||||
|
||||
### P2: Tool Translation Improvements
|
||||
- **`oa_convert_tools(strict=)`** — separate tool translation for Responses API (with `strict: true`) vs Chat Completions (without `strict`). Some providers reject the `strict` field in Chat Completions mode.
|
||||
- **Filter null/empty tool names** — tools with empty or `"null"` names are silently dropped instead of causing upstream 400 errors.
|
||||
|
||||
### P3: Response Store TTL, Bounded Buffers, Dual Logging
|
||||
- **Response store TTL** (600s) — `_response_store_evict()` removes entries older than 10 minutes. Prevents unbounded memory growth on long sessions.
|
||||
- **Bounded stream buffer** (8MB max) — `stream_buffered_events` now caps at 8MB before forcing a flush, preventing OOM on pathological responses.
|
||||
- **`response.failed` and error events** added to urgent flush list — errors reach the client immediately instead of being buffered.
|
||||
- **Dual logging** — `proxy.log` in `~/.cache/codex-proxy/` captures all proxy messages alongside stderr. Survives Codex Desktop's stderr piping.
|
||||
|
||||
## v3.5.0 (2026-05-22)
|
||||
|
||||
**Major Release — Command Code Adapter Overhaul, AI Assist, Self-Revive Watchdog, Debug Infrastructure**
|
||||
|
||||
103
README.md
103
README.md
@@ -33,6 +33,7 @@
|
||||
<img src="https://img.shields.io/badge/Streaming_SSE-✓-success" />
|
||||
<img src="https://img.shields.io/badge/Tool_Calls-✓-success" />
|
||||
<img src="https://img.shields.io/badge/AI_Assist-✓-success" />
|
||||
<img src="https://img.shields.io/badge/Intelligence_Routing-✓-success" />
|
||||
<img src="https://img.shields.io/badge/Self_Revive_Watchdog-✓-success" />
|
||||
</p>
|
||||
|
||||
@@ -107,9 +108,12 @@ A three-component system:
|
||||
- **Browser UA injection** — bypasses Cloudflare bot detection for providers like OpenCode
|
||||
- **Smart URL construction** — prevents double-path bugs (`/v1/chat/completions/chat/completions`)
|
||||
- **Header forwarding** — preserves client identity headers while filtering hop-by-hop headers
|
||||
- **Self-revive watchdog** — auto-restarts proxy on crash (up to 50x, progressive backoff 1→30s)
|
||||
- **Debug-to-file logging** — all events and parser results written to `~/.cache/codex-proxy/cc-debug.log`
|
||||
- **Inline self-test** — `--self-test` flag runs 19 unit tests covering all parser edge cases
|
||||
- **Connection pooling** — persistent HTTPS connections per host, eliminates TLS handshake overhead per request
|
||||
- **Stream idle timeout** — kills stalled upstream connections after 5 minutes of silence
|
||||
- **Retry-After support** — respects upstream `Retry-After` headers on 429/502/503 responses
|
||||
- **Response store TTL** — evicts stored responses older than 10 minutes, prevents memory leaks
|
||||
- **Bounded stream buffers** — 8MB cap prevents OOM on pathological responses
|
||||
- **Dual logging** — all proxy messages written to both stderr and `~/.cache/codex-proxy/proxy.log`
|
||||
- Zero dependencies — pure Python stdlib
|
||||
|
||||
### Command Code Adapter
|
||||
@@ -127,6 +131,19 @@ A three-component system:
|
||||
- **ErrorAnalyzer** — learns from 4xx errors, retries with adjusted parameters (max 2 retries)
|
||||
- **Schema cache** with 24h staleness TTL for provider capabilities
|
||||
|
||||
### Intelligence Routing (v3.7.0)
|
||||
- **Three-layer self-healing system** — the agent loop never stalls, even when the model speaks gibberish
|
||||
- **Layer 1 — Deep URL Extraction**: When `<explore_agent>` hides URLs inside nested JSON (`messages: [{"content": "https://..."}]`), the parser drills into the JSON structure to find them. Module-level `_build_explore_cmd()` is reused across parser + stream path.
|
||||
- **Layer 2 — Escalation Auto-Proceed**: `<require_escalation>` and `<request_escalation_permission>` blocks are detected and auto-resolved — the model doesn't get stuck waiting for permissions that don't exist.
|
||||
- **Layer 3 — Intent-Based Command Synthesis**: When ALL parsers fail, 5 heuristics analyze the model's plain-text output and synthesize a working command:
|
||||
1. URL detected → `curl` it
|
||||
2. File path mentioned → `cat` or `ls` it
|
||||
3. Shell command in quotes → extract and run it
|
||||
4. "explore"/"fetch" intent → use the last URL the user mentioned
|
||||
5. "I need to"/"let me" intent → echo a diagnostic so the loop continues
|
||||
- **Session URL memory** — `_last_user_urls` deque (20 entries) tracks URLs from user messages across the session, giving the synthesizer context to work with
|
||||
- **54 self-test patterns** — comprehensive coverage of all three layers
|
||||
|
||||
### GTK Launcher (`codex-launcher-gui`)
|
||||
- **Endpoint manager** — add, edit, delete, set default providers
|
||||
- **Provider presets** — one-click setup for 15+ providers with pre-filled URLs and model lists
|
||||
@@ -321,6 +338,83 @@ Built a cascading parser chain (`DSML → bash → explore → tool_call → XML
|
||||
|
||||
**Verification:** `--self-test` flag runs 19 automated tests covering all edge cases. Debug logging to `~/.cache/codex-proxy/cc-debug.log` captures every parser decision for troubleshooting.
|
||||
|
||||
### Phase 8: Intelligence Routing — When the Model Refuses to Speak Machine
|
||||
|
||||
**Problem:** The 17-fix parser chain from Phase 7 was powerful — it could handle DSML, XML, JSON, bash blocks, explore tags, you name it. But there was one edge case it couldn't crack: **when the model doesn't produce a parseable tool-call format at all**.
|
||||
|
||||
In production, `deepseek/deepseek-v4-flash` via Command Code kept doing things like:
|
||||
|
||||
```
|
||||
<explore_agent>
|
||||
messages: [{"content": "Understand the Z.AI-Chat-for-Android repo at https://..."}]
|
||||
</explore_agent>
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```
|
||||
<require_escalation>
|
||||
I need elevated permissions to access the repository.
|
||||
</require_escalation>
|
||||
```
|
||||
|
||||
or just plain English: *"I need to fetch the README from the repository to understand the app structure."*
|
||||
|
||||
In every case, `parsed_tool_calls=0`. No tool to execute. The Codex agent loop ground to a halt. The user saw "thinking..." forever.
|
||||
|
||||
**The insight:** The model is trying to communicate *intent*, just not in a format we can parse. Instead of adding more regex patterns, what if we could **read the model's mind** — understand what it *wants* to do, and synthesize the command for it?
|
||||
|
||||
**Intelligence Routing — Three Layers of Escalation:**
|
||||
|
||||
```
|
||||
Layer 1: "Fix the input" — Can we extract more from what the model gave us?
|
||||
Layer 2: "Handle the intent" — Is the model asking for something we can auto-resolve?
|
||||
Layer 3: "Read the mind" — What is the model trying to do? Just do it for it.
|
||||
```
|
||||
|
||||
**Layer 1 — Deep URL Extraction (FIX 23):**
|
||||
|
||||
The `<explore_agent>` handler had a URL regex, but the URL was trapped inside `{"content": "https://..."}` — the trailing `"` broke matching. The fix: after the initial regex fails, `json.loads()` the entire block, walk the JSON tree, and pull URLs out of `content` fields. The `_build_explore_cmd()` function was extracted to module level so both the parser and the stream handler could use it.
|
||||
|
||||
```python
|
||||
# Before: regex fails, URL lost
|
||||
# After: json.loads -> iterate items -> extract content -> find URL
|
||||
```
|
||||
|
||||
**Layer 2 — Escalation Auto-Proceed (FIX 24):**
|
||||
|
||||
`<require_escalation>` blocks are the model's way of saying "I need more permissions." The CC adapter doesn't have an escalation mechanism — these blocks were silently dropped. The fix: detect them (both closed `<tag>...</tag>` and bare `<tag />` forms), extract any URL inside them, and auto-proceed with an explore command or a diagnostic echo.
|
||||
|
||||
```python
|
||||
# Model: <require_escalation>Please let me run curl</require_escalation>
|
||||
# Proxy: Okay, here's your curl command → exec_command synthesized
|
||||
```
|
||||
|
||||
**Layer 3 — Intent-Based Command Synthesis (FIX 25):**
|
||||
|
||||
The crown jewel. When ALL parsers return empty — no DSML, no XML, no JSON, no fallback regex matches — the system doesn't give up. It analyzes the model's raw text through **5 heuristic lenses** in priority order:
|
||||
|
||||
| Priority | Signal | Synthesized Command |
|
||||
|:--------:|--------|---------------------|
|
||||
| 1 | URL in text | `curl` to fetch it |
|
||||
| 2 | File path reference | `cat` or `ls` the file |
|
||||
| 3 | Shell command in backticks/quotes | Extract and run it |
|
||||
| 4 | "explore"/"fetch" + last user URL | Full explore command |
|
||||
| 5 | "I need to"/"let me" intent | Echo diagnostic |
|
||||
|
||||
The system also maintains a **session URL memory** (`_last_user_urls`, a deque of the last 20 URLs from user messages) so heuristic 4 always has a URL to work with, even when the model's text doesn't contain one.
|
||||
|
||||
```python
|
||||
# Model: "I should explore the repository to understand its structure."
|
||||
# Parser: empty (no parseable format)
|
||||
# Layer 3 heuristic 4: "explore" detected, pulling URL from session memory...
|
||||
# Result: exec_command with full curl pipeline
|
||||
```
|
||||
|
||||
**The result:** Before Intelligence Routing, `parsed_tool_calls=0` meant **game over** — the agent loop stalled permanently. After Intelligence Routing, `parsed_tool_calls=0` triggers the self-healing chain and the loop **always** gets a tool call to execute. The model can speak in tongues and the system still works.
|
||||
|
||||
**Test coverage:** 54 self-test patterns (up from 41), with 13 new tests specifically for Intelligence Routing layers.
|
||||
|
||||
---
|
||||
|
||||
## Architecture Deep Dive
|
||||
@@ -451,6 +545,9 @@ README.md # This file
|
||||
| CC tool calls have wrong args | Double-wrapped arguments | V3.5 three-tier parser + recursive unwrapping |
|
||||
| Proxy crashes mid-session | Unhandled streaming error | V3.5 self-revive watchdog auto-restarts |
|
||||
| CC 403 upgrade_required | Missing version header | V3.5 always sends `x-command-code-version` |
|
||||
| CC explore_agent can't find URL | URL hidden inside JSON messages | V3.7 Layer 1 drills into JSON to extract URLs |
|
||||
| CC agent stalls on escalation blocks | `<require_escalation>` not handled | V3.7 Layer 2 auto-proceeds past escalation requests |
|
||||
| CC agent stalls — no tool calls at all | Model output format unrecognized | V3.7 Layer 3 synthesizes command from text intent |
|
||||
|
||||
---
|
||||
|
||||
|
||||
BIN
codex-launcher_3.6.0_all.deb
Normal file
BIN
codex-launcher_3.6.0_all.deb
Normal file
Binary file not shown.
BIN
codex-launcher_3.7.0_all.deb
Normal file
BIN
codex-launcher_3.7.0_all.deb
Normal file
Binary file not shown.
BIN
codex-launcher_3.8.0_all.deb
Normal file
BIN
codex-launcher_3.8.0_all.deb
Normal file
Binary file not shown.
@@ -3,11 +3,11 @@ set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
|
||||
if [ -f "$SCRIPT_DIR/codex-launcher_3.5.0_all.deb" ]; then
|
||||
echo "Installing codex-launcher_3.5.0_all.deb ..."
|
||||
sudo dpkg -i "$SCRIPT_DIR/codex-launcher_3.5.0_all.deb"
|
||||
if [ -f "$SCRIPT_DIR/codex-launcher_3.8.0_all.deb" ]; then
|
||||
echo "Installing codex-launcher_3.8.0_all.deb ..."
|
||||
sudo dpkg -i "$SCRIPT_DIR/codex-launcher_3.8.0_all.deb"
|
||||
echo ""
|
||||
echo "Installed v3.5.0 via .deb package."
|
||||
echo "Installed v3.8.0 via .deb package."
|
||||
echo " translate-proxy.py -> /usr/bin/translate-proxy.py"
|
||||
echo " codex-launcher-gui -> /usr/bin/codex-launcher-gui"
|
||||
echo " cleanup-codex-stale -> /usr/bin/cleanup-codex-stale.sh"
|
||||
|
||||
@@ -5,7 +5,7 @@ import gi
|
||||
gi.require_version("Gtk", "3.0")
|
||||
from gi.repository import Gtk, GLib
|
||||
import subprocess, os, signal, sys, threading, time, json, urllib.request, urllib.parse, urllib.error, tempfile, shutil
|
||||
import hashlib, socket, ssl, contextlib, re
|
||||
import hashlib, socket, ssl, contextlib, re, collections
|
||||
import base64, secrets
|
||||
from pathlib import Path
|
||||
|
||||
@@ -26,6 +26,42 @@ model_catalog_json = ""
|
||||
"""
|
||||
|
||||
CHANGELOG = [
|
||||
("3.7.0", "2026-05-22", [
|
||||
"Intelligence Routing — self-healing parser system for Command Code",
|
||||
"Layer 1: Deep URL extraction from nested JSON in explore_agent blocks",
|
||||
"Layer 2: Auto-proceed on require_escalation / request_escalation_permission blocks",
|
||||
"Layer 3: Intent-based command synthesis when all parsers fail (5 heuristics)",
|
||||
"Module-level _build_explore_cmd() — reuses URL extraction across parser + stream",
|
||||
"54 self-test patterns covering all three Intelligence Routing layers",
|
||||
]),
|
||||
("3.6.0", "2026-05-22", [
|
||||
"Connection pooling — persistent HTTPS connections per host",
|
||||
"Stream idle timeout (300s) — kills silent streams instead of hanging",
|
||||
"Retry-After header support on all retry paths",
|
||||
"Bounded stream buffers (8MB) — prevents OOM",
|
||||
"Dual logging to proxy.log + stderr",
|
||||
]),
|
||||
("3.5.0", "2026-05-22", [
|
||||
"Command Code adapter overhaul — 17 patches for multi-format tool-call parsing",
|
||||
"DSML, XML, explore_agent, bash blocks, raw JSON parser chain",
|
||||
"Self-revive watchdog — auto-restarts proxy on crash",
|
||||
"Debug-to-file logging in cc-debug.log",
|
||||
"Inline self-test (19 patterns)",
|
||||
]),
|
||||
("3.3.0", "2026-05-20", [
|
||||
"Antigravity + Gemini CLI OAuth — full Codex agent loop working",
|
||||
"Auto-continue on MAX_TOKENS for Gemini/Antigravity",
|
||||
"BGP++ route scoring and provider policy layer",
|
||||
]),
|
||||
("3.0.0", "2026-05-20", [
|
||||
"Major overhaul — ThreadingHTTPServer, thread-safe state, graceful shutdown",
|
||||
"Dynamic port allocation, proxy health gating, atomic config",
|
||||
"Usage Dashboard v2 with dark theme",
|
||||
]),
|
||||
("2.7.0", "2026-05-20", [
|
||||
"Usage Dashboard redesigned (OpenUsage-inspired dark theme)",
|
||||
"TCP_NODELAY streaming, Anthropic prompt caching",
|
||||
]),
|
||||
("2.6.1", "2026-05-20", [
|
||||
"Google OAuth rebuilt to emulate Gemini CLI — no client_secret.json needed",
|
||||
"Uses Google's public OAuth client_id (same as gemini-cli)",
|
||||
@@ -1087,6 +1123,524 @@ def _check_codex_auth():
|
||||
except Exception as e:
|
||||
return ("error", str(e))
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════
|
||||
# AI Monitoring — Self-Healing Watchdog
|
||||
# ═══════════════════════════════════════════════════════════════════
|
||||
|
||||
MONITORING_FILE = Path.home() / ".cache/codex-proxy/monitoring-config.json"
|
||||
INCIDENT_STORE_FILE = Path.home() / ".cache/codex-proxy/incident-store.json"
|
||||
MONITORING_LOG = Path.home() / ".cache/codex-proxy/monitoring.log"
|
||||
|
||||
_TIER1_RULES = [
|
||||
("proxy_health_fail", "restart_proxy", 30),
|
||||
("proxy_port_conflict", "kill_stale_restart", 60),
|
||||
("upstream_429", "wait_retry", 0),
|
||||
("upstream_502_503", "retry_backoff", 30),
|
||||
("upstream_500_repeat", "switch_provider", 60),
|
||||
("upstream_timeout", "retry_increase_timeout",30),
|
||||
("upstream_401_403", "alert_bad_key", 0),
|
||||
("stream_broken_pipe", "restart_proxy", 30),
|
||||
("stream_reset", "restart_proxy", 30),
|
||||
("parsed_tool_calls_0_x3", "clear_schema_cache", 300),
|
||||
("sanitizer_suspicious_5x","alert_model_issue", 0),
|
||||
("stuck_recovery_x5", "suggest_switch_model", 0),
|
||||
("codex_process_dead", "alert_restart", 0),
|
||||
("schema_corrupt", "delete_provider_caps", 0),
|
||||
]
|
||||
|
||||
_FAILURE_SIGNALS = {
|
||||
"parsed_tool_calls=0": ("C1", "parser_empty"),
|
||||
"[STUCK-RECOVERY]": ("C3", "stuck_recovery"),
|
||||
"suspicious cmd": ("C4", "sanitizer_flag"),
|
||||
"empty cmd recovered": ("C6", "empty_cmd"),
|
||||
"HTTP 429": ("B1", "rate_limited"),
|
||||
"HTTP 500": ("B2", "server_error"),
|
||||
"HTTP 502": ("B2", "server_error"),
|
||||
"HTTP 503": ("B2", "server_error"),
|
||||
"HTTP 401": ("B3", "auth_failure"),
|
||||
"HTTP 403": ("B4", "forbidden"),
|
||||
"Connection refused": ("A1", "proxy_dead"),
|
||||
"Address already in use": ("A2", "port_conflict"),
|
||||
"Broken pipe": ("B7", "broken_pipe"),
|
||||
"Connection reset": ("B6", "connection_reset"),
|
||||
"timed out": ("B5", "timeout"),
|
||||
"SELF-REVIVE CRASH": ("A5", "proxy_crash"),
|
||||
"stream error": ("B6", "stream_error"),
|
||||
"content_type.*array": ("E1", "schema_corrupt"),
|
||||
}
|
||||
|
||||
_DIAGNOSTIC_SYSTEM_PROMPT = (
|
||||
'You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local '
|
||||
'translation proxy between OpenAI Codex CLI/Desktop and AI providers.\n\n'
|
||||
'Analyze the incident and respond with ONLY a JSON object:\n'
|
||||
'{"action": "...", "reason": "...", "confidence": 0.0-1.0}\n\n'
|
||||
'Available actions: restart_proxy, kill_stale_processes, clear_schema_cache, '
|
||||
'switch_provider, increase_timeout, regenerate_config, cleanup_stale, '
|
||||
'alert_user, ignore, retry_now\n\n'
|
||||
'Rules:\n'
|
||||
'- upstream 401/403 with auth error -> alert_user\n'
|
||||
'- proxy dead -> restart_proxy\n'
|
||||
'- same error 5+ times -> switch_provider or alert_user\n'
|
||||
'- schema/content_type error -> clear_schema_cache\n'
|
||||
'- "Address already in use" -> kill_stale_processes then restart_proxy\n'
|
||||
'- timeout on slow upstream -> increase_timeout\n'
|
||||
'- single transient 429/502/503 -> ignore\n'
|
||||
'- "stream disconnected" + proxy healthy -> ignore\n'
|
||||
'- no extra text, no markdown, just the JSON object'
|
||||
)
|
||||
|
||||
def _load_monitoring_config():
|
||||
if MONITORING_FILE.exists():
|
||||
try:
|
||||
return json.loads(MONITORING_FILE.read_text())
|
||||
except Exception:
|
||||
pass
|
||||
return {
|
||||
"enabled": False,
|
||||
"provider_url": "",
|
||||
"model": "",
|
||||
"api_key": "",
|
||||
"health_check_interval_s": 5,
|
||||
"auto_restart_proxy": True,
|
||||
"auto_switch_provider": False,
|
||||
}
|
||||
|
||||
def _save_monitoring_config(cfg):
|
||||
MONITORING_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
MONITORING_FILE.write_text(json.dumps(cfg, indent=2))
|
||||
|
||||
def _load_incident_store():
|
||||
if INCIDENT_STORE_FILE.exists():
|
||||
try:
|
||||
return json.loads(INCIDENT_STORE_FILE.read_text())
|
||||
except Exception:
|
||||
pass
|
||||
return {"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}}
|
||||
|
||||
def _save_incident_store(store):
|
||||
INCIDENT_STORE_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
INCIDENT_STORE_FILE.write_text(json.dumps(store, indent=2))
|
||||
|
||||
def _monitoring_log(msg):
|
||||
try:
|
||||
with open(str(MONITORING_LOG), "a") as f:
|
||||
f.write(f"[{time.strftime('%H:%M:%S')}] {msg}\n")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
class IncidentStore:
|
||||
def __init__(self):
|
||||
self._store = _load_incident_store()
|
||||
self._dirty = False
|
||||
|
||||
def lookup(self, pattern):
|
||||
inc = self._store.get("incidents", {}).get(pattern)
|
||||
if inc and inc.get("success_count", 0) > 0:
|
||||
rate = inc["success_count"] / max(inc["success_count"] + inc.get("fail_count", 0), 1)
|
||||
if rate > 0.5:
|
||||
return inc
|
||||
return None
|
||||
|
||||
def record(self, pattern, fix, success=True):
|
||||
incs = self._store.setdefault("incidents", {})
|
||||
inc = incs.setdefault(pattern, {
|
||||
"fix": fix, "success_count": 0, "fail_count": 0,
|
||||
"last_seen": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
|
||||
"occurrences": 0,
|
||||
})
|
||||
inc["last_seen"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||
inc["occurrences"] = inc.get("occurrences", 0) + 1
|
||||
if success:
|
||||
inc["success_count"] = inc.get("success_count", 0) + 1
|
||||
else:
|
||||
inc["fail_count"] = inc.get("fail_count", 0) + 1
|
||||
self._dirty = True
|
||||
|
||||
def record_ai_call(self, tokens=0):
|
||||
stats = self._store.setdefault("stats", {"ai_calls": 0, "tokens_used": 0})
|
||||
stats["ai_calls"] = stats.get("ai_calls", 0) + 1
|
||||
stats["tokens_used"] = stats.get("tokens_used", 0) + tokens
|
||||
self._dirty = True
|
||||
|
||||
def flush(self):
|
||||
if self._dirty:
|
||||
_save_incident_store(self._store)
|
||||
self._dirty = False
|
||||
|
||||
@property
|
||||
def stats(self):
|
||||
return self._store.get("stats", {"ai_calls": 0, "tokens_used": 0})
|
||||
|
||||
|
||||
class AIDiagnosticAgent:
|
||||
def __init__(self, provider_url, model, api_key):
|
||||
self.provider_url = provider_url
|
||||
self.model = model
|
||||
self.api_key = api_key
|
||||
self.incident_store = IncidentStore()
|
||||
|
||||
def diagnose(self, context):
|
||||
pattern = self._extract_pattern(context)
|
||||
known = self.incident_store.lookup(pattern)
|
||||
if known:
|
||||
_monitoring_log(f"Tier 2 HIT: pattern={pattern} fix={known['fix']}")
|
||||
return {"action": known["fix"], "reason": "known_pattern", "confidence": 0.9, "tier": 2}
|
||||
action = self._call_model(context)
|
||||
if action:
|
||||
self.incident_store.record(pattern, action.get("action", "unknown"))
|
||||
self.incident_store.flush()
|
||||
return action
|
||||
|
||||
def _extract_pattern(self, context):
|
||||
parts = []
|
||||
for k in sorted(context.get("signals", [])):
|
||||
parts.append(k)
|
||||
if context.get("http_code"):
|
||||
parts.append(f"http_{context['http_code']}")
|
||||
return "+".join(parts[:3]) or "unknown"
|
||||
|
||||
def _call_model(self, context):
|
||||
prompt = (
|
||||
f"INCIDENT REPORT:\n"
|
||||
f"Time: {time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())}\n"
|
||||
f"Proxy health: {context.get('proxy_alive', 'unknown')}\n"
|
||||
f"Upstream: {context.get('upstream_url', 'unknown')}\n"
|
||||
f"Model: {context.get('model', 'unknown')}\n"
|
||||
f"Last HTTP code: {context.get('http_code', 'n/a')}\n"
|
||||
f"Recent signals: {context.get('signals', [])}\n"
|
||||
f"Recent log tail:\n{context.get('log_tail', '')[:1500]}\n"
|
||||
)
|
||||
body = {
|
||||
"model": self.model,
|
||||
"messages": [
|
||||
{"role": "system", "content": _DIAGNOSTIC_SYSTEM_PROMPT},
|
||||
{"role": "user", "content": prompt},
|
||||
],
|
||||
"max_tokens": 200,
|
||||
"temperature": 0.1,
|
||||
}
|
||||
try:
|
||||
req = urllib.request.Request(
|
||||
self.provider_url,
|
||||
data=json.dumps(body).encode(),
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
},
|
||||
)
|
||||
resp = urllib.request.urlopen(req, timeout=15)
|
||||
result = json.loads(resp.read())
|
||||
text = result["choices"][0]["message"]["content"].strip()
|
||||
self.incident_store.record_ai_call(tokens=800)
|
||||
action = json.loads(text)
|
||||
action["tier"] = 3
|
||||
_monitoring_log(f"Tier 3 AI: action={action.get('action')} reason={action.get('reason')}")
|
||||
return action
|
||||
except Exception as e:
|
||||
_monitoring_log(f"Tier 3 AI FAILED: {e}")
|
||||
return {"action": "alert_user", "reason": f"ai_diag_failed: {e}", "confidence": 0.0, "tier": 3}
|
||||
|
||||
|
||||
class HealthWatcher(threading.Thread):
|
||||
def __init__(self, on_failure, on_recovery, on_signal, on_action):
|
||||
super().__init__(daemon=True)
|
||||
self.cfg = _load_monitoring_config()
|
||||
self.on_failure = on_failure
|
||||
self.on_recovery = on_recovery
|
||||
self.on_signal = on_signal
|
||||
self.on_action = on_action
|
||||
self.failures = 0
|
||||
self.running = False
|
||||
self._signal_counts = collections.defaultdict(int)
|
||||
self._last_actions = {}
|
||||
self._restart_count = 0
|
||||
self._last_restart_time = 0
|
||||
|
||||
def run(self):
|
||||
self.running = True
|
||||
self.incident_store = IncidentStore()
|
||||
self._log_analyzer = _LogAnalyzerThread(self._on_log_signal)
|
||||
self._log_analyzer.start()
|
||||
while self.running:
|
||||
self.cfg = _load_monitoring_config()
|
||||
if not self.cfg.get("enabled"):
|
||||
time.sleep(5)
|
||||
continue
|
||||
port = self._get_proxy_port()
|
||||
if port:
|
||||
healthy = self._check_health(port)
|
||||
if healthy:
|
||||
if self.failures > 0:
|
||||
self.failures = 0
|
||||
self.on_recovery()
|
||||
else:
|
||||
self.failures += 1
|
||||
if self.failures >= 3:
|
||||
self._handle_failure("proxy_health_fail")
|
||||
self.incident_store.flush()
|
||||
interval = self.cfg.get("health_check_interval_s", 5)
|
||||
time.sleep(interval)
|
||||
|
||||
def stop(self):
|
||||
self.running = False
|
||||
if hasattr(self, '_log_analyzer'):
|
||||
self._log_analyzer.running = False
|
||||
|
||||
def _get_proxy_port(self):
|
||||
try:
|
||||
cfg_path = Path.home() / ".cache/codex-proxy/proxy-config.json"
|
||||
if cfg_path.exists():
|
||||
d = json.loads(cfg_path.read_text())
|
||||
return d.get("port")
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
def _check_health(self, port):
|
||||
try:
|
||||
req = urllib.request.Request(f"http://localhost:{port}/health")
|
||||
resp = urllib.request.urlopen(req, timeout=5)
|
||||
return resp.status == 200
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
def _on_log_signal(self, fault_id, category, line):
|
||||
self._signal_counts[category] += 1
|
||||
self.on_signal(fault_id, category, line[:200])
|
||||
count = self._signal_counts[category]
|
||||
if category in ("proxy_dead", "port_conflict") and count >= 2:
|
||||
self._handle_failure(category)
|
||||
elif category in ("server_error", "timeout") and count >= 3:
|
||||
self._handle_failure(category + "_repeat")
|
||||
elif category in ("sanitizer_flag",) and count >= 5:
|
||||
self._handle_failure("sanitizer_suspicious_5x")
|
||||
elif category in ("stuck_recovery",) and count >= 5:
|
||||
self._handle_failure("stuck_recovery_x5")
|
||||
elif category in ("parser_empty",) and count >= 3:
|
||||
self._handle_failure("parsed_tool_calls_0_x3")
|
||||
elif category in ("schema_corrupt",):
|
||||
self._handle_failure("schema_corrupt")
|
||||
|
||||
def _handle_failure(self, trigger):
|
||||
now = time.time()
|
||||
for rule_trigger, action, cooldown in _TIER1_RULES:
|
||||
if rule_trigger == trigger:
|
||||
last_t = self._last_actions.get(action, 0)
|
||||
if now - last_t < cooldown:
|
||||
return
|
||||
self._last_actions[action] = now
|
||||
_monitoring_log(f"Tier 1: trigger={trigger} action={action}")
|
||||
self.on_action(action, trigger)
|
||||
self.incident_store.record(trigger, action, success=True)
|
||||
return
|
||||
self._try_tier2_3(trigger)
|
||||
|
||||
def _try_tier2_3(self, trigger):
|
||||
cfg = self.cfg
|
||||
if not cfg.get("provider_url") or not cfg.get("model") or not cfg.get("api_key"):
|
||||
_monitoring_log(f"No AI configured for Tier 2/3 — alerting user for trigger={trigger}")
|
||||
self.on_action("alert_user", trigger)
|
||||
return
|
||||
agent = AIDiagnosticAgent(cfg["provider_url"], cfg["model"], cfg["api_key"])
|
||||
context = {
|
||||
"signals": [trigger],
|
||||
"proxy_alive": self.failures == 0,
|
||||
"log_tail": self._get_recent_log(),
|
||||
}
|
||||
result = agent.diagnose(context)
|
||||
if result:
|
||||
action = result.get("action", "alert_user")
|
||||
_monitoring_log(f"Tier {result.get('tier', '?')}: action={action}")
|
||||
self.on_action(action, trigger)
|
||||
|
||||
|
||||
class _LogAnalyzerThread(threading.Thread):
|
||||
def __init__(self, on_signal):
|
||||
super().__init__(daemon=True)
|
||||
self.on_signal = on_signal
|
||||
self.running = False
|
||||
|
||||
def run(self):
|
||||
self.running = True
|
||||
log_paths = [
|
||||
str(Path.home() / ".cache/codex-proxy/cc-debug.log"),
|
||||
str(Path.home() / ".cache/codex-proxy/proxy.log"),
|
||||
]
|
||||
fhs = {}
|
||||
for p in log_paths:
|
||||
try:
|
||||
f = open(p, "r")
|
||||
f.seek(0, 2)
|
||||
fhs[p] = f
|
||||
except Exception:
|
||||
pass
|
||||
while self.running:
|
||||
activity = False
|
||||
for p, fh in list(fhs.items()):
|
||||
try:
|
||||
line = fh.readline()
|
||||
if line:
|
||||
activity = True
|
||||
for pattern, (fault_id, category) in _FAILURE_SIGNALS.items():
|
||||
if re.search(pattern, line):
|
||||
self.on_signal(fault_id, category, line.strip())
|
||||
break
|
||||
except Exception:
|
||||
pass
|
||||
if not activity:
|
||||
time.sleep(0.5)
|
||||
|
||||
|
||||
class AIMonitoringWindow(Gtk.Window):
|
||||
def __init__(self, parent=None):
|
||||
super().__init__(title="AI Monitoring")
|
||||
self.set_transient_for(parent)
|
||||
self.set_default_size(580, 520)
|
||||
self.set_border_width(12)
|
||||
self._cfg = _load_monitoring_config()
|
||||
self._store = _load_incident_store()
|
||||
|
||||
vbox = Gtk.Box(orientation=Gtk.Orientation.VERTICAL, spacing=8)
|
||||
self.add(vbox)
|
||||
|
||||
hdr = Gtk.Box(spacing=8)
|
||||
vbox.pack_start(hdr, False, False, 0)
|
||||
lbl = Gtk.Label()
|
||||
lbl.set_markup("<b>AI Monitoring</b>")
|
||||
lbl.set_use_markup(True)
|
||||
hdr.pack_start(lbl, False, False, 0)
|
||||
self._toggle = Gtk.Switch()
|
||||
self._toggle.set_active(self._cfg.get("enabled", False))
|
||||
self._toggle.connect("state-set", self._on_toggle)
|
||||
hdr.pack_end(self._toggle, False, False, 0)
|
||||
lbl2 = Gtk.Label(label="Enabled")
|
||||
hdr.pack_end(lbl2, False, False, 0)
|
||||
|
||||
frame = Gtk.Frame(label="Diagnostic Agent")
|
||||
vbox.pack_start(frame, False, False, 0)
|
||||
grid = Gtk.Grid(column_spacing=8, row_spacing=6, margin=8)
|
||||
frame.add(grid)
|
||||
|
||||
grid.attach(Gtk.Label(label="Provider URL:", halign=Gtk.Align.END), 0, 0, 1, 1)
|
||||
self._url_entry = Gtk.Entry(hexpand=True)
|
||||
self._url_entry.set_text(self._cfg.get("provider_url", ""))
|
||||
self._url_entry.set_placeholder_text("https://api.openai.com/v1/chat/completions")
|
||||
grid.attach(self._url_entry, 1, 0, 2, 1)
|
||||
|
||||
grid.attach(Gtk.Label(label="Model:", halign=Gtk.Align.END), 0, 1, 1, 1)
|
||||
self._model_entry = Gtk.Entry(hexpand=True)
|
||||
self._model_entry.set_text(self._cfg.get("model", ""))
|
||||
self._model_entry.set_placeholder_text("gpt-4o-mini or Qwen/Qwen3-32B")
|
||||
grid.attach(self._model_entry, 1, 1, 2, 1)
|
||||
|
||||
grid.attach(Gtk.Label(label="API Key:", halign=Gtk.Align.END), 0, 2, 1, 1)
|
||||
self._key_entry = Gtk.Entry(hexpand=True, visibility=False)
|
||||
self._key_entry.set_text(self._cfg.get("api_key", ""))
|
||||
self._key_entry.set_placeholder_text("sk-...")
|
||||
grid.attach(self._key_entry, 1, 2, 1, 1)
|
||||
self._reveal_btn = Gtk.ToggleButton(label="Show")
|
||||
self._reveal_btn.connect("toggled", lambda b: self._key_entry.set_visibility(b.get_active()))
|
||||
grid.attach(self._reveal_btn, 2, 2, 1, 1)
|
||||
|
||||
grid.attach(Gtk.Label(label="Health Check:", halign=Gtk.Align.END), 0, 3, 1, 1)
|
||||
adj = Gtk.Adjustment(value=self._cfg.get("health_check_interval_s", 5), lower=2, upper=30, step_increment=1)
|
||||
self._interval_spin = Gtk.SpinButton(adjustment=adj)
|
||||
self._interval_spin.set_numeric(True)
|
||||
grid.attach(self._interval_spin, 1, 3, 1, 1)
|
||||
grid.attach(Gtk.Label(label="seconds"), 2, 3, 1, 1)
|
||||
|
||||
opts_box = Gtk.Box(spacing=12, margin_top=4)
|
||||
grid.attach(opts_box, 0, 4, 3, 1)
|
||||
self._auto_restart_cb = Gtk.CheckButton(label="Auto-restart proxy on crash")
|
||||
self._auto_restart_cb.set_active(self._cfg.get("auto_restart_proxy", True))
|
||||
opts_box.pack_start(self._auto_restart_cb, False, False, 0)
|
||||
self._auto_switch_cb = Gtk.CheckButton(label="Auto-switch provider on repeated failure")
|
||||
self._auto_switch_cb.set_active(self._cfg.get("auto_switch_provider", False))
|
||||
opts_box.pack_start(self._auto_switch_cb, False, False, 0)
|
||||
|
||||
save_btn = Gtk.Button(label="Save Configuration")
|
||||
save_btn.get_style_context().add_class("suggested-action")
|
||||
save_btn.connect("clicked", self._on_save)
|
||||
grid.attach(save_btn, 0, 5, 3, 1)
|
||||
|
||||
stats_box = Gtk.Box(spacing=16)
|
||||
vbox.pack_start(stats_box, False, False, 0)
|
||||
stats = self._store.get("stats", {"ai_calls": 0, "tokens_used": 0})
|
||||
self._stats_lbl = Gtk.Label()
|
||||
self._stats_lbl.set_markup(
|
||||
f"<small>AI diagnostic calls: <b>{stats.get('ai_calls', 0)}</b> | "
|
||||
f"Tokens used: <b>{stats.get('tokens_used', 0):,}</b> | "
|
||||
f"Known patterns: <b>{len(self._store.get('incidents', {}))}</b></small>"
|
||||
)
|
||||
self._stats_lbl.set_use_markup(True)
|
||||
stats_box.pack_start(self._stats_lbl, False, False, 0)
|
||||
|
||||
frame2 = Gtk.Frame(label="Recent Incidents")
|
||||
vbox.pack_start(frame2, True, True, 0)
|
||||
sw = Gtk.ScrolledWindow()
|
||||
sw.set_policy(Gtk.PolicyType.AUTOMATIC, Gtk.PolicyType.AUTOMATIC)
|
||||
frame2.add(sw)
|
||||
self._inc_buf = Gtk.TextBuffer()
|
||||
tv = Gtk.TextView(buffer=self._inc_buf)
|
||||
tv.set_editable(False)
|
||||
tv.set_cursor_visible(False)
|
||||
tv.set_wrap_mode(Gtk.WrapMode.WORD_CHAR)
|
||||
sw.add(tv)
|
||||
self._refresh_incidents()
|
||||
|
||||
bb = Gtk.Box(spacing=8)
|
||||
vbox.pack_start(bb, False, False, 0)
|
||||
view_btn = Gtk.Button(label="View Monitoring Log")
|
||||
view_btn.connect("clicked", lambda b: subprocess.Popen(["xdg-open", str(MONITORING_LOG)]))
|
||||
bb.pack_start(view_btn, False, False, 0)
|
||||
clear_btn = Gtk.Button(label="Clear Incident Store")
|
||||
clear_btn.connect("clicked", self._on_clear_store)
|
||||
bb.pack_start(clear_btn, False, False, 0)
|
||||
close_btn = Gtk.Button(label="Close")
|
||||
close_btn.connect("clicked", lambda b: self.destroy())
|
||||
bb.pack_end(close_btn, False, False, 0)
|
||||
|
||||
self.show_all()
|
||||
|
||||
def _on_toggle(self, switch, state):
|
||||
self._cfg["enabled"] = state
|
||||
_save_monitoring_config(self._cfg)
|
||||
|
||||
def _on_save(self, btn):
|
||||
self._cfg["provider_url"] = self._url_entry.get_text().strip()
|
||||
self._cfg["model"] = self._model_entry.get_text().strip()
|
||||
self._cfg["api_key"] = self._key_entry.get_text().strip()
|
||||
self._cfg["health_check_interval_s"] = int(self._interval_spin.get_value())
|
||||
self._cfg["auto_restart_proxy"] = self._auto_restart_cb.get_active()
|
||||
self._cfg["auto_switch_provider"] = self._auto_switch_cb.get_active()
|
||||
_save_monitoring_config(self._cfg)
|
||||
self._inc_buf.set_text("Configuration saved.\n")
|
||||
|
||||
def _on_clear_store(self, btn):
|
||||
_save_incident_store({"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}})
|
||||
self._store = {"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}}
|
||||
self._refresh_incidents()
|
||||
|
||||
def _refresh_incidents(self):
|
||||
lines = []
|
||||
for pattern, inc in sorted(self._store.get("incidents", {}).items(),
|
||||
key=lambda x: x[1].get("last_seen", ""), reverse=True):
|
||||
sc = inc.get("success_count", 0)
|
||||
fc = inc.get("fail_count", 0)
|
||||
rate = sc / max(sc + fc, 1)
|
||||
bar = "+" * min(int(rate * 10), 10) + "-" * (10 - min(int(rate * 10), 10))
|
||||
lines.append(
|
||||
f"[{inc.get('last_seen', '?')[:16]}] {pattern}\n"
|
||||
f" fix={inc.get('fix', '?')} success_rate={rate:.0%} [{bar}] "
|
||||
f"seen={inc.get('occurrences', 0)}x\n"
|
||||
)
|
||||
if not lines:
|
||||
lines.append("No incidents recorded yet.\n")
|
||||
lines.append("\nEnable AI Monitoring and use Codex to populate the store.\n")
|
||||
self._inc_buf.set_text("\n".join(lines))
|
||||
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════
|
||||
# Main window
|
||||
# ═══════════════════════════════════════════════════════════════════
|
||||
@@ -1107,7 +1661,7 @@ class LauncherWin(Gtk.Window):
|
||||
# header row
|
||||
hdr = Gtk.Box(spacing=8)
|
||||
vbox.pack_start(hdr, False, False, 0)
|
||||
lbl = Gtk.Label(label="<b>Codex Launcher v3.3.0</b>")
|
||||
lbl = Gtk.Label(label="<b>Codex Launcher v3.8.0</b>")
|
||||
lbl.set_use_markup(True)
|
||||
hdr.pack_start(lbl, False, False, 0)
|
||||
changelog_btn = Gtk.Button(label="Changelog")
|
||||
@@ -1125,6 +1679,9 @@ class LauncherWin(Gtk.Window):
|
||||
bgp_btn = Gtk.Button(label="AI BGP")
|
||||
bgp_btn.connect("clicked", lambda b: self._open_bgp())
|
||||
hdr.pack_end(bgp_btn, False, False, 0)
|
||||
mon_btn = Gtk.Button(label="AI Monitor")
|
||||
mon_btn.connect("clicked", lambda b: self._open_monitoring())
|
||||
hdr.pack_end(mon_btn, False, False, 0)
|
||||
mgr_btn = Gtk.Button(label="Manage Endpoints")
|
||||
mgr_btn.connect("clicked", lambda b: self._open_mgr())
|
||||
hdr.pack_end(mgr_btn, False, False, 0)
|
||||
@@ -1274,6 +1831,7 @@ class LauncherWin(Gtk.Window):
|
||||
self.show_all()
|
||||
self._rebuild_combo()
|
||||
self._log_dependency_status()
|
||||
self._start_watcher()
|
||||
|
||||
# ── helpers ──────────────────────────────────────────────────
|
||||
|
||||
@@ -1428,6 +1986,77 @@ class LauncherWin(Gtk.Window):
|
||||
d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
|
||||
d.run(); d.destroy()
|
||||
|
||||
def _open_monitoring(self):
|
||||
try:
|
||||
self._monitoring_window = AIMonitoringWindow(self)
|
||||
self._monitoring_window.connect("destroy", lambda *_: setattr(self, "_monitoring_window", None))
|
||||
except Exception as e:
|
||||
import traceback; traceback.print_exc()
|
||||
d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
|
||||
d.run(); d.destroy()
|
||||
|
||||
def _start_watcher(self):
|
||||
cfg = _load_monitoring_config()
|
||||
if not cfg.get("enabled"):
|
||||
return
|
||||
self._watcher = HealthWatcher(
|
||||
on_failure=self._on_watcher_failure,
|
||||
on_recovery=self._on_watcher_recovery,
|
||||
on_signal=self._on_watcher_signal,
|
||||
on_action=self._on_watcher_action,
|
||||
)
|
||||
self._watcher.start()
|
||||
self.log("AI Monitoring: watchdog started")
|
||||
|
||||
def _on_watcher_failure(self, count):
|
||||
GLib.idle_add(self.log, f"[AI Monitor] Proxy unresponsive (failures={count})")
|
||||
|
||||
def _on_watcher_recovery(self):
|
||||
GLib.idle_add(self.log, "[AI Monitor] Proxy recovered")
|
||||
|
||||
def _on_watcher_signal(self, fault_id, category, line):
|
||||
pass
|
||||
|
||||
def _on_watcher_action(self, action, trigger):
|
||||
cfg = _load_monitoring_config()
|
||||
if action == "restart_proxy" and cfg.get("auto_restart_proxy"):
|
||||
GLib.idle_add(self.log, f"[AI Monitor] Auto-restarting proxy (trigger: {trigger})")
|
||||
GLib.idle_add(self._restart_proxy_from_watcher)
|
||||
elif action == "clear_schema_cache":
|
||||
try:
|
||||
cap_file = Path.home() / ".cache/codex-proxy/provider-caps.json"
|
||||
if cap_file.exists():
|
||||
cap_file.unlink()
|
||||
GLib.idle_add(self.log, "[AI Monitor] Cleared corrupt schema cache")
|
||||
except Exception as e:
|
||||
GLib.idle_add(self.log, f"[AI Monitor] Failed to clear cache: {e}")
|
||||
elif action == "delete_provider_caps":
|
||||
try:
|
||||
cap_file = Path.home() / ".cache/codex-proxy/provider-caps.json"
|
||||
if cap_file.exists():
|
||||
cap_file.unlink()
|
||||
GLib.idle_add(self.log, "[AI Monitor] Deleted corrupted provider-caps.json")
|
||||
except Exception as e:
|
||||
GLib.idle_add(self.log, f"[AI Monitor] Failed: {e}")
|
||||
elif action == "kill_stale_restart":
|
||||
GLib.idle_add(self.log, f"[AI Monitor] Killing stale processes + restarting (trigger: {trigger})")
|
||||
self._kill()
|
||||
GLib.idle_add(self._restart_proxy_from_watcher)
|
||||
else:
|
||||
GLib.idle_add(self.log, f"[AI Monitor] Alert: {action} (trigger: {trigger})")
|
||||
|
||||
def _restart_proxy_from_watcher(self):
|
||||
try:
|
||||
ep_name = load_endpoints().get("default")
|
||||
if not ep_name:
|
||||
return
|
||||
for ep in load_endpoints().get("endpoints", []):
|
||||
if ep.get("name") == ep_name:
|
||||
self._start_proxy(ep)
|
||||
break
|
||||
except Exception as e:
|
||||
self.log(f"[AI Monitor] Proxy restart failed: {e}")
|
||||
|
||||
def _open_usage(self):
|
||||
try:
|
||||
self._usage_window = UsageWindow(self)
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user