docs: update CHANGELOG, README, GUI changelog for v3.8.0 AI Monitoring

- CHANGELOG.md: full v3.8.0 section with 3-tier system, 30 fault types, safety guards - README.md: AI Monitoring badge, features section, Phase 9 dev journey, troubleshooting rows - GUI CHANGELOG: v3.8.0 entry with 9 bullet points
2026-05-22 23:22:26 +04:00
3 changed files with 115 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,51 @@
 # Changelog

+## v3.8.0 (2026-05-22)
+
+**AI Monitoring — Self-Healing Watchdog with 3-Tier Response System**
+
+When the proxy crashes, the upstream dies, or the model gets stuck, Codex stops working. The user has to manually restart everything. AI Monitoring fixes this with an autonomous watchdog that detects, diagnoses, and recovers from failures without user intervention.
+
+### Three-Tier Response System
+
+| Tier | Speed | What | When |
+|------|-------|------|------|
+| **Tier 1** | < 1s | Rule-based auto-recovery | Known failure patterns (14 rules) |
+| **Tier 2** | < 100ms | Incident store lookup | We've seen this exact failure before |
+| **Tier 3** | 2-5s | AI diagnostic agent (configurable model) | Novel failure — no rule or pattern matches |
+
+### Watchdog Components
+- **HealthWatcher thread** — pings proxy `/health` every 5 seconds, detects crashes and hangs
+- **LogAnalyzer thread** — tails `cc-debug.log` for 18 failure signal patterns in real-time
+- **Tier 1 rule engine** — 14 rules covering: proxy crash restart, port conflict resolution, upstream retry with backoff, schema cache clearing, rate limit handling, stream error recovery
+- **Tier 2 incident store** — JSON pattern database (`~/.cache/codex-proxy/incident-store.json`) with success rates, learns from every resolved incident
+- **Tier 3 AI diagnostic agent** — calls a user-configured provider/model (e.g., Gemini Flash, GPT-4o-mini, local Ollama) to diagnose novel failures. Cost: ~$0.10-1.50/month
+
+### Failure Catalog: 30 Fault Types
+- **Category A** (7): Proxy crash, port conflict, memory leak, deadlock, SSL error, DNS failure, unhandled exception
+- **Category B** (10): Rate limit (429), server error (5xx), auth failure (401/403), CC upgrade required, timeout, connection reset, broken pipe, bad request, provider overloaded, Cloudflare block
+- **Category C** (10): Parser empty, stuck recovery, sanitizer flags, double-wrapped cmd, suspicious cmd, empty cmd, bare JSON token, bash without cmd, DSML name mismatch, stuck model loop
+- **Category D** (6): Codex process killed, memory explosion, 300s stall, config corruption, context overflow, WebSocket reconnect loop
+- **Category E** (5): Schema cache corruption, stale PID file, port from old session, OAuth token expired, BGP all routes down
+
+### Safety Guards
+- Rate-limited AI calls: max 1 per 60s, max 10/day
+- Restart cap: max 5 proxy restarts per 10 minutes
+- Cooldown per pattern (30s → 60s → 300s → alert user)
+- Monthly AI budget cap (configurable, default $2/month)
+
+### Enhanced /health Endpoint
+The proxy's `/health` endpoint now returns `uptime_s`, `memory_mb`, and `requests_total` for watchdog monitoring.
+
+### GUI Integration
+- **"AI Monitor" button** in header bar
+- **AIMonitoringWindow**: ON/OFF toggle, provider URL/model/API key selector, health check interval, auto-restart toggle, incident log viewer
+- Watchdog starts automatically when enabled
+- All actions logged to `~/.cache/codex-proxy/monitoring.log`
+
+### AI Monitoring Design Spec
+Full design document at `AI-MONITORING-DESIGN.md` — architecture diagrams, decision flow, safety guards, implementation plan.
+
 ## v3.7.0 (2026-05-22)

 **Intelligence Routing — Self-Healing Parser System**
--- a/README.md
+++ b/README.md
@@ -34,6 +34,7 @@
  <img src="https://img.shields.io/badge/Tool_Calls-✓-success" />
  <img src="https://img.shields.io/badge/AI_Assist-✓-success" />
  <img src="https://img.shields.io/badge/Intelligence_Routing-✓-success" />
+  <img src="https://img.shields.io/badge/AI_Monitoring-✓-success" />
  <img src="https://img.shields.io/badge/Self_Revive_Watchdog-✓-success" />
 </p>

@@ -144,6 +145,19 @@ A three-component system:
 - **Session URL memory** — `_last_user_urls` deque (20 entries) tracks URLs from user messages across the session, giving the synthesizer context to work with
 - **54 self-test patterns** — comprehensive coverage of all three layers

+### AI Monitoring (v3.8.0)
+- **Self-healing watchdog** — the proxy auto-recovers from crashes, the model getting stuck, upstream failures, and more
+- **Three-tier response system**: Tier 1 = rule-based (< 1s), Tier 2 = pattern lookup (< 100ms), Tier 3 = AI diagnostic agent (2-5s)
+- **HealthWatcher thread** — pings proxy `/health` every 5 seconds, auto-restarts on crash
+- **LogAnalyzer thread** — tails debug logs for 18 failure signal patterns in real-time
+- **14 Tier 1 rules** — restart proxy, clear schema cache, kill stale processes, retry with backoff, rate limit handling
+- **Incident pattern store** — learns from every resolved incident, looks up known fixes by success rate
+- **AI diagnostic agent** — user-configurable provider/model (e.g., Gemini Flash, GPT-4o-mini, local Ollama) for diagnosing novel failures
+- **30 fault types** catalogued across 5 categories: proxy failures (A), upstream errors (B), parser failures (C), Codex process failures (D), config/state failures (E)
+- **Safety guards** — rate-limited AI calls, restart caps (5/10min), cooldown per pattern, monthly budget cap
+- **GUI panel** — ON/OFF toggle, provider/model/API key selector, health check interval, auto-restart toggle, incident log viewer
+- **Enhanced `/health`** — returns `uptime_s`, `memory_mb`, `requests_total` for monitoring
+
 ### GTK Launcher (`codex-launcher-gui`)
 - **Endpoint manager** — add, edit, delete, set default providers
 - **Provider presets** — one-click setup for 15+ providers with pre-filled URLs and model lists
@@ -415,6 +429,46 @@ The system also maintains a **session URL memory** (`_last_user_urls`, a deque o

 **Test coverage:** 54 self-test patterns (up from 41), with 13 new tests specifically for Intelligence Routing layers.

+### Phase 9: AI Monitoring — The Watchman That Never Sleeps
+
+**Problem:** Intelligence Routing (Phase 8) handles failures *inside a single request*. But it can't detect a dead proxy process, reconnect Codex to a restarted proxy, switch to a backup provider when the primary is down, or clear corrupt caches. When the proxy crashes at 3 AM, the user wakes up to a broken Codex session and has to manually restart everything.
+
+**The insight:** We needed a separate watchdog process that runs *outside* the proxy — monitoring it from the outside, like a night watchman patrolling a building. But a dumb watchdog that just restarts on crash is crude. What if the watchdog could *think* — diagnose *why* the proxy crashed and take the right corrective action?
+
+**The Three-Tier Response System:**
+
+```
+Failure Detected
+      │
+      ├── Tier 1: Known pattern? → Rule-based fix (< 1 second)
+      │             "proxy dead" → restart_proxy
+      │             "429 rate limit" → wait_retry_after
+      │             "schema corrupt" → delete_provider_caps
+      │
+      ├── Tier 2: Seen this before? → Incident store lookup (< 100ms)
+      │             85% success rate → reuse the fix that worked last time
+      │
+      └── Tier 3: Novel failure? → AI diagnostic agent (2-5 seconds)
+                    Feed context to cheap LLM → get recommended action
+                    Learn from result for next time
+```
+
+**What makes this different from existing solutions:**
+
+Existing proxy tools (ccLoad, cc-proxy, codex-pool) all focus on routing and failover at the *request* level. None have an AI-powered diagnostic agent that analyzes failure context and recommends corrective actions. ccLoad has health checks and cooldowns, but it's purely rule-based. AI Monitoring adds the *intelligence* layer on top — the Tier 3 agent can diagnose novel failures that no rule covers.
+
+**How it works:**
+
+Two threads run in the GUI process:
+1. `HealthWatcher` — pings `/health` every 5 seconds. On 3 consecutive failures, triggers Tier 1 `restart_proxy`.
+2. `LogAnalyzer` — tails the debug log file, watching for 18 signal patterns. Counts consecutive failures per category. When a threshold is hit (e.g., 5x stuck recovery, 3x server error), triggers the appropriate tier.
+
+The AI diagnostic agent (Tier 3) is fully configurable — the user picks any provider and model. A cheap model like Gemini Flash (~$0.0002/call) or a free local Ollama instance works perfectly. The agent receives a structured incident report (proxy health, upstream status, recent errors, parser state) and responds with one JSON action.
+
+**Learning over time:** Every resolved incident is stored in `incident-store.json` with pattern → fix → success rate. Over time, the system shifts from Tier 3 (expensive AI calls) to Tier 2 (instant pattern lookup). A failure seen 10 times with 90% success rate will never reach the AI again.
+
+**Catalogued 30 fault types** across 5 categories based on analysis of 42 production `parsed_tool_calls=0` events, 13 stuck recoveries, and 11 sanitizer flags from our actual debug logs. The system knows exactly what to look for.
+
 ---

 ## Architecture Deep Dive
@@ -548,6 +602,10 @@ README.md                         # This file
 | CC explore_agent can't find URL | URL hidden inside JSON messages | V3.7 Layer 1 drills into JSON to extract URLs |
 | CC agent stalls on escalation blocks | `<require_escalation>` not handled | V3.7 Layer 2 auto-proceeds past escalation requests |
 | CC agent stalls — no tool calls at all | Model output format unrecognized | V3.7 Layer 3 synthesizes command from text intent |
+| Proxy crashes mid-session | Unhandled streaming error | V3.8 AI Monitor auto-restarts proxy |
+| Proxy port conflict on restart | Stale process holding port | V3.8 AI Monitor kills stale + restarts |
+| Schema cache corruption | ErrorAnalyzer learned wrong schema | V3.8 AI Monitor auto-clears provider-caps.json |
+| Upstream 500 repeatedly | Provider having issues | V3.8 AI Monitor detects pattern + alerts/switches |

 ---

--- a/src/codex-launcher-gui
+++ b/src/codex-launcher-gui
@@ -26,6 +26,17 @@ model_catalog_json = ""
 """

 CHANGELOG = [
+    ("3.8.0", "2026-05-22", [
+        "AI Monitoring — self-healing watchdog with 3-tier response system",
+        "HealthWatcher: monitors proxy health every 5s, auto-restarts on crash",
+        "LogAnalyzer: tails debug logs for 18 failure signal patterns",
+        "Tier 1: 14 rule-based auto-recovery rules (< 1s response)",
+        "Tier 2: Incident pattern store with success rate tracking",
+        "Tier 3: AI diagnostic agent — configurable provider/model for novel failures",
+        "30 fault types catalogued across 5 categories (A-E)",
+        "GUI: AI Monitor panel with ON/OFF, provider selector, incident log",
+        "Enhanced /health endpoint with memory and uptime metrics",
+    ]),
    ("3.7.0", "2026-05-22", [
        "Intelligence Routing — self-healing parser system for Command Code",
        "Layer 1: Deep URL extraction from nested JSON in explore_agent blocks",