docs: update CHANGELOG, README, GUI changelog for v3.8.0 AI Monitoring

- CHANGELOG.md: full v3.8.0 section with 3-tier system, 30 fault types, safety guards - README.md: AI Monitoring badge, features section, Phase 9 dev journey, troubleshooting rows - GUI CHANGELOG: v3.8.0 entry with 9 bullet points
2026-05-22 23:22:26 +04:00
parent 080f2bc56e
commit 78f28bd3ac
3 changed files with 115 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -34,6 +34,7 @@
  <img src="https://img.shields.io/badge/Tool_Calls-✓-success" />
  <img src="https://img.shields.io/badge/AI_Assist-✓-success" />
  <img src="https://img.shields.io/badge/Intelligence_Routing-✓-success" />
+  <img src="https://img.shields.io/badge/AI_Monitoring-✓-success" />
  <img src="https://img.shields.io/badge/Self_Revive_Watchdog-✓-success" />
 </p>

@@ -144,6 +145,19 @@ A three-component system:
 - **Session URL memory** — `_last_user_urls` deque (20 entries) tracks URLs from user messages across the session, giving the synthesizer context to work with
 - **54 self-test patterns** — comprehensive coverage of all three layers

+### AI Monitoring (v3.8.0)
+- **Self-healing watchdog** — the proxy auto-recovers from crashes, the model getting stuck, upstream failures, and more
+- **Three-tier response system**: Tier 1 = rule-based (< 1s), Tier 2 = pattern lookup (< 100ms), Tier 3 = AI diagnostic agent (2-5s)
+- **HealthWatcher thread** — pings proxy `/health` every 5 seconds, auto-restarts on crash
+- **LogAnalyzer thread** — tails debug logs for 18 failure signal patterns in real-time
+- **14 Tier 1 rules** — restart proxy, clear schema cache, kill stale processes, retry with backoff, rate limit handling
+- **Incident pattern store** — learns from every resolved incident, looks up known fixes by success rate
+- **AI diagnostic agent** — user-configurable provider/model (e.g., Gemini Flash, GPT-4o-mini, local Ollama) for diagnosing novel failures
+- **30 fault types** catalogued across 5 categories: proxy failures (A), upstream errors (B), parser failures (C), Codex process failures (D), config/state failures (E)
+- **Safety guards** — rate-limited AI calls, restart caps (5/10min), cooldown per pattern, monthly budget cap
+- **GUI panel** — ON/OFF toggle, provider/model/API key selector, health check interval, auto-restart toggle, incident log viewer
+- **Enhanced `/health`** — returns `uptime_s`, `memory_mb`, `requests_total` for monitoring
+
 ### GTK Launcher (`codex-launcher-gui`)
 - **Endpoint manager** — add, edit, delete, set default providers
 - **Provider presets** — one-click setup for 15+ providers with pre-filled URLs and model lists
@@ -415,6 +429,46 @@ The system also maintains a **session URL memory** (`_last_user_urls`, a deque o

 **Test coverage:** 54 self-test patterns (up from 41), with 13 new tests specifically for Intelligence Routing layers.

+### Phase 9: AI Monitoring — The Watchman That Never Sleeps
+
+**Problem:** Intelligence Routing (Phase 8) handles failures *inside a single request*. But it can't detect a dead proxy process, reconnect Codex to a restarted proxy, switch to a backup provider when the primary is down, or clear corrupt caches. When the proxy crashes at 3 AM, the user wakes up to a broken Codex session and has to manually restart everything.
+
+**The insight:** We needed a separate watchdog process that runs *outside* the proxy — monitoring it from the outside, like a night watchman patrolling a building. But a dumb watchdog that just restarts on crash is crude. What if the watchdog could *think* — diagnose *why* the proxy crashed and take the right corrective action?
+
+**The Three-Tier Response System:**
+
+```
+Failure Detected
+      │
+      ├── Tier 1: Known pattern? → Rule-based fix (< 1 second)
+      │             "proxy dead" → restart_proxy
+      │             "429 rate limit" → wait_retry_after
+      │             "schema corrupt" → delete_provider_caps
+      │
+      ├── Tier 2: Seen this before? → Incident store lookup (< 100ms)
+      │             85% success rate → reuse the fix that worked last time
+      │
+      └── Tier 3: Novel failure? → AI diagnostic agent (2-5 seconds)
+                    Feed context to cheap LLM → get recommended action
+                    Learn from result for next time
+```
+
+**What makes this different from existing solutions:**
+
+Existing proxy tools (ccLoad, cc-proxy, codex-pool) all focus on routing and failover at the *request* level. None have an AI-powered diagnostic agent that analyzes failure context and recommends corrective actions. ccLoad has health checks and cooldowns, but it's purely rule-based. AI Monitoring adds the *intelligence* layer on top — the Tier 3 agent can diagnose novel failures that no rule covers.
+
+**How it works:**
+
+Two threads run in the GUI process:
+1. `HealthWatcher` — pings `/health` every 5 seconds. On 3 consecutive failures, triggers Tier 1 `restart_proxy`.
+2. `LogAnalyzer` — tails the debug log file, watching for 18 signal patterns. Counts consecutive failures per category. When a threshold is hit (e.g., 5x stuck recovery, 3x server error), triggers the appropriate tier.
+
+The AI diagnostic agent (Tier 3) is fully configurable — the user picks any provider and model. A cheap model like Gemini Flash (~$0.0002/call) or a free local Ollama instance works perfectly. The agent receives a structured incident report (proxy health, upstream status, recent errors, parser state) and responds with one JSON action.
+
+**Learning over time:** Every resolved incident is stored in `incident-store.json` with pattern → fix → success rate. Over time, the system shifts from Tier 3 (expensive AI calls) to Tier 2 (instant pattern lookup). A failure seen 10 times with 90% success rate will never reach the AI again.
+
+**Catalogued 30 fault types** across 5 categories based on analysis of 42 production `parsed_tool_calls=0` events, 13 stuck recoveries, and 11 sanitizer flags from our actual debug logs. The system knows exactly what to look for.
+
 ---

 ## Architecture Deep Dive
@@ -548,6 +602,10 @@ README.md                         # This file
 | CC explore_agent can't find URL | URL hidden inside JSON messages | V3.7 Layer 1 drills into JSON to extract URLs |
 | CC agent stalls on escalation blocks | `<require_escalation>` not handled | V3.7 Layer 2 auto-proceeds past escalation requests |
 | CC agent stalls — no tool calls at all | Model output format unrecognized | V3.7 Layer 3 synthesizes command from text intent |
+| Proxy crashes mid-session | Unhandled streaming error | V3.8 AI Monitor auto-restarts proxy |
+| Proxy port conflict on restart | Stale process holding port | V3.8 AI Monitor kills stale + restarts |
+| Schema cache corruption | ErrorAnalyzer learned wrong schema | V3.8 AI Monitor auto-clears provider-caps.json |
+| Upstream 500 repeatedly | Provider having issues | V3.8 AI Monitor detects pattern + alerts/switches |

 ---