docs: update CHANGELOG, README, GUI changelog for v3.8.0 AI Monitoring

- CHANGELOG.md: full v3.8.0 section with 3-tier system, 30 fault types, safety guards - README.md: AI Monitoring badge, features section, Phase 9 dev journey, troubleshooting rows - GUI CHANGELOG: v3.8.0 entry with 9 bullet points
2026-05-22 23:22:26 +04:00
parent 080f2bc56e
commit 78f28bd3ac
3 changed files with 115 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,51 @@
 # Changelog

+## v3.8.0 (2026-05-22)
+
+**AI Monitoring — Self-Healing Watchdog with 3-Tier Response System**
+
+When the proxy crashes, the upstream dies, or the model gets stuck, Codex stops working. The user has to manually restart everything. AI Monitoring fixes this with an autonomous watchdog that detects, diagnoses, and recovers from failures without user intervention.
+
+### Three-Tier Response System
+
+| Tier | Speed | What | When |
+|------|-------|------|------|
+| **Tier 1** | < 1s | Rule-based auto-recovery | Known failure patterns (14 rules) |
+| **Tier 2** | < 100ms | Incident store lookup | We've seen this exact failure before |
+| **Tier 3** | 2-5s | AI diagnostic agent (configurable model) | Novel failure — no rule or pattern matches |
+
+### Watchdog Components
+- **HealthWatcher thread** — pings proxy `/health` every 5 seconds, detects crashes and hangs
+- **LogAnalyzer thread** — tails `cc-debug.log` for 18 failure signal patterns in real-time
+- **Tier 1 rule engine** — 14 rules covering: proxy crash restart, port conflict resolution, upstream retry with backoff, schema cache clearing, rate limit handling, stream error recovery
+- **Tier 2 incident store** — JSON pattern database (`~/.cache/codex-proxy/incident-store.json`) with success rates, learns from every resolved incident
+- **Tier 3 AI diagnostic agent** — calls a user-configured provider/model (e.g., Gemini Flash, GPT-4o-mini, local Ollama) to diagnose novel failures. Cost: ~$0.10-1.50/month
+
+### Failure Catalog: 30 Fault Types
+- **Category A** (7): Proxy crash, port conflict, memory leak, deadlock, SSL error, DNS failure, unhandled exception
+- **Category B** (10): Rate limit (429), server error (5xx), auth failure (401/403), CC upgrade required, timeout, connection reset, broken pipe, bad request, provider overloaded, Cloudflare block
+- **Category C** (10): Parser empty, stuck recovery, sanitizer flags, double-wrapped cmd, suspicious cmd, empty cmd, bare JSON token, bash without cmd, DSML name mismatch, stuck model loop
+- **Category D** (6): Codex process killed, memory explosion, 300s stall, config corruption, context overflow, WebSocket reconnect loop
+- **Category E** (5): Schema cache corruption, stale PID file, port from old session, OAuth token expired, BGP all routes down
+
+### Safety Guards
+- Rate-limited AI calls: max 1 per 60s, max 10/day
+- Restart cap: max 5 proxy restarts per 10 minutes
+- Cooldown per pattern (30s → 60s → 300s → alert user)
+- Monthly AI budget cap (configurable, default $2/month)
+
+### Enhanced /health Endpoint
+The proxy's `/health` endpoint now returns `uptime_s`, `memory_mb`, and `requests_total` for watchdog monitoring.
+
+### GUI Integration
+- **"AI Monitor" button** in header bar
+- **AIMonitoringWindow**: ON/OFF toggle, provider URL/model/API key selector, health check interval, auto-restart toggle, incident log viewer
+- Watchdog starts automatically when enabled
+- All actions logged to `~/.cache/codex-proxy/monitoring.log`
+
+### AI Monitoring Design Spec
+Full design document at `AI-MONITORING-DESIGN.md` — architecture diagrams, decision flow, safety guards, implementation plan.
+
 ## v3.7.0 (2026-05-22)

 **Intelligence Routing — Self-Healing Parser System**