docs: update CHANGELOG, README, GUI changelog for v3.8.0 AI Monitoring

- CHANGELOG.md: full v3.8.0 section with 3-tier system, 30 fault types, safety guards
- README.md: AI Monitoring badge, features section, Phase 9 dev journey, troubleshooting rows
- GUI CHANGELOG: v3.8.0 entry with 9 bullet points
This commit is contained in:
admin
2026-05-22 23:22:26 +04:00
Unverified
parent 080f2bc56e
commit 78f28bd3ac
3 changed files with 115 additions and 0 deletions

View File

@@ -34,6 +34,7 @@
<img src="https://img.shields.io/badge/Tool_Calls-✓-success" />
<img src="https://img.shields.io/badge/AI_Assist-✓-success" />
<img src="https://img.shields.io/badge/Intelligence_Routing-✓-success" />
<img src="https://img.shields.io/badge/AI_Monitoring-✓-success" />
<img src="https://img.shields.io/badge/Self_Revive_Watchdog-✓-success" />
</p>
@@ -144,6 +145,19 @@ A three-component system:
- **Session URL memory** — `_last_user_urls` deque (20 entries) tracks URLs from user messages across the session, giving the synthesizer context to work with
- **54 self-test patterns** — comprehensive coverage of all three layers
### AI Monitoring (v3.8.0)
- **Self-healing watchdog** — the proxy auto-recovers from crashes, the model getting stuck, upstream failures, and more
- **Three-tier response system**: Tier 1 = rule-based (< 1s), Tier 2 = pattern lookup (< 100ms), Tier 3 = AI diagnostic agent (2-5s)
- **HealthWatcher thread** — pings proxy `/health` every 5 seconds, auto-restarts on crash
- **LogAnalyzer thread** — tails debug logs for 18 failure signal patterns in real-time
- **14 Tier 1 rules** — restart proxy, clear schema cache, kill stale processes, retry with backoff, rate limit handling
- **Incident pattern store** — learns from every resolved incident, looks up known fixes by success rate
- **AI diagnostic agent** — user-configurable provider/model (e.g., Gemini Flash, GPT-4o-mini, local Ollama) for diagnosing novel failures
- **30 fault types** catalogued across 5 categories: proxy failures (A), upstream errors (B), parser failures (C), Codex process failures (D), config/state failures (E)
- **Safety guards** — rate-limited AI calls, restart caps (5/10min), cooldown per pattern, monthly budget cap
- **GUI panel** — ON/OFF toggle, provider/model/API key selector, health check interval, auto-restart toggle, incident log viewer
- **Enhanced `/health`** — returns `uptime_s`, `memory_mb`, `requests_total` for monitoring
### GTK Launcher (`codex-launcher-gui`)
- **Endpoint manager** — add, edit, delete, set default providers
- **Provider presets** — one-click setup for 15+ providers with pre-filled URLs and model lists
@@ -415,6 +429,46 @@ The system also maintains a **session URL memory** (`_last_user_urls`, a deque o
**Test coverage:** 54 self-test patterns (up from 41), with 13 new tests specifically for Intelligence Routing layers.
### Phase 9: AI Monitoring — The Watchman That Never Sleeps
**Problem:** Intelligence Routing (Phase 8) handles failures *inside a single request*. But it can't detect a dead proxy process, reconnect Codex to a restarted proxy, switch to a backup provider when the primary is down, or clear corrupt caches. When the proxy crashes at 3 AM, the user wakes up to a broken Codex session and has to manually restart everything.
**The insight:** We needed a separate watchdog process that runs *outside* the proxy — monitoring it from the outside, like a night watchman patrolling a building. But a dumb watchdog that just restarts on crash is crude. What if the watchdog could *think* — diagnose *why* the proxy crashed and take the right corrective action?
**The Three-Tier Response System:**
```
Failure Detected
├── Tier 1: Known pattern? → Rule-based fix (< 1 second)
│ "proxy dead" → restart_proxy
│ "429 rate limit" → wait_retry_after
│ "schema corrupt" → delete_provider_caps
├── Tier 2: Seen this before? → Incident store lookup (< 100ms)
│ 85% success rate → reuse the fix that worked last time
└── Tier 3: Novel failure? → AI diagnostic agent (2-5 seconds)
Feed context to cheap LLM → get recommended action
Learn from result for next time
```
**What makes this different from existing solutions:**
Existing proxy tools (ccLoad, cc-proxy, codex-pool) all focus on routing and failover at the *request* level. None have an AI-powered diagnostic agent that analyzes failure context and recommends corrective actions. ccLoad has health checks and cooldowns, but it's purely rule-based. AI Monitoring adds the *intelligence* layer on top — the Tier 3 agent can diagnose novel failures that no rule covers.
**How it works:**
Two threads run in the GUI process:
1. `HealthWatcher` — pings `/health` every 5 seconds. On 3 consecutive failures, triggers Tier 1 `restart_proxy`.
2. `LogAnalyzer` — tails the debug log file, watching for 18 signal patterns. Counts consecutive failures per category. When a threshold is hit (e.g., 5x stuck recovery, 3x server error), triggers the appropriate tier.
The AI diagnostic agent (Tier 3) is fully configurable — the user picks any provider and model. A cheap model like Gemini Flash (~$0.0002/call) or a free local Ollama instance works perfectly. The agent receives a structured incident report (proxy health, upstream status, recent errors, parser state) and responds with one JSON action.
**Learning over time:** Every resolved incident is stored in `incident-store.json` with pattern → fix → success rate. Over time, the system shifts from Tier 3 (expensive AI calls) to Tier 2 (instant pattern lookup). A failure seen 10 times with 90% success rate will never reach the AI again.
**Catalogued 30 fault types** across 5 categories based on analysis of 42 production `parsed_tool_calls=0` events, 13 stuck recoveries, and 11 sanitizer flags from our actual debug logs. The system knows exactly what to look for.
---
## Architecture Deep Dive
@@ -548,6 +602,10 @@ README.md # This file
| CC explore_agent can't find URL | URL hidden inside JSON messages | V3.7 Layer 1 drills into JSON to extract URLs |
| CC agent stalls on escalation blocks | `<require_escalation>` not handled | V3.7 Layer 2 auto-proceeds past escalation requests |
| CC agent stalls — no tool calls at all | Model output format unrecognized | V3.7 Layer 3 synthesizes command from text intent |
| Proxy crashes mid-session | Unhandled streaming error | V3.8 AI Monitor auto-restarts proxy |
| Proxy port conflict on restart | Stale process holding port | V3.8 AI Monitor kills stale + restarts |
| Schema cache corruption | ErrorAnalyzer learned wrong schema | V3.8 AI Monitor auto-clears provider-caps.json |
| Upstream 500 repeatedly | Provider having issues | V3.8 AI Monitor detects pattern + alerts/switches |
---