docs: update CHANGELOG, README, GUI changelog for v3.8.0 AI Monitoring
- CHANGELOG.md: full v3.8.0 section with 3-tier system, 30 fault types, safety guards - README.md: AI Monitoring badge, features section, Phase 9 dev journey, troubleshooting rows - GUI CHANGELOG: v3.8.0 entry with 9 bullet points
This commit is contained in:
58
README.md
58
README.md
@@ -34,6 +34,7 @@
|
||||
<img src="https://img.shields.io/badge/Tool_Calls-✓-success" />
|
||||
<img src="https://img.shields.io/badge/AI_Assist-✓-success" />
|
||||
<img src="https://img.shields.io/badge/Intelligence_Routing-✓-success" />
|
||||
<img src="https://img.shields.io/badge/AI_Monitoring-✓-success" />
|
||||
<img src="https://img.shields.io/badge/Self_Revive_Watchdog-✓-success" />
|
||||
</p>
|
||||
|
||||
@@ -144,6 +145,19 @@ A three-component system:
|
||||
- **Session URL memory** — `_last_user_urls` deque (20 entries) tracks URLs from user messages across the session, giving the synthesizer context to work with
|
||||
- **54 self-test patterns** — comprehensive coverage of all three layers
|
||||
|
||||
### AI Monitoring (v3.8.0)
|
||||
- **Self-healing watchdog** — the proxy auto-recovers from crashes, the model getting stuck, upstream failures, and more
|
||||
- **Three-tier response system**: Tier 1 = rule-based (< 1s), Tier 2 = pattern lookup (< 100ms), Tier 3 = AI diagnostic agent (2-5s)
|
||||
- **HealthWatcher thread** — pings proxy `/health` every 5 seconds, auto-restarts on crash
|
||||
- **LogAnalyzer thread** — tails debug logs for 18 failure signal patterns in real-time
|
||||
- **14 Tier 1 rules** — restart proxy, clear schema cache, kill stale processes, retry with backoff, rate limit handling
|
||||
- **Incident pattern store** — learns from every resolved incident, looks up known fixes by success rate
|
||||
- **AI diagnostic agent** — user-configurable provider/model (e.g., Gemini Flash, GPT-4o-mini, local Ollama) for diagnosing novel failures
|
||||
- **30 fault types** catalogued across 5 categories: proxy failures (A), upstream errors (B), parser failures (C), Codex process failures (D), config/state failures (E)
|
||||
- **Safety guards** — rate-limited AI calls, restart caps (5/10min), cooldown per pattern, monthly budget cap
|
||||
- **GUI panel** — ON/OFF toggle, provider/model/API key selector, health check interval, auto-restart toggle, incident log viewer
|
||||
- **Enhanced `/health`** — returns `uptime_s`, `memory_mb`, `requests_total` for monitoring
|
||||
|
||||
### GTK Launcher (`codex-launcher-gui`)
|
||||
- **Endpoint manager** — add, edit, delete, set default providers
|
||||
- **Provider presets** — one-click setup for 15+ providers with pre-filled URLs and model lists
|
||||
@@ -415,6 +429,46 @@ The system also maintains a **session URL memory** (`_last_user_urls`, a deque o
|
||||
|
||||
**Test coverage:** 54 self-test patterns (up from 41), with 13 new tests specifically for Intelligence Routing layers.
|
||||
|
||||
### Phase 9: AI Monitoring — The Watchman That Never Sleeps
|
||||
|
||||
**Problem:** Intelligence Routing (Phase 8) handles failures *inside a single request*. But it can't detect a dead proxy process, reconnect Codex to a restarted proxy, switch to a backup provider when the primary is down, or clear corrupt caches. When the proxy crashes at 3 AM, the user wakes up to a broken Codex session and has to manually restart everything.
|
||||
|
||||
**The insight:** We needed a separate watchdog process that runs *outside* the proxy — monitoring it from the outside, like a night watchman patrolling a building. But a dumb watchdog that just restarts on crash is crude. What if the watchdog could *think* — diagnose *why* the proxy crashed and take the right corrective action?
|
||||
|
||||
**The Three-Tier Response System:**
|
||||
|
||||
```
|
||||
Failure Detected
|
||||
│
|
||||
├── Tier 1: Known pattern? → Rule-based fix (< 1 second)
|
||||
│ "proxy dead" → restart_proxy
|
||||
│ "429 rate limit" → wait_retry_after
|
||||
│ "schema corrupt" → delete_provider_caps
|
||||
│
|
||||
├── Tier 2: Seen this before? → Incident store lookup (< 100ms)
|
||||
│ 85% success rate → reuse the fix that worked last time
|
||||
│
|
||||
└── Tier 3: Novel failure? → AI diagnostic agent (2-5 seconds)
|
||||
Feed context to cheap LLM → get recommended action
|
||||
Learn from result for next time
|
||||
```
|
||||
|
||||
**What makes this different from existing solutions:**
|
||||
|
||||
Existing proxy tools (ccLoad, cc-proxy, codex-pool) all focus on routing and failover at the *request* level. None have an AI-powered diagnostic agent that analyzes failure context and recommends corrective actions. ccLoad has health checks and cooldowns, but it's purely rule-based. AI Monitoring adds the *intelligence* layer on top — the Tier 3 agent can diagnose novel failures that no rule covers.
|
||||
|
||||
**How it works:**
|
||||
|
||||
Two threads run in the GUI process:
|
||||
1. `HealthWatcher` — pings `/health` every 5 seconds. On 3 consecutive failures, triggers Tier 1 `restart_proxy`.
|
||||
2. `LogAnalyzer` — tails the debug log file, watching for 18 signal patterns. Counts consecutive failures per category. When a threshold is hit (e.g., 5x stuck recovery, 3x server error), triggers the appropriate tier.
|
||||
|
||||
The AI diagnostic agent (Tier 3) is fully configurable — the user picks any provider and model. A cheap model like Gemini Flash (~$0.0002/call) or a free local Ollama instance works perfectly. The agent receives a structured incident report (proxy health, upstream status, recent errors, parser state) and responds with one JSON action.
|
||||
|
||||
**Learning over time:** Every resolved incident is stored in `incident-store.json` with pattern → fix → success rate. Over time, the system shifts from Tier 3 (expensive AI calls) to Tier 2 (instant pattern lookup). A failure seen 10 times with 90% success rate will never reach the AI again.
|
||||
|
||||
**Catalogued 30 fault types** across 5 categories based on analysis of 42 production `parsed_tool_calls=0` events, 13 stuck recoveries, and 11 sanitizer flags from our actual debug logs. The system knows exactly what to look for.
|
||||
|
||||
---
|
||||
|
||||
## Architecture Deep Dive
|
||||
@@ -548,6 +602,10 @@ README.md # This file
|
||||
| CC explore_agent can't find URL | URL hidden inside JSON messages | V3.7 Layer 1 drills into JSON to extract URLs |
|
||||
| CC agent stalls on escalation blocks | `<require_escalation>` not handled | V3.7 Layer 2 auto-proceeds past escalation requests |
|
||||
| CC agent stalls — no tool calls at all | Model output format unrecognized | V3.7 Layer 3 synthesizes command from text intent |
|
||||
| Proxy crashes mid-session | Unhandled streaming error | V3.8 AI Monitor auto-restarts proxy |
|
||||
| Proxy port conflict on restart | Stale process holding port | V3.8 AI Monitor kills stale + restarts |
|
||||
| Schema cache corruption | ErrorAnalyzer learned wrong schema | V3.8 AI Monitor auto-clears provider-caps.json |
|
||||
| Upstream 500 repeatedly | Provider having issues | V3.8 AI Monitor detects pattern + alerts/switches |
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user