v3.8.0: AI Monitoring — self-healing watchdog with 3-tier response system
- HealthWatcher thread: monitors proxy /health every 5s - LogAnalyzer thread: tails cc-debug.log for 18 failure signal patterns - Tier 1 rule engine: 14 rules for instant auto-recovery (< 1s) - Tier 2 incident store: JSON pattern database with success rates - Tier 3 AI diagnostic agent: calls configurable provider/model for novel failures - AIMonitoringWindow GUI: ON/OFF toggle, provider/model/API key selector, incident log - 30 fault types catalogued across 5 categories (A-E) - Enhanced /health endpoint with memory_mb, uptime_s, requests_total - Auto-restart proxy, auto-clear schema cache, kill stale processes - Safety: rate-limited AI calls, restart caps, cooldowns per pattern - AI Monitoring design spec (AI-MONITORING-DESIGN.md) - 54 self-test patterns passing
This commit is contained in:
@@ -3410,10 +3410,20 @@ class Handler(http.server.BaseHTTPRequestHandler):
|
||||
if self.path in ("/v1/models", "/models"):
|
||||
self.send_json(200, {"object": "list", "data": MODELS})
|
||||
elif self.path in ("/health", "/v1/health"):
|
||||
import resource as _res
|
||||
_mem_mb = 0
|
||||
try:
|
||||
_mem_mb = _res.getrusage(_res.RUSAGE_SELF).ru_maxrss / 1024
|
||||
except Exception:
|
||||
pass
|
||||
_uptime = time.time() - _START_TIME if '_START_TIME' in dir() else 0
|
||||
self.send_json(200, {"ok": True, "backend": BACKEND,
|
||||
"target_url": TARGET_URL,
|
||||
"models": [m.get("id") for m in MODELS],
|
||||
"bgp_routes": len(BGP_ROUTES)})
|
||||
"bgp_routes": len(BGP_ROUTES),
|
||||
"uptime_s": round(_uptime, 1),
|
||||
"memory_mb": round(_mem_mb, 1),
|
||||
"requests_total": _STATS.get("requests", 0)})
|
||||
else:
|
||||
self.send_error(404)
|
||||
|
||||
@@ -4750,10 +4760,11 @@ def _handle_shutdown_signal(sig, frame):
|
||||
_SHUTDOWN_REQUESTED = True
|
||||
print(f"[SELF-REVIVE] Signal {sig} received, shutting down cleanly", flush=True)
|
||||
if 'SERVER' in globals() and SERVER:
|
||||
SERVER.shutdown()
|
||||
|
||||
SERVER.shutdown()
|
||||
|
||||
def main():
|
||||
global SERVER
|
||||
global SERVER, _START_TIME
|
||||
_START_TIME = time.time()
|
||||
_init_runtime()
|
||||
signal.signal(signal.SIGTERM, _handle_shutdown_signal)
|
||||
signal.signal(signal.SIGINT, _handle_shutdown_signal)
|
||||
|
||||
Reference in New Issue
Block a user