v3.8.0: AI Monitoring — self-healing watchdog with 3-tier response system

- HealthWatcher thread: monitors proxy /health every 5s - LogAnalyzer thread: tails cc-debug.log for 18 failure signal patterns - Tier 1 rule engine: 14 rules for instant auto-recovery (< 1s) - Tier 2 incident store: JSON pattern database with success rates - Tier 3 AI diagnostic agent: calls configurable provider/model for novel failures - AIMonitoringWindow GUI: ON/OFF toggle, provider/model/API key selector, incident log - 30 fault types catalogued across 5 categories (A-E) - Enhanced /health endpoint with memory_mb, uptime_s, requests_total - Auto-restart proxy, auto-clear schema cache, kill stale processes - Safety: rate-limited AI calls, restart caps, cooldowns per pattern - AI Monitoring design spec (AI-MONITORING-DESIGN.md) - 54 self-test patterns passing
docs: AI Monitoring design spec v3.8.0 — self-healing watchdog with 3-tier response system
2026-05-22 22:36:16 +04:00 · 2026-05-22 22:22:30 +04:00 · 2026-05-22 16:35:08 +04:00 · 2026-05-22 16:29:45 +04:00 · 2026-05-22 16:09:51 +04:00 · 2026-05-22 13:26:03 +04:00
9 changed files with 1932 additions and 74 deletions
--- a/AI-MONITORING-DESIGN.md
+++ b/AI-MONITORING-DESIGN.md
@@ -0,0 +1,638 @@
+# AI Monitoring — Design Specification
+
+> **Codex Launcher v3.8.0 Feature Design**
+> Self-healing nano agent that monitors proxy health, diagnoses failures, and auto-recovers sessions.
+
+---
+
+## 1. Problem Statement
+
+Over 42 sessions in production, we observed these failure categories:
+
+| # | Failure Category | Count | Example |
+|---|-----------------|-------|---------|
+| F1 | **parsed_tool_calls=0** — model produces unparseable output | 42 | Bare `<explore_agent>`, `<bash>` without cmd, plain English intent |
+| F2 | **Stuck recovery triggered** — Intelligence Routing Layer 3 | 13 | "I need to fetch the README", "let me write the script" |
+| F3 | **Sanitizer flagged suspicious cmd** — cmd still JSON after unwrap | 11 | `{/'cmd/': /'sshpass -p .../'}` — double-escaped quoting |
+| F4 | **Upstream 500** — provider internal error | ~5 | `"An internal error occurred. Please try again later."` |
+| F5 | **Connection timeout** — upstream unreachable | ~3 | `Connection timed out after 15002 milliseconds` |
+| F6 | **Upstream 401/403** — auth failure | ~2 | Wrong API key, expired token, `upgrade_required` |
+| F7 | **Stream crash** — exception mid-stream | ~2 | `BrokenPipeError`, `ConnectionResetError` during SSE |
+| F8 | **Proxy port conflict** — Address already in use | ~1 | Stale process holding port |
+| F9 | **Schema cache corruption** — stale content_type=array | ~1 | `ErrorAnalyzer` learned wrong schema |
+| F10 | **Codex Desktop crash** — SIGKILL at ~27GB | ~1 | Issue #24048 — unbounded tool output memory |
+| F11 | **Codex 300s stall** — turn state machine race | ~1 | Issue #23807 — `stream disconnected` after 300s |
+
+### The Gap
+
+Intelligence Routing (v3.7.0) handles F1/F2/F3 **inside a single request**. But it can't:
+
+- **Detect a dead proxy process** (F7/F8) — the proxy already crashed
+- **Reconnect Codex to a restarted proxy** (F5/F7/F8) — Codex doesn't auto-reconnect
+- **Switch to a backup provider** when the primary is down (F4/F5)
+- **Clear corrupt caches** (F9) — requires out-of-band action
+- **Restart Codex Desktop** after a crash (F10/F11)
+- **Learn from failure patterns** across sessions — each failure is handled independently
+
+### What We Need
+
+A **separate lightweight watchdog process** that:
+1. Monitors proxy health continuously
+2. Detects failures the proxy can't detect itself
+3. Uses a cheap AI model to diagnose novel failures
+4. Takes corrective action automatically
+5. Learns from past incidents to prevent repeats
+
+---
+
+## 2. Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        Codex Launcher GUI                            │
+│  ┌──────────┐  ┌──────────────┐  ┌───────────────────────────────┐ │
+│  │  Proxy   │  │   Codex      │  │   AI Monitoring Panel         │ │
+│  │  Manager │  │   Launcher   │  │   ┌─────────────────────┐     │ │
+│  │          │  │              │  │   │ ON/OFF Toggle        │     │ │
+│  └────┬─────┘  └──────┬───────┘  │   │ Provider Selector    │     │ │
+│       │               │          │   │ Model Selector        │     │ │
+│       │               │          │   │ Incident Log          │     │ │
+│       │               │          │   │ [View Diagnostics]    │     │ │
+│       │               │          │   └─────────────────────┘     │ │
+│       │               │          └───────────────────────────────┘ │
+└───────┼───────────────┼────────────────────────────────────────────┘
+        │               │
+        ▼               ▼
+┌───────────────┐  ┌────────────────┐
+│ translate-    │  │  Codex Desktop  │
+│ proxy.py      │  │  / CLI          │
+│ (port 8080)   │  │                 │
+│               │  │                 │
+│ /health ──────┼──┼─► health check  │
+│ /responses ───┼──┼─► main API      │
+└───────────────┘  └────────────────┘
+        ▲
+        │ health probes + log analysis + corrective actions
+        │
+┌───────┴────────────────────────────────────────────────────────────┐
+│                     AI Monitor Watchdog                             │
+│                    (thread in codex-launcher-gui)                   │
+│                                                                     │
+│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────────┐  │
+│  │  Health Watcher  │  │  Log Analyzer   │  │  AI Diagnostic   │  │
+│  │  (every 5s)      │  │  (continuous)    │  │  Agent (on-call) │  │
+│  │                  │  │                  │  │                  │  │
+│  │  - /health probe │  │  - tail cc-debug │  │  - Classify err  │  │
+│  │  - process alive │  │  - tail proxy.log│  │  - Root cause    │  │
+│  │  - port check    │  │  - pattern match │  │  - Suggest fix   │  │
+│  │  - memory watch  │  │  - incident DB   │  │  - Execute fix   │  │
+│  └────────┬────────┘  └────────┬────────┘  └────────┬─────────┘  │
+│           │                    │                     │             │
+│           └────────────────────┼─────────────────────┘             │
+│                                ▼                                   │
+│                    ┌──────────────────────┐                        │
+│                    │  Incident Store      │                        │
+│                    │  (JSON file)         │                        │
+│                    │  - Known patterns    │                        │
+│                    │  - Past resolutions  │                        │
+│                    │  - Success rates     │                        │
+│                    └──────────────────────┘                        │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 3. Three-Tier Response System
+
+### Tier 1: Fast Path — Rule-Based Auto-Recovery (< 1 second)
+
+Immediate reactions to **known failure patterns**. No AI needed.
+
+```python
+TIER1_RULES = [
+    # (trigger_pattern, action, cooldown)
+    
+    # --- Proxy Health ---
+    ("proxy_health_fail",      "restart_proxy",           30),
+    ("proxy_port_conflict",    "kill_stale + restart",     60),
+    ("proxy_memory_over_1gb",  "restart_proxy",           120),
+    
+    # --- Upstream Errors ---
+    ("upstream_429",           "wait_retry_after",          0),
+    ("upstream_502_503",       "retry_with_backoff",       30),
+    ("upstream_500_repeat_3x", "switch_provider",          60),
+    ("upstream_timeout",       "retry + increase_timeout", 30),
+    ("upstream_401_403",       "alert_user_bad_key",        0),
+    
+    # --- Stream Errors ---
+    ("stream_broken_pipe",     "restart_proxy",            30),
+    ("stream_reset",           "restart_proxy",            30),
+    ("stream_idle_300s",       "restart_proxy",            60),
+    
+    # --- Parser Failures ---
+    ("parsed_tool_calls_0_x3", "clear_schema_cache",      300),
+    ("sanitizer_suspicious_5x","alert_user_model_issue",    0),
+    ("stuck_recovery_x5",      "suggest_switch_model",      0),
+    
+    # --- Codex Process ---
+    ("codex_process_dead",     "alert_user_restart",         0),
+    ("codex_memory_over_4gb",  "alert_user_memory",          0),
+    
+    # --- Cache Corruption ---
+    ("schema_content_type_array", "delete_provider_caps",     0),
+]
+```
+
+### Tier 2: Pattern Matching — Incident Store Lookup (< 100ms)
+
+For failures we've **seen before and resolved**, look up the fix:
+
+```json
+{
+  "incidents": [
+    {
+      "pattern": "cc_stream_ended_empty + explore_agent + no_url",
+      "fix": "synth_explore_from_last_user_urls",
+      "source": "FIX-23",
+      "success_rate": 0.85,
+      "last_seen": "2026-05-22T16:00:00Z",
+      "occurrences": 5
+    },
+    {
+      "pattern": "require_escalation + no_cmd",
+      "fix": "auto_proceed_echo",
+      "source": "FIX-24",
+      "success_rate": 1.0,
+      "last_seen": "2026-05-22T15:30:00Z",
+      "occurrences": 3
+    }
+  ]
+}
+```
+
+### Tier 3: AI Diagnostic — Nano Agent (2-5 seconds)
+
+For **novel failures** that don't match any rule or pattern, invoke a cheap AI model:
+
+```
+Prompt Template (system):
+─────────────────────
+You are a diagnostic agent for a translation proxy that sits between
+OpenAI Codex CLI/Desktop and AI providers (Command Code, OpenAI-compat,
+Anthropic, etc.). You analyze error context and suggest ONE corrective action.
+
+Available actions: restart_proxy, kill_stale_processes, clear_schema_cache,
+switch_provider, increase_timeout, alert_user, ignore, retry_now,
+regenerate_config, cleanup_codex_stale
+
+Respond with ONLY a JSON object: {"action": "...", "reason": "...", "confidence": 0.0-1.0}
+
+Prompt Template (user):
+─────────────────────
+INCIDENT REPORT:
+Time: {timestamp}
+Session: {session_id}
+Proxy health: {alive/dead, port, uptime, memory_mb}
+Upstream: {url, model, last_http_code, last_error}
+Recent errors (last 60s):
+{log_lines}
+Parser state: {parsed_tool_calls, stuck_recovery_count, sanitizer_flags}
+Provider: {backend_type, model}
+History: {last_5_incidents_for_this_pattern}
+
+What corrective action should be taken?
+```
+
+---
+
+## 4. Complete Failure Catalog
+
+### Category A: Proxy-Level Failures (watchdog detects, auto-recovers)
+
+| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
+|----|---------|----------|---------------|---------------|
+| A1 | Proxy process crashed | `/health` returns connection refused | `restart_proxy` | `urllib.error.URLError: [Errno 111] Connection refused` |
+| A2 | Port conflict | `Address already in use` on startup | `kill_stale + restart` | `OSError: [Errno 98] Address already in use` |
+| A3 | Memory leak | Process RSS > 1GB | `restart_proxy` | `/proc/{pid}/status` VmRSS check |
+| A4 | Deadlock | Health check hangs > 15s | `restart_proxy` | health probe timeout |
+| A5 | Unhandled exception | Process exits with non-zero | `restart_proxy` | `SELF-REVIVE CRASH #{n}` |
+| A6 | SSL/TLS error | `CERTIFICATE_VERIFY_FAILED` upstream | `alert_user` | `urllib.error.URLError: certificate verify failed` |
+| A7 | DNS resolution failure | `getaddrinfo failed` | `retry_with_backoff` | `socket.gaierror: Name or service not known` |
+
+### Category B: Upstream Provider Failures (proxy detects, watchdog analyzes)
+
+| ID | Failure | Symptoms | Tier 1 Action | Log Signature |
+|----|---------|----------|---------------|---------------|
+| B1 | Rate limit (429) | Too many requests | `wait_retry_after` | `HTTP 429` + `Retry-After` header |
+| B2 | Server error (5xx) | Provider down | `retry_with_backoff` | `HTTP 500/502/503` |
+| B3 | Auth failure (401/403) | Bad/expired key | `alert_user_bad_key` | `HTTP 401 {"error":"invalid_api_key"}` |
+| B4 | CC upgrade required (403) | Version mismatch | `update_cc_version` | `HTTP 403 upgrade_required` |
+| B5 | Connection timeout | Upstream silent | `retry + increase_timeout` | `urllib.error.URLError: timed out` |
+| B6 | Connection reset | Upstream dropped mid-stream | `restart_proxy` | `ConnectionResetError: Connection reset by peer` |
+| B7 | Broken pipe | Client disconnected | `ignore` | `BrokenPipeError: Broken pipe` |
+| B8 | Upstream 400 bad request | Malformed request | `clear_schema_cache` | `HTTP 400 {"error":"...expected string..."}` |
+| B9 | Provider capacity (503) | Overloaded | `switch_provider` | `HTTP 503` after 3 retries |
+| B10 | Cloudflare block (403/1010) | Bot detection | `check_browser_ua` | `HTTP 403 error 1010` |
+
+### Category C: Parser/Format Failures (Intelligence Routing handles, watchdog tracks)
+
+| ID | Failure | Symptoms | Auto-Fix (IR Layer) | Watchdog Escalation |
+|----|---------|----------|--------------------|--------------------|
+| C1 | Bare `<explore_agent>` | `parsed_tool_calls=0` | Layer 1: URL extraction | If 3x in a row → suggest model switch |
+| C2 | `<require_escalation>` block | Model wants permissions | Layer 2: Auto-proceed | If 5x → suggest different provider |
+| C3 | Unrecognized format | No parser matches | Layer 3: Intent synthesis | If 5x → log for AI diagnosis |
+| C4 | Double-wrapped cmd | `cmd = "{\"cmd\": ...}"` | Sanitizer: unwrap | If cmd still JSON → alert |
+| C5 | Suspicious cmd (JSON) | `cmd starts with {` | Sanitizer: flag | If 3x → clear cache + restart |
+| C6 | Empty cmd | `cmd = ""` or `cmd = "{}"` | Sanitizer: diagnostic echo | If 3x → suggest model switch |
+| C7 | Bare `{` token | Model outputs incomplete JSON | Layer 3: heuristic 5 | If persistent → AI diagnosis |
+| C8 | `<bash>` without cmd | Block has sandbox but no command | Layer 3: heuristic | If 3x → AI diagnosis |
+| C9 | DSML name mismatch | `name="cmd"` vs `name="command"` | DSML parser handles both | Self-test catches regression |
+| C10 | Stuck model loop | Same recovery 5+ times | Layer 3 max 3x then alert | Switch model or provider |
+
+### Category D: Codex Process Failures (watchdog detects, alerts user)
+
+| ID | Failure | Symptoms | Action | Log Signature |
+|----|---------|----------|--------|---------------|
+| D1 | Codex process killed | PID gone from pids.json | `alert_user_restart` | Process not in `/proc/{pid}` |
+| D2 | Codex memory explosion | RSS > 4GB | `alert_user_memory` | `/proc/{pid}/status` check |
+| D3 | Codex 300s stall | `stream disconnected` loop | `restart_proxy` | Codex stderr: `stream disconnected` |
+| D4 | Config corruption | `database disk image is malformed` | `regenerate_config` | Codex stderr: `malformed` |
+| D5 | Session context overflow | `context_length_exceeded` | `alert_user_context` | Codex stderr: `context_length_exceeded` |
+| D6 | WebSocket reconnect loop | `Reconnecting... N/5` | `check_proxy_health` | Codex stderr: `Reconnecting` |
+
+### Category E: Config/State Failures (watchdog detects, auto-fixes)
+
+| ID | Failure | Symptoms | Action | Detection |
+|----|---------|----------|--------|-----------|
+| E1 | Schema cache corruption | `content_type: "array"` in provider-caps.json | `delete_provider_caps` | Read file, check for known-bad values |
+| E2 | Stale PID file | pids.json has dead PIDs | `cleanup_pids` | Check `/proc/{pid}` existence |
+| E3 | Port from old session | config.toml has stale port | `regenerate_config` | Port in config != running port |
+| E4 | OAuth token expired | Google/Gemini token refresh fails | `alert_user_reauth` | Token file `expiry_ts < now` |
+| E5 | BGP all routes down | Every route returned error | `alert_user_no_provider` | All routes in cooldown |
+
+---
+
+## 5. Component Design
+
+### 5.1 Health Watcher Thread
+
+Runs in the GUI process as a background thread. Pings proxy `/health` endpoint every 5 seconds.
+
+```python
+class HealthWatcher(threading.Thread):
+    def __init__(self, proxy_port, on_failure, on_recovery):
+        super().__init__(daemon=True)
+        self.proxy_port = proxy_port
+        self.on_failure = on_failure
+        self.on_recovery = on_recovery
+        self.check_interval = 5  # seconds
+        self.failures = 0
+        self.running = True
+    
+    def run(self):
+        while self.running:
+            healthy = self._check_health()
+            if healthy:
+                if self.failures > 0:
+                    self.failures = 0
+                    self.on_recovery()
+            else:
+                self.failures += 1
+                if self.failures >= 3:  # 15s of consecutive failures
+                    self.on_failure(self.failures)
+            time.sleep(self.check_interval)
+    
+    def _check_health(self):
+        try:
+            req = urllib.request.Request(f"http://localhost:{self.proxy_port}/health")
+            resp = urllib.request.urlopen(req, timeout=5)
+            return resp.status == 200
+        except Exception:
+            return False
+```
+
+### 5.2 Log Analyzer Thread
+
+Tails the debug log and extracts failure signals in real-time.
+
+```python
+FAILURE_SIGNALS = {
+    "parsed_tool_calls=0":      ("C1", "parser_empty"),
+    "[STUCK-RECOVERY]":         ("C3", "stuck_recovery"),
+    "suspicious cmd":           ("C4", "sanitizer_flag"),
+    "empty cmd recovered":      ("C6", "empty_cmd"),
+    "HTTP 429":                 ("B1", "rate_limited"),
+    "HTTP 500":                 ("B2", "server_error"),
+    "HTTP 401":                 ("B3", "auth_failure"),
+    "HTTP 403":                 ("B4", "forbidden"),
+    "Connection refused":       ("A1", "proxy_dead"),
+    "Address already in use":   ("A2", "port_conflict"),
+    "Broken pipe":              ("B7", "broken_pipe"),
+    "Connection reset":         ("B6", "connection_reset"),
+    "timed out":                ("B5", "timeout"),
+    "SELF-REVIVE CRASH":        ("A5", "proxy_crash"),
+    "stream error":             ("B6", "stream_error"),
+}
+
+class LogAnalyzer(threading.Thread):
+    def __init__(self, log_path, on_signal):
+        super().__init__(daemon=True)
+        self.log_path = log_path
+        self.on_signal = on_signal
+        self.running = True
+    
+    def run(self):
+        fh = open(self.log_path, "r")
+        fh.seek(0, 2)  # seek to end
+        while self.running:
+            line = fh.readline()
+            if not line:
+                time.sleep(0.5)
+                continue
+            for pattern, (fault_id, category) in FAILURE_SIGNALS.items():
+                if pattern in line:
+                    self.on_signal(fault_id, category, line.strip())
+                    break
+```
+
+### 5.3 AI Diagnostic Agent
+
+Invoked by the watchdog when a failure doesn't match Tier 1 rules or Tier 2 patterns.
+
+```python
+class AIDiagnosticAgent:
+    def __init__(self, provider_url, model, api_key):
+        self.provider_url = provider_url
+        self.model = model
+        self.api_key = api_key
+        self.system_prompt = DIAGNOSTIC_SYSTEM_PROMPT  # defined below
+        self.incident_store = IncidentStore()
+    
+    def diagnose(self, context):
+        # Tier 2: Check incident store first
+        pattern = self._extract_pattern(context)
+        known_fix = self.incident_store.lookup(pattern)
+        if known_fix and known_fix["success_rate"] > 0.7:
+            return known_fix["fix"], "tier2_pattern", known_fix["success_rate"]
+        
+        # Tier 3: Ask AI
+        prompt = self._build_prompt(context)
+        response = self._call_model(prompt)
+        action = self._parse_response(response)
+        
+        # Learn from this incident
+        if action:
+            self.incident_store.record(pattern, action)
+        
+        return action, "tier3_ai", None
+    
+    def _call_model(self, prompt):
+        body = {
+            "model": self.model,
+            "messages": [
+                {"role": "system", "content": self.system_prompt},
+                {"role": "user", "content": prompt}
+            ],
+            "max_tokens": 200,
+            "temperature": 0.1,
+        }
+        req = urllib.request.Request(
+            self.provider_url,
+            data=json.dumps(body).encode(),
+            headers={
+                "Content-Type": "application/json",
+                "Authorization": f"Bearer {self.api_key}",
+            }
+        )
+        resp = urllib.request.urlopen(req, timeout=15)
+        return json.loads(resp.read())["choices"][0]["message"]["content"]
+```
+
+### 5.4 Incident Store
+
+JSON file that accumulates failure patterns and their resolutions.
+
+```json
+{
+  "version": 1,
+  "incidents": {
+    "parser_empty+explore_agent": {
+      "fault_ids": ["C1"],
+      "fix": "synth_explore_from_urls",
+      "source": "intelligent_routing",
+      "success_count": 8,
+      "fail_count": 1,
+      "last_seen": "2026-05-22T16:00:00Z",
+      "auto_applied": true
+    },
+    "server_error+repeat_3x": {
+      "fault_ids": ["B2"],
+      "fix": "switch_provider",
+      "source": "tier1_rule",
+      "success_count": 2,
+      "fail_count": 0,
+      "last_seen": "2026-05-22T14:00:00Z",
+      "auto_applied": true
+    }
+  },
+  "ai_diagnostic_calls": 0,
+  "tokens_used": 0,
+  "cost_usd": 0.0
+}
+```
+
+### 5.5 Diagnostic Agent System Prompt
+
+```
+You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local
+translation proxy between OpenAI Codex CLI/Desktop and various AI providers.
+
+## Your Job
+Analyze the incident report and recommend ONE corrective action.
+
+## Available Actions
+- restart_proxy: Kill and restart translate-proxy.py
+- kill_stale_processes: Kill orphaned proxy/codex processes
+- clear_schema_cache: Delete ~/.cache/codex-proxy/provider-caps.json
+- switch_provider: Switch to a different configured endpoint
+- increase_timeout: Increase upstream timeout for slow providers
+- regenerate_config: Regenerate Codex config.toml
+- cleanup_codex_stale: Run cleanup-codex-stale.sh
+- alert_user: Show notification to user (can't auto-fix)
+- ignore: Transient error, no action needed
+- retry_now: Immediate retry without changes
+
+## Decision Rules
+- If upstream returns 401/403 with auth error → alert_user (can't fix bad keys)
+- If proxy process is dead → restart_proxy
+- If same error repeated 5+ times → switch_provider or alert_user
+- If error is about content_type/schema → clear_schema_cache
+- If "Address already in use" → kill_stale_processes then restart_proxy
+- If timeout and upstream is slow → increase_timeout
+- If single transient 429/502/503 → ignore (retry handles it)
+- If "stream disconnected" and proxy is healthy → ignore (Codex retries)
+
+## Response Format
+Reply with ONLY a JSON object:
+{"action": "...", "reason": "...", "confidence": 0.0-1.0}
+
+No explanation, no markdown, no extra text.
+```
+
+---
+
+## 6. GUI Integration
+
+### AI Monitoring Panel (in Settings tab)
+
+```
+┌─────────────────────────────────────────────────────────┐
+│  AI Monitoring                                    [ON]  │
+│                                                          │
+│  ┌─ Diagnostic Agent ─────────────────────────────────┐ │
+│  │ Provider: [OpenCode Zen          ▼]                │ │
+│  │ Model:    [Qwen3-32B              ▼]                │ │
+│  │ API Key:  [sk-•••••••••••••••••••• ]                │ │
+│  │                                                     │ │
+│  │ Cost this month: $0.12 (3 diagnostic calls)         │ │
+│  │ Tokens used: 1,847 input / 423 output               │ │
+│  └─────────────────────────────────────────────────────┘ │
+│                                                          │
+│  ┌─ Incident Log (last 7 days) ──────────────────────┐  │
+│  │ ✅ 16:00 F1 parser_empty → synth_explore (Tier 2) │  │
+│  │ ⚠️ 15:30 B2 server_error → retry (Tier 1)         │  │
+│  │ ✅ 15:00 A1 proxy_dead → restart_proxy (Tier 1)    │  │
+│  │ 🤖 14:30 C3 novel_format → clear_cache (Tier 3)   │  │
+│  │ ...                                               │  │
+│  └────────────────────────────────────────────────────┘  │
+│                                                          │
+│  [View Full Diagnostics]  [Export Incident Report]       │
+└─────────────────────────────────────────────────────────┘
+```
+
+### Config Storage (in endpoints.json)
+
+```json
+{
+  "ai_monitoring": {
+    "enabled": true,
+    "provider_url": "https://opencode.ai/zen/v1/chat/completions",
+    "model": "Qwen/Qwen3-32B",
+    "api_key": "sk-...",
+    "tier1_enabled": true,
+    "tier2_enabled": true,
+    "tier3_enabled": true,
+    "auto_restart_proxy": true,
+    "auto_switch_provider": false,
+    "health_check_interval_s": 5,
+    "max_memory_mb": 1024,
+    "notification_level": "important_only"
+  }
+}
+```
+
+### Recommended Models (by cost)
+
+| Model | Cost/Diagnosis | Latency | Quality | Recommended For |
+|-------|---------------|---------|---------|----------------|
+| **Qwen3-32B** (OpenCode) | ~$0.0005 | 2-4s | Good | Default — cheapest decent model |
+| **DeepSeek V4 Flash** | ~$0.0003 | 2-3s | Good | Cheapest option |
+| **GPT-4o-mini** | ~$0.001 | 1-2s | Excellent | Best quality/latency |
+| **Gemini 2.0 Flash** | ~$0.0002 | 1-2s | Good | Cheapest + fastest |
+| **Claude Haiku 4.5** | ~$0.001 | 2-3s | Excellent | Best reasoning quality |
+| **Local Ollama** (if running) | $0 | 5-15s | Varies | Zero-cost offline option |
+
+### Cost Estimate
+
+- Average diagnostic prompt: ~800 tokens input, ~100 tokens output
+- Expected frequency: ~1-5 incidents per day that reach Tier 3
+- **Monthly cost**: $0.10 - $1.50 depending on model and usage
+
+---
+
+## 7. Watchdog Response Flow
+
+```
+Failure Detected
+      │
+      ▼
+┌─────────────┐    YES    ┌──────────────────┐
+│ Tier 1 Rule? ├─────────►│ Execute Action    │
+│ (known)      │           │ Log incident      │
+└──────┬───────┘           └──────────────────┘
+       │ NO
+       ▼
+┌─────────────┐    YES    ┌──────────────────┐
+│ Tier 2 Match?├─────────►│ Apply Known Fix   │
+│ (incident DB)│           │ Update success    │
+└──────┬───────┘           └──────────────────┘
+       │ NO
+       ▼
+┌─────────────┐   YES     ┌──────────────────┐
+│ AI Enabled?  ├─────────►│ Collect Context   │
+│ (Tier 3)     │           │ Build Prompt      │
+└──────┬───────┘           │ Call AI Model     │
+       │ NO                │ Parse Response    │
+       ▼                   │ Execute if auto   │
+┌─────────────┐           │ Store incident    │
+│ Alert User   │           └──────────────────┘
+│ (can't fix)  │
+└─────────────┘
+```
+
+---
+
+## 8. Safety Guards
+
+1. **Rate limit AI calls** — max 1 Tier 3 call per 60 seconds, max 10 per day
+2. **Never auto-execute destructive actions** — `alert_user` for: delete files, change API keys, modify source code
+3. **Auto-restart cap** — max 5 proxy restarts per 10 minutes, then alert user
+4. **Cost cap** — monthly AI diagnostic budget (configurable, default $2/month)
+5. **Cooldown per pattern** — same failure pattern has escalating cooldown (30s → 60s → 300s → alert)
+6. **User override** — any auto-action can be cancelled within 3 seconds via GUI
+7. **Incident store max size** — 500 entries, LRU eviction
+8. **Health check bypass** — if user manually stopped proxy, don't alert
+
+---
+
+## 9. Implementation Plan
+
+### Phase 1: Core Watchdog (v3.8.0)
+- `HealthWatcher` thread in `codex-launcher-gui`
+- `LogAnalyzer` thread tailing `cc-debug.log` and `proxy.log`
+- Tier 1 rule engine with all 20+ rules
+- Incident store (JSON file)
+- GUI toggle (ON/OFF) in settings
+- Auto-restart proxy on crash
+
+### Phase 2: Pattern Learning (v3.8.1)
+- Tier 2 incident store lookup
+- Auto-learn from Intelligence Routing outcomes
+- Success rate tracking per pattern
+- Incident log viewer in GUI
+
+### Phase 3: AI Diagnostic Agent (v3.9.0)
+- Tier 3 AI model integration
+- Provider/model selector in GUI
+- Diagnostic prompt template
+- Cost tracking
+- Full incident report export
+
+### Phase 4: Advanced Recovery (v4.0.0)
+- Auto-switch to backup provider on repeated failure
+- BGP route health monitoring
+- Predictive failure detection (memory growth, latency trends)
+- Codex process memory monitoring
+- WebSocket reconnect assistance
+
+---
+
+## 10. File Changes Summary
+
+| File | Changes |
+|------|---------|
+| `codex-launcher-gui` | +HealthWatcher thread, +LogAnalyzer thread, +AI Monitoring panel, +incident log viewer |
+| `translate-proxy.py` | +`/monitoring` endpoint (returns health + metrics), enhanced `/health` with memory/uptime |
+| `~/.cache/codex-proxy/incident-store.json` | New file — incident pattern database |
+| `~/.cache/codex-proxy/monitoring.log` | New file — watchdog activity log |
+| `~/.codex/endpoints.json` | +`ai_monitoring` config section |
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,48 @@
 # Changelog

+## v3.7.0 (2026-05-22)
+
+**Intelligence Routing — Self-Healing Parser System**
+
+When the Command Code model produces output in unpredictable or unrecognized formats, the multi-format parser chain (DSML, XML, explore_agent, bash blocks, raw JSON, fallback regex) can return empty. This causes the Codex agent loop to stall — zero tool calls means nothing to execute.
+
+Intelligence Routing is a **three-layer self-healing system** that ensures the agent loop always continues:
+
+### Layer 1: Deep URL Extraction (FIX 23)
+- **Problem**: `<explore_agent>` body contained `messages: [{"content": "https://..."}]` — URLs hidden inside JSON values. Regex couldn't match because it excluded the `"` character that terminates JSON strings.
+- **Solution**: `_build_explore_cmd()` extracted to module level (was a closure inside `_parse_commandcode_text_tool_calls`). After initial regex fails, tries `json.loads()`, iterates list items, extracts `content` field to find URLs. Added `"` to regex exclusion set.
+- **Self-tests**: Pattern M, O, O2 verify URL extraction from nested JSON.
+
+### Layer 2: Escalation Block Handling (FIX 24)
+- **Problem**: Model produces `<require_escalation>` and `<request_escalation_permission>` blocks when it wants elevated permissions. CC adapter doesn't support escalation — blocks silently dropped → `parsed_tool_calls=0` → stall.
+- **Solution**: Two handlers:
+  - FIX 24a: Closed-tag blocks — extracts URL if present and runs explore command; otherwise echoes auto-proceed.
+  - FIX 24b: Bare/unclosed tags (`<require_escalation />`) — auto-proceeds with diagnostic echo.
+- **Self-tests**: Pattern N, N2 verify both closed and bare escalation blocks.
+
+### Layer 3: Intent-Based Command Synthesis (FIX 25 — THE CORE)
+- **Problem**: After ALL parsers return empty, the agent loop has zero tool calls. Model may have written plain English ("I need to fetch the README"), partial JSON, or completely unrecognized formats.
+- **Solution**: 5-heuristic synthesis chain in `cc_stream_to_sse()`, run when `parsed_tool_calls=0` and text has content:
+  1. **URL in text** → `curl` to fetch it
+  2. **File path reference** ("read the file /path/to/X") → `cat` or `ls` that file
+  3. **Shell command in backticks/quotes** → extract and run it
+  4. **"explore"/"fetch"/"investigate"/"repository" intent** + last user URL → `_build_explore_cmd()` with `_last_user_urls` deque
+  5. **"I need to"/"let me"/"please" intent text** → echo diagnostic with the intent
+- The system NEVER returns empty tool calls when there's text to analyze.
+- **Self-tests**: Patterns M-O2 cover the full pipeline.
+
+### Architecture
+```
+_parse_commandcode_text_tool_calls()  ←  Layer 1 + Layer 2
+cc_stream_to_sse()                    ←  Layer 3 (after parser chain + fallback)
+_last_user_urls deque (maxlen=20)     ←  Session-wide URL memory for heuristic 4
+```
+
+### Test Coverage
+- **54 self-test patterns** (up from 41 in v3.6.0)
+- 13 new tests covering all three Intelligence Routing layers
+- Tests verify: nested JSON URL extraction, closed/bare escalation blocks, module-level explore command builder
+
 ## v3.6.0 (2026-05-22)

 **Performance & Stability Hardening — Connection Pooling, Stream Idle Timeouts, Retry-After**
--- a/README.md
+++ b/README.md
@@ -33,6 +33,7 @@
  <img src="https://img.shields.io/badge/Streaming_SSE-✓-success" />
  <img src="https://img.shields.io/badge/Tool_Calls-✓-success" />
  <img src="https://img.shields.io/badge/AI_Assist-✓-success" />
+  <img src="https://img.shields.io/badge/Intelligence_Routing-✓-success" />
  <img src="https://img.shields.io/badge/Self_Revive_Watchdog-✓-success" />
 </p>

@@ -130,6 +131,19 @@ A three-component system:
 - **ErrorAnalyzer** — learns from 4xx errors, retries with adjusted parameters (max 2 retries)
 - **Schema cache** with 24h staleness TTL for provider capabilities

+### Intelligence Routing (v3.7.0)
+- **Three-layer self-healing system** — the agent loop never stalls, even when the model speaks gibberish
+- **Layer 1 — Deep URL Extraction**: When `<explore_agent>` hides URLs inside nested JSON (`messages: [{"content": "https://..."}]`), the parser drills into the JSON structure to find them. Module-level `_build_explore_cmd()` is reused across parser + stream path.
+- **Layer 2 — Escalation Auto-Proceed**: `<require_escalation>` and `<request_escalation_permission>` blocks are detected and auto-resolved — the model doesn't get stuck waiting for permissions that don't exist.
+- **Layer 3 — Intent-Based Command Synthesis**: When ALL parsers fail, 5 heuristics analyze the model's plain-text output and synthesize a working command:
+  1. URL detected → `curl` it
+  2. File path mentioned → `cat` or `ls` it
+  3. Shell command in quotes → extract and run it
+  4. "explore"/"fetch" intent → use the last URL the user mentioned
+  5. "I need to"/"let me" intent → echo a diagnostic so the loop continues
+- **Session URL memory** — `_last_user_urls` deque (20 entries) tracks URLs from user messages across the session, giving the synthesizer context to work with
+- **54 self-test patterns** — comprehensive coverage of all three layers
+
 ### GTK Launcher (`codex-launcher-gui`)
 - **Endpoint manager** — add, edit, delete, set default providers
 - **Provider presets** — one-click setup for 15+ providers with pre-filled URLs and model lists
@@ -324,6 +338,83 @@ Built a cascading parser chain (`DSML → bash → explore → tool_call → XML

 **Verification:** `--self-test` flag runs 19 automated tests covering all edge cases. Debug logging to `~/.cache/codex-proxy/cc-debug.log` captures every parser decision for troubleshooting.

+### Phase 8: Intelligence Routing — When the Model Refuses to Speak Machine
+
+**Problem:** The 17-fix parser chain from Phase 7 was powerful — it could handle DSML, XML, JSON, bash blocks, explore tags, you name it. But there was one edge case it couldn't crack: **when the model doesn't produce a parseable tool-call format at all**.
+
+In production, `deepseek/deepseek-v4-flash` via Command Code kept doing things like:
+
+```
+<explore_agent>
+messages: [{"content": "Understand the Z.AI-Chat-for-Android repo at https://..."}]
+</explore_agent>
+```
+
+or:
+
+```
+<require_escalation>
+I need elevated permissions to access the repository.
+</require_escalation>
+```
+
+or just plain English: *"I need to fetch the README from the repository to understand the app structure."*
+
+In every case, `parsed_tool_calls=0`. No tool to execute. The Codex agent loop ground to a halt. The user saw "thinking..." forever.
+
+**The insight:** The model is trying to communicate *intent*, just not in a format we can parse. Instead of adding more regex patterns, what if we could **read the model's mind** — understand what it *wants* to do, and synthesize the command for it?
+
+**Intelligence Routing — Three Layers of Escalation:**
+
+```
+Layer 1: "Fix the input"     — Can we extract more from what the model gave us?
+Layer 2: "Handle the intent" — Is the model asking for something we can auto-resolve?
+Layer 3: "Read the mind"     — What is the model trying to do? Just do it for it.
+```
+
+**Layer 1 — Deep URL Extraction (FIX 23):**
+
+The `<explore_agent>` handler had a URL regex, but the URL was trapped inside `{"content": "https://..."}` — the trailing `"` broke matching. The fix: after the initial regex fails, `json.loads()` the entire block, walk the JSON tree, and pull URLs out of `content` fields. The `_build_explore_cmd()` function was extracted to module level so both the parser and the stream handler could use it.
+
+```python
+# Before: regex fails, URL lost
+# After: json.loads -> iterate items -> extract content -> find URL
+```
+
+**Layer 2 — Escalation Auto-Proceed (FIX 24):**
+
+`<require_escalation>` blocks are the model's way of saying "I need more permissions." The CC adapter doesn't have an escalation mechanism — these blocks were silently dropped. The fix: detect them (both closed `<tag>...</tag>` and bare `<tag />` forms), extract any URL inside them, and auto-proceed with an explore command or a diagnostic echo.
+
+```python
+# Model: <require_escalation>Please let me run curl</require_escalation>
+# Proxy: Okay, here's your curl command → exec_command synthesized
+```
+
+**Layer 3 — Intent-Based Command Synthesis (FIX 25):**
+
+The crown jewel. When ALL parsers return empty — no DSML, no XML, no JSON, no fallback regex matches — the system doesn't give up. It analyzes the model's raw text through **5 heuristic lenses** in priority order:
+
+| Priority | Signal | Synthesized Command |
+|:--------:|--------|---------------------|
+| 1 | URL in text | `curl` to fetch it |
+| 2 | File path reference | `cat` or `ls` the file |
+| 3 | Shell command in backticks/quotes | Extract and run it |
+| 4 | "explore"/"fetch" + last user URL | Full explore command |
+| 5 | "I need to"/"let me" intent | Echo diagnostic |
+
+The system also maintains a **session URL memory** (`_last_user_urls`, a deque of the last 20 URLs from user messages) so heuristic 4 always has a URL to work with, even when the model's text doesn't contain one.
+
+```python
+# Model: "I should explore the repository to understand its structure."
+# Parser: empty (no parseable format)
+# Layer 3 heuristic 4: "explore" detected, pulling URL from session memory...
+# Result: exec_command with full curl pipeline
+```
+
+**The result:** Before Intelligence Routing, `parsed_tool_calls=0` meant **game over** — the agent loop stalled permanently. After Intelligence Routing, `parsed_tool_calls=0` triggers the self-healing chain and the loop **always** gets a tool call to execute. The model can speak in tongues and the system still works.
+
+**Test coverage:** 54 self-test patterns (up from 41), with 13 new tests specifically for Intelligence Routing layers.
+
 ---

 ## Architecture Deep Dive
@@ -454,6 +545,9 @@ README.md                         # This file
 | CC tool calls have wrong args | Double-wrapped arguments | V3.5 three-tier parser + recursive unwrapping |
 | Proxy crashes mid-session | Unhandled streaming error | V3.5 self-revive watchdog auto-restarts |
 | CC 403 upgrade_required | Missing version header | V3.5 always sends `x-command-code-version` |
+| CC explore_agent can't find URL | URL hidden inside JSON messages | V3.7 Layer 1 drills into JSON to extract URLs |
+| CC agent stalls on escalation blocks | `<require_escalation>` not handled | V3.7 Layer 2 auto-proceeds past escalation requests |
+| CC agent stalls — no tool calls at all | Model output format unrecognized | V3.7 Layer 3 synthesizes command from text intent |

 ---

--- a/codex-launcher_3.6.0_all.deb
+++ b/codex-launcher_3.6.0_all.deb
--- a/codex-launcher_3.7.0_all.deb
+++ b/codex-launcher_3.7.0_all.deb
--- a/codex-launcher_3.8.0_all.deb
+++ b/codex-launcher_3.8.0_all.deb
--- a/install.sh
+++ b/install.sh
@@ -3,11 +3,11 @@ set -e

 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

-if [ -f "$SCRIPT_DIR/codex-launcher_3.6.0_all.deb" ]; then
-    echo "Installing codex-launcher_3.6.0_all.deb ..."
-    sudo dpkg -i "$SCRIPT_DIR/codex-launcher_3.6.0_all.deb"
+if [ -f "$SCRIPT_DIR/codex-launcher_3.8.0_all.deb" ]; then
+    echo "Installing codex-launcher_3.8.0_all.deb ..."
+    sudo dpkg -i "$SCRIPT_DIR/codex-launcher_3.8.0_all.deb"
    echo ""
-    echo "Installed v3.6.0 via .deb package."
+    echo "Installed v3.8.0 via .deb package."
    echo "  translate-proxy.py   -> /usr/bin/translate-proxy.py"
    echo "  codex-launcher-gui   -> /usr/bin/codex-launcher-gui"
    echo "  cleanup-codex-stale  -> /usr/bin/cleanup-codex-stale.sh"
--- a/src/codex-launcher-gui
+++ b/src/codex-launcher-gui
@@ -5,7 +5,7 @@ import gi
 gi.require_version("Gtk", "3.0")
 from gi.repository import Gtk, GLib
 import subprocess, os, signal, sys, threading, time, json, urllib.request, urllib.parse, urllib.error, tempfile, shutil
-import hashlib, socket, ssl, contextlib, re
+import hashlib, socket, ssl, contextlib, re, collections
 import base64, secrets
 from pathlib import Path

@@ -26,6 +26,42 @@ model_catalog_json = ""
 """

 CHANGELOG = [
+    ("3.7.0", "2026-05-22", [
+        "Intelligence Routing — self-healing parser system for Command Code",
+        "Layer 1: Deep URL extraction from nested JSON in explore_agent blocks",
+        "Layer 2: Auto-proceed on require_escalation / request_escalation_permission blocks",
+        "Layer 3: Intent-based command synthesis when all parsers fail (5 heuristics)",
+        "Module-level _build_explore_cmd() — reuses URL extraction across parser + stream",
+        "54 self-test patterns covering all three Intelligence Routing layers",
+    ]),
+    ("3.6.0", "2026-05-22", [
+        "Connection pooling — persistent HTTPS connections per host",
+        "Stream idle timeout (300s) — kills silent streams instead of hanging",
+        "Retry-After header support on all retry paths",
+        "Bounded stream buffers (8MB) — prevents OOM",
+        "Dual logging to proxy.log + stderr",
+    ]),
+    ("3.5.0", "2026-05-22", [
+        "Command Code adapter overhaul — 17 patches for multi-format tool-call parsing",
+        "DSML, XML, explore_agent, bash blocks, raw JSON parser chain",
+        "Self-revive watchdog — auto-restarts proxy on crash",
+        "Debug-to-file logging in cc-debug.log",
+        "Inline self-test (19 patterns)",
+    ]),
+    ("3.3.0", "2026-05-20", [
+        "Antigravity + Gemini CLI OAuth — full Codex agent loop working",
+        "Auto-continue on MAX_TOKENS for Gemini/Antigravity",
+        "BGP++ route scoring and provider policy layer",
+    ]),
+    ("3.0.0", "2026-05-20", [
+        "Major overhaul — ThreadingHTTPServer, thread-safe state, graceful shutdown",
+        "Dynamic port allocation, proxy health gating, atomic config",
+        "Usage Dashboard v2 with dark theme",
+    ]),
+    ("2.7.0", "2026-05-20", [
+        "Usage Dashboard redesigned (OpenUsage-inspired dark theme)",
+        "TCP_NODELAY streaming, Anthropic prompt caching",
+    ]),
    ("2.6.1", "2026-05-20", [
        "Google OAuth rebuilt to emulate Gemini CLI — no client_secret.json needed",
        "Uses Google's public OAuth client_id (same as gemini-cli)",
@@ -1087,6 +1123,524 @@ def _check_codex_auth():
    except Exception as e:
        return ("error", str(e))

+# ═══════════════════════════════════════════════════════════════════
+# AI Monitoring — Self-Healing Watchdog
+# ═══════════════════════════════════════════════════════════════════
+
+MONITORING_FILE = Path.home() / ".cache/codex-proxy/monitoring-config.json"
+INCIDENT_STORE_FILE = Path.home() / ".cache/codex-proxy/incident-store.json"
+MONITORING_LOG = Path.home() / ".cache/codex-proxy/monitoring.log"
+
+_TIER1_RULES = [
+    ("proxy_health_fail",      "restart_proxy",         30),
+    ("proxy_port_conflict",    "kill_stale_restart",    60),
+    ("upstream_429",           "wait_retry",             0),
+    ("upstream_502_503",       "retry_backoff",         30),
+    ("upstream_500_repeat",    "switch_provider",       60),
+    ("upstream_timeout",       "retry_increase_timeout",30),
+    ("upstream_401_403",       "alert_bad_key",          0),
+    ("stream_broken_pipe",     "restart_proxy",         30),
+    ("stream_reset",           "restart_proxy",         30),
+    ("parsed_tool_calls_0_x3", "clear_schema_cache",   300),
+    ("sanitizer_suspicious_5x","alert_model_issue",      0),
+    ("stuck_recovery_x5",      "suggest_switch_model",   0),
+    ("codex_process_dead",     "alert_restart",           0),
+    ("schema_corrupt",         "delete_provider_caps",    0),
+]
+
+_FAILURE_SIGNALS = {
+    "parsed_tool_calls=0":      ("C1", "parser_empty"),
+    "[STUCK-RECOVERY]":         ("C3", "stuck_recovery"),
+    "suspicious cmd":           ("C4", "sanitizer_flag"),
+    "empty cmd recovered":      ("C6", "empty_cmd"),
+    "HTTP 429":                 ("B1", "rate_limited"),
+    "HTTP 500":                 ("B2", "server_error"),
+    "HTTP 502":                 ("B2", "server_error"),
+    "HTTP 503":                 ("B2", "server_error"),
+    "HTTP 401":                 ("B3", "auth_failure"),
+    "HTTP 403":                 ("B4", "forbidden"),
+    "Connection refused":       ("A1", "proxy_dead"),
+    "Address already in use":   ("A2", "port_conflict"),
+    "Broken pipe":              ("B7", "broken_pipe"),
+    "Connection reset":         ("B6", "connection_reset"),
+    "timed out":                ("B5", "timeout"),
+    "SELF-REVIVE CRASH":        ("A5", "proxy_crash"),
+    "stream error":             ("B6", "stream_error"),
+    "content_type.*array":      ("E1", "schema_corrupt"),
+}
+
+_DIAGNOSTIC_SYSTEM_PROMPT = (
+    'You are a diagnostic agent for "Codex Launcher" — a desktop app that runs a local '
+    'translation proxy between OpenAI Codex CLI/Desktop and AI providers.\n\n'
+    'Analyze the incident and respond with ONLY a JSON object:\n'
+    '{"action": "...", "reason": "...", "confidence": 0.0-1.0}\n\n'
+    'Available actions: restart_proxy, kill_stale_processes, clear_schema_cache, '
+    'switch_provider, increase_timeout, regenerate_config, cleanup_stale, '
+    'alert_user, ignore, retry_now\n\n'
+    'Rules:\n'
+    '- upstream 401/403 with auth error -> alert_user\n'
+    '- proxy dead -> restart_proxy\n'
+    '- same error 5+ times -> switch_provider or alert_user\n'
+    '- schema/content_type error -> clear_schema_cache\n'
+    '- "Address already in use" -> kill_stale_processes then restart_proxy\n'
+    '- timeout on slow upstream -> increase_timeout\n'
+    '- single transient 429/502/503 -> ignore\n'
+    '- "stream disconnected" + proxy healthy -> ignore\n'
+    '- no extra text, no markdown, just the JSON object'
+)
+
+def _load_monitoring_config():
+    if MONITORING_FILE.exists():
+        try:
+            return json.loads(MONITORING_FILE.read_text())
+        except Exception:
+            pass
+    return {
+        "enabled": False,
+        "provider_url": "",
+        "model": "",
+        "api_key": "",
+        "health_check_interval_s": 5,
+        "auto_restart_proxy": True,
+        "auto_switch_provider": False,
+    }
+
+def _save_monitoring_config(cfg):
+    MONITORING_FILE.parent.mkdir(parents=True, exist_ok=True)
+    MONITORING_FILE.write_text(json.dumps(cfg, indent=2))
+
+def _load_incident_store():
+    if INCIDENT_STORE_FILE.exists():
+        try:
+            return json.loads(INCIDENT_STORE_FILE.read_text())
+        except Exception:
+            pass
+    return {"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}}
+
+def _save_incident_store(store):
+    INCIDENT_STORE_FILE.parent.mkdir(parents=True, exist_ok=True)
+    INCIDENT_STORE_FILE.write_text(json.dumps(store, indent=2))
+
+def _monitoring_log(msg):
+    try:
+        with open(str(MONITORING_LOG), "a") as f:
+            f.write(f"[{time.strftime('%H:%M:%S')}] {msg}\n")
+    except Exception:
+        pass
+
+
+class IncidentStore:
+    def __init__(self):
+        self._store = _load_incident_store()
+        self._dirty = False
+
+    def lookup(self, pattern):
+        inc = self._store.get("incidents", {}).get(pattern)
+        if inc and inc.get("success_count", 0) > 0:
+            rate = inc["success_count"] / max(inc["success_count"] + inc.get("fail_count", 0), 1)
+            if rate > 0.5:
+                return inc
+        return None
+
+    def record(self, pattern, fix, success=True):
+        incs = self._store.setdefault("incidents", {})
+        inc = incs.setdefault(pattern, {
+            "fix": fix, "success_count": 0, "fail_count": 0,
+            "last_seen": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
+            "occurrences": 0,
+        })
+        inc["last_seen"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+        inc["occurrences"] = inc.get("occurrences", 0) + 1
+        if success:
+            inc["success_count"] = inc.get("success_count", 0) + 1
+        else:
+            inc["fail_count"] = inc.get("fail_count", 0) + 1
+        self._dirty = True
+
+    def record_ai_call(self, tokens=0):
+        stats = self._store.setdefault("stats", {"ai_calls": 0, "tokens_used": 0})
+        stats["ai_calls"] = stats.get("ai_calls", 0) + 1
+        stats["tokens_used"] = stats.get("tokens_used", 0) + tokens
+        self._dirty = True
+
+    def flush(self):
+        if self._dirty:
+            _save_incident_store(self._store)
+            self._dirty = False
+
+    @property
+    def stats(self):
+        return self._store.get("stats", {"ai_calls": 0, "tokens_used": 0})
+
+
+class AIDiagnosticAgent:
+    def __init__(self, provider_url, model, api_key):
+        self.provider_url = provider_url
+        self.model = model
+        self.api_key = api_key
+        self.incident_store = IncidentStore()
+
+    def diagnose(self, context):
+        pattern = self._extract_pattern(context)
+        known = self.incident_store.lookup(pattern)
+        if known:
+            _monitoring_log(f"Tier 2 HIT: pattern={pattern} fix={known['fix']}")
+            return {"action": known["fix"], "reason": "known_pattern", "confidence": 0.9, "tier": 2}
+        action = self._call_model(context)
+        if action:
+            self.incident_store.record(pattern, action.get("action", "unknown"))
+            self.incident_store.flush()
+        return action
+
+    def _extract_pattern(self, context):
+        parts = []
+        for k in sorted(context.get("signals", [])):
+            parts.append(k)
+        if context.get("http_code"):
+            parts.append(f"http_{context['http_code']}")
+        return "+".join(parts[:3]) or "unknown"
+
+    def _call_model(self, context):
+        prompt = (
+            f"INCIDENT REPORT:\n"
+            f"Time: {time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())}\n"
+            f"Proxy health: {context.get('proxy_alive', 'unknown')}\n"
+            f"Upstream: {context.get('upstream_url', 'unknown')}\n"
+            f"Model: {context.get('model', 'unknown')}\n"
+            f"Last HTTP code: {context.get('http_code', 'n/a')}\n"
+            f"Recent signals: {context.get('signals', [])}\n"
+            f"Recent log tail:\n{context.get('log_tail', '')[:1500]}\n"
+        )
+        body = {
+            "model": self.model,
+            "messages": [
+                {"role": "system", "content": _DIAGNOSTIC_SYSTEM_PROMPT},
+                {"role": "user", "content": prompt},
+            ],
+            "max_tokens": 200,
+            "temperature": 0.1,
+        }
+        try:
+            req = urllib.request.Request(
+                self.provider_url,
+                data=json.dumps(body).encode(),
+                headers={
+                    "Content-Type": "application/json",
+                    "Authorization": f"Bearer {self.api_key}",
+                },
+            )
+            resp = urllib.request.urlopen(req, timeout=15)
+            result = json.loads(resp.read())
+            text = result["choices"][0]["message"]["content"].strip()
+            self.incident_store.record_ai_call(tokens=800)
+            action = json.loads(text)
+            action["tier"] = 3
+            _monitoring_log(f"Tier 3 AI: action={action.get('action')} reason={action.get('reason')}")
+            return action
+        except Exception as e:
+            _monitoring_log(f"Tier 3 AI FAILED: {e}")
+            return {"action": "alert_user", "reason": f"ai_diag_failed: {e}", "confidence": 0.0, "tier": 3}
+
+
+class HealthWatcher(threading.Thread):
+    def __init__(self, on_failure, on_recovery, on_signal, on_action):
+        super().__init__(daemon=True)
+        self.cfg = _load_monitoring_config()
+        self.on_failure = on_failure
+        self.on_recovery = on_recovery
+        self.on_signal = on_signal
+        self.on_action = on_action
+        self.failures = 0
+        self.running = False
+        self._signal_counts = collections.defaultdict(int)
+        self._last_actions = {}
+        self._restart_count = 0
+        self._last_restart_time = 0
+
+    def run(self):
+        self.running = True
+        self.incident_store = IncidentStore()
+        self._log_analyzer = _LogAnalyzerThread(self._on_log_signal)
+        self._log_analyzer.start()
+        while self.running:
+            self.cfg = _load_monitoring_config()
+            if not self.cfg.get("enabled"):
+                time.sleep(5)
+                continue
+            port = self._get_proxy_port()
+            if port:
+                healthy = self._check_health(port)
+                if healthy:
+                    if self.failures > 0:
+                        self.failures = 0
+                        self.on_recovery()
+                else:
+                    self.failures += 1
+                    if self.failures >= 3:
+                        self._handle_failure("proxy_health_fail")
+            self.incident_store.flush()
+            interval = self.cfg.get("health_check_interval_s", 5)
+            time.sleep(interval)
+
+    def stop(self):
+        self.running = False
+        if hasattr(self, '_log_analyzer'):
+            self._log_analyzer.running = False
+
+    def _get_proxy_port(self):
+        try:
+            cfg_path = Path.home() / ".cache/codex-proxy/proxy-config.json"
+            if cfg_path.exists():
+                d = json.loads(cfg_path.read_text())
+                return d.get("port")
+        except Exception:
+            pass
+        return None
+
+    def _check_health(self, port):
+        try:
+            req = urllib.request.Request(f"http://localhost:{port}/health")
+            resp = urllib.request.urlopen(req, timeout=5)
+            return resp.status == 200
+        except Exception:
+            return False
+
+    def _on_log_signal(self, fault_id, category, line):
+        self._signal_counts[category] += 1
+        self.on_signal(fault_id, category, line[:200])
+        count = self._signal_counts[category]
+        if category in ("proxy_dead", "port_conflict") and count >= 2:
+            self._handle_failure(category)
+        elif category in ("server_error", "timeout") and count >= 3:
+            self._handle_failure(category + "_repeat")
+        elif category in ("sanitizer_flag",) and count >= 5:
+            self._handle_failure("sanitizer_suspicious_5x")
+        elif category in ("stuck_recovery",) and count >= 5:
+            self._handle_failure("stuck_recovery_x5")
+        elif category in ("parser_empty",) and count >= 3:
+            self._handle_failure("parsed_tool_calls_0_x3")
+        elif category in ("schema_corrupt",):
+            self._handle_failure("schema_corrupt")
+
+    def _handle_failure(self, trigger):
+        now = time.time()
+        for rule_trigger, action, cooldown in _TIER1_RULES:
+            if rule_trigger == trigger:
+                last_t = self._last_actions.get(action, 0)
+                if now - last_t < cooldown:
+                    return
+                self._last_actions[action] = now
+                _monitoring_log(f"Tier 1: trigger={trigger} action={action}")
+                self.on_action(action, trigger)
+                self.incident_store.record(trigger, action, success=True)
+                return
+        self._try_tier2_3(trigger)
+
+    def _try_tier2_3(self, trigger):
+        cfg = self.cfg
+        if not cfg.get("provider_url") or not cfg.get("model") or not cfg.get("api_key"):
+            _monitoring_log(f"No AI configured for Tier 2/3 — alerting user for trigger={trigger}")
+            self.on_action("alert_user", trigger)
+            return
+        agent = AIDiagnosticAgent(cfg["provider_url"], cfg["model"], cfg["api_key"])
+        context = {
+            "signals": [trigger],
+            "proxy_alive": self.failures == 0,
+            "log_tail": self._get_recent_log(),
+        }
+        result = agent.diagnose(context)
+        if result:
+            action = result.get("action", "alert_user")
+            _monitoring_log(f"Tier {result.get('tier', '?')}: action={action}")
+            self.on_action(action, trigger)
+
+
+class _LogAnalyzerThread(threading.Thread):
+    def __init__(self, on_signal):
+        super().__init__(daemon=True)
+        self.on_signal = on_signal
+        self.running = False
+
+    def run(self):
+        self.running = True
+        log_paths = [
+            str(Path.home() / ".cache/codex-proxy/cc-debug.log"),
+            str(Path.home() / ".cache/codex-proxy/proxy.log"),
+        ]
+        fhs = {}
+        for p in log_paths:
+            try:
+                f = open(p, "r")
+                f.seek(0, 2)
+                fhs[p] = f
+            except Exception:
+                pass
+        while self.running:
+            activity = False
+            for p, fh in list(fhs.items()):
+                try:
+                    line = fh.readline()
+                    if line:
+                        activity = True
+                        for pattern, (fault_id, category) in _FAILURE_SIGNALS.items():
+                            if re.search(pattern, line):
+                                self.on_signal(fault_id, category, line.strip())
+                                break
+                except Exception:
+                    pass
+            if not activity:
+                time.sleep(0.5)
+
+
+class AIMonitoringWindow(Gtk.Window):
+    def __init__(self, parent=None):
+        super().__init__(title="AI Monitoring")
+        self.set_transient_for(parent)
+        self.set_default_size(580, 520)
+        self.set_border_width(12)
+        self._cfg = _load_monitoring_config()
+        self._store = _load_incident_store()
+
+        vbox = Gtk.Box(orientation=Gtk.Orientation.VERTICAL, spacing=8)
+        self.add(vbox)
+
+        hdr = Gtk.Box(spacing=8)
+        vbox.pack_start(hdr, False, False, 0)
+        lbl = Gtk.Label()
+        lbl.set_markup("<b>AI Monitoring</b>")
+        lbl.set_use_markup(True)
+        hdr.pack_start(lbl, False, False, 0)
+        self._toggle = Gtk.Switch()
+        self._toggle.set_active(self._cfg.get("enabled", False))
+        self._toggle.connect("state-set", self._on_toggle)
+        hdr.pack_end(self._toggle, False, False, 0)
+        lbl2 = Gtk.Label(label="Enabled")
+        hdr.pack_end(lbl2, False, False, 0)
+
+        frame = Gtk.Frame(label="Diagnostic Agent")
+        vbox.pack_start(frame, False, False, 0)
+        grid = Gtk.Grid(column_spacing=8, row_spacing=6, margin=8)
+        frame.add(grid)
+
+        grid.attach(Gtk.Label(label="Provider URL:", halign=Gtk.Align.END), 0, 0, 1, 1)
+        self._url_entry = Gtk.Entry(hexpand=True)
+        self._url_entry.set_text(self._cfg.get("provider_url", ""))
+        self._url_entry.set_placeholder_text("https://api.openai.com/v1/chat/completions")
+        grid.attach(self._url_entry, 1, 0, 2, 1)
+
+        grid.attach(Gtk.Label(label="Model:", halign=Gtk.Align.END), 0, 1, 1, 1)
+        self._model_entry = Gtk.Entry(hexpand=True)
+        self._model_entry.set_text(self._cfg.get("model", ""))
+        self._model_entry.set_placeholder_text("gpt-4o-mini or Qwen/Qwen3-32B")
+        grid.attach(self._model_entry, 1, 1, 2, 1)
+
+        grid.attach(Gtk.Label(label="API Key:", halign=Gtk.Align.END), 0, 2, 1, 1)
+        self._key_entry = Gtk.Entry(hexpand=True, visibility=False)
+        self._key_entry.set_text(self._cfg.get("api_key", ""))
+        self._key_entry.set_placeholder_text("sk-...")
+        grid.attach(self._key_entry, 1, 2, 1, 1)
+        self._reveal_btn = Gtk.ToggleButton(label="Show")
+        self._reveal_btn.connect("toggled", lambda b: self._key_entry.set_visibility(b.get_active()))
+        grid.attach(self._reveal_btn, 2, 2, 1, 1)
+
+        grid.attach(Gtk.Label(label="Health Check:", halign=Gtk.Align.END), 0, 3, 1, 1)
+        adj = Gtk.Adjustment(value=self._cfg.get("health_check_interval_s", 5), lower=2, upper=30, step_increment=1)
+        self._interval_spin = Gtk.SpinButton(adjustment=adj)
+        self._interval_spin.set_numeric(True)
+        grid.attach(self._interval_spin, 1, 3, 1, 1)
+        grid.attach(Gtk.Label(label="seconds"), 2, 3, 1, 1)
+
+        opts_box = Gtk.Box(spacing=12, margin_top=4)
+        grid.attach(opts_box, 0, 4, 3, 1)
+        self._auto_restart_cb = Gtk.CheckButton(label="Auto-restart proxy on crash")
+        self._auto_restart_cb.set_active(self._cfg.get("auto_restart_proxy", True))
+        opts_box.pack_start(self._auto_restart_cb, False, False, 0)
+        self._auto_switch_cb = Gtk.CheckButton(label="Auto-switch provider on repeated failure")
+        self._auto_switch_cb.set_active(self._cfg.get("auto_switch_provider", False))
+        opts_box.pack_start(self._auto_switch_cb, False, False, 0)
+
+        save_btn = Gtk.Button(label="Save Configuration")
+        save_btn.get_style_context().add_class("suggested-action")
+        save_btn.connect("clicked", self._on_save)
+        grid.attach(save_btn, 0, 5, 3, 1)
+
+        stats_box = Gtk.Box(spacing=16)
+        vbox.pack_start(stats_box, False, False, 0)
+        stats = self._store.get("stats", {"ai_calls": 0, "tokens_used": 0})
+        self._stats_lbl = Gtk.Label()
+        self._stats_lbl.set_markup(
+            f"<small>AI diagnostic calls: <b>{stats.get('ai_calls', 0)}</b>  |  "
+            f"Tokens used: <b>{stats.get('tokens_used', 0):,}</b>  |  "
+            f"Known patterns: <b>{len(self._store.get('incidents', {}))}</b></small>"
+        )
+        self._stats_lbl.set_use_markup(True)
+        stats_box.pack_start(self._stats_lbl, False, False, 0)
+
+        frame2 = Gtk.Frame(label="Recent Incidents")
+        vbox.pack_start(frame2, True, True, 0)
+        sw = Gtk.ScrolledWindow()
+        sw.set_policy(Gtk.PolicyType.AUTOMATIC, Gtk.PolicyType.AUTOMATIC)
+        frame2.add(sw)
+        self._inc_buf = Gtk.TextBuffer()
+        tv = Gtk.TextView(buffer=self._inc_buf)
+        tv.set_editable(False)
+        tv.set_cursor_visible(False)
+        tv.set_wrap_mode(Gtk.WrapMode.WORD_CHAR)
+        sw.add(tv)
+        self._refresh_incidents()
+
+        bb = Gtk.Box(spacing=8)
+        vbox.pack_start(bb, False, False, 0)
+        view_btn = Gtk.Button(label="View Monitoring Log")
+        view_btn.connect("clicked", lambda b: subprocess.Popen(["xdg-open", str(MONITORING_LOG)]))
+        bb.pack_start(view_btn, False, False, 0)
+        clear_btn = Gtk.Button(label="Clear Incident Store")
+        clear_btn.connect("clicked", self._on_clear_store)
+        bb.pack_start(clear_btn, False, False, 0)
+        close_btn = Gtk.Button(label="Close")
+        close_btn.connect("clicked", lambda b: self.destroy())
+        bb.pack_end(close_btn, False, False, 0)
+
+        self.show_all()
+
+    def _on_toggle(self, switch, state):
+        self._cfg["enabled"] = state
+        _save_monitoring_config(self._cfg)
+
+    def _on_save(self, btn):
+        self._cfg["provider_url"] = self._url_entry.get_text().strip()
+        self._cfg["model"] = self._model_entry.get_text().strip()
+        self._cfg["api_key"] = self._key_entry.get_text().strip()
+        self._cfg["health_check_interval_s"] = int(self._interval_spin.get_value())
+        self._cfg["auto_restart_proxy"] = self._auto_restart_cb.get_active()
+        self._cfg["auto_switch_provider"] = self._auto_switch_cb.get_active()
+        _save_monitoring_config(self._cfg)
+        self._inc_buf.set_text("Configuration saved.\n")
+
+    def _on_clear_store(self, btn):
+        _save_incident_store({"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}})
+        self._store = {"version": 1, "incidents": {}, "stats": {"ai_calls": 0, "tokens_used": 0}}
+        self._refresh_incidents()
+
+    def _refresh_incidents(self):
+        lines = []
+        for pattern, inc in sorted(self._store.get("incidents", {}).items(),
+                                    key=lambda x: x[1].get("last_seen", ""), reverse=True):
+            sc = inc.get("success_count", 0)
+            fc = inc.get("fail_count", 0)
+            rate = sc / max(sc + fc, 1)
+            bar = "+" * min(int(rate * 10), 10) + "-" * (10 - min(int(rate * 10), 10))
+            lines.append(
+                f"[{inc.get('last_seen', '?')[:16]}] {pattern}\n"
+                f"  fix={inc.get('fix', '?')}  success_rate={rate:.0%} [{bar}]  "
+                f"seen={inc.get('occurrences', 0)}x\n"
+            )
+        if not lines:
+            lines.append("No incidents recorded yet.\n")
+            lines.append("\nEnable AI Monitoring and use Codex to populate the store.\n")
+        self._inc_buf.set_text("\n".join(lines))
+
+
 # ═══════════════════════════════════════════════════════════════════
 # Main window
 # ═══════════════════════════════════════════════════════════════════
@@ -1107,7 +1661,7 @@ class LauncherWin(Gtk.Window):
        # header row
        hdr = Gtk.Box(spacing=8)
        vbox.pack_start(hdr, False, False, 0)
-        lbl = Gtk.Label(label="<b>Codex Launcher v3.3.0</b>")
+        lbl = Gtk.Label(label="<b>Codex Launcher v3.8.0</b>")
        lbl.set_use_markup(True)
        hdr.pack_start(lbl, False, False, 0)
        changelog_btn = Gtk.Button(label="Changelog")
@@ -1125,6 +1679,9 @@ class LauncherWin(Gtk.Window):
        bgp_btn = Gtk.Button(label="AI BGP")
        bgp_btn.connect("clicked", lambda b: self._open_bgp())
        hdr.pack_end(bgp_btn, False, False, 0)
+        mon_btn = Gtk.Button(label="AI Monitor")
+        mon_btn.connect("clicked", lambda b: self._open_monitoring())
+        hdr.pack_end(mon_btn, False, False, 0)
        mgr_btn = Gtk.Button(label="Manage Endpoints")
        mgr_btn.connect("clicked", lambda b: self._open_mgr())
        hdr.pack_end(mgr_btn, False, False, 0)
@@ -1274,6 +1831,7 @@ class LauncherWin(Gtk.Window):
        self.show_all()
        self._rebuild_combo()
        self._log_dependency_status()
+        self._start_watcher()

    # ── helpers ──────────────────────────────────────────────────

@@ -1420,13 +1978,84 @@ class LauncherWin(Gtk.Window):
            d.run(); d.destroy()

    def _open_bgp(self):
-        try:
-            self._bgp_window = BGPPoolMgr(self)
-            self._bgp_window.connect("destroy", lambda *_: setattr(self, "_bgp_window", None))
-        except Exception as e:
-            import traceback; traceback.print_exc()
-            d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
-            d.run(); d.destroy()
+         try:
+             self._bgp_window = BGPPoolMgr(self)
+             self._bgp_window.connect("destroy", lambda *_: setattr(self, "_bgp_window", None))
+         except Exception as e:
+             import traceback; traceback.print_exc()
+             d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
+             d.run(); d.destroy()
+
+    def _open_monitoring(self):
+         try:
+             self._monitoring_window = AIMonitoringWindow(self)
+             self._monitoring_window.connect("destroy", lambda *_: setattr(self, "_monitoring_window", None))
+         except Exception as e:
+             import traceback; traceback.print_exc()
+             d = Gtk.MessageDialog(self, 0, Gtk.MessageType.ERROR, Gtk.ButtonsType.OK, f"Error: {e}")
+             d.run(); d.destroy()
+
+    def _start_watcher(self):
+         cfg = _load_monitoring_config()
+         if not cfg.get("enabled"):
+             return
+         self._watcher = HealthWatcher(
+             on_failure=self._on_watcher_failure,
+             on_recovery=self._on_watcher_recovery,
+             on_signal=self._on_watcher_signal,
+             on_action=self._on_watcher_action,
+         )
+         self._watcher.start()
+         self.log("AI Monitoring: watchdog started")
+
+    def _on_watcher_failure(self, count):
+         GLib.idle_add(self.log, f"[AI Monitor] Proxy unresponsive (failures={count})")
+
+    def _on_watcher_recovery(self):
+         GLib.idle_add(self.log, "[AI Monitor] Proxy recovered")
+
+    def _on_watcher_signal(self, fault_id, category, line):
+         pass
+
+    def _on_watcher_action(self, action, trigger):
+         cfg = _load_monitoring_config()
+         if action == "restart_proxy" and cfg.get("auto_restart_proxy"):
+             GLib.idle_add(self.log, f"[AI Monitor] Auto-restarting proxy (trigger: {trigger})")
+             GLib.idle_add(self._restart_proxy_from_watcher)
+         elif action == "clear_schema_cache":
+             try:
+                 cap_file = Path.home() / ".cache/codex-proxy/provider-caps.json"
+                 if cap_file.exists():
+                     cap_file.unlink()
+                     GLib.idle_add(self.log, "[AI Monitor] Cleared corrupt schema cache")
+             except Exception as e:
+                 GLib.idle_add(self.log, f"[AI Monitor] Failed to clear cache: {e}")
+         elif action == "delete_provider_caps":
+             try:
+                 cap_file = Path.home() / ".cache/codex-proxy/provider-caps.json"
+                 if cap_file.exists():
+                     cap_file.unlink()
+                     GLib.idle_add(self.log, "[AI Monitor] Deleted corrupted provider-caps.json")
+             except Exception as e:
+                 GLib.idle_add(self.log, f"[AI Monitor] Failed: {e}")
+         elif action == "kill_stale_restart":
+             GLib.idle_add(self.log, f"[AI Monitor] Killing stale processes + restarting (trigger: {trigger})")
+             self._kill()
+             GLib.idle_add(self._restart_proxy_from_watcher)
+         else:
+             GLib.idle_add(self.log, f"[AI Monitor] Alert: {action} (trigger: {trigger})")
+
+    def _restart_proxy_from_watcher(self):
+         try:
+             ep_name = load_endpoints().get("default")
+             if not ep_name:
+                 return
+             for ep in load_endpoints().get("endpoints", []):
+                 if ep.get("name") == ep_name:
+                     self._start_proxy(ep)
+                     break
+         except Exception as e:
+             self.log(f"[AI Monitor] Proxy restart failed: {e}")

    def _open_usage(self):
        try:
--- a/src/translate-proxy.py
+++ b/src/translate-proxy.py
@@ -83,7 +83,76 @@ FIX 8: Adaptive probing caused format mismatch (REVERTED)
        - ErrorAnalyzer learning on retries (not proactive probes)
  Location: Reverted to cc_input_to_messages(), removed _build_cc_messages + _probe_cc_format

-═══════════════════════════════════════════════════════════════════
+FIX 21: DSML parser silently drops tool calls when model uses name="cmd" (THE HALT BUG)
+  Symptom: Codex CLI stops mid-task. Model generates valid DSML exec_command with
+        <｜｜DSML｜｜parameter name="cmd" string="true">curl ...
+        Parser returns parsed_tool_calls=0. Client sees text output but no tool to execute.
+        CLI has nothing to do and halts.
+  Root cause: Line 1798 had `if key == "command":` — only matching parameter name="command".
+        The actual tool schema defines the parameter as "cmd" (see exec_command schema).
+        When DeepSeek generates name="cmd", the key "cmd" != "command", so cmd stays None,
+        and line 1825-1826 `if not cmd: continue` silently skips the entire tool call.
+        The XML parser (line 2205) already handled both: `params.get("command") or params.get("cmd")`
+        but the DSML parser did not.
+  Fix: Changed to `if key in ("command", "cmd"):` in the DSML parameter loop.
+  Test: Pattern L self-test verifies DSML with name="cmd" is parsed correctly.
+  Location: _parse_commandcode_text_tool_calls() DSML parameter loop, self-test Pattern L
+
+════════════════════════════════════════════════════════════════════
+INTELLIGENCE ROUTING — Self-Healing Parser System (v3.7.0)
+════════════════════════════════════════════════════════════════════
+
+Problem: The Command Code model produces output in unpredictable formats
+that change between sessions and models. When the multi-format parser chain
+(DSML → <bash> → <explore_agent> → <tool_call type=...> → XML → raw JSON →
+fallback regex) returns empty, the Codex agent loop has zero tool calls and
+STALLS — the user sees the model "thinking" but nothing happens.
+
+Intelligence Routing is a three-layer self-healing system:
+
+LAYER 1 — Deep URL Extraction (FIX 23)
+  The <explore_agent> handler was failing because URLs were hidden inside
+  nested JSON: messages: [{"content": "https://..."}]. The regex couldn't
+  find them because it excluded the " character that terminates JSON values.
+  
+  Solution: _build_explore_cmd() is now a module-level function (was a
+  closure). After the initial regex fails, it tries json.loads() on the
+  text, iterates list items, and extracts the "content" field to find URLs.
+  Also added " to the regex exclusion set and rstrip characters.
+
+LAYER 2 — Escalation Block Handling (FIX 24)
+  The model produces <require_escalation> and <request_escalation_permission>
+  blocks when it wants elevated permissions. The CC adapter doesn't support
+  escalation — these blocks were silently dropped, causing parsed_tool_calls=0.
+  
+  Solution: Two handlers:
+    - FIX 24a: Closed-tag blocks — extracts URL if present, runs explore cmd;
+      otherwise echoes auto-proceed message.
+    - FIX 24b: Bare/unclosed tags (<require_escalation />) — auto-proceeds.
+
+LAYER 3 — Intent-Based Command Synthesis (FIX 25, THE CORE)
+  When ALL parsers return empty and text has content, the system plays
+  detective using 5 heuristics in priority order:
+  
+    1. URL detected in text → curl to fetch it
+    2. File path reference → cat or ls that file
+    3. Shell command in backticks/quotes → extract and run
+    4. "explore"/"fetch"/"investigate" intent + last user URL → explore cmd
+    5. "I need to"/"let me"/"please" intent text → echo diagnostic
+
+  This ensures the agent loop ALWAYS has a tool call to execute, even when
+  the model's output format is completely unrecognized. The loop never stalls.
+
+Architecture:
+  _parse_commandcode_text_tool_calls() — LAYER 1 + LAYER 2
+  cc_stream_to_sse() — LAYER 3 (runs after parser chain + fallback)
+  
+  The _last_user_urls deque (maxlen=20) tracks URLs from user messages
+  across the session, giving Layer 3 heuristic 4 a URL to work with.
+
+  Self-tests: 54 patterns (was 41) covering all three layers.
+
+════════════════════════════════════════════════════════════════════
 """

 import json, http.server, socketserver, urllib.request, urllib.parse, urllib.error, re
@@ -204,6 +273,7 @@ _pool = uuid.uuid4().hex[:8]
 _antigravity_version = "1.18.3"
 _antigravity_version_checked = 0
 _antigravity_version_lock = threading.Lock()
+_last_user_urls = collections.deque(maxlen=20)

 _conn_pool_lock = threading.Lock()
 _conn_pool = {}
@@ -1720,6 +1790,49 @@ def _unwrap_cmd(cmd_val):
            break
    return cmd_val

+def _build_explore_cmd(text_for_url):
+    """Module-level explore command builder. Extracts repo URL from text,
+    builds a curl pipeline to fetch README, contents listing, and releases.
+    Used by _parse_commandcode_text_tool_calls (closure wrapper) and
+    cc_stream_to_sse (stuck recovery heuristic)."""
+    if not text_for_url:
+        return None, None
+    url_m = re.search(r"https?://[^\s\]'\\>\",]+", text_for_url)
+    repo_url = url_m.group(0).rstrip(")].,;'\\\"") if url_m else ""
+    if not repo_url and isinstance(text_for_url, str):
+        try:
+            _parsed = json.loads(text_for_url)
+            if isinstance(_parsed, list):
+                for _item in _parsed:
+                    _c = _item.get("content", "") if isinstance(_item, dict) else str(_item)
+                    url_m2 = re.search(r"https?://[^\s\]'\\>\",]+", _c)
+                    if url_m2:
+                        repo_url = url_m2.group(0).rstrip(")].,;'\\\"")
+                        break
+        except Exception:
+            pass
+    if not repo_url:
+        return None, None
+    if repo_url.endswith(".git"):
+        repo_url = repo_url[:-4]
+    if "/api/v1/repos/" not in repo_url:
+        host_m = re.match(r"(https?://[^/]+)/(.*)", repo_url)
+        if host_m:
+            host, path = host_m.groups()
+            api_base = f"{host}/api/v1/repos/{path}"
+        else:
+            api_base = repo_url.replace("/admin/", "/api/v1/repos/")
+    else:
+        api_base = repo_url
+    cmd = (
+        f"cd /tmp && "
+        f"curl -sL --max-time 15 '{api_base}/contents/README.md' 2>/dev/null | "
+        f"python3 -c \"import sys,json,base64; d=json.load(sys.stdin); print(base64.b64decode(d['content']).decode())\" 2>/dev/null | head -600 && "
+        f"curl -sL --max-time 15 '{api_base}/contents' 2>/dev/null | python3 -c \"import sys,json; d=json.load(sys.stdin); print('\\n'.join(f'{{x.get(\'path\')}} {{x.get(\'type\')}}' for x in d[:50]))\" 2>/dev/null && "
+        f"curl -sL --max-time 15 '{api_base}/releases' 2>/dev/null | python3 -c \"import sys,json; d=json.load(sys.stdin); print(json.dumps(d[:3], indent=2)[:2000])\" 2>/dev/null"
+    )
+    return cmd, "Explore repository to understand the app and gather README, root contents, and releases for the landing page."
+
 def _parse_commandcode_text_tool_calls(text):
    """Parse CommandCode's text-form tool calls into Responses function calls.

@@ -1739,6 +1852,9 @@ def _parse_commandcode_text_tool_calls(text):
    calls = []
    if not text:
        return calls
+
+    _build_explore_cmd_local = _build_explore_cmd
+
    # [FIX 17] DSML tool_call blocks used by the model now.
    # Example:
    #   <｜｜DSML｜｜tool_calls>
@@ -1763,7 +1879,12 @@ def _parse_commandcode_text_tool_calls(text):
            for pm in re.finditer(r"<[^>]*parameter[^>]*name=\"([^\"]+)\"[^>]*>(.*?)</[^>]*parameter>", body, re.DOTALL | re.IGNORECASE):
                key = (pm.group(1) or "").strip().lower()
                val = _strip_xmlish_tags(pm.group(2)).strip()
-                if key == "command":
+                # [FIX 21] Accept both "command" and "cmd" parameter names.
+                # The tool schema defines the parameter as "cmd" (see exec_command schema),
+                # but the model sometimes uses "command" (especially from prefix_rule fallback).
+                # Previously only "command" was accepted, so DSML blocks with name="cmd"
+                # were silently dropped — causing Codex CLI to stop mid-task.
+                if key in ("command", "cmd"):
                    cmd = val
                elif key == "prefix_rule" and not cmd:
                    try:
@@ -1776,6 +1897,15 @@ def _parse_commandcode_text_tool_calls(text):
                    sandbox_permissions = val
                elif key == "justification":
                    justification = val
+
+            # [FIX 20] Support explore / explore_agent in DSML blocks
+            is_explore = raw_name.lower() in ("explore", "explore_agent")
+            if is_explore:
+                explore_cmd, explore_just = _build_explore_cmd_local(body)
+                if explore_cmd:
+                    cmd = explore_cmd
+                    justification = explore_just
+
            # Fallback: if the body contains a raw JSON command.
            if not cmd:
                jm = re.search(r'"(?:command|cmd)"\s*:\s*"((?:[^"\\]|\\.)*)"', body, re.DOTALL)
@@ -1783,7 +1913,9 @@ def _parse_commandcode_text_tool_calls(text):
                    cmd = jm.group(1).replace('\\n', '\n').replace('\\"', '"').strip()
            if not cmd:
                continue
-            tool_name = "exec_command" if raw_name.lower() in ("exec", "bash", "shell", "terminal", "run_command") else raw_name
+            # [FIX 19] Translate execute_request and other variations to exec_command (CLI only supports exec_command)
+            # [FIX 20] Translate explore and explore_agent to exec_command
+            tool_name = "exec_command" if raw_name.lower() in ("exec", "bash", "shell", "terminal", "run_command", "execute_request", "execute_command", "run_shell_command", "run_shell", "run", "explore", "explore_agent") else raw_name
            args = {"cmd": _unwrap_cmd(cmd)}
            if sandbox_permissions:
                args["sandbox_permissions"] = sandbox_permissions if sandbox_permissions in ("use_default", "require_escalated", "with_user_approval") else "require_escalated"
@@ -1794,6 +1926,7 @@ def _parse_commandcode_text_tool_calls(text):
                "name": tool_name,
                "arguments": json.dumps(args, ensure_ascii=False),
            })
+
    # [FIX 16] Native <bash> blocks from CommandCode.
    # Example:
    #   <bash>
@@ -1848,6 +1981,7 @@ def _parse_commandcode_text_tool_calls(text):
            "name": "exec_command",
            "arguments": json.dumps(args, ensure_ascii=False),
        })
+
    # [FIX 15] Native <explore_agent> blocks from CommandCode.
    # Format seen in logs:
    #   <explore_agent>\nmessages: [{...}]\n</explore_agent>
@@ -1857,13 +1991,11 @@ def _parse_commandcode_text_tool_calls(text):
        body = body.strip()
        msgs = None
        if body:
-            # Prefer explicit JSON array after `messages:`; fall back to raw body.
            try:
                msgs = json.loads(body) if body.startswith("[") else None
            except Exception:
                msgs = None
        if msgs is None and body:
-            # Try to extract a JSON array from the body.
            mm = re.search(r"(\[.*\])", body, re.DOTALL)
            if mm:
                try:
@@ -1872,28 +2004,70 @@ def _parse_commandcode_text_tool_calls(text):
                    msgs = None
        if msgs is None:
            msgs = body
-        # Convert explore_agent into a real exec_command so downstream clients can execute it.
        text_for_url = body if isinstance(body, str) else json.dumps(body, ensure_ascii=False)
-        url_m = re.search(r"https?://[^\s\]'>\"]+", text_for_url)
-        repo_url = url_m.group(0).rstrip(")].,;'") if url_m else ""
-        if repo_url:
-            api_base = repo_url.replace("/admin/", "/api/v1/repos/")
-            # Build a safe, generic exploration command: README + root contents + releases.
-            cmd = (
-                f"cd /tmp && "
-                f"curl -sL --max-time 15 '{api_base}/contents/README.md' 2>/dev/null | "
-                f"python3 -c \"import sys,json,base64; d=json.load(sys.stdin); print(base64.b64decode(d['content']).decode())\" 2>/dev/null | head -600 && "
-                f"curl -sL --max-time 15 '{api_base}/contents' 2>/dev/null | python3 -c \"import sys,json; d=json.load(sys.stdin); print('\\n'.join(f'{{x.get(\'path\')}} {{x.get(\'type\')}}' for x in d[:50]))\" 2>/dev/null && "
-                f"curl -sL --max-time 15 '{api_base}/releases' 2>/dev/null | python3 -c \"import sys,json; d=json.load(sys.stdin); print(json.dumps(d[:3], indent=2)[:2000])\" 2>/dev/null"
-            )
-            args = {"cmd": cmd, "justification": "Explore repository to understand the app and gather README, root contents, and releases for the landing page."}
-        else:
-            args = {"cmd": "echo 'explore_agent: unable to extract repository URL'", "justification": "Fallback for explore_agent block without URL."}
+        cmd, justification = _build_explore_cmd_local(text_for_url)
+        if not cmd:
+            cmd = "echo 'explore_agent: unable to extract repository URL'"
+            justification = "Fallback for explore_agent block without URL."
+        args = {"cmd": cmd}
+        if justification:
+            args["justification"] = justification
        calls.append({
            "full_match": m.group(0),
            "name": "exec_command",
            "arguments": json.dumps(args, ensure_ascii=False),
        })
+
+    if not calls and text.count("<explore_agent>") >= 2:
+        url_m = re.search(r"https?://[^\s\]'\\>\"]+", text)
+        if not url_m:
+            for prev_url in _last_user_urls:
+                url_m = re.search(r"https?://[^\s\]'\\>\"]+", prev_url)
+                if url_m:
+                    break
+        if url_m:
+            explore_url = url_m.group(0).rstrip(")].,;'\\")
+            cmd, justification = _build_explore_cmd_local(explore_url)
+            if cmd:
+                calls.append({
+                    "full_match": "<explore_agent>...",
+                    "name": "exec_command",
+                    "arguments": json.dumps({"cmd": cmd, "justification": justification or "Explore repository"}, ensure_ascii=False),
+                })
+
+    # [FIX 24] Handle <require_escalation> and <request_escalation_permission> blocks.
+    # The model produces these when it wants elevated permissions but the CC
+    # adapter doesn't support them. Synthesize a proceed command so the loop continues.
+    if not calls:
+        for m in re.finditer(r"<(?:require_escalation|request_escalation_permission)>(.*?)</(?:require_escalation|request_escalation_permission)>", text, re.DOTALL | re.IGNORECASE):
+            body_escal = (m.group(1) or "").strip()
+            _inner_url_m = re.search(r"https?://[^\s\]'\\>\",]+", body_escal)
+            if _inner_url_m:
+                _e_url = _inner_url_m.group(0).rstrip(")].,;'\\\"")
+                _e_cmd, _e_just = _build_explore_cmd_local(_e_url)
+                if _e_cmd:
+                    calls.append({
+                        "full_match": m.group(0),
+                        "name": "exec_command",
+                        "arguments": json.dumps({"cmd": _e_cmd, "justification": _e_just or "Escalation block with URL — auto-proceed"}, ensure_ascii=False),
+                    })
+                    continue
+            if not calls:
+                calls.append({
+                    "full_match": m.group(0),
+                    "name": "exec_command",
+                    "arguments": json.dumps({"cmd": "echo 'escalation: auto-proceeding — no specific command in escalation block'", "justification": "Auto-proceed past escalation request"}, ensure_ascii=False),
+                })
+
+    # [FIX 24b] Bare <require_escalation ... /> or <request_escalation_permission ... />
+    # without closing tags. Just auto-proceed.
+    if not calls and re.search(r"<(?:require_escalation|request_escalation_permission)[\s/>]", text, re.IGNORECASE):
+        calls.append({
+            "full_match": "<escalation_bare/>",
+            "name": "exec_command",
+            "arguments": json.dumps({"cmd": "echo 'escalation: auto-proceeding past bare escalation tag'", "justification": "Auto-proceed past bare escalation tag"}, ensure_ascii=False),
+        })
+
    patterns = [
        r"<tool_call(?:\s+name=['\"]?([^'\">\s]+)['\"]?)?>(.*?)</tool_call[)]?>",
        r"<function=(\w+)>(.*?)</function>",
@@ -2062,16 +2236,33 @@ def _parse_commandcode_text_tool_calls(text):
            if not tc_name:
                continue
            tc_id = _extract_field(snippet, "id")
-            tool_name = "exec_command" if tc_name.lower() in ("bash", "shell", "terminal", "run_command") else tc_name
-            args_raw = _extract_args(snippet) or _extract_field(snippet, "arguments") or _extract_field(snippet, "input") or "{}"
-            try:
-                args = json.loads(args_raw) if args_raw.startswith('{') else {"cmd": args_raw}
-            except Exception:
-                args = {"cmd": args_raw}
-            if "cmd" not in args or not args["cmd"]:
-                args["cmd"] = str(args)
-            # [FIX 11] Self-healing: unwrap double-wrapped cmd values
-            args["cmd"] = _unwrap_cmd(args.get("cmd", ""))
+            
+            # [FIX 20] Support explore / explore_agent in raw JSON tool calls
+            is_explore = tc_name.lower() in ("explore", "explore_agent")
+            
+            if is_explore:
+                # Build explore command from the whole snippet/arguments
+                explore_cmd, explore_just = _build_explore_cmd_local(snippet)
+                if explore_cmd:
+                    args = {"cmd": explore_cmd}
+                    if explore_just:
+                        args["justification"] = explore_just
+                else:
+                    args = {"cmd": "echo 'explore: unable to extract repository URL'", "justification": "Fallback for explore tool call without URL."}
+                tool_name = "exec_command"
+            else:
+                # [FIX 19] Translate execute_request and other variations to exec_command (CLI only supports exec_command)
+                tool_name = "exec_command" if tc_name.lower() in ("exec", "bash", "shell", "terminal", "run_command", "execute_request", "execute_command", "run_shell_command", "run_shell", "run") else tc_name
+                args_raw = _extract_args(snippet) or _extract_field(snippet, "arguments") or _extract_field(snippet, "input") or "{}"
+                try:
+                    args = json.loads(args_raw) if args_raw.startswith('{') else {"cmd": args_raw}
+                except Exception:
+                    args = {"cmd": args_raw}
+                if "cmd" not in args or not args["cmd"]:
+                    args["cmd"] = str(args)
+                # [FIX 11] Self-healing: unwrap double-wrapped cmd values
+                args["cmd"] = _unwrap_cmd(args.get("cmd", ""))
+                
            # Normalize sandbox_permissions to valid values
            _VALID_SP = frozenset({"use_default", "require_escalated", "with_user_approval"})
            if "sandbox_permissions" in args:
@@ -2100,6 +2291,7 @@ def _parse_commandcode_text_tool_calls(text):
                "arguments": json.dumps(args, ensure_ascii=False),
            })
        return results
+
    for pat in patterns:
        for m in re.finditer(pat, text, re.DOTALL | re.IGNORECASE):
            if pat.startswith("<function"):
@@ -2118,7 +2310,8 @@ def _parse_commandcode_text_tool_calls(text):
                    cmd = obj.get("command") or obj.get("cmd") or ""
                    cmd = _unwrap_cmd(cmd)  # [FIX 11]
                    if cmd:
-                        tool_name = "exec_command" if raw_name.lower() in ("bash", "shell", "terminal", "run_command") else raw_name
+                        # [FIX 19] Translate execute_request and other variations to exec_command (CLI only supports exec_command)
+                        tool_name = "exec_command" if raw_name.lower() in ("exec", "bash", "shell", "terminal", "run_command", "execute_request", "execute_command", "run_shell_command", "run_shell", "run") else raw_name
                        args = {"cmd": cmd}
                        sp = obj.get("sandbox_permissions")
                        if isinstance(sp, dict) and sp.get("require_escalated"):
@@ -2134,7 +2327,19 @@ def _parse_commandcode_text_tool_calls(text):
            for pm in re.finditer(r"<parameter(?:\s+name=[\"']?(\w+)[\"']?|=(\w+))>(.*?)</parameter>", body, re.DOTALL | re.IGNORECASE):
                key = pm.group(1) or pm.group(2) or "text"
                params[key] = _strip_xmlish_tags(pm.group(3)).strip()
-            cmd = params.get("command") or params.get("cmd") or ""
+            
+            # [FIX 20] Support explore / explore_agent in XML tool calls
+            is_explore = raw_name.lower() in ("explore", "explore_agent")
+            if is_explore:
+                explore_cmd, explore_just = _build_explore_cmd_local(body)
+                if explore_cmd:
+                    cmd = explore_cmd
+                    params["justification"] = explore_just
+                else:
+                    cmd = ""
+            else:
+                cmd = params.get("command") or params.get("cmd") or ""
+
            if not cmd and body_stripped.startswith("{"):
                cm = re.search(r'"(?:command|cmd)"\s*:\s*"(.*?)"\s*,\s*"(?:sandbox_permissions|justification|prefix_rule)"', body, re.DOTALL)
                if not cm:
@@ -2159,7 +2364,9 @@ def _parse_commandcode_text_tool_calls(text):
                    cmd = "\n".join(lines)
            if not cmd:
                continue
-            tool_name = "exec_command" if raw_name.lower() in ("bash", "shell", "terminal", "run_command") else raw_name
+            # [FIX 19] Translate execute_request and other variations to exec_command (CLI only supports exec_command)
+            # [FIX 20] Translate explore and explore_agent to exec_command
+            tool_name = "exec_command" if raw_name.lower() in ("exec", "bash", "shell", "terminal", "run_command", "execute_request", "execute_command", "run_shell_command", "run_shell", "run", "explore", "explore_agent") else raw_name
            args = {"cmd": _unwrap_cmd(cmd)}  # [FIX 11] all paths must unwrap
            if params.get("sandbox_permissions"):
                args["sandbox_permissions"] = params["sandbox_permissions"]
@@ -2169,6 +2376,42 @@ def _parse_commandcode_text_tool_calls(text):

    # Also extract raw JSON tool-call objects embedded in free text
    calls.extend(_extract_raw_json_tool_calls(text))
+
+    # [FIX 18] Native <todo_write> blocks from the model (used for checklist/task tracking)
+    # The model outputs a task checklist in a custom <todo_write> XML tag block:
+    #   <todo_write>
+    #     <todos>[{"id":"1","status":"in_progress","description":"..."}]</todos>
+    #   </todo_write>
+    # We parse this and map it to a standard 'TodoWrite' tool call so the CLI agent loop continues execution.
+    for m in re.finditer(r"<todo_write>(.*?)</todo_write>", text, re.DOTALL | re.IGNORECASE):
+        body = (m.group(1) or "").strip()
+        if not body:
+            continue
+        todos_match = re.search(r"<todos>(.*?)</todos>", body, re.DOTALL | re.IGNORECASE)
+        if not todos_match:
+            continue
+        raw_todos_json = todos_match.group(1).strip()
+        try:
+            raw_todos = json.loads(raw_todos_json)
+        except Exception as e:
+            print(f"[translate-proxy] [FIX 18] Failed to parse <todos> JSON: {e}", file=sys.stderr)
+            raw_todos = None
+        if isinstance(raw_todos, list):
+            parsed_todos = []
+            for item in raw_todos:
+                if isinstance(item, dict):
+                    desc = item.get("description") or item.get("content") or ""
+                    parsed_todos.append({
+                        "content": desc,
+                        "activeForm": item.get("activeForm") or desc,
+                        "status": item.get("status") or "pending"
+                    })
+            calls.append({
+                "full_match": m.group(0),
+                "name": "TodoWrite",
+                "arguments": json.dumps({"todos": parsed_todos}, ensure_ascii=False)
+            })
+
    # [FIX 11] Self-healing: last-chance sanitization pass on ALL extracted calls
    calls = _sanitize_tool_calls(calls)
    return calls
@@ -2191,6 +2434,14 @@ def _sanitize_tool_calls(calls):
    """
    cleaned = []
    for i, call in enumerate(calls):
+        # [FIX 18] Skip sanitization pass for non-shell tool calls (e.g., TodoWrite)
+        # Sanitization specifically validates and repairs command shell executions (the 'cmd' argument).
+        # Running it on other tools without a 'cmd' parameter (like TodoWrite) would falsely flag
+        # them as containing JSON garbage or empty commands, corrupting their actual parameters.
+        if call.get("name") != "exec_command":
+            cleaned.append(call)
+            continue
+
        try:
            args_raw = call.get("arguments", "{}")
            if isinstance(args_raw, str):
@@ -2417,6 +2668,70 @@ def cc_stream_to_sse(cc_stream, model, req_id):
            else:
                _deflog(f"[CC-DEBUG] Fallback also failed. text_buf first 500: {text_buf[:500]!r}")
    
+    # [FIX 25] SELF-HEALING STUCK DETECTOR
+    # When ALL parsers returned empty and text has intent signals, synthesize a
+    # command so the agent loop doesn't stall. This catches:
+    #   - Bare text with no tool call format at all
+    #   - Unrecognized XML-ish blocks
+    #   - Partial JSON (bare "{")
+    #   - Model explaining what it wants to do but not producing a tool call
+    if not parsed_tool_calls and len(text_buf) > 10:
+        _synth_cmd = None
+        _synth_just = None
+        _tl = text_buf.lower()
+
+        # Heuristic 1: URL in text → fetch it
+        _url_in_text = re.search(r"https?://[^\s\]'\\>\",]+", text_buf)
+        if _url_in_text:
+            _synth_url = _url_in_text.group(0).rstrip(")].,;'\\\"")
+            _synth_cmd = f"curl -sL --max-time 15 '{_synth_url}' 2>/dev/null | head -200"
+            _synth_just = "Auto-synthesized: URL detected in text, fetching"
+
+        # Heuristic 2: File path references → list or read
+        if not _synth_cmd:
+            _file_m = re.search(r"(?:read|open|view|check|examine|cat|show)\s+(?:the\s+)?(?:file\s+)?[`'\"]?(/[^\s'\"]+\.\w+)", _tl)
+            if _file_m:
+                _fpath = _file_m.group(1)
+                _synth_cmd = f"cat '{_fpath}' 2>/dev/null | head -200 || ls -la '{_fpath}'"
+                _synth_just = f"Auto-synthesized: file reference detected ({_fpath})"
+
+        # Heuristic 3: Shell command mentioned in backticks or quotes
+        if not _synth_cmd:
+            _shell_m = re.search(r"[`'\"]((?:curl|wget|git|npm|pip|python|ls|cat|grep|find|mkdir|cd|rm|cp|mv|chmod|docker|make|cargo|go)\s[^\s`'\"]+)", text_buf)
+            if _shell_m:
+                _synth_cmd = _shell_m.group(1)
+                _synth_just = "Auto-synthesized: shell command detected in text"
+
+        # Heuristic 4: "explore" or "fetch" intent + last user URL
+        if not _synth_cmd and ("explore" in _tl or "fetch" in _tl or "investigate" in _tl or "repository" in _tl):
+            for _prev_url in _last_user_urls:
+                _url_m2 = re.search(r"https?://[^\s\]'\\>\",]+", _prev_url)
+                if _url_m2:
+                    _pu = _url_m2.group(0).rstrip(")].,;'\\\"")
+                    _ecmd, _ejust = _build_explore_cmd(_pu)
+                    if _ecmd:
+                        _synth_cmd = _ecmd
+                        _synth_just = _ejust or "Auto-synthesized: explore intent with last user URL"
+                    break
+
+        # Heuristic 5: Generic "I need to" / "let me" / "I'll" intent with command-like text
+        if not _synth_cmd:
+            _intent_m = re.search(r"(?:I(?:'ll| will| need to| should)|let me|please)\s+(.+?)(?:\.|!|\n|$)", _tl, re.IGNORECASE)
+            if _intent_m:
+                _intent_text = _intent_m.group(1).strip()
+                if len(_intent_text) > 10 and len(_intent_text) < 200:
+                    _synth_cmd = f"echo 'Stuck recovery: model intent was: {_intent_text[:100]}'"
+                    _synth_just = f"Auto-synthesized from intent text: {_intent_text[:80]}"
+
+        if _synth_cmd:
+            parsed_tool_calls = [{
+                "full_match": "__synth_stuck_recovery__",
+                "name": "exec_command",
+                "arguments": json.dumps({"cmd": _synth_cmd, "justification": _synth_just or "Auto-synthesized stuck recovery"}, ensure_ascii=False),
+            }]
+            _deflog(f"[CC-DEBUG] [STUCK-RECOVERY] Synthesized: cmd={_synth_cmd[:120]!r}")
+            print(f"[CC-DEBUG] [STUCK-RECOVERY] Synthesized command from text intent", file=sys.stderr, flush=True)
+
    # Also log to stderr for visibility when not piped
    print(f"[CC-DEBUG] text_buf={len(text_buf)} chars, tool_calls={len(parsed_tool_calls)}", file=sys.stderr, flush=True)
    
@@ -3095,10 +3410,20 @@ class Handler(http.server.BaseHTTPRequestHandler):
        if self.path in ("/v1/models", "/models"):
            self.send_json(200, {"object": "list", "data": MODELS})
        elif self.path in ("/health", "/v1/health"):
+            import resource as _res
+            _mem_mb = 0
+            try:
+                _mem_mb = _res.getrusage(_res.RUSAGE_SELF).ru_maxrss / 1024
+            except Exception:
+                pass
+            _uptime = time.time() - _START_TIME if '_START_TIME' in dir() else 0
            self.send_json(200, {"ok": True, "backend": BACKEND,
                                 "target_url": TARGET_URL,
                                 "models": [m.get("id") for m in MODELS],
-                                 "bgp_routes": len(BGP_ROUTES)})
+                                 "bgp_routes": len(BGP_ROUTES),
+                                 "uptime_s": round(_uptime, 1),
+                                 "memory_mb": round(_mem_mb, 1),
+                                 "requests_total": _STATS.get("requests", 0)})
        else:
            self.send_error(404)

@@ -3126,6 +3451,9 @@ class Handler(http.server.BaseHTTPRequestHandler):
        except Exception as e:
            return self.send_json(400, {"error": {"message": f"Bad request: {e}"}})

+        self._session_id = uuid.uuid4().hex[:8]
+        _sid = self._session_id
+
        import datetime as _dt
        _log_path = os.path.join(_LOG_DIR, "requests.log")
        _ts = _dt.datetime.now().isoformat()
@@ -3139,9 +3467,9 @@ class Handler(http.server.BaseHTTPRequestHandler):
        raw_types = [i.get("type") for i in raw_input] if isinstance(raw_input, list) else "str"
        resolved_types = [i.get("type") for i in input_data] if isinstance(input_data, list) else "str"

-        print(f"[REQUEST] prev_id={prev_id} raw={raw_types} resolved={resolved_types}", file=sys.stderr)
+        print(f"[{_sid}] prev_id={prev_id} raw={raw_types} resolved={resolved_types}", file=sys.stderr)
        with open(_log_path, "a") as _lf:
-            _lf.write(f"\n{'='*60}\n{_ts} REQUEST {self.path}\n")
+            _lf.write(f"\n{'='*60}\n{_ts} [session={_sid}] REQUEST {self.path}\n")
            _lf.write(f"  prev_id={prev_id}\n")
            _lf.write(f"  raw_input_types={raw_types}\n")
            _lf.write(f"  resolved_input_types={resolved_types}\n")
@@ -3163,6 +3491,12 @@ class Handler(http.server.BaseHTTPRequestHandler):
        model = body.get("model", MODELS[0]["id"] if MODELS else "unknown")
        stream = body.get("stream", False)
        request_id = body.get("request_id") or body.get("id") or uid("req")
+        if isinstance(input_data, list):
+            for item in input_data:
+                if isinstance(item, dict) and item.get("type") == "message" and item.get("role") == "user":
+                    content = str(item.get("content", ""))
+                    for url_m in re.finditer(r"https?://[^\s\]'\"<>]+", content):
+                        _last_user_urls.append(url_m.group(0))
        save_request_snapshot(request_id, body)
        _req_t0 = time.time()
        try:
@@ -3229,7 +3563,7 @@ class Handler(http.server.BaseHTTPRequestHandler):
                "Content-Type": "application/json",
                "Authorization": f"Bearer {effective_key}",
            }, browser_ua=True)
-            print(f"[translate-proxy] POST {target} model={model} stream={stream} items={len(input_data) if isinstance(input_data,list) else 1}", file=sys.stderr)
+            print(f"[{self._session_id}] POST {target} model={model} stream={stream} items={len(input_data) if isinstance(input_data,list) else 1}", file=sys.stderr)
            chat_body_b = json.dumps(chat_body).encode()
            max_retries = 3
            for attempt in range(max_retries + 1):
@@ -3247,14 +3581,14 @@ class Handler(http.server.BaseHTTPRequestHandler):
                                wait = min(2 ** (attempt + 1), 15)
                        else:
                            wait = min(2 ** (attempt + 1), 15)
-                        print(f"[translate-proxy] HTTP {e.code} (attempt {attempt+1}/{max_retries}), retrying in {wait}s: {err_body[:150]}", file=sys.stderr)
+                        print(f"[{self._session_id}] HTTP {e.code} (attempt {attempt+1}/{max_retries}), retrying in {wait}s: {err_body[:150]}", file=sys.stderr)
                        time.sleep(wait)
                        continue
                    return self.send_json(e.code, {"error": {"type": "upstream_error", "message": _sanitize_err_body(err_body)}})
                except (ConnectionResetError, ConnectionAbortedError, BrokenPipeError) as e:
                    if attempt < max_retries:
                        wait = min(2 ** (attempt + 1), 10)
-                        print(f"[translate-proxy] connection error (attempt {attempt+1}/{max_retries}), retrying in {wait}s: {e}", file=sys.stderr)
+                        print(f"[{self._session_id}] connection error (attempt {attempt+1}/{max_retries}), retrying in {wait}s: {e}", file=sys.stderr)
                        time.sleep(wait)
                        continue
                    return self.send_json(502, {"error": {"type": "proxy_error", "message": str(e)}})
@@ -3488,7 +3822,7 @@ class Handler(http.server.BaseHTTPRequestHandler):
            headers["X-Goog-Api-Client"] = "gl-node/22.17.0"
            headers["Client-Metadata"] = "ideType=IDE_UNSPECIFIED,platform=PLATFORM_UNSPECIFIED,pluginType=GEMINI"
        body_b = json.dumps(wrapped).encode()
-        print(f"[gemini-oauth] model={model} stream={stream} items={len(input_data) if isinstance(input_data, list) else 1} project={project_id}", file=sys.stderr)
+        print(f"[{self._session_id}] model={model} stream={stream} items={len(input_data) if isinstance(input_data, list) else 1} project={project_id}", file=sys.stderr)

        for ep in endpoints:
            target = f"{ep}/{url_suffix}"
@@ -3503,17 +3837,17 @@ class Handler(http.server.BaseHTTPRequestHandler):
                        debug_path = os.path.join(_LOG_DIR, "gemini-last-400-request.json")
                        with open(debug_path, "w") as dbg:
                            json.dump({"endpoint": ep, "model": model, "wrapped": wrapped, "error": err_body}, dbg, indent=2)
-                        print(f"[gemini-oauth] saved 400 debug request to {debug_path}", file=sys.stderr)
+                        print(f"[{self._session_id}] saved 400 debug request to {debug_path}", file=sys.stderr)
                    except Exception:
                        pass
                if e.code == 429 and ep != endpoints[-1]:
-                    print(f"[gemini-oauth] {ep} HTTP 429, trying next endpoint", file=sys.stderr)
+                    print(f"[{self._session_id}] {ep} HTTP 429, trying next endpoint", file=sys.stderr)
                    continue
                return self.send_json(e.code, {"error": {"type": "upstream_error", "message": _sanitize_err_body(err_body)}})
            except Exception as e:
                if ep == endpoints[-1]:
                    return self.send_json(502, {"error": {"type": "proxy_error", "message": str(e)}})
-                print(f"[gemini-oauth] {ep} failed: {e}, trying next", file=sys.stderr)
+                print(f"[{self._session_id}] {ep} failed: {e}, trying next", file=sys.stderr)
                continue

        if stream:
@@ -3566,10 +3900,10 @@ class Handler(http.server.BaseHTTPRequestHandler):
                candidates = chunk.get("response", chunk).get("candidates", [])
                if not candidates:
                    if chunk.get("error"):
-                        print(f"[gemini-oauth] stream error chunk: {str(chunk.get('error'))[:300]}", file=sys.stderr)
+                        print(f"[{self._session_id}] stream error chunk: {str(chunk.get('error'))[:300]}", file=sys.stderr)
                    continue
                if candidates[0].get("finishReason") and not candidates[0].get("content", {}).get("parts"):
-                    print(f"[gemini-oauth] finish without parts: {candidates[0].get('finishReason')}", file=sys.stderr)
+                    print(f"[{self._session_id}] finish without parts: {candidates[0].get('finishReason')}", file=sys.stderr)
                parts = candidates[0].get("content", {}).get("parts", [])
                for part in parts:
                    if part.get("thought"):
@@ -3598,7 +3932,7 @@ class Handler(http.server.BaseHTTPRequestHandler):
                last_finish = candidates[0].get("finishReason", "")
                if OAUTH_PROVIDER == "google-antigravity" and full_text and last_finish:
                    if last_finish == "MAX_TOKENS" and not current_tool_calls:
-                        print(f"[gemini-oauth] MAX_TOKENS hit ({len(full_text)} chars), auto-continuing...", file=sys.stderr)
+                        print(f"[{self._session_id}] MAX_TOKENS hit ({len(full_text)} chars), auto-continuing...", file=sys.stderr)
                        break
                    stream_finished = True
                    break
@@ -3704,14 +4038,14 @@ class Handler(http.server.BaseHTTPRequestHandler):
                "Content-Type": "application/json",
                "Authorization": f"Bearer {r_key}",
            }, browser_ua=True)
-            print(f"[bgp] trying route '{route.get('name', r_url)}' model={r_model}", file=sys.stderr)
+            print(f"[{self._session_id}] trying route '{route.get('name', r_url)}' model={r_model}", file=sys.stderr)
            req = urllib.request.Request(target, data=json.dumps(chat_body).encode(), headers=fwd)
            t0_route = time.time()
            route_ok = False
            for attempt in range(3):
                try:
                    upstream = urllib.request.urlopen(req, timeout=_upstream_timeout(body, stream))
-                    print(f"[bgp] route '{route.get('name', r_url)}' connected OK", file=sys.stderr)
+                    print(f"[{self._session_id}] route '{route.get('name', r_url)}' connected OK", file=sys.stderr)
                    _update_route_stats(route, True, time.time() - t0_route)
                    self._forward_oa_compat(upstream, stream, r_model, chat_body, body, input_data, fwd, target)
                    return
@@ -3720,18 +4054,18 @@ class Handler(http.server.BaseHTTPRequestHandler):
                    if e.code in (429, 502, 503) and attempt < 2:
                        retry_after = e.headers.get("Retry-After")
                        wait = min(int(retry_after), 60) if retry_after and retry_after.isdigit() else min(2 ** (attempt + 1), 10)
-                        print(f"[bgp] route '{route.get('name', r_url)}' HTTP {e.code}, retry {attempt+1}/2 in {wait}s", file=sys.stderr)
+                        print(f"[{self._session_id}] route '{route.get('name', r_url)}' HTTP {e.code}, retry {attempt+1}/2 in {wait}s", file=sys.stderr)
                        time.sleep(wait)
                        req = urllib.request.Request(target, data=json.dumps(chat_body).encode(), headers=fwd)
                        continue
-                    print(f"[bgp] route '{route.get('name', r_url)}' FAILED: HTTP {e.code}: {err[:200]}", file=sys.stderr)
+                    print(f"[{self._session_id}] route '{route.get('name', r_url)}' FAILED: HTTP {e.code}: {err[:200]}", file=sys.stderr)
                    _update_route_stats(route, False, time.time() - t0_route, http_code=e.code)
                    errors.append(f"{route.get('name','?')}: HTTP {e.code}")
                    break
                except (ConnectionResetError, ConnectionAbortedError, BrokenPipeError) as e:
                    if attempt < 2:
                        wait = min(2 ** (attempt + 1), 8)
-                        print(f"[bgp] route '{route.get('name', r_url)}' conn error, retry {attempt+1}/2 in {wait}s: {e}", file=sys.stderr)
+                        print(f"[{self._session_id}] route '{route.get('name', r_url)}' conn error, retry {attempt+1}/2 in {wait}s: {e}", file=sys.stderr)
                        time.sleep(wait)
                        req = urllib.request.Request(target, data=json.dumps(chat_body).encode(), headers=fwd)
                        continue
@@ -3739,12 +4073,12 @@ class Handler(http.server.BaseHTTPRequestHandler):
                    errors.append(f"{route.get('name','?')}: {e}")
                    break
                except Exception as e:
-                    print(f"[bgp] route '{route.get('name', r_url)}' FAILED: {e}", file=sys.stderr)
+                    print(f"[{self._session_id}] route '{route.get('name', r_url)}' FAILED: {e}", file=sys.stderr)
                    _update_route_stats(route, False, time.time() - t0_route, error_type=str(e))
                    errors.append(f"{route.get('name','?')}: {e}")
                    break

-        print(f"[bgp] ALL ROUTES FAILED: {errors}", file=sys.stderr)
+        print(f"[{self._session_id}] ALL ROUTES FAILED: {errors}", file=sys.stderr)
        self.send_json(502, {"error": {"type": "bgp_all_routes_failed", "message": f"All BGP routes failed: {'; '.join(errors)}"}})

    def _forward_oa_compat(self, upstream, stream, model, chat_body, body, input_data, fwd, target, tracker=None):
@@ -4022,7 +4356,7 @@ class Handler(http.server.BaseHTTPRequestHandler):
            }

            fwd = forwarded_headers(self.headers, headers_extra, browser_ua=True)
-            print(f"[translate-proxy] POST {target} model={model} stream={stream} attempt={attempt} [command-code]", file=sys.stderr)
+            print(f"[{self._session_id}] POST {target} model={model} stream={stream} attempt={attempt} [command-code]", file=sys.stderr)
            req = urllib.request.Request(
                target,
                data=json.dumps(cc_body).encode(),
@@ -4037,7 +4371,7 @@ class Handler(http.server.BaseHTTPRequestHandler):
                if attempt < max_retries:
                    hints = ErrorAnalyzer.analyze(err, schema)
                    if hints:
-                        print(f"[command-code] error analysis: {hints}", file=sys.stderr)
+                        print(f"[{self._session_id}] error analysis: {hints}", file=sys.stderr)
                        ErrorAnalyzer.merge_into_schema(hints, schema)
                        _save_schema(schema, model=model)
                        continue
@@ -4083,7 +4417,7 @@ class Handler(http.server.BaseHTTPRequestHandler):
            try:
                self.stream_buffered_events(cc_stream_to_sse(upstream, model, body.get("request_id") or body.get("id")), on_event=on_event)
            except Exception as e:
-                print(f"[command-code] stream error: {e}", file=sys.stderr)
+                print(f"[{self._session_id}] stream error: {e}", file=sys.stderr)
                try:
                    err_event = 'data: ' + json.dumps({"type": "response.completed",
                        "response": {"id": body.get("request_id") or body.get("id") or uid("resp"),
@@ -4416,7 +4750,8 @@ class Handler(http.server.BaseHTTPRequestHandler):

    def log_message(self, fmt, *args):
        msg = fmt % args if args else fmt
-        print(f"[translate-proxy] {BACKEND} {msg}", file=sys.stderr)
+        _sid = getattr(self, '_session_id', None) or 'proxy'
+        print(f"[{_sid}] {BACKEND} {msg}", file=sys.stderr)

 _SHUTDOWN_REQUESTED = False

@@ -4425,10 +4760,11 @@ def _handle_shutdown_signal(sig, frame):
    _SHUTDOWN_REQUESTED = True
    print(f"[SELF-REVIVE] Signal {sig} received, shutting down cleanly", flush=True)
    if 'SERVER' in globals() and SERVER:
-        SERVER.shutdown()
+         SERVER.shutdown()
 
 def main():
-    global SERVER
+    global SERVER, _START_TIME
+    _START_TIME = time.time()
    _init_runtime()
    signal.signal(signal.SIGTERM, _handle_shutdown_signal)
    signal.signal(signal.SIGINT, _handle_shutdown_signal)
@@ -4539,6 +4875,124 @@ if __name__ == "__main__":
            except Exception as e:
                _check(f"sanitizer: output valid JSON, got {e}", False)
        
+        # Pattern H: Native <todo_write> XML block parsing and sanitization bypass (FIX 18)
+        _todo_xml = """Some preamble text.
+<todo_write>
+<todos>[{"id":"1","status":"in_progress","description":"Create landing page directory and HTML structure"},{"id":"2","status":"pending","description":"Write the full landing page"}]</todos>
+</todo_write>
+Postamble text."""
+        _calls_h = _parse_commandcode_text_tool_calls(_todo_xml)
+        _check("todo_write: extracted call exists", len(_calls_h) == 1, f"got {len(_calls_h)} calls")
+        if _calls_h:
+            _call_h = _calls_h[0]
+            _check("todo_write: name is TodoWrite", _call_h.get("name") == "TodoWrite")
+            try:
+                _args_h = json.loads(_call_h.get("arguments", "{}"))
+                _todos_h = _args_h.get("todos", [])
+                _check("todo_write: correct todos count", len(_todos_h) == 2, f"got {len(_todos_h)} todos")
+                if len(_todos_h) == 2:
+                    _check("todo_write: item 1 content", _todos_h[0].get("content") == "Create landing page directory and HTML structure")
+                    _check("todo_write: item 1 activeForm", _todos_h[0].get("activeForm") == "Create landing page directory and HTML structure")
+                    _check("todo_write: item 1 status", _todos_h[0].get("status") == "in_progress")
+                    _check("todo_write: item 2 status", _todos_h[1].get("status") == "pending")
+                # Confirm that the arguments contain no 'cmd' or sanitization comment
+                _check("todo_write: no cmd injected", "cmd" not in _args_h)
+            except Exception as e:
+                _check(f"todo_write: parsed JSON error: {e}", False)
+        
+        # Pattern I: Translate execute_request to exec_command (FIX 19)
+        _exec_req_raw = '<｜｜DSML｜｜tool_calls>\n<｜｜DSML｜｜invoke name="execute_request">\n<｜｜DSML｜｜parameter name="command" string="true">ls -la</｜｜DSML｜｜parameter>\n</｜｜DSML｜｜invoke>\n</｜｜DSML｜｜tool_calls>'
+        _calls_i = _parse_commandcode_text_tool_calls(_exec_req_raw)
+        _check("execute_request: mapped successfully", len(_calls_i) == 1, f"got {len(_calls_i)} calls")
+        if _calls_i:
+            _call_i = _calls_i[0]
+            _check("execute_request: name translated to exec_command", _call_i.get("name") == "exec_command", f"got {_call_i.get('name')}")
+            try:
+                _args_i = json.loads(_call_i.get("arguments", "{}"))
+                _check("execute_request: correct command extracted", _args_i.get("cmd") == "ls -la", f"got {_args_i.get('cmd')}")
+            except Exception as e:
+                _check(f"execute_request: arguments parsing error: {e}", False)
+
+        # Pattern J: Translate DSML-style explore/explore_agent block (FIX 20)
+        _explore_dsml = '<｜｜DSML｜｜tool_calls>\n  <｜｜DSML｜｜invoke name="explore">\n  <｜｜DSML｜｜parameter name="messages" string="true">[{"content": "Understand what the Z.AI-Chat-for-Android project is about... URL: https://github.rommark.dev/admin/Z.AI-Chat-for-Android", "role": "user"}]</｜｜DSML｜｜parameter>\n  </｜｜DSML｜｜invoke>\n  </｜｜DSML｜｜tool_calls>'
+        _calls_j = _parse_commandcode_text_tool_calls(_explore_dsml)
+        _check("explore DSML: mapped successfully", len(_calls_j) == 1, f"got {len(_calls_j)} calls")
+        if _calls_j:
+            _call_j = _calls_j[0]
+            _check("explore DSML: name translated to exec_command", _call_j.get("name") == "exec_command", f"got {_call_j.get('name')}")
+            try:
+                _args_j = json.loads(_call_j.get("arguments", "{}"))
+                _check("explore DSML: built a curl explore script targeting api base", "api/v1/repos/admin/Z.AI-Chat-for-Android" in _args_j.get("cmd", ""), f"got {_args_j.get('cmd')!r}")
+            except Exception as e:
+                _check(f"explore DSML: arguments parsing error: {e}", False)
+
+        # Pattern K: Translate raw JSON-style explore call (FIX 20)
+        _explore_json = '{"type":"tool-call","name":"explore_agent","id":"call_123","arguments":"{\\\"messages\\\": [{\\\"content\\\": \\\"https://github.rommark.dev/admin/Z.AI-Chat-for-Android\\\"}]}"}'
+        _calls_k = _parse_commandcode_text_tool_calls(_explore_json)
+        _check("explore JSON: mapped successfully", len(_calls_k) == 1, f"got {len(_calls_k)} calls")
+        if _calls_k:
+            _call_k = _calls_k[0]
+            _check("explore JSON: name translated to exec_command", _call_k.get("name") == "exec_command")
+            try:
+                _args_k = json.loads(_call_k.get("arguments", "{}"))
+                _check("explore JSON: built a curl explore script targeting api base", "api/v1/repos/admin/Z.AI-Chat-for-Android" in _args_k.get("cmd", ""), f"got {_args_k.get('cmd')!r}")
+            except Exception as e:
+                _check(f"explore JSON: arguments parsing error: {e}", False)
+
+        # Pattern L: DSML with parameter name="cmd" instead of name="command" (FIX 21)
+        # This is THE critical regression test — the model often uses name="cmd" (matching
+        # the actual tool schema) instead of name="command". Previously the DSML parser
+        # silently dropped these, causing Codex CLI to halt mid-task.
+        _cmd_dsml = '<｜｜DSML｜｜tool_calls>\n  <｜｜DSML｜｜invoke name="exec_command">\n  <｜｜DSML｜｜parameter name="cmd" string="true">curl -sL --max-time 15 \'https://github.rommark.dev/api/v1/repos/admin/Z.AI-Chat-for-Android/contents/README.md\' 2>/dev/null</｜｜DSML｜｜parameter>\n  <｜｜DSML｜｜parameter name="sandbox_permissions" string="true">require_escalated</｜｜DSML｜｜parameter>\n  <｜｜DSML｜｜parameter name="justification" string="true">I need to get the README from the private repo to understand the Android app before building the landing page mockup.</｜｜DSML｜｜parameter>\n  </｜｜DSML｜｜invoke>\n  </｜｜DSML｜｜tool_calls>'
+        _calls_l = _parse_commandcode_text_tool_calls(_cmd_dsml)
+        _check("DSML name=cmd: mapped successfully", len(_calls_l) == 1, f"got {len(_calls_l)} calls")
+        if _calls_l:
+            _call_l = _calls_l[0]
+            _check("DSML name=cmd: name is exec_command", _call_l.get("name") == "exec_command", f"got {_call_l.get('name')}")
+            try:
+                _args_l = json.loads(_call_l.get("arguments", "{}"))
+                _check("DSML name=cmd: cmd extracted correctly", "curl -sL --max-time 15" in _args_l.get("cmd", ""), f"got {_args_l.get('cmd')!r}")
+                _check("DSML name=cmd: sandbox_permissions extracted", _args_l.get("sandbox_permissions") == "require_escalated", f"got {_args_l.get('sandbox_permissions')!r}")
+                _check("DSML name=cmd: justification extracted", "README" in _args_l.get("justification", ""), f"got {_args_l.get('justification')!r}")
+            except Exception as e:
+                _check(f"DSML name=cmd: arguments parsing error: {e}", False)
+
+        # Pattern M: explore_agent with nested JSON messages containing URL (FIX 23)
+        _explore_nested = '<explore_agent>\nmessages: [{"content": "Understand the Z.AI-Chat-for-Android repo at https://github.rommark.dev/admin/Z.AI-Chat-for-Android"}]\n</explore_agent>'
+        _calls_m = _parse_commandcode_text_tool_calls(_explore_nested)
+        _check("FIX23 explore nested JSON: parsed", len(_calls_m) == 1, f"got {len(_calls_m)} calls")
+        if _calls_m:
+            _args_m = json.loads(_calls_m[0].get("arguments", "{}"))
+            _check("FIX23 explore nested JSON: cmd has curl", "curl" in _args_m.get("cmd", ""), f"got {_args_m.get('cmd')!r}")
+            _check("FIX23 explore nested JSON: URL in cmd", "github.rommark.dev" in _args_m.get("cmd", ""), f"missing URL in cmd")
+
+        # Pattern N: require_escalation block (FIX 24)
+        _esc_text = '<require_escalation>I need to run a command with elevated permissions to access the repository at https://github.rommark.dev/admin/Z.AI-Chat-for-Android</require_escalation>'
+        _calls_n = _parse_commandcode_text_tool_calls(_esc_text)
+        _check("FIX24 require_escalation: parsed", len(_calls_n) == 1, f"got {len(_calls_n)} calls")
+        if _calls_n:
+            _args_n = json.loads(_calls_n[0].get("arguments", "{}"))
+            _check("FIX24 require_escalation: name is exec_command", _calls_n[0].get("name") == "exec_command", f"got {_calls_n[0].get('name')}")
+            _check("FIX24 require_escalation: cmd has curl or echo", "curl" in _args_n.get("cmd", "") or "echo" in _args_n.get("cmd", ""), f"got {_args_n.get('cmd')!r}")
+
+        # Pattern N2: bare request_escalation_permission tag (FIX 24b)
+        _esc_bare = 'I want to proceed.\n<request_escalation_permission />\nPlease let me continue.'
+        _calls_n2 = _parse_commandcode_text_tool_calls(_esc_bare)
+        _check("FIX24b bare escalation: parsed", len(_calls_n2) == 1, f"got {len(_calls_n2)} calls")
+        if _calls_n2:
+            _check("FIX24b bare escalation: name is exec_command", _calls_n2[0].get("name") == "exec_command", f"got {_calls_n2[0].get('name')}")
+
+        # Pattern O: _build_explore_cmd module-level function (FIX 23/25)
+        _cmd_o, _just_o = _build_explore_cmd("https://github.rommark.dev/admin/Z.AI-Chat-for-Android")
+        _check("FIX23/25 _build_explore_cmd: returns cmd", _cmd_o is not None, "returned None")
+        _check("FIX23/25 _build_explore_cmd: has curl", _cmd_o and "curl" in _cmd_o, f"no curl in {_cmd_o!r}")
+        _check("FIX23/25 _build_explore_cmd: has api path", _cmd_o and "/api/v1/repos/" in _cmd_o, f"no api path in {_cmd_o!r}")
+
+        # Pattern O2: _build_explore_cmd with JSON array containing URL
+        _cmd_o2, _ = _build_explore_cmd('[{"content": "https://github.rommark.dev/admin/Z.AI-Chat-for-Android"}]')
+        _check("FIX23/25 _build_explore_cmd from JSON array: returns cmd", _cmd_o2 is not None, "returned None")
+        _check("FIX23/25 _build_explore_cmd from JSON array: has curl", _cmd_o2 and "curl" in _cmd_o2, f"no curl in {_cmd_o2!r}")
+
        print(f"[CC-SELF-TEST] Results: {_counts[0]} passed, {_counts[1]} failed",
              file=sys.stderr)
        if _counts[1]: