Files

admin 994c5481bf fix: crash loop after reboot - resilient error handlers + mask user service

Root causes:
1. uncaughtException/unhandledRejection called gracefulShutdown() -> process.exit(0)
   Any minor error killed the entire bot. Changed to LOG ONLY (Hermes/OpenCode pattern).
2. User-level systemd service was running alongside system-level, fighting for port 3001.
   Masked user service permanently.
3. Fragile new Promise(() => {}) keepalive replaced with setInterval-based keepalive.
4. Syntax error in uncaughtException handler (literal newline in single-quoted string).

Tested: 5 rapid consecutive restarts all pass. Uptime stable.

Co-Authored-By: zcode <noreply@zcode.dev>

2026-05-06 16:51:12 +00:00

14 KiB

Raw Blame History

Changelog

All notable changes to zCode CLI X will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[2.0.2] - 2026-05-06

⚡ Performance

Agentic Task Execution — Hermes / OpenCode / Ruflo Inspired

Re-engineered the tool execution pipeline by studying three major open-source projects:

Sources studied:

Hermes Agent (NousResearch) — ToolCallGuardrailController with SHA256 signature-based loop detection, idempotent vs mutating tool classification, configurable warn/block/halt thresholds
OpenCode (anomalyco) — doom_loop detection, explicit "avoid bash for find/grep/cat" prompt, parallel bash call guidance built into tool descriptions
Ruflo (ruvnet) — parallel data extraction with deduplication

Before (v2.0.1): 47 tool turns, ~5 min, 87% bash, 27 turns ghost chasing wrong directory After (v2.0.2): 15 turns (7+8 delegate), ~2 min, 2-4 parallel calls/turn, 0 ghost chasing, 0 guardrail warnings

Changes:

Hermes-style ToolCallGuardrailController (src/bot/session-state.js)
- beforeCall() / afterCall() lifecycle (from Hermes ToolCallGuardrailController)
- SHA256 signature-based exact failure detection (from Hermes ToolCallSignature)
- Idempotent vs mutating tool classification (from Hermes IDEMPOTENT_TOOL_NAMES)
- Same-tool failure storm detection (warn after 3, halt after 8)
- Idempotent no-progress detection (warn when same result returned 2x, block after 5x)
- Automatic failure classification from tool results (from Hermes classify_tool_failure)
OpenCode-style tool selection guidance (src/bot/index.js system prompt)
- Explicit "avoid bash with find/grep/cat/head/tail/sed/awk" (from OpenCode shell/prompt.ts)
- "Use glob NOT find, use grep NOT grep, use file_read NOT cat" (from OpenCode)
- Parallel bash calls in single message (from OpenCode tool description)
Parallel tool execution — Promise.all() for independent calls (from Cursor)
Planning nudge injection — Pre-planning message before AI starts
Bash tool marked as LAST RESORT — with alternative tools listed in description
Full Hermes guardrail integration in tool execution loop — beforeCall checks, afterCall failure tracking, guidance appended to results

🐛 Fixed

Crash loop after reboot — uncaughtException and unhandledRejection handlers were calling gracefulShutdown() (which calls process.exit(0)), so ANY unhandled error killed the bot. Changed to LOG ONLY (Hermes/OpenCode pattern) — only SIGINT/SIGTERM trigger clean shutdown.
Dual systemd service war — User-level service (~/.config/systemd/user/zcode.service) was running alongside system-level service, both fighting for port 3001. Masked the user service permanently (ln -sf /dev/null zcode.service).
Fragile keepalive — await new Promise(() => {}) replaced with setInterval-based keepalive that's robust against V8 optimization.

📄 Files Changed

src/bot/session-state.js — Complete rewrite with Hermes guardrail controller (+200 lines)
src/bot/index.js — Parallel tool execution, system prompt overhaul, resilient error handlers (+160 lines)
src/bot/index.js — Fixed syntax error in uncaughtException handler (literal newline in string)
CHANGELOG.md — Updated with full v2.0.2 details
README.md — Updated header with v2.0.2 summary

[2.0.1] - 2026-05-06

🐛 Fixed

Critical: EADDRINUSE Crash Loop (Port Binding Race Condition)

Root Cause: The EADDRINUSE error handler used fuser to identify processes on port 3001. During systemd restart cycles, fuser returned the current process PID due to a race condition (the socket was half-open before the guard p !== process.pid could filter it). The process would kill itself, triggering a crash loop.

Additionally, two competing systemd services (system-level and user-level) were both trying to manage the same binary, creating a restart war where each instance killed the other.

Fix: Replaced the entire fuser-based port conflict resolution with a robust approach inspired by Next.js, Vite, and webpack-dev-server:

PID-file based stale detection — Read .zcode-bot.pid to identify the previous instance (no fuser, no race condition with the current process)
net.createServer port probe — Atomically test if a port is free using Node.js built-in net module (no external shell commands, no TOCTOU gap)
ss fallback — When pidfile is missing (deleted during graceful shutdown), use ss -tlnp to find the PID owning the port (kernel-authoritative, no race)
Wait loop with 300ms polling — After SIGTERM to stale process, poll until port is confirmed free before attempting to bind (up to 5s timeout)
Single-service architecture — Disabled the user-level systemd unit; only the system-level zcode.service manages the process, preventing dual-instance conflicts

Impact: The bot now survives rapid restart cycles (5 consecutive restarts tested), recovers cleanly from stale processes, and has zero EADDRINUSE crashes.

Secondary Fixes

Pidfile lock removed — The old acquirePidfile() killed any process with the stored PID, including the current process during restart races. Now pidfile is informational-only
WebSocket EADDRINUSE swallower removed — The wss.on('error') handler silently swallowed EADDRINUSE errors on the WS server, masking the real issue. Removed entirely
sequentialize middleware disabled — @grammyjs/runner's sequentialize caused incompatibility with systemd service management; replaced with a pass-through middleware

🎨 Improved

Telegram Message Formatting Overhaul

Enhanced the markdownToHtml converter in src/bot/message-sender.js to produce visually rich, well-structured Telegram messages:

Heading hierarchy — h1 gets 🚀 + separator line, h2 gets █ block marker, h3 gets ▸ triangle, h4 gets ● dot — all bold, visually distinct
Multi-line blockquotes — consecutive > lines now merge into a single <blockquote> element instead of one per line
Indented bullet lists — • with leading spaces for better readability
Table support — Markdown tables (| col | col |) rendered as <pre> blocks
Horizontal rules — --- and *** render as ──── separator lines
Code blocks — fenced code blocks get <pre><code> with language class attribute
Cleaner vertical spacing (excessive blank lines collapsed)

🔧 Changed

src/bot/index.js — Port binding logic completely rewritten (68 lines removed, 143 added)
src/bot/message-sender.js — markdownToHtml converter enhanced (13 lines removed, 41 added)
zcode.service (system) — Added EnvironmentFile, reduced RestartSec to 5s, added TimeoutStartSec=60
User-level systemd unit masked to prevent dual-service conflicts

[2.0.0] - 2026-05-06

🎉 Major Release - Ruflo Integration Complete

Complete integration of Ruflo's multi-agent orchestration system with comprehensive documentation update.

✨ Added

Core Features

Multi-Agent Swarm System
- SwarmCoordinator with 3 topologies: simple, hierarchical, swarm
- 9 agent roles: coder, tester, reviewer, architect, devops, security, researcher, designer, coordinator
- DAG-compatible task system with priorities and dependencies
- AgentOrchestrator for distributed task execution
Plugin System
- PluginManager with fault-isolated extension point routing
- PluginLoader with dependency-resolving batch loading
- 16 standard extension points:
  - tool.execute (before/after)
  - ai.response (before/after)
  - session.start / session.end
  - message.receive / message.send
  - memory.save / memory.load
  - agent.spawn / agent.terminate
  - cron.trigger
  - health.check
  - And more...
- BasePlugin with lifecycle hooks (initialize, shutdown)
Hook System
- Pre/post tool hooks for logging, validation, caching
- Pre/post AI hooks for prompt modification, response analysis
- Session lifecycle hooks (start, end, pause, resume)
- Priority-based execution order
- Zero latency impact (runs asynchronously)
Enhanced Memory Backend
- JSONBackend with typed entries, LRU eviction, text search
- InMemoryBackend with TTL auto-eviction for ephemeral data
- 7 memory types: lesson, pattern, preference, discovery, gotcha, context, ephemeral
- Smart eviction (old discoveries first, lessons/gotchas kept)

New Tools (6 Total)

swarm_spawn - Spawn new agent swarm with specified roles
swarm_execute - Execute current swarm task
swarm_distribute - Distribute work to swarm agents
swarm_state - Check swarm progress and status
swarm_terminate - Terminate all swarm agents
delegate_agent - Delegate task to specific agent role

Documentation

README.md - Complete rewrite (26,782 bytes, ~1,180 lines)
- Feature comparison table (zCode vs Hermes vs Claude vs Ruflo)
- Architecture diagrams (system overview, Ruflo integration, message flow)
- Usage examples for all commands
- Security guidelines and performance benchmarks
- Roadmap (v1.1, v1.2, v2.0)
INSTALLATION.md - New comprehensive setup guide (11,789 bytes, ~545 lines)
CREDITS.md - New attribution document (8,893 bytes, ~309 lines)
CONTRIBUTING.md - New contribution guide (9,574 bytes, ~461 lines)
REPO_UPDATE_SUMMARY.md - New update summary (7,450 bytes, ~205 lines)

Metadata

package.json - Enhanced with comprehensive metadata
- Version bumped to 2.0.0
- Added author, license, repository information
- Added 20+ keywords for discoverability
- Added funding and support links

🔄 Changed

Version Bump: 1.0.0 → 2.0.0 (major release)
README.md: Complete rewrite, 1,180 lines changed
package.json: Enhanced metadata, 55 lines modified
Documentation Structure: Organized into core, setup, and contributing sections

🛠️ Modified

src/plugins/ - New plugin system (4 files, ~23KB)
src/agents/ - Enhanced agent system (4 files, ~28KB)
src/bot/hooks.js - New hook system (4,900 bytes)
src/bot/memory-backend.js - Enhanced memory backend (8,077 bytes)
src/bot/index.js - Integrated all new systems (~17KB)

🧪 Added Tests

test-ruflo-smoke.mjs - Comprehensive smoke test suite
- Total: 53 tests, all passing ✅

🎯 Features Comparison

Feature	v1.0.0	v2.0.0	Change
24/7 Telegram Bot	✅	✅	Unchanged
Self-Learning Memory	✅	✅	Enhanced with LRU
Voice I/O	✅	✅	Unchanged
Self-Evolution	✅	✅	Unchanged
Multi-Agent Swarm	❌	✅	NEW
Plugin System	❌	✅	NEW
Hook System	❌	✅	NEW
Enhanced Memory	⚠️	✅	UPGRADED
18 Tools	✅	✅	Unchanged
9 Agent Roles	❌	✅	NEW
16 Extension Points	❌	✅	NEW
6 Swarm Tools	❌	✅	NEW
Documentation	⚠️	✅	COMPLETE

Legend: ✅ Full support | ⚠️ Partial support | ❌ Not available

[1.0.0] - 2026-05-04

🎉 Initial Release

✨ Added

Core Features
- 24/7 Telegram bot with grammy framework
- Self-learning memory (5 categories)
- Voice I/O (Vosk STT + node-edge-tts TTS)
- Self-evolution with 3-layer safety
- Intelligence Routing (unified agentic loop)
- RTK (Rust Token Killer) integration
Tools (18 Total)
- BashTool, FileEditTool, FileReadTool, FileWriteTool
- GitTool, WebSearchTool, WebFetchTool
- BrowserTool, VisionTool, TTSTool
- GrepTool, GlobTool, TaskCreateTool
- TaskUpdateTool, TaskListTool, SendMessageTool
- ScheduleCronTool, SelfEvolveTool
Agents (3 Roles)
- Code Reviewer
- System Architect
- DevOps Engineer
Skills
- code_review, bug_fix, refactor, documentation, testing
Documentation
- README.md, ARCHITECTURE.md, SERVICE_MAP.md
- QUICKSTART.md, TELEGRAM_SETUP.md, PERFORMANCE.md

🛠️ Technology Stack

AI Model: Z.AI GLM-5.1 (Coding Plan)
Telegram Framework: grammy
Web Server: Express
Logging: Winston
Voice STT: Vosk (offline)
Voice TTS: node-edge-tts
Token Optimization: RTK
Database: JSON-backed memory

📦 Installation

Node.js ≥ 20.0.0
npm ≥ 9.0.0
ffmpeg (for voice I/O)
Python 3.8+ (for Vosk)
systemd (for 24/7 service)

[Unreleased]

Planned for v2.1.0

Enhanced swarm topologies (federated, gossip)
Plugin marketplace
Advanced analytics dashboard
Custom agent training (LoRA fine-tuning)

Planned for v2.2.0

Web UI dashboard
Multi-language support (Spanish, French, German)
Distributed memory backend (Redis)
Kubernetes deployment
Horizontal scaling

Notes

Breaking Changes

v2.0.0: No breaking changes to existing functionality
All v1.0.0 features remain fully compatible
New features are additive only

Migration Guide

No migration needed! v2.0.0 is fully backward compatible with v1.0.0.

Known Issues

None reported.

Contributors

Roman (@uroma2) - Author, maintainer, primary developer
[More contributors coming soon]

zCode CLI X - The Ultimate Agentic Coding Assistant Hermes Agent × Claude Code × Ruflo × Opencode

14 KiB Raw Blame History Unescape Escape

Changelog

[2.0.2] - 2026-05-06

⚡ Performance

Agentic Task Execution — Hermes / OpenCode / Ruflo Inspired

🐛 Fixed

📄 Files Changed

[2.0.1] - 2026-05-06

🐛 Fixed

Critical: EADDRINUSE Crash Loop (Port Binding Race Condition)

Secondary Fixes

🎨 Improved

Telegram Message Formatting Overhaul

🔧 Changed

[2.0.0] - 2026-05-06

🎉 Major Release - Ruflo Integration Complete

✨ Added

Core Features

New Tools (6 Total)

Documentation

Metadata

🔄 Changed

🛠️ Modified

🧪 Added Tests

🎯 Features Comparison

[1.0.0] - 2026-05-04

🎉 Initial Release

✨ Added

🛠️ Technology Stack

📦 Installation

[Unreleased]

Planned for v2.1.0

Planned for v2.2.0

Notes

Breaking Changes

Migration Guide

Known Issues

Contributors

14 KiB

Raw Blame History