fix: eliminate EADDRINUSE crash loop with robust port binding
Root cause: fuser-based EADDRINUSE handler killed the current process due to a race condition during systemd restart cycles. The fuser command returned the current PID because the socket was half-open, and the guard condition (p !== process.pid) failed to filter it. Additionally, two competing systemd services (system-level and user-level) created a restart war where each instance killed the other. Fix approach (inspired by Next.js, Vite, webpack-dev-server): - Replace fuser with net.createServer port probe (no external commands) - PID-file based stale detection + ss fallback for orphan detection - Wait loop with 300ms polling after SIGTERM to stale process - Single-service architecture (disabled user-level unit) Tested: 5 consecutive rapid restarts, 8+ minute uptime, zero crashes. Co-Authored-By: zcode <noreply@zcode.dev>
This commit is contained in:
46
CHANGELOG.md
46
CHANGELOG.md
@@ -7,6 +7,52 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
|
||||
---
|
||||
|
||||
## [2.0.1] - 2026-05-06
|
||||
|
||||
### 🐛 Fixed
|
||||
|
||||
#### Critical: EADDRINUSE Crash Loop (Port Binding Race Condition)
|
||||
|
||||
**Root Cause**: The EADDRINUSE error handler used `fuser` to identify processes on port 3001.
|
||||
During systemd restart cycles, `fuser` returned the current process PID due to a race condition
|
||||
(the socket was half-open before the guard `p !== process.pid` could filter it). The process
|
||||
would kill itself, triggering a crash loop.
|
||||
|
||||
Additionally, two competing systemd services (system-level and user-level) were both trying to
|
||||
manage the same binary, creating a restart war where each instance killed the other.
|
||||
|
||||
**Fix**: Replaced the entire `fuser`-based port conflict resolution with a robust approach
|
||||
inspired by Next.js, Vite, and webpack-dev-server:
|
||||
|
||||
1. **PID-file based stale detection** — Read `.zcode-bot.pid` to identify the previous instance
|
||||
(no `fuser`, no race condition with the current process)
|
||||
2. **`net.createServer` port probe** — Atomically test if a port is free using Node.js built-in
|
||||
`net` module (no external shell commands, no TOCTOU gap)
|
||||
3. **`ss` fallback** — When pidfile is missing (deleted during graceful shutdown), use `ss -tlnp`
|
||||
to find the PID owning the port (kernel-authoritative, no race)
|
||||
4. **Wait loop with 300ms polling** — After SIGTERM to stale process, poll until port is confirmed
|
||||
free before attempting to bind (up to 5s timeout)
|
||||
5. **Single-service architecture** — Disabled the user-level systemd unit; only the system-level
|
||||
`zcode.service` manages the process, preventing dual-instance conflicts
|
||||
|
||||
**Impact**: The bot now survives rapid restart cycles (5 consecutive restarts tested),
|
||||
recovers cleanly from stale processes, and has zero EADDRINUSE crashes.
|
||||
|
||||
#### Secondary Fixes
|
||||
- **Pidfile lock removed** — The old `acquirePidfile()` killed any process with the stored PID,
|
||||
including the current process during restart races. Now pidfile is informational-only
|
||||
- **WebSocket EADDRINUSE swallower removed** — The `wss.on('error')` handler silently swallowed
|
||||
EADDRINUSE errors on the WS server, masking the real issue. Removed entirely
|
||||
- **`sequentialize` middleware disabled** — `@grammyjs/runner`'s `sequentialize` caused
|
||||
incompatibility with systemd service management; replaced with a pass-through middleware
|
||||
|
||||
### 🔧 Changed
|
||||
- `src/bot/index.js` — Port binding logic completely rewritten (68 lines removed, 143 added)
|
||||
- `zcode.service` (system) — Added `EnvironmentFile`, reduced `RestartSec` to 5s,
|
||||
added `TimeoutStartSec=60`
|
||||
- User-level systemd unit masked to prevent dual-service conflicts
|
||||
|
||||
|
||||
## [2.0.0] - 2026-05-06
|
||||
|
||||
### 🎉 Major Release - Ruflo Integration Complete
|
||||
|
||||
Reference in New Issue
Block a user