fix: eliminate EADDRINUSE crash loop with robust port binding

Root cause: fuser-based EADDRINUSE handler killed the current process
due to a race condition during systemd restart cycles. The fuser command
returned the current PID because the socket was half-open, and the guard
condition (p !== process.pid) failed to filter it.

Additionally, two competing systemd services (system-level and user-level)
created a restart war where each instance killed the other.

Fix approach (inspired by Next.js, Vite, webpack-dev-server):
- Replace fuser with net.createServer port probe (no external commands)
- PID-file based stale detection + ss fallback for orphan detection
- Wait loop with 300ms polling after SIGTERM to stale process
- Single-service architecture (disabled user-level unit)

Tested: 5 consecutive rapid restarts, 8+ minute uptime, zero crashes.

Co-Authored-By: zcode <noreply@zcode.dev>
This commit is contained in:
admin
2026-05-06 12:47:36 +00:00
Unverified
parent c164446a9c
commit 98ed33ba8f
4 changed files with 198 additions and 69 deletions

View File

@@ -7,6 +7,52 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
---
## [2.0.1] - 2026-05-06
### 🐛 Fixed
#### Critical: EADDRINUSE Crash Loop (Port Binding Race Condition)
**Root Cause**: The EADDRINUSE error handler used `fuser` to identify processes on port 3001.
During systemd restart cycles, `fuser` returned the current process PID due to a race condition
(the socket was half-open before the guard `p !== process.pid` could filter it). The process
would kill itself, triggering a crash loop.
Additionally, two competing systemd services (system-level and user-level) were both trying to
manage the same binary, creating a restart war where each instance killed the other.
**Fix**: Replaced the entire `fuser`-based port conflict resolution with a robust approach
inspired by Next.js, Vite, and webpack-dev-server:
1. **PID-file based stale detection** — Read `.zcode-bot.pid` to identify the previous instance
(no `fuser`, no race condition with the current process)
2. **`net.createServer` port probe** — Atomically test if a port is free using Node.js built-in
`net` module (no external shell commands, no TOCTOU gap)
3. **`ss` fallback** — When pidfile is missing (deleted during graceful shutdown), use `ss -tlnp`
to find the PID owning the port (kernel-authoritative, no race)
4. **Wait loop with 300ms polling** — After SIGTERM to stale process, poll until port is confirmed
free before attempting to bind (up to 5s timeout)
5. **Single-service architecture** — Disabled the user-level systemd unit; only the system-level
`zcode.service` manages the process, preventing dual-instance conflicts
**Impact**: The bot now survives rapid restart cycles (5 consecutive restarts tested),
recovers cleanly from stale processes, and has zero EADDRINUSE crashes.
#### Secondary Fixes
- **Pidfile lock removed** — The old `acquirePidfile()` killed any process with the stored PID,
including the current process during restart races. Now pidfile is informational-only
- **WebSocket EADDRINUSE swallower removed** — The `wss.on('error')` handler silently swallowed
EADDRINUSE errors on the WS server, masking the real issue. Removed entirely
- **`sequentialize` middleware disabled** — `@grammyjs/runner`'s `sequentialize` caused
incompatibility with systemd service management; replaced with a pass-through middleware
### 🔧 Changed
- `src/bot/index.js` — Port binding logic completely rewritten (68 lines removed, 143 added)
- `zcode.service` (system) — Added `EnvironmentFile`, reduced `RestartSec` to 5s,
added `TimeoutStartSec=60`
- User-level systemd unit masked to prevent dual-service conflicts
## [2.0.0] - 2026-05-06
### 🎉 Major Release - Ruflo Integration Complete