feat: add Vosk STT - offline voice-to-text, no API key needed
This commit is contained in:
59
README.md
59
README.md
@@ -58,6 +58,59 @@ User message + AI response
|
||||
| `/recall <query>` | Search memories by keyword |
|
||||
| `/forget <id>` | Delete a specific memory |
|
||||
|
||||
### 🎤 Voice I/O (Speech-to-Text + Text-to-Speech)
|
||||
|
||||
Fully local voice processing. No API keys, no cloud services, no costs.
|
||||
|
||||
```
|
||||
User sends voice message
|
||||
│
|
||||
▼
|
||||
┌──────────────┐
|
||||
│ Download OGG │ ← Telegram Bot API
|
||||
│ to /tmp │
|
||||
└──────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────┐
|
||||
│ ffmpeg → WAV │ ← 16kHz mono (Vosk requirement)
|
||||
│ (16kHz mono) │
|
||||
└──────┬───────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────┐
|
||||
│ Vosk STT │ ← Offline, ~200ms, 68MB model
|
||||
│ Python bridge│ Zero network calls
|
||||
└──────┬───────┘
|
||||
│
|
||||
▼
|
||||
{"text": "...", "confidence": 0.95}
|
||||
│
|
||||
▼
|
||||
Feed into chatWithAI → AI responds
|
||||
(optionally via TTS tool → voice reply)
|
||||
```
|
||||
|
||||
| Component | Technology | Size | Latency | Cost |
|
||||
|---|---|---|---|---|
|
||||
| **STT** (voice→text) | [Vosk](https://alphacephei.com/vosk/) — offline speech recognition | 68MB model | ~200ms | Free |
|
||||
| **TTS** (text→voice) | [node-edge-tts](https://github.com/yayuyokit/Edge-TTS-node) — Microsoft Edge voices | No download | ~2s | Free |
|
||||
| **Audio conversion** | ffmpeg (system) | N/A | ~100ms | Free |
|
||||
|
||||
**How it works:**
|
||||
1. Telegram sends voice as OGG Opus. Bot downloads to `/tmp`.
|
||||
2. `scripts/stt.py` — Python bridge that converts to WAV (ffmpeg) and runs Vosk inference.
|
||||
3. Returns JSON `{"text": "...", "confidence": 0.95}` to Node.js.
|
||||
4. Transcribed text enters the normal `handleTextMessage()` pipeline — full AI response with streaming, tools, memory, self-correction.
|
||||
5. AI can optionally use the `tts` tool to reply with a voice message.
|
||||
|
||||
**Why Vosk over Whisper:**
|
||||
- **No GPU needed** — runs on CPU, ~200MB RAM (Whisper needs 1-4GB)
|
||||
- **Fast** — 200ms vs 5-10s for Whisper on CPU
|
||||
- **Tiny model** — 68MB vs 1-3GB for Whisper
|
||||
- **Offline** — zero network calls, zero API costs
|
||||
- **Good enough** — ~95% accuracy for English speech
|
||||
|
||||
### 🧠 Intelligence Routing
|
||||
|
||||
The core of zCode CLI X's reliability. A unified agentic loop that handles both streaming and non-streaming through the same execution path — no more split paths that lose context or hang silently.
|
||||
@@ -464,6 +517,10 @@ Z.AI API (SSE)
|
||||
| Telegram integration | ✅ Native bot + webhook + streaming | ✅ 2-way Telegram bridge | ❌ None |
|
||||
| Discord | ✅ Native bot (discord.js) | ✅ Full Discord integration | ❌ None |
|
||||
| Multi-channel delivery | ✅ Delivery hub (TG + DC + WS + log) | ✅ Cron→multi-platform | ❌ None |
|
||||
| **Voice** | | | |
|
||||
| Speech-to-Text | ✅ Vosk (offline, ~200ms, 68MB) | ⚠️ Whisper (needs GPU) | ❌ None |
|
||||
| Text-to-Speech | ✅ Edge TTS (free, 100+ voices) | ✅ node-edge-tts | ❌ None |
|
||||
| Voice→AI pipeline | ✅ Transcribe → full agentic loop | ⚠️ Separate pipeline | ❌ None |
|
||||
| **Infrastructure** | | | |
|
||||
| Model routing | ✅ Multi-provider | ✅ Multi-provider routing | ❌ Single model |
|
||||
| Context compression | ✅ Compact pipeline | ✅ lean-ctx MCP (90% savings) | ❌ None |
|
||||
@@ -485,6 +542,8 @@ Z.AI API (SSE)
|
||||
- **Winston**: Structured logging
|
||||
- **WebSocket**: Real-time updates
|
||||
- **RTK**: Rust Token Killer (token optimization)
|
||||
- **Vosk**: Offline speech recognition (STT, 68MB model, no API key)
|
||||
- **ffmpeg**: Audio conversion (OGG → WAV for Vosk)
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
|
||||
Reference in New Issue
Block a user