Speech-to-text
25 languages via NVIDIA Parakeet TDT. ~15× faster than Whisper on Apple Silicon, ~2.5× on
CPU. Auto language detection, optional VAD for long audio, plain text or
LLM-friendly TOON output.
- EN · RU · DE · FR · ES
- + 20 more
- VAD opt-in
- JSON / TOON
Text-to-speech
Kokoro-82M for English, Vosk-TTS for Russian, plus ~180 macOS system voices.
Auto-routes by detected language. SSML preview supported.
Language detection
SpeechBrain ECAPA-TDNN identifies 107 languages from audio. Apple
NLLanguageRecognizer handles text. Both run locally.
Voice activity detection
Silero VAD v5 trims silence on long, sparse recordings. Auto-enabled past 120
seconds of audio.
Rust engine, zero deps
A single 20MB binary running FluidAudio (CoreML) on Apple Silicon and ort (ONNX) on
everything else. Audio decoded by Symphonia: WAV, MP3, OGG/Opus, FLAC, AAC, M4A.
No ffmpeg. No Python. No native Node addons.
OpenClaw-ready
Plug straight into your LLM agent as a voice-processing skill. Give Claude (or any
agent) ears and a voice in one install.