Speech-to-text
25 languages via NVIDIA Parakeet TDT. ~19× faster than Whisper on Apple Silicon,
~2.5× on CPU. Auto language detection, optional VAD for long audio,
timestamped segments, plain text or LLM-friendly TOON output.
- EN · RU · DE · FR · ES
- + 20 more
- --timestamps
- VAD opt-in
- JSON / TOON
New
Speaker diarization
Drop a meeting recording in, get back timestamped segments tagged with a stable
speaker ID. Powered by FluidAudio’s Sortformer model running on the Apple Neural
Engine via a Swift sidecar — ~4.6× real-time on a 71-minute Zoom call. v1.19
gives the cold ANE warmup extra headroom so the first --speakers call
on a fresh process no longer hits a false timeout.
- --speakers
- darwin-arm64
- Sortformer (CoreML)
- opt-in install
Text-to-speech
Kokoro-82M for English (CoreML on Apple Silicon, ONNX elsewhere) with acronym
auto-expand (FBI, EPAM, JSON), Vosk-TTS for Russian with stress markers, plus ~180
macOS system voices. Auto-routes by detected language. SSML
<prosody rate>, <break>, and
<emphasis> work on every backend — v1.21 closed the Apple
Silicon gap via FluidAudio 0.14.7’s model-native speed control.
OGG/Opus voice notes
Synthesize straight into the format Telegram, WhatsApp, Signal and Discord
expect for inline voice messages — ~24× smaller than WAV. Pick a bitrate; the
engine handles the OggOpus container, no ffmpeg required.
- --format ogg-opus
- 16 / 32 / 64 kbps
Timestamped transcripts
--json --timestamps returns segments with start and
end. Plays nicely with subtitles, meeting tools, and chunked LLM
summarisation. Programmatic API: transcribeWithTimestamps().
Language detection
107 languages from audio via SpeechBrain ECAPA-TDNN. Text language detection
runs through a precompiled Swift sidecar on macOS — 30–50 ms cold or warm.
Voice activity detection
Silero VAD v5 trims silence on long, sparse recordings. Auto-enabled past 120
seconds of audio.
Persistent --stdin-loop
One TTS process, many requests. Models stay in memory across calls so each
synthesis hits ~10 ms cold-start instead of ~800 ms. Ideal for chatbots,
live agents, and long agentic loops.
New
kesha record
Capture voice notes from the default microphone and pipe straight into
kesha --json to close the transcribe-then-reply loop. No
ffmpeg, no separate recorder process.
- kesha record
- | kesha --json
- WAV mono
New
Privacy-safe diagnostic logs
Opt-in local-only command events with paths and user-supplied text redacted.
kesha doctor surfaces status; kesha logs manages them;
kesha support-bundle bundles them for support. Mode
retain-on-failure keeps logs only when a command actually fails.
- kesha doctor
- kesha logs
- support-bundle
- local only
v1.23
Multilingual TTS
English + Spanish, French, Italian, and Portuguese on every platform via
CharsiuG2P (ByT5-tiny ONNX, CC-BY 4.0) with per-language number and acronym
normalizers — OTAN, OVNI, FIFA read as words. Mandarin Chinese
runs natively on Apple Silicon via FluidAudio 0.14.8’s tone-aware ANE variant.
Russian stays on Vosk-TTS. --lang es-ES for Castilian Spanish.
- en · es · fr · it · pt
- ru (Vosk)
- zh (macOS ANE)
- es-ES · es-419
v1.22
Structured error taxonomy
Every failure path prints a stable, machine-readable error [CODE]: …
line. kesha-engine --error-codes-json dumps the whole taxonomy; a
docs drift gate keeps docs/errors.md honest. Agents and CI can branch
on why something failed instead of regexing log soup.
- stable codes
- --error-codes-json
- exit codes
- agent-friendly
Rust engine, zero deps
A single Rust binary for macOS arm64, Linux x64, and Windows x64. FluidAudio
(CoreML) on Apple Silicon via dedicated sidecars (diarization, TTS, language
detection); ort (ONNX) elsewhere. Symphonia decodes WAV, MP3, OGG/Opus, FLAC,
AAC, M4A; pure-Rust opus + flacenc handle encode
— royalty-free codecs, no C dep, no patent exposure. Linux .deb and
.rpm packages signed with Sigstore.
No ffmpeg. No Python. No native Node addons.
OpenClaw-ready
Plug straight into your LLM agent as a voice-processing skill. Give Claude (or any
agent) ears and a voice in one install.