v1.23 · MIT licensed · macOS · Linux · Windows

Voice in, text out. A Rust toolkit that's ~19× faster than Whisper on Apple Silicon.

Small, fast, open-source audio models — speech-to-text with timestamped segments and speaker diarization for meetings, text-to-speech that exports straight to Telegram-ready OGG/Opus voice notes, voice activity detection, and language ID. CLI tools plus an OpenClaw skill for LLM agents. CoreML on Apple Silicon, ONNX on Linux + Windows. One Rust binary. No Python. No ffmpeg.

$ bun add -g @drakulavich/kesha-voice-kit
  • 25 STT languages
  • 107 Detection langs
  • ~20MB Single binary
  • 19× vs Whisper (M-series)

Capabilities

Everything voice, in a single CLI.

Four pipelines, one binary. Pick what you need; the rest stays out of your way.

Speech-to-text

25 languages via NVIDIA Parakeet TDT. ~19× faster than Whisper on Apple Silicon, ~2.5× on CPU. Auto language detection, optional VAD for long audio, timestamped segments, plain text or LLM-friendly TOON output.

  • EN · RU · DE · FR · ES
  • + 20 more
  • --timestamps
  • VAD opt-in
  • JSON / TOON
New

Speaker diarization

Drop a meeting recording in, get back timestamped segments tagged with a stable speaker ID. Powered by FluidAudio’s Sortformer model running on the Apple Neural Engine via a Swift sidecar — ~4.6× real-time on a 71-minute Zoom call. v1.19 gives the cold ANE warmup extra headroom so the first --speakers call on a fresh process no longer hits a false timeout.

  • --speakers
  • darwin-arm64
  • Sortformer (CoreML)
  • opt-in install

Text-to-speech

Kokoro-82M for English (CoreML on Apple Silicon, ONNX elsewhere) with acronym auto-expand (FBI, EPAM, JSON), Vosk-TTS for Russian with stress markers, plus ~180 macOS system voices. Auto-routes by detected language. SSML <prosody rate>, <break>, and <emphasis> work on every backend — v1.21 closed the Apple Silicon gap via FluidAudio 0.14.7’s model-native speed control.

OGG/Opus voice notes

Synthesize straight into the format Telegram, WhatsApp, Signal and Discord expect for inline voice messages — ~24× smaller than WAV. Pick a bitrate; the engine handles the OggOpus container, no ffmpeg required.

  • --format ogg-opus
  • 16 / 32 / 64 kbps

Timestamped transcripts

--json --timestamps returns segments with start and end. Plays nicely with subtitles, meeting tools, and chunked LLM summarisation. Programmatic API: transcribeWithTimestamps().

Language detection

107 languages from audio via SpeechBrain ECAPA-TDNN. Text language detection runs through a precompiled Swift sidecar on macOS — 30–50 ms cold or warm.

Voice activity detection

Silero VAD v5 trims silence on long, sparse recordings. Auto-enabled past 120 seconds of audio.

Persistent --stdin-loop

One TTS process, many requests. Models stay in memory across calls so each synthesis hits ~10 ms cold-start instead of ~800 ms. Ideal for chatbots, live agents, and long agentic loops.

New

kesha record

Capture voice notes from the default microphone and pipe straight into kesha --json to close the transcribe-then-reply loop. No ffmpeg, no separate recorder process.

  • kesha record
  • | kesha --json
  • WAV mono
New

Privacy-safe diagnostic logs

Opt-in local-only command events with paths and user-supplied text redacted. kesha doctor surfaces status; kesha logs manages them; kesha support-bundle bundles them for support. Mode retain-on-failure keeps logs only when a command actually fails.

  • kesha doctor
  • kesha logs
  • support-bundle
  • local only
v1.23

Multilingual TTS

English + Spanish, French, Italian, and Portuguese on every platform via CharsiuG2P (ByT5-tiny ONNX, CC-BY 4.0) with per-language number and acronym normalizers — OTAN, OVNI, FIFA read as words. Mandarin Chinese runs natively on Apple Silicon via FluidAudio 0.14.8’s tone-aware ANE variant. Russian stays on Vosk-TTS. --lang es-ES for Castilian Spanish.

  • en · es · fr · it · pt
  • ru (Vosk)
  • zh (macOS ANE)
  • es-ES · es-419
v1.22

Structured error taxonomy

Every failure path prints a stable, machine-readable error [CODE]: … line. kesha-engine --error-codes-json dumps the whole taxonomy; a docs drift gate keeps docs/errors.md honest. Agents and CI can branch on why something failed instead of regexing log soup.

  • stable codes
  • --error-codes-json
  • exit codes
  • agent-friendly

Rust engine, zero deps

A single Rust binary for macOS arm64, Linux x64, and Windows x64. FluidAudio (CoreML) on Apple Silicon via dedicated sidecars (diarization, TTS, language detection); ort (ONNX) elsewhere. Symphonia decodes WAV, MP3, OGG/Opus, FLAC, AAC, M4A; pure-Rust opus + flacenc handle encode — royalty-free codecs, no C dep, no patent exposure. Linux .deb and .rpm packages signed with Sigstore. No ffmpeg. No Python. No native Node addons.

OpenClaw-ready

Plug straight into your LLM agent as a voice-processing skill. Give Claude (or any agent) ears and a voice in one install.

Hear it

Four samples, two engines, two languages.

Real output from kesha say — Kokoro-82M for English, Vosk-TTS for Russian. SSML <prosody rate> renders the same text at different speeds without resampling artifacts. v1.21 brought prosody to Apple Silicon Kokoro too — EN, RU, and the v1.22 multilingual voices all honor <prosody>, <break>, and <emphasis>.

EN · Kokoro-82M medium (1.0×)

“Kesha turns voice into text in a single command — no Python, no ffmpeg.”

kesha say --voice en-am_michael "…" --out demo.flac
EN · Kokoro-82M <prosody rate="x-fast"> · 1.5×

Same text, x-fast prosody. Tempo only — pitch and timbre unchanged.

kesha say --voice en-am_michael --ssml '<speak><prosody rate="x-fast">…'
RU · Vosk-TTS medium (1.0×)

«Кеша превращает голос в текст одной командой — без зависимостей.»

kesha say --voice ru-vosk-m02 "…" --out demo.flac
RU · Vosk-TTS <prosody rate="slow"> · 0.75×

Same text, slow prosody. Useful for accessibility and dictation.

kesha say --voice ru-vosk-m02 --ssml '<speak><prosody rate="slow">…'

Performance

Faster than Whisper.
Quieter than your fans.

Compared against Whisper large-v3-turbo across Russian and English clips — every engine left to auto-detect language. CoreML on M-series lands at roughly 19× real-time; the ONNX path on x86 still beats Whisper by ~2.5×.

  • ~19× vs openai-whisper · M-series
  • ~2.5× vs openai-whisper · CPU x86
  • ~7× vs faster-whisper · M-series
Read the full benchmark
Bar chart comparing transcription speed: openai-whisper vs faster-whisper vs Kesha Voice Kit
Lower bars = faster. Source: BENCHMARK.md.

Quick start

Three commands to a working transcript.

Requires Bun ≥ 1.3 on macOS arm64 or Linux x64.

01

Install Bun

Skip if you already have it.

$ curl -fsSL https://bun.sh/install | bash
02

Install Kesha

Pulls the engine and base STT models, then warms the Neural Engine on macOS (--tts pulls Kokoro + multilingual voices, ~990 MB on darwin-arm64). First transcribe runs in ~500 ms instead of a 25-second hang. kesha init is the guided entry point for humans; kesha install is the explicit one for CI and agents — same code underneath.

$ bun add -g @drakulavich/kesha-voice-kit
$ kesha init             # humans (interactive)
$ kesha install --tts    # + multilingual voices
03

Transcribe

Pipe-friendly. Stdout = transcript, stderr = errors.

$ kesha audio.ogg
Свободу попугаям! Свободу!
# Plain transcript
$ kesha audio.ogg

# Text + language/confidence
$ kesha --format transcript audio.ogg

# Full JSON with lang fields
$ kesha --format json audio.ogg

# Compact, LLM-friendly TOON
$ kesha --toon audio.ogg

# Long / silence-heavy audio (auto past 120s)
$ kesha --vad lecture.m4a

# Timestamped segments (start/end per phrase)
$ kesha --json --timestamps interview.ogg

# Meeting transcript with speaker IDs (darwin-arm64)
$ kesha --json --vad --speakers meeting.m4a

# Warn if detected language differs
$ kesha --lang en interview.wav

# Record from default mic, then transcribe straight to JSON
$ kesha record | kesha --json
# Opt-in TTS pack: Kokoro (EN) + Vosk-TTS (RU), ~990MB
$ kesha install --tts

# Speak — voice auto-picks from text language
$ kesha say "Hello, world" > hello.wav
$ kesha say "Привет, мир" > privet.wav

# OGG/Opus voice notes (drop straight into Telegram)
$ kesha say --format ogg-opus "Уже бегу!" --out voice.ogg

# Variable-speed playback via SSML <prosody rate>
$ kesha say --voice en-am_michael --ssml \
    '<speak><prosody rate="x-fast">Read this fast.</prosody></speak>' \
    --out fast.wav

# Use a macOS system voice
$ kesha say --voice Milena "Чао!"

Want to hear what these voices sound like? Play the four samples ↗

import { transcribe, downloadModel } from "@drakulavich/kesha-voice-kit/core";

// One-time install
await downloadModel();

// Transcribe a file
const text = await transcribe("audio.ogg");

// With detection metadata
const result = await transcribe("audio.ogg", { format: "json" });
console.log(result.language, result.confidence);
# Multiple files — headers per file, like `head`
$ kesha freedom.ogg tahiti.ogg
=== freedom.ogg ===
Свободу попугаям! Свободу!

=== tahiti.ogg ===
Таити, Таити! Не были мы ни в какой Таити! Нас и тут неплохо кормят.

# Pipe transcripts into your favorite LLM
$ kesha --toon meeting.m4a | llm "Summarize action items"

# Status of installed engine + models
$ kesha status

# Diagnose env, log status, and CoreML caches
$ kesha doctor

# Bundle logs + manifests for a bug report
$ kesha support-bundle

What's inside

Nine models, one runtime.

Each model is best-in-class for its task and runs through kesha-engine — Rust, with CoreML sidecars on Apple Silicon and ONNX everywhere else. Nine multilingual voices ship in one bundle as of v1.22.

Model Task Size Source
NVIDIA Parakeet TDT 0.6B v3 Speech-to-text ~2.5 GB HuggingFace ↗
SpeechBrain ECAPA-TDNN Audio language detection ~86 MB HuggingFace ↗
Apple NLLanguageRecognizer Text language detection built-in macOS framework
Silero VAD v5 (opt-in) Voice activity detection ~2.3 MB snakers4/silero-vad ↗
Kokoro-82M (opt-in) Text-to-speech · English ~990 MB HuggingFace ↗
Vosk-TTS (opt-in) Text-to-speech · Russian bundled alphacep/vosk-tts ↗
FluidAudio Sortformer v2 (opt-in, darwin-arm64) Speaker diarization ~245 MB FluidInference/FluidAudio ↗
FluidAudio Kokoro — multilingual (opt-in, darwin-arm64) Text-to-speech (CoreML) · en + native zh; es/fr/it/pt via Eng G2P ~990 MB FluidInference/FluidAudio ↗

Audio decoding via Symphonia — WAV, MP3, OGG/Opus, FLAC, AAC, M4A.

Languages

25 for transcription. 107 for detection.

English Spanish German French Italian Portuguese Russian Polish Dutch Czech Slovak Slovenian Croatian Bulgarian Romanian Greek Hungarian Finnish Swedish Danish Estonian Latvian Lithuanian Ukrainian Maltese

Need detection only? VoxLingua107 covers 107 languages.

Integrations

Drop in where you already work.

OpenClaw

Give your LLM agent ears. Voice-processing skill, pre-wired.

Setup guide →

Raycast

macOS launcher commands: transcribe selected audio, speak the clipboard. From the raycast/ directory.

Source & install →

Programmatic API

Import transcribe() and downloadModel() directly into any Bun/Node project.

See examples →

Air-gapped mirrors

Behind a corporate proxy or fully offline? Point Kesha at your model mirror.

Mirror docs →

Ready to ship voice?

Open source, MIT licensed, and improving every week. Star the repo or jump straight to the install.

Copied