v1.5 · MIT licensed · CoreML + ONNX

Voice in, text out. A Rust toolkit that's ~15× faster than Whisper on Apple Silicon.

A collection of small, fast, open-source audio models — speech-to-text, text-to-speech, voice activity detection, language identification — packaged as CLI tools and an OpenClaw skill for LLM agents. One 20MB binary. No Python. No ffmpeg.

$ bun install -g @drakulavich/kesha-voice-kit
  • 25 STT languages
  • 107 Detection langs
  • ~20MB Single binary
  • 15× vs Whisper (M-series)

Capabilities

Everything voice, in a single CLI.

Four pipelines, one binary. Pick what you need; the rest stays out of your way.

Speech-to-text

25 languages via NVIDIA Parakeet TDT. ~15× faster than Whisper on Apple Silicon, ~2.5× on CPU. Auto language detection, optional VAD for long audio, plain text or LLM-friendly TOON output.

  • EN · RU · DE · FR · ES
  • + 20 more
  • VAD opt-in
  • JSON / TOON

Text-to-speech

Kokoro-82M for English, Vosk-TTS for Russian, plus ~180 macOS system voices. Auto-routes by detected language. SSML preview supported.

Language detection

SpeechBrain ECAPA-TDNN identifies 107 languages from audio. Apple NLLanguageRecognizer handles text. Both run locally.

Voice activity detection

Silero VAD v5 trims silence on long, sparse recordings. Auto-enabled past 120 seconds of audio.

Rust engine, zero deps

A single 20MB binary running FluidAudio (CoreML) on Apple Silicon and ort (ONNX) on everything else. Audio decoded by Symphonia: WAV, MP3, OGG/Opus, FLAC, AAC, M4A. No ffmpeg. No Python. No native Node addons.

OpenClaw-ready

Plug straight into your LLM agent as a voice-processing skill. Give Claude (or any agent) ears and a voice in one install.

Performance

Faster than Whisper.
Quieter than your fans.

Compared against Whisper large-v3-turbo across Russian and English clips — every engine left to auto-detect language. CoreML on M3 Pro lands at roughly 15× real-time; the ONNX path on x86 still beats Whisper by ~2.5×.

  • ~15× vs openai-whisper · M3 Pro
  • ~2.5× vs openai-whisper · CPU x86
  • ~7× vs faster-whisper · M3 Pro
Read the full benchmark
Bar chart comparing transcription speed: openai-whisper vs faster-whisper vs Kesha Voice Kit
Lower bars = faster. Source: BENCHMARK.md.

Quick start

Three commands to a working transcript.

Requires Bun ≥ 1.3 on macOS arm64 or Linux x64.

01

Install Bun

Skip if you already have it.

$ curl -fsSL https://bun.sh/install | bash
02

Install Kesha

Pulls the engine and base STT models.

$ bun install -g @drakulavich/kesha-voice-kit
$ kesha install
03

Transcribe

Pipe-friendly. Stdout = transcript, stderr = errors.

$ kesha audio.ogg
Свободу попугаям! Свободу!
# Plain transcript
$ kesha audio.ogg

# Text + language/confidence
$ kesha --format transcript audio.ogg

# Full JSON with lang fields
$ kesha --format json audio.ogg

# Compact, LLM-friendly TOON
$ kesha --toon audio.ogg

# Long / silence-heavy audio (auto past 120s)
$ kesha --vad lecture.m4a

# Warn if detected language differs
$ kesha --lang en interview.wav
# Opt-in TTS pack: Kokoro (EN) + Vosk-TTS (RU), ~990MB
$ kesha install --tts

# Speak — voice auto-picks from text language
$ kesha say "Hello, world" > hello.wav
$ kesha say "Привет, мир" > privet.wav

# Use a macOS system voice
$ kesha say --voice Milena "Чао!"
import { transcribe, downloadModel } from "@drakulavich/kesha-voice-kit/core";

// One-time install
await downloadModel();

// Transcribe a file
const text = await transcribe("audio.ogg");

// With detection metadata
const result = await transcribe("audio.ogg", { format: "json" });
console.log(result.language, result.confidence);
# Multiple files — headers per file, like `head`
$ kesha freedom.ogg tahiti.ogg
=== freedom.ogg ===
Свободу попугаям! Свободу!

=== tahiti.ogg ===
Таити, Таити! Не были мы ни в какой Таити! Нас и тут неплохо кормят.

# Pipe transcripts into your favorite LLM
$ kesha --toon meeting.m4a | llm "Summarize action items"

# Status of installed engine + models
$ kesha status

What's inside

Six models, one runtime.

Each model is best-in-class for its task and runs through kesha-engine — Rust, with CoreML on Apple Silicon and ONNX everywhere else.

Model Task Size Source
NVIDIA Parakeet TDT 0.6B v3 Speech-to-text ~2.5 GB HuggingFace ↗
SpeechBrain ECAPA-TDNN Audio language detection ~86 MB HuggingFace ↗
Apple NLLanguageRecognizer Text language detection built-in macOS framework
Silero VAD v5 (opt-in) Voice activity detection ~2.3 MB snakers4/silero-vad ↗
Kokoro-82M (opt-in) Text-to-speech · English ~990 MB HuggingFace ↗
Vosk-TTS (opt-in) Text-to-speech · Russian bundled alphacep/vosk-tts ↗

Audio decoding via Symphonia — WAV, MP3, OGG/Opus, FLAC, AAC, M4A.

Languages

25 for transcription. 107 for detection.

English Spanish German French Italian Portuguese Russian Polish Dutch Czech Slovak Slovenian Croatian Bulgarian Romanian Greek Hungarian Finnish Swedish Danish Estonian Latvian Lithuanian Ukrainian Maltese

Need detection only? VoxLingua107 covers 107 languages.

Integrations

Drop in where you already work.

OpenClaw

Give your LLM agent ears. Voice-processing skill, pre-wired.

Setup guide →

Raycast

macOS launcher commands: transcribe selected audio, speak the clipboard. From the raycast/ directory.

Source & install →

Programmatic API

Import transcribe() and downloadModel() directly into any Bun/Node project.

See examples →

Air-gapped mirrors

Behind a corporate proxy or fully offline? Point Kesha at your model mirror.

Mirror docs →

Ready to ship voice?

Open source, MIT licensed, and improving every week. Star the repo or jump straight to the install.

Copied