The gist

Voicebox is an open-source, local-first AI voice studio created by Jamie Pine. It provides a free alternative to cloud-based services like ElevenLabs by handling the entire voice input/output loop on the user's machine. It solves the problem of privacy and control for voice cloning, speech generation, and system-wide dictation, allowing users to keep their voice data and models entirely private.

What it does

Clone voices from a few seconds of audio or use over 50 preset voices.
Generate speech in 23 languages using seven different text-to-speech (TTS) engines.
Dictate into any application using a global hotkey and a local Whisper-based speech-to-text model.
Compose multi-track audio for podcasts or conversations in a timeline-based editor.
Apply post-processing effects like reverb, pitch shift, and compression to generated audio.
Integrate voice I/O into other agents and applications via a local REST API and MCP server.

How it works

Voicebox is a self-hosted desktop application for macOS, Windows, and Linux that runs AI models locally on the user's GPU (MLX, CUDA, ROCm). Users can import an audio file to clone a voice, then type text to generate speech. The output is an audio file that can be edited with effects or arranged in a timeline. All processing, models, and data remain on the user's machine. The software is free and open-source.

Best for

Developers and content creators who need a private, highly customizable toolkit for voice I/O, such as building voice-enabled AI agents, producing podcasts, or creating game dialogue.

Watch out for

Performance is dependent on your local hardware, particularly GPU capabilities. While macOS and Windows have installers, Linux users must build the application from the source code.