MOSS-TTS-Nano

MOSS-TTS-Nano is a 0.1B parameter, open-source multilingual text-to-speech model for realtime, CPU-friendly speech generation and voice cloning, supporting 20 languages with a simple local setup.

A high-resolution social sharing image preview for MOSS-TTS-Nano tool.

MOSS-TTS-Nano is an open-source, multilingual, tiny speech generation model from MOSI.AI and the OpenMOSS team. With only 0.1 billion parameters, it is explicitly designed for realtime speech generation, capable of running directly on a CPU without a dedicated GPU. This design ensures a simple deployment stack suitable for local demonstrations, web serving, and lightweight product integration, making advanced TTS accessible.

The model operates on a pure autoregressive Audio Tokenizer + LLM pipeline, facilitating an inference workflow that is user-friendly for both terminal and web-demo users. It natively supports 48 kHz, 2-channel audio output and is capable of handling long input texts through automatic chunked voice cloning. A notable advancement is its efficient ONNX CPU version, which offers nearly double the processing efficiency of the original, removes PyTorch inference dependencies, and functions smoothly on a single CPU core, enhancing its deployment flexibility.

MOSS-TTS-Nano currently supports 20 languages, including Chinese and English, making it versatile for global applications. It provides multiple deployment options, including direct Python entrypoints (infer.py, app.py), and a packaged CLI for generation and serving. This allows developers to easily integrate high-fidelity, multilingual speech capabilities into their projects.

This tool is ideal for developers and engineers who prioritize small model footprints, low latency, and efficient CPU-based speech generation for realtime products. Its robust multilingual support and voice cloning features, combined with its ease of local setup and deployment, make it suitable for building lightweight browser extensions, local applications, or web services where resource efficiency and broad language coverage are critical.