VoxCPM2 is an open-source text-to-speech model released by OpenBMB in 2026. It uses a 2-billion-parameter tokenizer-free diffusion autoregressive architecture to turn written text into natural, expressive speech across 30 languages.
Unlike traditional TTS systems that rely on phoneme dictionaries or discrete speech tokens, VoxCPM2 maps text directly to a continuous speech representation. This reduces pronunciation errors and produces smoother prosody, especially on long-form content and code-mixed text. The model is trained on more than 2 million hours of multilingual speech and outputs broadcast-quality 48kHz audio through AudioVAE V2's built-in super-resolution layer.
A single VoxCPM2 model covers three common workflows in one place: pure text-to-speech with built-in voices, zero-shot cloning of any voice from a few seconds of audio, and Voice Design that generates entirely new speakers from a natural-language description. Our hosted platform brings all three to your browser instantly — no install, no GPU, no setup. The underlying model is released under Apache 2.0, so outputs can be used in personal projects, commercial products, and paid services without per-character fees.