OmniVoice logoOmniVoice
Loading

VoxCPM2: AI Voice Cloning & Custom Voice Design

Turn any text into natural-sounding speech in 30 languages — right in your browser. Clone a real voice from just 5 seconds of audio, or invent a brand-new one by describing it in plain English. Studio-quality sound, free to start, and ready for personal or commercial projects.

Enter your text

0/4000
Limit 4000 characters per generation. Available: 4000 characters.

Select a voice

Credits: free · 1 credit per 100 characters

What is VoxCPM2?

VoxCPM2 is an open-source text-to-speech model released by OpenBMB in 2026. It uses a 2-billion-parameter tokenizer-free diffusion autoregressive architecture to turn written text into natural, expressive speech across 30 languages.

Unlike traditional TTS systems that rely on phoneme dictionaries or discrete speech tokens, VoxCPM2 maps text directly to a continuous speech representation. This reduces pronunciation errors and produces smoother prosody, especially on long-form content and code-mixed text. The model is trained on more than 2 million hours of multilingual speech and outputs broadcast-quality 48kHz audio through AudioVAE V2's built-in super-resolution layer.

A single VoxCPM2 model covers three common workflows in one place: pure text-to-speech with built-in voices, zero-shot cloning of any voice from a few seconds of audio, and Voice Design that generates entirely new speakers from a natural-language description. Our hosted platform brings all three to your browser instantly — no install, no GPU, no setup. The underlying model is released under Apache 2.0, so outputs can be used in personal projects, commercial products, and paid services without per-character fees.

Key Features

Key Features of VoxCPM2

Four capabilities make VoxCPM2 a complete voice toolkit: multilingual coverage, novel voice creation, controllable cloning, and studio-grade output — all from a single model with one workflow.

🌍

30-Language Multilingual

VoxCPM2 supports 30 languages out of the box, from widely used languages such as English, Spanish, French, German, Portuguese, Russian, and Japanese to less-resourced ones like Swahili, Burmese, Khmer, and Lao. There is no need to set a language tag or switch checkpoints — paste your text and VoxCPM2 detects the language automatically. The same voice can speak fluently across languages, which makes cross-lingual dubbing, localization, and bilingual content production feel like a single continuous workflow.

🎨

Voice Design

Need a voice that does not exist yet? Just describe it. A prompt like `(a young woman, gentle and sweet voice)` is enough for VoxCPM2 to generate a brand-new speaker with the matching gender, age, tone, emotion, and pace — no reference audio required. Voice Design is well suited to character creation in games, branded narration for products, audiobook casts, animation, and any project where you want a distinctive voice but cannot or do not want to record one yourself.

🎛️

Controllable Cloning

Upload a 5- to 10-second audio sample and VoxCPM2 captures the speaker's timbre, then reproduces it on any text you provide. The default mode preserves the original delivery, while inline style cues such as `(slightly faster, cheerful tone)` let you steer emotion, pace, and expression without losing the voice's core identity. The output sounds like the same person speaking — just in a new context, a new language, or a new emotional register, depending on what your project needs.

🔊

48kHz Studio-Quality Output

VoxCPM2 accepts a 16kHz reference and produces 48kHz output natively through AudioVAE V2's built-in super-resolution, so you do not need a separate upsampler or post-processing chain. The audio retains natural breaths, micro-pauses, and prosodic detail that lower-sample-rate TTS systems typically smooth out. The result is broadcast-grade fidelity suitable for podcasts, audiobooks, video voiceovers, music production, and any setting where listeners would otherwise notice the difference between synthetic and real speech.

Live Demo Gallery

Hear VoxCPM2 in Action

The samples below are produced directly by VoxCPM2 — no editing, no post-processing. Each card shows the input text and the generated audio, so you can hear exactly how the model handles different languages, emotions, and voice modes before opening the playground yourself.

30 Languages Grid

Speak in 30 Languages, Natively

VoxCPM2 generates speech in 30 languages with automatic language detection — no model swapping, no manual tags, no extra configuration. Voice quality stays consistent across high-resource languages such as English, Mandarin, Spanish, French, German, and Japanese, and remains usable on lower-resource ones like Swahili, Lao, Khmer, and Burmese. Click any language below to hear a sample produced by the same model.

Most popular languages

1/5

English
Spanish
German
Japanese
A
B

How It Works

How VoxCPM2 Works in 3 Steps

Whether you are cloning a real voice, designing a new one, or just turning text into speech, VoxCPM2 follows the same three-step workflow. Most users go from blank page to a finished audio file in under a minute.

Step 1 — Provide a voice prompt

VoxCPM2 voice selection interface

Choose how you want the voice to sound. To clone a real speaker, upload a short reference audio clip — five to ten seconds is usually enough. To design a new voice instead, write a natural-language description such as `(a calm middle-aged man, deep voice)`. You can also leave this step empty and let VoxCPM2 use one of its built-in default voices.

Step 2 — Enter your target text

VoxCPM2 text input interface

Paste or type the content you want VoxCPM2 to speak, from a single sentence to a long-form script. Inline style cues like `(slightly faster, cheerful tone)` let you steer emotion, pace, and expression for specific phrases while keeping the underlying voice intact. The model automatically detects the language, so you do not need to label it.

Step 3 — Generate, download

VoxCPM2 generated audio preview interface

Click generate and VoxCPM2 returns a 48kHz audio file within seconds. Preview it directly in the browser, download it as a standard FLAC file, or copy a share link. The output is ready to drop into a podcast, a video timeline, a game build, a learning platform, or any other project that needs natural-sounding speech.

Benchmark Performance

Benchmark Performance — State of the Art on Public Tests

VoxCPM2 has been evaluated on the most widely used speech synthesis benchmarks in 2026, posting state-of-the-art or top-tier scores on multilingual word error rate, speaker similarity, and instruction-following metrics. The numbers below come from the official paper (arXiv 2509.24650) and independent third-party evaluations — not internal tests. Each row links back to the underlying dataset so you can verify the result yourself.

BenchmarkWhat it measuresVoxCPM2 result
Seed-TTS-eval (EN / ZH)Multilingual WER + speaker similarityState of the art
CV3-evalCross-lingual voice transferTop tier
InstructTTSEvalVoice Design instruction-followingState of the art
MiniMax Multilingual TestLow-resource language qualityTop tier

Technical Specifications

Technical Specifications

For teams evaluating VoxCPM2 for self-hosting or integration, the headline specs are below. They describe the underlying model — which is the same model that powers our hosted platform. VoxCPM2 supports batch and streaming inference, with mature tooling for production deployment.

Parameters

2B

Audio Output

48kHz studio quality

Languages

30

License

Apache 2.0

Architecture

Tokenizer-Free Diffusion Autoregressive

Backbone

MiniCPM-4

Reference Input

16kHz

VRAM

~8 GB

RTF (RTX 4090)

0.30 standard / 0.13 with Nano-vLLM

Training Data

2M+ hours, multilingual

Streaming

Yes

Fine-tuning

LoRA + Full SFT

Trusted by the Community

Trusted by Developers and Researchers Worldwide

VoxCPM2 is the open-source speech model behind our platform — downloaded more than 234,000 times a month on Hugging Face, starred by thousands of developers on GitHub, and published as a peer-reviewable paper on arXiv. The technology is transparent, auditable, and free of vendor lock-in.

234K+

Monthly Hugging Face Downloads

12K+

GitHub Stars

2M+

Training Hours

arXiv 2509.24650

Peer-Reviewable Paper

Apache 2.0

Fully Open License

Use Cases

What People Build with VoxCPM2

Common production paths for VoxCPM2 across narration, localization, game voices, accessibility, agents, and education.

01

Podcast & Audiobook Narration

Produce hours of natural-sounding narration in a single voice without booking a recording studio. VoxCPM2 handles long-form text gracefully, preserves prosody across chapters, and supports consistent narrator identity. Solo podcasters and audiobook publishers use it to reduce production time from days of recording and editing to a single afternoon at the keyboard.

02

Voice Localization & Dubbing

Translate a voice into 30 languages while keeping the original speaker's voice. Upload a short reference clip and VoxCPM2 reproduces the timbre across languages, which is ideal for YouTubers expanding into international markets, online course creators serving multilingual students, and marketing teams localizing campaign assets without re-hiring voice talent in every region.

03

Game & Character Voice Design

Generate distinct voices for non-player characters by describing them in text — `a gruff dwarven blacksmith`, `a cheerful elven merchant`, `a quiet AI companion`. Indie game studios use Voice Design to populate large casts without contracting a full voice acting team. The same workflow scales from prototype builds to shipped titles.

04

Accessibility & Screen Readers

Give screen-reader users a high-fidelity, expressive output instead of the flat default voices that ship with most operating systems. The 48kHz audio reduces listener fatigue over long reading sessions, and the wide language coverage makes VoxCPM2 a strong choice for accessibility tools targeting global audiences with diverse language needs.

05

AI Agents & Voice Assistants

Power chatbots, customer support agents, and voice-first applications with VoxCPM2. Real-time streaming keeps response latency low enough for natural conversation, and Voice Design lets you give an agent a unique on-brand voice without licensing a third-party voice actor or paying recurring per-character fees elsewhere.

06

Education & Content Localization

Convert lectures, tutorials, and learning scripts into multiple languages with consistent narration. Teachers, ed-tech platforms, and corporate training teams use VoxCPM2 to extend the reach of existing course material without re-recording each version, while keeping a single recognizable voice across all language editions of the same course.

Comparison Table

How VoxCPM2 Compares

VoxCPM2 is one of several leading TTS systems available in 2026. The table below summarizes how it compares with ElevenLabs, F5-TTS, CosyVoice 2, and XTTS v2 on the dimensions most users care about — language coverage, output quality, voice design, cloning, license, and self-hosting. The headline: VoxCPM2 is currently the only system that combines open-source licensing, 30-language support, native 48kHz output, and text-driven voice design in a single model. Each comparison page below covers one competitor in detail.

FeatureVoxCPM2ElevenLabsF5-TTSCosyVoice 2XTTS v2
Open Source✅ Apache 2.0❌ Closed✅ MIT✅ Apache 2.0✅ CPML
Languages303225+17
48kHz Output❌ 24k❌ 24k❌ 24k
Voice Design from Text⚠️ partial
Zero-Shot Cloning
Streaming⚠️
Commercial Use✅ Free💰 Paid⚠️ Limited
Self-Host

Pricing

Simple Pricing — Pay Once, Credits Never Expire

VoxCPM2 uses a free starter tier and one-time credit packs — no subscription, no monthly bills, no auto-renew. Pick the pack that fits your generation volume and scale only when you need more.

1 credit ≈ 100 characters ≈ 8 seconds of speech. All plans include multilingual generation, voice design, and cloning workflows.

Free$0

No card required

No credit card. Generate your first voiceover in under 30 seconds.

  • 2 credits included
  • ≈ 200 characters
  • ≈ 16 seconds of speech
  • All 646 languages
  • Voice Cloning
  • Voice Design
  • MP3 / WAV export
  • No credit card required
Basic$9.9

Great for first purchase

Perfect for short videos, ads, and trying things out.

  • 800 credits
  • ≈ 80,000 characters
  • ≈ 1.8 hours of speech
  • All 646 languages
  • Voice Cloning
  • Voice Design
  • MP3 / WAV export
  • Everything in Free
  • Commercial license
  • Email support
  • Credits never expire
Most Popular
Save 20%
Pro$37.1$29.9

vs buying the same credits with Basic

MOST POPULAR — Save 20% per credit

The pick for podcasters, YouTubers, and small studios.

  • 3,000 credits
  • ≈ 300,000 characters
  • ≈ 4.5 hours of speech
  • $0.009 / credit
  • Save 20% vs Basic
  • All 646 languages
  • Voice Cloning
  • Voice Design
  • Use Latest Voice Model
  • MP3 / WAV export
  • Commercial license
  • Email support
  • Credits never expire
Save 50%
Business$99.9$49.9

Best per-credit value

Built for audiobook narrators, course creators, and content studios.

  • 6,000 credits
  • ≈ 600,000 characters
  • ≈ 12 hours of speech
  • $0.008 / credit
  • Save 50% vs Basic
  • All 646 languages
  • Voice Cloning
  • Voice Design
  • Use Latest Voice Model
  • MP3 / WAV export
  • Commercial license
  • Priority generation
  • Email support
  • Credits never expire
7‑Day Refund
Money-back guarantee
Secure Payment
Powered by Stripe
24/7 Support
Always here to help

Choose one-time credits or subscription • Flexible billing options

✓ Free 2 credits on signup✓ 7-day money-back guarantee✓ Cancel anytime✓ Used by creators in 90+ countries

Powering millions of top creators

VoxCPM2 ships continuous updates with new languages, performance improvements, and product features. The most recent releases are listed below.

Bring characters to life with

Bring characters to life with
Voice Cloning

Narrate your videos with Text to

Narrate your videos with Text to
Speech

Create audiobooks with Story

Create audiobooks with Story
Studio

Explore 2M+ voices in the Voice

Explore 2M+ voices in the Voice
Library

Read updates Follow product changelog

Frequently Asked Questions

Answers to the VoxCPM2 questions people ask before trying the browser playground.

VoxCPM2 is OpenBMB's open-source 2-billion-parameter tokenizer-free TTS model that generates 48kHz speech in 30 languages with built-in voice design and voice cloning.

Start Generating Speech with VoxCPM2

Type your script, pick a voice, hit generate. Studio-quality 48kHz speech in 30 languages — free, in your browser, no setup.