VoxCPM2 is an open-source tokenizer-free text-to-speech model from OpenBMB. It focuses on multilingual speech generation, voice design, voice cloning, and high-quality 48kHz output.

Is VoxCPM2 the same as OmniVoice?

No. VoxCPM2 is the model and keyword topic for this page. OmniVoice is the product experience that presents voice generation, cloning, and related audio tools in the browser.

Can VoxCPM2 clone a voice?

Yes. VoxCPM2 supports voice cloning from short reference audio. In production, you should only clone voices you own or have explicit permission to use.

What languages does VoxCPM2 support?

The original VoxCPM2 positioning highlights 30 languages, including major languages and lower-resource languages useful for localization and global creator workflows.

Can I use VoxCPM2 audio commercially?

Commercial use depends on your plan, source material rights, and applicable terms. Always make sure you have rights to scripts, voices, and generated outputs.

Can VoxCPM2 be self-hosted?

The VoxCPM2 model is positioned as open source under Apache 2.0, making self-hosting possible for teams that need infrastructure control.

VoxCPM2: AI Voice Cloning & Custom Voice Design

Turn any text into natural-sounding speech in 30 languages — right in your browser. Clone a real voice from just 5 seconds of audio, or invent a brand-new one by describing it in plain English. Studio-quality sound, free to start, and ready for personal or commercial projects.

Enter your text

0/4000

Limit 4000 characters per generation. Available: 4000 characters.

Select a voice

Credits: free · 1 credit per 100 characters

What is VoxCPM2?

VoxCPM2 is an open-source text-to-speech model released by OpenBMB in 2026. It uses a 2-billion-parameter tokenizer-free diffusion autoregressive architecture to turn written text into natural, expressive speech across 30 languages.

Unlike traditional TTS systems that rely on phoneme dictionaries or discrete speech tokens, VoxCPM2 maps text directly to a continuous speech representation. This reduces pronunciation errors and produces smoother prosody, especially on long-form content and code-mixed text. The model is trained on more than 2 million hours of multilingual speech and outputs broadcast-quality 48kHz audio through AudioVAE V2's built-in super-resolution layer.

A single VoxCPM2 model covers three common workflows in one place: pure text-to-speech with built-in voices, zero-shot cloning of any voice from a few seconds of audio, and Voice Design that generates entirely new speakers from a natural-language description. Our hosted platform brings all three to your browser instantly — no install, no GPU, no setup. The underlying model is released under Apache 2.0, so outputs can be used in personal projects, commercial products, and paid services without per-character fees.

Key Features

Key Features of VoxCPM2

Four capabilities make VoxCPM2 a complete voice toolkit: multilingual coverage, novel voice creation, controllable cloning, and studio-grade output — all from a single model with one workflow.

🌍

30-Language Multilingual

VoxCPM2 supports 30 languages out of the box, from widely used languages such as English, Spanish, French, German, Portuguese, Russian, and Japanese to less-resourced ones like Swahili, Burmese, Khmer, and Lao. There is no need to set a language tag or switch checkpoints — paste your text and VoxCPM2 detects the language automatically. The same voice can speak fluently across languages, which makes cross-lingual dubbing, localization, and bilingual content production feel like a single continuous workflow.

🎨

Voice Design

Need a voice that does not exist yet? Just describe it. A prompt like `(a young woman, gentle and sweet voice)` is enough for VoxCPM2 to generate a brand-new speaker with the matching gender, age, tone, emotion, and pace — no reference audio required. Voice Design is well suited to character creation in games, branded narration for products, audiobook casts, animation, and any project where you want a distinctive voice but cannot or do not want to record one yourself.

🎛️

Controllable Cloning

Upload a 5- to 10-second audio sample and VoxCPM2 captures the speaker's timbre, then reproduces it on any text you provide. The default mode preserves the original delivery, while inline style cues such as `(slightly faster, cheerful tone)` let you steer emotion, pace, and expression without losing the voice's core identity. The output sounds like the same person speaking — just in a new context, a new language, or a new emotional register, depending on what your project needs.

🔊

48kHz Studio-Quality Output

VoxCPM2 accepts a 16kHz reference and produces 48kHz output natively through AudioVAE V2's built-in super-resolution, so you do not need a separate upsampler or post-processing chain. The audio retains natural breaths, micro-pauses, and prosodic detail that lower-sample-rate TTS systems typically smooth out. The result is broadcast-grade fidelity suitable for podcasts, audiobooks, video voiceovers, music production, and any setting where listeners would otherwise notice the difference between synthetic and real speech.

Live Demo Gallery

Hear VoxCPM2 in Action

The samples below are produced directly by VoxCPM2 — no editing, no post-processing. Each card shows the input text and the generated audio, so you can hear exactly how the model handles different languages, emotions, and voice modes before opening the playground yourself.

30 Languages Grid

Speak in 30 Languages, Natively

VoxCPM2 generates speech in 30 languages with automatic language detection — no model swapping, no manual tags, no extra configuration. Voice quality stays consistent across high-resource languages such as English, Mandarin, Spanish, French, German, and Japanese, and remains usable on lower-resource ones like Swahili, Lao, Khmer, and Burmese. Click any language below to hear a sample produced by the same model.

How VoxCPM2 Works in 3 Steps

Whether you are cloning a real voice, designing a new one, or just turning text into speech, VoxCPM2 follows the same three-step workflow. Most users go from blank page to a finished audio file in under a minute.

Step 1 — Provide a voice prompt

Choose how you want the voice to sound. To clone a real speaker, upload a short reference audio clip — five to ten seconds is usually enough. To design a new voice instead, write a natural-language description such as `(a calm middle-aged man, deep voice)`. You can also leave this step empty and let VoxCPM2 use one of its built-in default voices.

Step 2 — Enter your target text

Paste or type the content you want VoxCPM2 to speak, from a single sentence to a long-form script. Inline style cues like `(slightly faster, cheerful tone)` let you steer emotion, pace, and expression for specific phrases while keeping the underlying voice intact. The model automatically detects the language, so you do not need to label it.

Step 3 — Generate, download

VoxCPM2 generated audio preview interface

Click generate and VoxCPM2 returns a 48kHz audio file within seconds. Preview it directly in the browser, download it as a standard FLAC file, or copy a share link. The output is ready to drop into a podcast, a video timeline, a game build, a learning platform, or any other project that needs natural-sounding speech.

Benchmark Performance

Benchmark Performance — State of the Art on Public Tests

VoxCPM2 has been evaluated on the most widely used speech synthesis benchmarks in 2026, posting state-of-the-art or top-tier scores on multilingual word error rate, speaker similarity, and instruction-following metrics. The numbers below come from the official paper (arXiv 2509.24650) and independent third-party evaluations — not internal tests. Each row links back to the underlying dataset so you can verify the result yourself.

Benchmark	What it measures	VoxCPM2 result
Seed-TTS-eval (EN / ZH)	Multilingual WER + speaker similarity	State of the art
CV3-eval	Cross-lingual voice transfer	Top tier
InstructTTSEval	Voice Design instruction-following	State of the art
MiniMax Multilingual Test	Low-resource language quality	Top tier

Technical Specifications

For teams evaluating VoxCPM2 for self-hosting or integration, the headline specs are below. They describe the underlying model — which is the same model that powers our hosted platform. VoxCPM2 supports batch and streaming inference, with mature tooling for production deployment.

Parameters

Audio Output

48kHz studio quality

Languages

License

Apache 2.0

Architecture

Tokenizer-Free Diffusion Autoregressive

Backbone

MiniCPM-4

Reference Input

16kHz

VRAM

~8 GB

RTF (RTX 4090)

0.30 standard / 0.13 with Nano-vLLM

Training Data

2M+ hours, multilingual

Streaming

Yes

Fine-tuning

LoRA + Full SFT

Trusted by the Community

Trusted by Developers and Researchers Worldwide

VoxCPM2 is the open-source speech model behind our platform — downloaded more than 234,000 times a month on Hugging Face, starred by thousands of developers on GitHub, and published as a peer-reviewable paper on arXiv. The technology is transparent, auditable, and free of vendor lock-in.

234K+

Monthly Hugging Face Downloads

12K+

GitHub Stars

2M+

Training Hours

arXiv 2509.24650

Peer-Reviewable Paper

Apache 2.0

Fully Open License

Use Cases

What People Build with VoxCPM2

Common production paths for VoxCPM2 across narration, localization, game voices, accessibility, agents, and education.

Podcast & Audiobook Narration

Produce hours of natural-sounding narration in a single voice without booking a recording studio. VoxCPM2 handles long-form text gracefully, preserves prosody across chapters, and supports consistent narrator identity. Solo podcasters and audiobook publishers use it to reduce production time from days of recording and editing to a single afternoon at the keyboard.

Voice Localization & Dubbing

Translate a voice into 30 languages while keeping the original speaker's voice. Upload a short reference clip and VoxCPM2 reproduces the timbre across languages, which is ideal for YouTubers expanding into international markets, online course creators serving multilingual students, and marketing teams localizing campaign assets without re-hiring voice talent in every region.

Game & Character Voice Design

Generate distinct voices for non-player characters by describing them in text — `a gruff dwarven blacksmith`, `a cheerful elven merchant`, `a quiet AI companion`. Indie game studios use Voice Design to populate large casts without contracting a full voice acting team. The same workflow scales from prototype builds to shipped titles.

Accessibility & Screen Readers

Give screen-reader users a high-fidelity, expressive output instead of the flat default voices that ship with most operating systems. The 48kHz audio reduces listener fatigue over long reading sessions, and the wide language coverage makes VoxCPM2 a strong choice for accessibility tools targeting global audiences with diverse language needs.

AI Agents & Voice Assistants

Power chatbots, customer support agents, and voice-first applications with VoxCPM2. Real-time streaming keeps response latency low enough for natural conversation, and Voice Design lets you give an agent a unique on-brand voice without licensing a third-party voice actor or paying recurring per-character fees elsewhere.

Education & Content Localization

Convert lectures, tutorials, and learning scripts into multiple languages with consistent narration. Teachers, ed-tech platforms, and corporate training teams use VoxCPM2 to extend the reach of existing course material without re-recording each version, while keeping a single recognizable voice across all language editions of the same course.

Comparison Table

How VoxCPM2 Compares

VoxCPM2 is one of several leading TTS systems available in 2026. The table below summarizes how it compares with ElevenLabs, F5-TTS, CosyVoice 2, and XTTS v2 on the dimensions most users care about — language coverage, output quality, voice design, cloning, license, and self-hosting. The headline: VoxCPM2 is currently the only system that combines open-source licensing, 30-language support, native 48kHz output, and text-driven voice design in a single model. Each comparison page below covers one competitor in detail.

Feature	VoxCPM2	ElevenLabs	F5-TTS	CosyVoice 2	XTTS v2
Open Source	✅ Apache 2.0	❌ Closed	✅ MIT	✅ Apache 2.0	✅ CPML
Languages	30	32	2	5+	17
48kHz Output	✅	✅	❌ 24k	❌ 24k	❌ 24k
Voice Design from Text	✅	❌	❌	⚠️ partial	❌
Zero-Shot Cloning	✅	✅	✅	✅	✅
Streaming	✅	✅	❌	✅	⚠️
Commercial Use	✅ Free	💰 Paid	✅	⚠️ Limited	✅
Self-Host	✅	❌	✅	✅	✅

Pricing

Simple Pricing — Pay Once, Credits Never Expire

VoxCPM2 uses a free starter tier and one-time credit packs — no subscription, no monthly bills, no auto-renew. Pick the pack that fits your generation volume and scale only when you need more.

1 credit ≈ 100 characters ≈ 8 seconds of speech. All plans include multilingual generation, voice design, and cloning workflows.

Free$0

No card required

No credit card. Generate your first voiceover in under 30 seconds.

2 credits included
≈ 200 characters
≈ 16 seconds of speech
All 646 languages
Voice Cloning
Voice Design
MP3 / WAV export
No credit card required

Basic$9.9

Great for first purchase

Perfect for short videos, ads, and trying things out.

800 credits
≈ 80,000 characters
≈ 1.8 hours of speech
All 646 languages
Voice Cloning
Voice Design
MP3 / WAV export
Everything in Free
Commercial license
Email support
Credits never expire

Powering millions of top creators

VoxCPM2 ships continuous updates with new languages, performance improvements, and product features. The most recent releases are listed below.

Bring characters to life with
Voice Cloning

Narrate your videos with Text to
Speech

Create audiobooks with Story
Studio

Explore 2M+ voices in the Voice
Library

Read updates Follow product changelog

Frequently Asked Questions

Answers to the VoxCPM2 questions people ask before trying the browser playground.

VoxCPM2 is OpenBMB's open-source 2-billion-parameter tokenizer-free TTS model that generates 48kHz speech in 30 languages with built-in voice design and voice cloning.

Start Generating Speech with VoxCPM2

Type your script, pick a voice, hit generate. Studio-quality 48kHz speech in 30 languages — free, in your browser, no setup.

Try VoxCPM2 Free →View Pricing

VoxCPM2: AI Voice Cloning & Custom Voice Design

Enter your text

Select a voice

What is VoxCPM2?

Key Features of VoxCPM2

30-Language Multilingual

Voice Design

Controllable Cloning

48kHz Studio-Quality Output

Hear VoxCPM2 in Action

Speak in 30 Languages, Natively

How VoxCPM2 Works in 3 Steps

Step 1 — Provide a voice prompt

Step 2 — Enter your target text

Step 3 — Generate, download

Benchmark Performance — State of the Art on Public Tests

Technical Specifications

Trusted by Developers and Researchers Worldwide

What People Build with VoxCPM2

Podcast & Audiobook Narration

Voice Localization & Dubbing

Game & Character Voice Design

Accessibility & Screen Readers

AI Agents & Voice Assistants

Education & Content Localization

How VoxCPM2 Compares

Simple Pricing — Pay Once, Credits Never Expire

Powering millions of top creators

Frequently Asked Questions

What is VoxCPM2?

Is VoxCPM2 free to use?

How is VoxCPM2 different from VoxCPM v1?

What languages does VoxCPM2 support?

Do I need a GPU or special hardware?

Does VoxCPM2 support real-time streaming?

Can I create a custom voice from my own audio?

How short can a voice cloning reference be?

How do credits work?

Can I use VoxCPM2 outputs commercially?

Is voice cloning with VoxCPM2 ethical?

Can I self-host VoxCPM2 on my own hardware?

Start Generating Speech with VoxCPM2