How many speakers can it handle simultaneously?

Up to 4 distinct speakers per generation. Each speaker can be cloned from a reference audio sample.

Yes. VibeVoice exhibits emergent singing capability—it can produce melodic vocals without being explicitly trained on singing data.

The model weights and inference code are publicly available. Check the GitHub repository for license details and usage guidelines.

VibeVoice Pro – Expressive Long-Form AI Voice Synthesis

Q: How long can VibeVoice generate in one pass?

Up to 90 minutes of continuous speech in a single generation, thanks to the continuous tokenizer at 7.5 Hz with 3200× compression.

Q: What languages are supported?

English, Chinese, Japanese, Korean, and major European languages. Code-switching between languages is supported within a single generation.

Capabilities

🎭

Expressive Dialogue

Generate multi-turn conversations with natural emotion, hesitation, and turn-taking. Not just text-reading—real dialogue.

👥

Multi-Speaker

Up to 4 distinct voices in one generation. Each cloned from a short reference audio. Speaker transitions are seamless.

🎶

Emergent Singing

The model can produce melodic vocals without explicit singing training. A surprising capability of the continuous tokenizer.

🌍

Multi-Language

EN, ZH, JA, KO, and European languages. Code-switch between languages within a single utterance.

⏰

Ultra Long-Form

90-minute generation in one pass. 7.5 Hz continuous tokens with 3200× compression preserve quality over long durations.

🔧

LLM-Powered

Built on Qwen2.5 language model backbone. Understands context, handles complex scripts, and follows instructions.

How VibeVoice Works: Architecture & Applications

Architecture: LLM Meets Speech

VibeVoice combines a large language model (Qwen2.5) with continuous speech tokenizers at 7.5 Hz and a flow-matching diffusion head. Traditional TTS systems convert text to mel-spectrograms frame by frame; VibeVoice instead generates compact speech tokens autoregressively, then decodes them to waveform. This enables understanding of context, dialogue flow, and even emotional subtext in the input script.

The continuous tokenizer achieves 3200× compression: one second of 24 kHz audio (24,000 samples) is represented as ~7.5 tokens. This extreme compression is what enables 90-minute generation—the model only needs about 40,000 tokens for a full hour, well within the context window of modern LLMs.

Multi-Speaker Dialogue Generation

In a multi-speaker scenario, the input script uses speaker tags (e.g., [Speaker A], [Speaker B]) to indicate who is speaking. Each speaker is primed with a 5–15 second reference audio. The model clones the voice characteristics and maintains speaker identity throughout the conversation, including natural turn-taking pauses and overlapping speech when appropriate.

Practical Applications

Podcast producers use VibeVoice to generate full-length episodes from scripts, complete with host and guest voices. Audiobook publishers generate chapter-length narration with consistent narrator voice over 8+ hours. Localization teams produce dubbed dialogues for videos in multiple languages from a single script. Corporate training departments create interactive audio modules with role-played scenarios.

Comparison with Other TTS Models

Compared to Bark (Suno), VibeVoice handles much longer content—Bark is limited to ~15 seconds per generation. Compared to XTTS (Coqui), VibeVoice's LLM backbone provides better contextual understanding and emotion. Compared to IndexTTS2, VibeVoice's strength is in long-form and multi-speaker scenarios, while IndexTTS2 excels at precise duration control for dubbing. The two are complementary rather than competitors.

Getting Started

Clone the repository from GitHub, install dependencies with pip install -r requirements.txt, and download model weights from Hugging Face. A Gradio demo is provided for quick experimentation. For production deployment, use the included FastAPI server with streaming output support.

Frequently Asked Questions

How long can VibeVoice generate in one pass?

Up to 90 minutes of continuous speech, thanks to the 7.5 Hz tokenizer with 3200× compression.

How many speakers can it handle?

Up to 4 distinct speakers per generation. Each cloned from a short reference audio.

Can it sing?

Yes. VibeVoice exhibits emergent singing capability—melodic vocals without explicit singing training data.

What languages are supported?

EN, ZH, JA, KO, and major European languages. Code-switching is supported within a single generation.

Is it open-source?

Model weights and inference code are publicly available on GitHub and Hugging Face.

Expressive Long-Form Voice Synthesis