Expressive Long-Form Voice Synthesis

Generate up to 90 minutes of multi-speaker conversational audio with emergent singing. Built on Qwen2.5 + continuous speech tokenization + diffusion.

Professional microphone and audio waveform
90 minMax generation
4 SpeakersPer session
7.5 HzToken rate

Capabilities

🎭

Expressive Dialogue

Generate multi-turn conversations with natural emotion, hesitation, and turn-taking. Not just text-reading—real dialogue.

👥

Multi-Speaker

Up to 4 distinct voices in one generation. Each cloned from a short reference audio. Speaker transitions are seamless.

🎶

Emergent Singing

The model can produce melodic vocals without explicit singing training. A surprising capability of the continuous tokenizer.

🌍

Multi-Language

EN, ZH, JA, KO, and European languages. Code-switch between languages within a single utterance.

Ultra Long-Form

90-minute generation in one pass. 7.5 Hz continuous tokens with 3200× compression preserve quality over long durations.

🔧

LLM-Powered

Built on Qwen2.5 language model backbone. Understands context, handles complex scripts, and follows instructions.

How VibeVoice Works: Architecture & Applications

Architecture: LLM Meets Speech

VibeVoice combines a large language model (Qwen2.5) with continuous speech tokenizers at 7.5 Hz and a flow-matching diffusion head. Traditional TTS systems convert text to mel-spectrograms frame by frame; VibeVoice instead generates compact speech tokens autoregressively, then decodes them to waveform. This enables understanding of context, dialogue flow, and even emotional subtext in the input script.

The continuous tokenizer achieves 3200× compression: one second of 24 kHz audio (24,000 samples) is represented as ~7.5 tokens. This extreme compression is what enables 90-minute generation—the model only needs about 40,000 tokens for a full hour, well within the context window of modern LLMs.

Multi-Speaker Dialogue Generation

In a multi-speaker scenario, the input script uses speaker tags (e.g., [Speaker A], [Speaker B]) to indicate who is speaking. Each speaker is primed with a 5–15 second reference audio. The model clones the voice characteristics and maintains speaker identity throughout the conversation, including natural turn-taking pauses and overlapping speech when appropriate.

Practical Applications

Podcast producers use VibeVoice to generate full-length episodes from scripts, complete with host and guest voices. Audiobook publishers generate chapter-length narration with consistent narrator voice over 8+ hours. Localization teams produce dubbed dialogues for videos in multiple languages from a single script. Corporate training departments create interactive audio modules with role-played scenarios.

Comparison with Other TTS Models

Compared to Bark (Suno), VibeVoice handles much longer content—Bark is limited to ~15 seconds per generation. Compared to XTTS (Coqui), VibeVoice's LLM backbone provides better contextual understanding and emotion. Compared to IndexTTS2, VibeVoice's strength is in long-form and multi-speaker scenarios, while IndexTTS2 excels at precise duration control for dubbing. The two are complementary rather than competitors.

Getting Started

Clone the repository from GitHub, install dependencies with pip install -r requirements.txt, and download model weights from Hugging Face. A Gradio demo is provided for quick experimentation. For production deployment, use the included FastAPI server with streaming output support.

Who Uses VibeVoice Pro

Podcast studio with multiple microphones
  • Podcast producers — full-episode generation from scripts with distinct host/guest voices
  • Audiobook publishers — chapter-length narration with consistent character voices
  • Localization teams — multi-language dubbing from a single source script
  • Corporate training — interactive audio modules with role-played dialogues
  • Creative writers — hearing their scripts performed by AI voices before recording

Frequently Asked Questions

How long can VibeVoice generate in one pass?

Up to 90 minutes of continuous speech, thanks to the 7.5 Hz tokenizer with 3200× compression.

How many speakers can it handle?

Up to 4 distinct speakers per generation. Each cloned from a short reference audio.

Can it sing?

Yes. VibeVoice exhibits emergent singing capability—melodic vocals without explicit singing training data.

What languages are supported?

EN, ZH, JA, KO, and major European languages. Code-switching is supported within a single generation.

Is it open-source?

Model weights and inference code are publicly available on GitHub and Hugging Face.

About VibeVoice Pro

VibeVoice Pro is built for creators and studios who need long-form, multi-speaker voice synthesis. The LLM-powered architecture delivers natural dialogue with emotional range, while the continuous tokenizer enables generation lengths that no other open-source TTS system can match.