Audio & Voice Generation

modelBridge supports a full audio pipeline — text-to-speech, text-to-audio, audio-to-audio, voice conversion, and more. Generate professional voiceover, sound effects, or music without leaving Premiere Pro, and the result imports directly to your timeline on the correct audio track.

Supported Audio Categories

Category	What It Does	Example Use Case
Text to Speech	Convert written text to spoken audio	Voiceover narration, dialogue, placeholder VO
Text to Audio	Generate sound from a text description	Sound effects, ambient audio, music
Audio to Audio	Transform existing audio	Voice conversion, enhancement, noise removal
Speech to Speech	Real-time voice transformation	Change a voice while preserving timing and emotion
Video to Audio	Generate audio from video content	Lip sync, music visualization

All audio categories use the same schema-driven interface as image and video models — the plugin reads the model’s API specification and generates the appropriate form controls automatically.

Text-to-Speech Workflow

Select a TTS model from the model dropdown (search for “tts”, “speech”, or “elevenlabs”)
Write your script in the prompt field
Adjust voice parameters — each model exposes different controls (voice selection, speed, emotion, language)
Check the cost estimate — updates live as you adjust parameters
Click Generate — the audio downloads and imports to your Project Bin

Timeline Import for Audio

When you click Import to Timeline, audio results are placed on the first available audio track at the playhead position. This is automatic — you do not need to manually route audio to the correct track type.

Inline Audio Preview

Before importing, you can preview audio results with an inline player directly in the result card. The player includes play/pause controls and a progress bar. Only one audio preview can play at a time — starting a new one pauses the previous.

ElevenLabs Eleven v3 Tag Bar

ElevenLabs Eleven v3 supports expressive tags that control emotion, accent, and delivery. modelBridge surfaces these through a tag bar directly above the prompt field:

Emotion chips — click to insert tags like [excited], [whispers], [laughs], [sad], [angry] at the cursor position
Accent dropdown — common accents (British, American, Australian, Indian, etc.) plus a custom text option, inserting [accent: British] or your custom accent tag
Collapsible — the tag bar collapses with a chevron toggle when you do not need it

The tag bar is a shortcut, not a limitation. You can type any custom tag directly in the prompt — [cheerful], [slowly], [in a deep voice] — the bar just makes the most common tags one-click accessible.

Writing Tips for Other ElevenLabs Models

Non-v3 ElevenLabs models show contextual writing tips — guidance on how that specific model interprets prompt formatting, punctuation for pacing, and natural speech patterns. These tips appear automatically when the model is selected.

Text-to-Audio for Sound Design

Beyond voiceover, text-to-audio models generate sound effects and music from descriptions:

“Thunderstorm with distant sirens”
“Footsteps on gravel, slow and deliberate”
“Upbeat electronic music, 120 BPM, no vocals”

The same workflow applies — describe the sound, adjust parameters, generate, and the result imports to your project.

Voice Conversion (Audio-to-Audio)

Select a voice clip on your timeline, choose a voice conversion model, and click Generate. The AI-converted voice imports directly to the first available audio track. No export, no browser, no re-import.

This is a 3-click workflow:

Select a voice clip on the timeline
Choose an audio-to-audio model
Click Generate

Dual Mode for Audio

Dual Mode works with audio models the same way it works with video. Generate the same script with two different TTS models side by side, preview both results, and import the one you prefer. This is useful for comparing voices or finding the right tone for a narration.

Cost Tracking

Audio generations are tracked in the Billing tab alongside video and image generations, with the same cost estimate and actual cost display. Token-based TTS models show their rate honestly — “Token-based pricing · ~$2.50 / 1M tokens” — rather than converting to a fabricated per-second rate.