Skip to content

Audio & Voice Generation

modelBridge supports a full audio pipeline — text-to-speech, text-to-audio, audio-to-audio, voice conversion, and more. Generate professional voiceover, sound effects, or music without leaving Premiere Pro, and the result imports directly to your timeline on the correct audio track.

CategoryWhat It DoesExample Use Case
Text to SpeechConvert written text to spoken audioVoiceover narration, dialogue, placeholder VO
Text to AudioGenerate sound from a text descriptionSound effects, ambient audio, music
Audio to AudioTransform existing audioVoice conversion, enhancement, noise removal
Speech to SpeechReal-time voice transformationChange a voice while preserving timing and emotion
Video to AudioGenerate audio from video contentLip sync, music visualization

All audio categories use the same schema-driven interface as image and video models — the plugin reads the model’s API specification and generates the appropriate form controls automatically.

  1. Select a TTS model from the model dropdown (search for “tts”, “speech”, or “elevenlabs”)
  2. Write your script in the prompt field
  3. Adjust voice parameters — each model exposes different controls (voice selection, speed, emotion, language)
  4. Check the cost estimate — updates live as you adjust parameters
  5. Click Generate — the audio downloads and imports to your Project Bin

When you click Import to Timeline, audio results are placed on the first available audio track at the playhead position. This is automatic — you do not need to manually route audio to the correct track type.

Before importing, you can preview audio results with an inline player directly in the result card. The player includes play/pause controls and a progress bar. Only one audio preview can play at a time — starting a new one pauses the previous.

ElevenLabs Eleven v3 supports expressive tags that control emotion, accent, and delivery. modelBridge surfaces these through a tag bar directly above the prompt field:

  • Emotion chips — click to insert tags like [excited], [whispers], [laughs], [sad], [angry] at the cursor position
  • Accent dropdown — common accents (British, American, Australian, Indian, etc.) plus a custom text option, inserting [accent: British] or your custom accent tag
  • Collapsible — the tag bar collapses with a chevron toggle when you do not need it

The tag bar is a shortcut, not a limitation. You can type any custom tag directly in the prompt — [cheerful], [slowly], [in a deep voice] — the bar just makes the most common tags one-click accessible.

Non-v3 ElevenLabs models show contextual writing tips — guidance on how that specific model interprets prompt formatting, punctuation for pacing, and natural speech patterns. These tips appear automatically when the model is selected.

Beyond voiceover, text-to-audio models generate sound effects and music from descriptions:

  • “Thunderstorm with distant sirens”
  • “Footsteps on gravel, slow and deliberate”
  • “Upbeat electronic music, 120 BPM, no vocals”

The same workflow applies — describe the sound, adjust parameters, generate, and the result imports to your project.

Select a voice clip on your timeline, choose a voice conversion model, and click Generate. The AI-converted voice imports directly to the first available audio track. No export, no browser, no re-import.

This is a 3-click workflow:

  1. Select a voice clip on the timeline
  2. Choose an audio-to-audio model
  3. Click Generate

Dual Mode works with audio models the same way it works with video. Generate the same script with two different TTS models side by side, preview both results, and import the one you prefer. This is useful for comparing voices or finding the right tone for a narration.

Audio generations are tracked in the Billing tab alongside video and image generations, with the same cost estimate and actual cost display. Token-based TTS models show their rate honestly — “Token-based pricing · ~$2.50 / 1M tokens” — rather than converting to a fabricated per-second rate.