VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
1The University of Texas at Austin
2Rembrand
TL;DR
VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.
To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice.
Speech Editing with VoiceCraft
Guess which part is synthesized!
Original Transcript
Original Audio
Edited Transcript
V...
Read more at jasonppy.github.io