"University of Texas Debuts VoiceCraft, a Groundbreaking Neural Codec Language Model for Zero-Shot Speech Editing and Text-to-Speech"

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

1The University of Texas at Austin 2Rembrand TL;DR VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts. To clone an unseen voice or edit a recording, VoiceCraft needs only a few seconds of the voice. Speech Editing with VoiceCraft Guess which part is synthesized! Original Transcript Original Audio Edited Transcript V...