![]() ![]() Tortoise is primarily an autoregressive decoder model combined with a diffusion model. ![]() The text being spoken in the clips does not matter, but diverse text does seem to perform better.For example, if you want to hear your target voice read an audiobook, try to find clips of them reading a book. Try to find clips that are spoken in such a way as you wish your output to sound like.Avoid clips that have excessive stuttering, stammering or words like "uh" or "like" in them.These generally have distortion caused by the amplification system. Tortoise is unlikely to do well with them. These clips were removed from the training dataset. Avoid clips with background music, noise or reverb.Save the clips as a WAV file with floating point format and a 22,050 sample rate.Īs mentioned above, your reference clips have a profound impact on the output of Tortoise. ![]() More is better, but I only experimented with up to 5 in my testing. Cut your clips into ~10 second segments.Guidelines for good clips are in the next section. Good sources are YouTube interviews (you can use youtube-dl to fetch the audio), audiobooks or podcasts. Gather audio clips of your speaker(s).To add new voices to Tortoise, you will need to do the following: ![]() What Tortoise can do for zero-shot mimicking, take a look at the others. If your goal is high quality speech, I recommend you pick one of them. Voices prepended with "train_" came from the training set and performįar better than the others. This repo comes with several pre-packaged voices. The reference clip is also used to determine non-voice related aspects of the audio output like volume, background noise, recording quality and reverb. These clips are used to determine many properties of the output, such as the pitch and tone of the voice, speaking speed, and even speaking defects like a lisp or stuttering. These reference clips are recordings of a speaker that you provide to guide speech generation. It accomplishes this by consulting reference clips. Tortoise was specifically trained to be a multi-speaker model. tts_with_preset( "your text here", voice_samples = reference_clips, preset = 'fast') Voice customization guide
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |