Link Mycroft to my own TTS?

jjohnston7 · September 4, 2018, 5:23pm

Is there a way to link Mycroft to my own TTS server? Like if I already had an installed version of Capti, could I some how make Mycroft use that engine/voices? with Mimic II coming soon, maybe its not worth it, but Capti seems to have a lot of available voices… I’m not tied to Capti, I would LOVE something where I could make my own voice…

baconator · September 5, 2018, 5:38pm

Capti offline looks like it runs on windows/mac, but doesn’t have an API or network connectivity method there? Mycroft would want to be able to pass it text and get back an audio file, and it doesn’t seem like that’s an automated part of capti?

Mimic2 you could make your own voice for…takes quite a number of hours of voice samples(at least 5, 20+ better), then you’d have to model that (good GPU(s) + time), and be ready to tweak that a bit and probably remodel with additional samples to help improve things. Running it also takes a pretty good chunk of resources.

jjohnston7 · September 5, 2018, 9:09pm

Sounds good. I’m not tied to any of that. the MIMIC2 audio samples sound good! I heard there was talk about Mycroft developing a way to upload files, and create my own voice that way? Otherwise, is there a how to written on how to create my own voice now using MIMIC2? I’ve very new, so talk slow and simple… lol

baconator · September 5, 2018, 10:55pm

for mimic2, it’s a matter of formatting your voice samples in one of a couple ways. Or if you’re inclined, writing your own wrapper for the pre-processing. Try and keep things between 1-10seconds, though having numbers and letters as part of the corpus may help.

See the “Training” section. You’d have to create samples that fit either the LJ or MAILABS formats to use those pre-processors. Once you’ve got that done, you get to train (and monitor) your model. This is where the fun begins. Depending on how much data you have, what hardware you’re running on, how clean the data it is, a bit of luck, the phase of the moon, and what version of code you’re running, it may run as long as you let it, or bomb out somewhere sooner. Even if it does keep running, you may have to stop it if large spikes occur when training, and restart from a previous step. Also you’ll probably have to spend a lot of time adjusting hparams before things progress very far.

After every 1000 steps you get a wav and png on disk you can review. At first they are ugly. And by ugly I mean unusable. Depending how well things are going by 5000 steps you may see some semblance of cohesion in the graph and hear somethings vaguely resembling words in the audio.

If all goes well, after a few hundred thousand steps, you should have a model trained that sounds like…your voice.

With no restarts/loss spikes I can get between 40-60k steps per day on a small (5hr) dataset with an nvidia 1070.

jjohnston7 · October 29, 2018, 9:10pm

@baconator Thanks! So would I use a Linux box like Ubuntu in order to train, or is it best on windows? I have a dual GPU computer I can use that runs win7, but I can convert to Ubuntu if needed. they are RX 580 GPUs, one is an 8gig ram the other is 4… that computer is just sitting around, so I can set it up whatever is best/suggested…

baconator · October 29, 2018, 9:27pm

As far as I know you have to train under linux, haven’t seen any windows info on that yet.
If tensorflow recognizes both gpu’s, it could use them. I haven’t done the dual gpu training, I usually run one set of params on one gpu and a second on the other. There was something I read the other week that mentioned that ROCm driver support in TF has matured enough to be usable, so that’s a plus. You’ll have to tweak the batch sizes and outputs per step and such to make the right sizing for your setup.