Mimic II pre-trained model

sjtilney · November 20, 2018, 5:50pm

Does Mycroft share any of the pre-trained models for Mimic II? I’ve found the models build for the original Mimic here: https://github.com/MycroftAI/mimic/tree/development/voices, but I can’t find anything for Mimic II.

baconator · November 21, 2018, 2:30am

The model itself isn’t necessarily open source*. You can access it once you configure mimic2 on your instance of mycroft, of course. There’s probably some value in having the LJSpeech* corpus modeled and available at some point as well.

A team member can clarify that, I’m not certain.
https://keithito.com/LJ-Speech-Dataset/ a public domain data set of transcribed voice samples that’s pretty commonly used.

sjtilney · November 21, 2018, 7:06am

I’ve looked at the publicly available datasets, I was hoping to avoid training a model from scratch because training on a CPU will take weeks to get a decent results, and running on a GPU is pretty expensive.

baconator · November 21, 2018, 7:14am

If you wait a couple weeks, I might be able to give the LJSpeech set a try.

KathyReid · November 21, 2018, 9:34am

That’s right, at the moment the pre-trained Mimic 2 voice, the Kusal voice, is not available - as it’s a premium offering for our Subscribers ($2 a month, good value!)

12buntu · November 21, 2018, 3:18pm

Hey, I was going to make a new thread but decided I might as well ask here.
How does making a voice model work with Mimic II? Can it be any recording with a transcription, or does it have to be a certain set of phrases? Is it possible to set up a way to volunteer your voice? Or perhaps use long audio recordings like Librivox? I don’t have the best voice but I would love to attempt to create a voice (mostly to study how it works, I have to have a goal). I love Mycroft but I really hate the Alan Pope voice. Of course, I understand why premium voices are necessary, but I would love more voices.

sjtilney · November 21, 2018, 3:55pm

It can be any set a phrases, but more is almost always better. The Mimic II repo doesn’t provide any tools for collecting text and voice data, you’ll have to do that on your own, and also write your own pre-processor.
Details on how to do that are in the repo https://github.com/MycroftAI/mimic2/blob/master/TRAINING_DATA.md

12buntu · November 21, 2018, 7:50pm

Ok , so I’m also trying to see if I’m up to the challenge of making a voice. Assuming I have all necessary audio and the preprocessor, what do I do then? How long will it take on an average laptop. And this is probably a question for a team member, but is it legal to use a librivox recording?

Dominik · November 21, 2018, 8:11pm

At M-AILABS you find audio-material for training voices (some of it based on Libirivox), ready to run with Tacotron/Mimic and „free to use“. Still it might be a nice move to contact the speaker and notify them that a voice-assistant is speaking with their voice.

But beware: with a normal desktop CPU it will take weeks and months to get usable results. A GPU (GTX 1080 or better) is highly recommended.

baconator · November 21, 2018, 11:08pm

All librivox recordings are in the public domain, so you should be fine. I’m using a particular one for my testing, in fact. LJSpeech and M-AILABS all use librivox for data sources.

The google tacotron voices were built with 20-44 hours of a high-quality, highly regulated professional voice artist. I believe the Kusal voice is 16 hours of high quality recordings from a well-trained speaker. LJ Speech is done from 128kbs MP3 files, converted to wav. I have 13 hours (9.5k clips) so far for my dataset and it’s being cranky about working well. The transcriptions should be accurate as possible. Any significant number of proper names or pronunciations should probably be added to your local CMU-Dict file. Basically, review the LJSpeech dataset to compare your data with. My set has about 11k distinct words, of which 900 were not in CMU-Dict. Of those 900, the majority are proper nouns and their possessive contractions, followed by “un” prefixed words, “s”, “ly”, “ies” suffixed works, then misspellings or odd variations (UK vs US). The vocal speed should be as uniform as possible. The assortment of clips should be as even across the time range (1-10 seconds) as possible. The formatting of your data should probably follow one of the existing types now, then you can re-use the preprocessing scripts easier for it. Mimic2’s analyze function can serve you well for evaluating your dataset.

You will not want to train on a laptop, unless you hate your laptop and never want to end up with anything usable. Get a gcp or aws gpu instance and train there if nothing else. An nvidia 1070 or 1080 can be used for training, lower end gpu’s will run you into more and more issues as you get to lower end hardware. A single 1070 does about 30-50k steps per day, and you will want to train until you find the point of overtraining…large datasets this is probably 300k+. When training, you should, with a good dataset, see alignment by 25k steps. Based on your hardware, you will want to adjust the hparams in various directions. Less data, lower training rate. More device memory, larger batch size/lower outputs per step. There’s a few dozen knobs and buttons to tweak along the way.

Also feel free to check the chat server’s machine learning channel as well.

12buntu · November 21, 2018, 11:22pm

So it seems this isn’t going to be my project, though keep us updated about your voice. Mostly I just wanted a better quality voice for myself, however I could obtain that. Do you know of anyone who has made a better mimic 2 voice.

baconator · November 22, 2018, 6:36am

The Kusal voice is quite good (see Kathy’s comment above). Other than that, no, I don’t know of anyone who’s made public a voice model for it yet.
(eta) Looking at the LJSpeech data, I will try and model that to a reasonable length this weekend, and see what comes out.

kpchaitanya · February 28, 2019, 10:39am

Hi Kathy

I have subscribed for $2 a month for one year.
Where can i find the Kusal voice or how can i proceed and test the mimic2
TTS model offline on average PC ?

I am building a chatbot where i am using deepspeech for STT and mimic1 for TTS. But i really want to replace TTS with mimic2 as its voice sample a lot better than mimic1.

Any help will be greatly appreciated.

Thanks in advance

baconator · February 28, 2019, 6:16pm

If you’re running a chatbot, you’ll want to run a local copy of mimic2, and probably build your own model. You can pull LJspeech dataset down if nothing else.

gez-mycroft · February 28, 2019, 11:44pm

Hi there, we aren’t able to provide pre-trained Mimic2 models to download at this stage, however as baconator said there are a few good open data sets that can be used to train your own.

Check out:

LJ Speech
Blizzard 2012 and
M-ailabs - this is an alternate link as their primary domain seems to be broken at the moment.

kpchaitanya · March 6, 2019, 12:06pm

thanks baconator and gez-mycroft

Is there anyone who trained and utilized offline and detailed instructions to follow.

How about the vocal, how to proceed to use our own voice or is there anything available and open source, which can be utilized ?

Do we know i use the public data set for training, how long will it take , as i am planning to use AWS machine to train ?

gez-mycroft · March 7, 2019, 12:13am

If you want to use your own voice, we have also open sourced the Mimic Recording Studio however this is no small undertaking. You would need to record around 15,000 - 20,000 phrases. There are tips on that github repo to get the best outcome.

The Mimic2 instructions on the repo are the most detailed we have and there is an active Mimic Channel on Chat if you run into trouble. They will also have some more realistic numbers on how long training has been taking. It varies greatly depending on the training parameters and hardware that you use.

builderjer · March 10, 2019, 5:38am

is there a way, to take a recording of somebody and split them into short wav files, and use them to input them into mimic recording studio?

This is my dilemma:
I would love to use my grandmothers voice for Mycroft. Getting her to actually sit and read a series of phrases is most likely NOT going to happen. What I can do is set her up with a Bluetooth lapel mic and record her throughout the day. With that recording, I will break them into separate files and input them into the studio.

Is this something that can be accomplished?

Dominik · March 10, 2019, 9:05am

Can be done, but still will be a tedious work over a long timeframe. Besides cutting the audio into snippets you need to transcribe the speech to text files.
Result will most likely be poor because you need a consistent level, intonation and voice level - something you can not achieve when recording in such uncontroled situation.

builderjer · March 10, 2019, 2:47pm

I’m well aware that it will be extremely tedious, but I also can not think of a better way to preserve her memory. I think that there are some very good broadcast quality microphones out there that could get a relatively good recording. If it can be done this way, I think I will try.

Next question, just using a voice model, will it be necessary to use a GPU? or can I still use the online TTS and just use a local voice?