Mimic 3 High Quality

darkfiberiru · August 31, 2022, 3:43pm

So all the mimic 3 models are listed as low quality? Does this indicate Medium/high or even ultra quality models are a possibility today?

I’m both curious about quality of these models and real time factor on different hardware. Obviously low quality on pi4 is .5 so I assume we can’t go much higher. But different use cases may allow different hardware.

synesthesiam · September 6, 2022, 3:15pm

Yes! There are quite a few hyperparameters that can be tuned on the VITS model. VITS is a combination of GlowTTS and HiFi-GAN, so Mimic 3’s current notion of “low” and “high” quality currently map to the v3 and v1 Hifi-GAN configs.

However, there are other parameters that also influence quality and real-time factor. I’m investigating the effects of changing the following parameters right now:

Audio sample rate (currently 22050 Hz, testing 16000 Hz)
Number of hidden/inter/filter channels
Whether or not the input has a “0” after every symbol (interspersed padding)

With a reduced sample rate, fewer channels, and no padding, I can get a real-time factor of about 0.3 on a Raspberry Pi 4, so there is definitely room for improvement!

Going the other direction, a “high” quality model can likely be improved by increasing the number of channels. If you have input audio with a higher sample rate, it should be possible to train a model at that rate (e.g., 44.1Khz). I haven’t tested anything in this range though, since I’m focused on what will run well on the Mark II