Mimic2 Speed Boost - Response Caching

Originally published at: http://mycroft.ai/blog/mimic2-speed-boost-response-caching/

Mycroft is constantly striving to build a better Voice User Experience for the Community. We’ve introduced the Precise Wake Word spotter. We’ll shortly be deploying the Skill Marketplace. We’re spending the next few months improving skills and usability leading up to Mark II’s delivery.

We’ve also developed the Mimic2 Speech Synthesis engine, opening up Mycroft to more natural speech and more voices to choose from. However, Mimic2 can’t run on a Raspberry Pi like Mimic1. It requires a GPU to generate speech fast enough to be useful. So, we’re hosting it on our servers to provide voices for those who don’t have their own GPU. This also allows us to deploy a function that will improve Mycroft’s speed to respond to requests - response caching.

By caching common responses, the Mimic2 engine doesn’t need to re-synthesize those responses every time they’re called for. Say 100 people every morning ask for the weather in Portland, Oregon within an hour or so of each other. Previously they would have all called the Mimic2 service separately to generate the same forecast 100 times. Now, the 5:30 am early riser may be the first to request Portland’s weather and will have it generated by Mimic2. But, her request now means the other 99 people get the response sent straight to the speaker, improving the time to response.

Mimic2's cache means it can pull from pre-generated responses to send to the user. Otherwise, it will generate the utterance as usual. The Mimic2 service can now check its cache for a response and send the audio directly to the user instead of re-synthesizing the response.

Tell me more

To learn a bit more about this feature, I checked in with Mycroft’s CTO Steve Penrod.

Give us an overview of this; what exactly is going on here and why?

Steve - Mimic2’s initial implementation would generate every single Text to Speech request; the neural network running on a GPU generating a fresh audio representation for each requested phrase. This straightforward approach is what we call an MVP (Minimum Viable Product) -- it did exactly what we needed it to do, but nothing else.

The nature of a voice assistant is that it often repeats stock phrases – “You are welcome” or “All alarms cleared”. We can (and do) cache those at the device level. But more impactful to Mimic2 is the fact that the server is generating responses for many devices. So as rare as the phrase “It is 11:02” or “Currently sunny and 72 degrees” might be for a single user, with thousands of devices interacting you start to get collisions for even those dynamic phrases.

Implementing a simple cache allows us to use a cheap and near infinite resource – disk space – to enhance the system without adding limited and expensive GPU resources.

How does Mycroft decide what responses get cached?

Steve - There really is no decision -- everything Mimic2 generates goes into the cache. The caching scheme places the most recent request at the top of the stack whether it was newly generated or pulled out of the cache. When we start to run out of space in the cache we simply clear out the bottom of the stack and throw away the oldest generated phrases.

Anything else the Community should know about this?

Steve - From a technical perspective, this is a great example of how all the old tricks are still useful even in the machine learning world.

A few have asked me if there is any privacy concern, but I don’t see any. Generated utterances have no association with a user account or skill. So even if we cache the phrase “Your balance is twelve fifty” there is no way to determine who initially created the interaction that generated that response, what the question was that elicited it, or even what skill was invoked to generate the output. It is impossible to tell if that balance was referring to a checking account, Steam credits, or the number of calories I have left on my diet plan.

From the user perspective, this is just a great performance boost!

Finally, what are the speed benefits of this system?

Steve - It is hundreds of times faster to retrieve a phrase from a cache than it is to generate it. As a bonus, the more Mycroft is deployed the more effective this will be with dynamic content, as cache hit likelihood will increase with more users. There is network overhead that still exists, but we are expecting TTS response time to be cut in half on average. Though, aren’t you the guy in charge of metrics around here?

The Data

He’s right. So, I took a look at the metrics for our Opted-In user base, using the skills with the slowest Time to Response from the first Mycroft Benchmark. We deployed Mimic2 caching on August 31. My sample was all Mimic2 interactions for the 15 days leading up to August 31 and the 15 days after. For Time to Response (T2R) the Mimic2 cache is showing a reduction on average from 12.27 seconds to 8.99 seconds, over 25% reduction. Not quite the 50% cut Steve mentioned, but T2R takes into account other factors like skill handling (turning on lights with an Iot skill).

The new Mimic2 cache improved Time to Response by 25% on average.

When looking at Text to Speech (TTS) generation time in our range, we decreased on average from 6.44 seconds to 2.99 seconds. That’s a 53.5% reduction!

After implementing a cache, Mimic2 spent 50% less time generating new responses on average

Everything we do for Mycroft is aimed at improving the experience for our Community. If you haven’t given Mimic2 a try yet, you can set it as your voice for Mycroft at https://home.mycroft.ai/#/setting/basic under “Voice”. While you’re there, why not Opt-In to Mycroft’s Open Dataset? Then, you can help make Mycroft better just by using it! We’re regularly updating Mimic2’s initial model, so if you run into words or phrases it stumbles on, let us know below.