Originally published at: https://mycroft.ai/blog/the-mycroft-benchmark/
Machine Learning requires data to improve. The best source of that data is through our Community, who Opt-In to share the data from their interactions with Mycroft. That allows us at Mycroft AI to build Open Datasets of tagged speech to improve Wake Word spotting, Speech to Text, and Text to Speech. But, to improve the software that utilizes those engines, we need a different kind of data to analyze. How does Mycroft compare to other voice assistants and smart speakers? How does Mycroft itself improve over time? How can you help?
Benchmarking Mycroft
A benchmark is important for a number of reasons, first and foremost being it offers us a baseline of Mycroftâs performance on a given date that we can compare once changes are in place. Then, as necessary, we can compare different configurations of Mycroft, new platforms and hardware for Mycroft, and our competition.Over the last couple of weeks, weâve been preparing and conducting a repeatable benchmark of Mycroft against other voice assistants in the field. This will be a new addition to the Mycroft Open Dataset; not tagged speech or intent samples, but a standard process and metrics that anyone can use to measure Mycroft and other voice assistants quantitatively. Below, Iâll report on the results of the first iteration where we compared a Mycroft Mark I to a first generation Amazon Echo and Google Home.
The Process
To conduct this benchmark, we had to put together a series of questions, which wasnât as easy as it sounds. Being an emerging technology, there arenât industry standards that exist yet. So who better to set that standard than the Open player? We prepared a starter set of 14 questions based on the observed usage of Skills by Opted-In Mycroft users (more on that later), taking into consideration industry-reported most used Skills from places like Voicebot. That first run of questions was:- How are you?
- What time is it?
- How is the weather?
- What is tomorrowâs forecast?
- Wikipedia search Abraham Lincoln
- Tell me a joke
- Tell me the news
- Say âPeter Piper picked a peck of pickled peppersâ
- Set volume to 50 percent
- What is the height of the Eiffel Tower?
- Play Capital Cities Radio on Pandora
- Who is the lead singer of the Rolling Stones?
- Set a 2 minute timer
- Add eggs to my shopping list
- Tell me about Abraham Lincoln
- What is the height of the Eiffel Tower
- Who is the lead singer of the Rolling Stones?
- How is the weather?
- What is tomorrowâs forecast?
- Play Capital Cities Radio on Pandora
- Play Safe and Sound by Capital Cities on Spotify
- Set a 2-minute timer
- Set an alarm for tomorrow morning at 7:00
- What time is it?
- Tell me the news
- Add eggs to my shopping list
- Set volume to 5
- How are you?
- Tell me a joke
- Say/repeat/Simon says â[random sentence]â
Issuing all the requests to each assistant took about 45 minutes. To get the best idea of when requests ended and responses started, I imported the audio into Audacity and used the waveforms to determine five points:
- The Wake Word
- The end of the request
- The beginning of the response
- The start of 'real info'
- The end of the response
The Results
Now to the good stuff, or in this case, the âroom for improvementâ stuff. Here are the results from the first Mycroft Benchmark.Time to Response
One of the biggest points we wanted to track was the âTime to Response.â In this context, that means the ending of the provided request to the beginning of an audible response. We tracked that across the 14 questions using the new Mimic2 American Male voice. We found that Mycroft currently responds an average of 3.3x slower than Google and Amazon. On average for our sample, Alexa responded to requests in 1.66 seconds, Google Assistant responded in 1.45 seconds, and Mycroft in 5.03 seconds.Time to Real Info
The next thing we decided to track was when the voice assistantâs response actually began answering the question it was asked. As mentioned above, this is a subjective decision for the time being, but still offers some interesting data to look at. On average, Alexa started providing real info 3.02 seconds after the request finished. Google provided real info at 3.55 seconds. Mycroft started providing real info at 5.7 seconds.We can see that the graph is a good bit tighter here, and in one case, âTell me the news,â Mycroft actually comes out on top. My presumption is that Mycroftâs competition is adding some phrasing to the beginning of responses that require API hits or pulling up a stream. Though, it also included the reason behind the outlier that is Googleâs response to the News query - a nearly 16 second notification about being able to search for specific topics or news sources. I also did a quick look at the time between the response starting and when the assistant provided Real Info. On average, Alexa spoke for 1.36 seconds before providing Real Info. Google Assistant spoke for 2.1 seconds before Real Info. Mycroft spoke for 0.66 seconds before providing Real Info.
Where to go from here
This benchmark was especially helpful in comparing Mycroft objectively to Google and Amazon. Eventually, weâll be able to broaden it to others in the space. Now the trick is figuring out how to improve the experience, then return to this benchmark periodically to reassess.For improvements to the experience, we have another source of metrics from which weâll be able to get actionable information: the Mycroft Metrics Service.
Our Opted-In Community Members have timing information for their interactions with Mycroft anonymously uploaded to a database for analysis. This is how we determined the Mycroft Communityâs most used Skills (that is, the Opted-In users most used Skills) for the 14 questions of the Benchmark. Apart from Skill usage, we have visibility of what steps are carried out in an interaction, and how long each step takes. From there we can determine what steps of a Mycroft interaction take the longest, and work to speed them up or find creative improvements to the Voice User Experience.
Weâll also revise the benchmark to be more explicit in comparing the timing of responses. Itâs likely weâll create one or more subjective measures for quality of response. As Skills expand, the number of questions will certainly expand too.
Thereâs also the question of where this information will live and be available to the community. The blog is a great place for explaining a new process but isnât great for storing and displaying data. Weâve had some Skill Data published on the Github since May. A repo and/or Github.io page will likely be the residence of data, graphs, and more regular updates on Mycroft Metrics and Benchmarking. That will make it free and available for anyone to use, whether youâre comparing the speed of your local system to others, planning an improvement to Mycroft Core to speed up interactions, or creating a visualization for research. This data is Open and yours to use. Since that will take some time to set up, here is a Google Sheet to give you immediate access to the first round of data.
How can you help?
Iâm so glad you asked! Like I mentioned, metrics come back only for Community Members who have Opted-In to the Open Dataset. So the best way to help is to Opt-In and use Mycroft! That way, we get a population of interactions that is as broad as possible. People on different networks in different locations using different devices interacting with Mycroft in different ways provides the best information for Mycroft and the community to make decisions on.To Opt-In:
- Go to home.mycroft.ai and Log In
- In the top right corner of the screen, click your name
- Select âSettingsâ from the menu. Youâll arrive at the Basic Settings page
- Scroll to the bottom and once youâve read about the Open Dataset, check âI agreeâ to Opt-In
- Thatâs it!
Have an idea to improve Mycroftâs metrics and benchmarking? Maybe a question on the process? Let us know on the forum.