Actually, I think this can be very easily done fully client-side, with good accuracy. Even on Android, the voice recognition can run client-side / offline.
I wonder if the project is in any way related to their DeepSpeech project (https://github.com/mozilla/DeepSpeech). Maybe they use DeepSpeech on the server-side? At some other place they call it Pipsqueak, not sure if this is yet something else.
"this can be very easily done fully client-side" : maybe, if you have the voice model and and inference engine that runs well on devices. Mozilla doesn't have that yet, so this experiment uses a backend running a Kaldi server and model that uses too much memory to run locally.
Once DeepSpeech is ready I'm pretty sure they will switch to that, and ultimately to on-device voice recognition with PipSqueak (PipSqueak is expected to be an inference engine usable on devices). Unfortunately none of these projects are far along enough to be usable.
Common Voice is mostly related to DeepSpeech as this will help getting data to train the engine.
> Actually, I think this can be very easily done fully client-side, with good accuracy. Even on Android, the voice recognition can run client-side / offline.
I'm not sure I'd say it's easy; you will certainly trade off accuracy versus a state-of-the-art server model. Among other things, Firefox users are not going to download gigabytes of recognition model, so it'd have to be a lot smaller than the server ones would be.
Very possibly it will be slower too, since the servers would most likely be using GPUs for at least parts of the recognition, but it might not be easy to ensure the same on all the millions of PCs Firefox runs on.
It's interesting if Mozilla is running their own speech recognition system. I wonder whether it would actually be usable in practice... The problem is, I couldn't find any kind of online demo.
I was especially interested in the Voice Fill (speech recognition) technology. Landing page: https://testpilot.firefox.com/experiments/voice-fill
It seems the project is here: https://github.com/mozilla/speaktome/
This seems as if it actually is a webservice. From the code (https://github.com/mozilla/speaktome/blob/master/extension/c...), I see: const STT_SERVER_URL = "https://speaktome.services.mozilla.com";
Actually, I think this can be very easily done fully client-side, with good accuracy. Even on Android, the voice recognition can run client-side / offline.
I wonder if the project is in any way related to their DeepSpeech project (https://github.com/mozilla/DeepSpeech). Maybe they use DeepSpeech on the server-side? At some other place they call it Pipsqueak, not sure if this is yet something else.
And maybe also related is their common voice project (https://voice.mozilla.org/). Recent discussion here on HN (https://news.ycombinator.com/item?id=14794654).
Some more information also here: https://research.mozilla.org/machine-learning/