For some time now, we’ve been searching the Internet and asking Google questions with our voices alone. It’s pretty accurate, and has been since it launched back with Android 4.1 Jelly Bean, but the speech recognition software behind it recently got a lot better, with Google pronouncing that they had improved things by 50%. Now, Fran§oise Beaufays a Research Scientist at Google details just how Google reduced transcription errors by 50%. The answer seems to lie in the same technology powering Google Photos; deep learning.
Google Voice was launched before Android 4.1 Jelly Bean, and while it’s not available to everyone it’s still used by many that haven’t moved over to Project Fi. However, in recent years, we’ve been expected to talk to Google in a much more natural and human manner. Artificial Intelligence has been the driving force behind Google Voice’s newfound accuracy. As Beaufays notes, the original version that launched in 2009 was built using Gaussian Mixture Model (GMM) acoustic model and that having the models adapt to the user’s voice helped make this fairly accurate. To improve upon this however, they rebuilt the listening engine, and used Deep Learning to create a whole new system.
As Beaufays notes they decided to ” retrain both the acoustic and language models, and to do so using existing voicemails” and by using Recurrent Neural Networks (RNNs) they were able to create short-term memory of sorts for the system to determine what a voicemail is actually saying by comparing it to previous transcriptions. This recognition isn’t just reserved for Google Voice however, as we now talk to our smartwatches, our smartphones and tablets as well. The whole blog post is an entertaining read for those interested in just how Deep Learning is used at Google. With the recent announcement of Alphabet Inc. and Google being spun off so to speak, we might soon see “D for Deep Learning” if this rapid pace of development continues at Google.