Google Pixel device owners will soon be able to enjoy instant voice transcription using on-device neural networking technology within Gboard. Google and others have been developing a new way of detecting and transcribing speech since 2014, and with the rise of dedicated AI hardware on board mobile chipsets, users will soon be able to enjoy the fruits of those labors. The training model being used is smaller than ever, and all of the processing happens right on the device in question, meaning you get more customization, you don’t have to be on a network to use the feature, and you don’t have to deal with finicky servers. It should be noted that it’s debuting in Gboard for now, and may well stay exclusive to the app. The feature is also starting its rollout with US English. More dialects and languages may come in the future, but there’s no official word on that at this time.
The shift to this development hinged on being able to reduce the size of the language model used for recognizing, matching, and stringing together bits of speech. Google’s old in-house model was around 2GB in size, and took extensive computing power to process and search with any degree of speed, thus the need for being online. A later model weighed in around 450MB. A new model has been developed, in tandem with new natural language processing technologies to facilitate its use, that tips the scales at 88MB. More than light enough to fit on most mobile devices and be searched and processed by them, the new model is at the center of the creation of a fully offline, neural-powered speech transcription engine.
Older processes for natural language processing and speech recognition had three steps. First, small acoustic snippets were linked to phonemes, or parts of words, processed into words, then finally, words were strung together and compared against sentence models to find the closest match. In contrast, this new model and method works on sentence fragments, and is based on an entirely new processing architecture that’s essentially an evolved form of a more generalized audio recognition method used for training neural networks.
This new model falls under the category of Recurrent Neural Network Transducers, which are able to analyze input on a constant, piece-by-piece basis. This is a far cry from older models, which relied heavily on context to match phonemes to word parts and words, and in most cases, needed an entire sentence in order to generate input. This new model allows for faster, more precise transcription, and less jank when it comes to partial sentences, stops, and other unconventional situations.
This new input method may well prove revolutionary. AI Assistants will understand people better, speech to text searches, messages, and other text input will be faster and better than ever before, and full voice control of a device could easily become feasible. Gadgets like the Sony Xperia Ear have attempted something like that in the past, but this new model and the onboard neural networking and machine learning that it relies on could be just what the technology needs to go mainstream, even if that will take a while.