MIT researchers have created a new AI program that’s capable of not only recognizing objects in images and speech at the same time, but actively blends the two in order to understand and utilize both more effectively. This AI program can analyze images with audio captions, then put those two resources together and figure out what object corresponds to what parts of the caption. It demonstrates this in testing by highlighting image areas and objects while they’re being described by the caption. According to the researchers behind the project, this presents a more natural and organic solution than conventional speech recognition or image recognition training. Essentially, the AI is learning like a human would, which could make it more flexible and thus more capable in the future.
This AI program is actually an expansion of a previous model that was able to match up words and phrases to themed collections of images, such as colors and archetypes. The model uses two convolutional neural networks that process speech input and image input separately, then a higher layer combines those and builds associations. Researchers showed the model both correct and incorrect associations in order to help it learn to discern connections, or lack thereof, on its own.
The implications of this project are quite vast and should be obvious; not only will this discovery allow faster speech recognition and image recognition AI training in future models, it will also pave the way for AI based on convolutional neural networks that not only mimic the human brain in structure, but also in learning methods. Theoretically, this opens the path to things like AI with common sense that may know that it’s bad to drive a car off of a cliff, or AI that can recognize and react to human emotions appropriately, such as knowing that a crying child could be comforted by doing or saying something that a child that age would find funny. Improved AI-based translation is also a possibility here, since the AI in question could potentially learn words and their other-language counterparts at the same time, on the same material, even when there’s not enough transcription of a language for conventional speech recognition or translation training.