Google's AI Now 94% Accurate at Describing a Photo

If you use Google Photos, you’ll know that it already does an impressive job at identifying and grouping pictures of individual people and types of objects, creating albums to help you find that one photo you’re searching for. This is only a part though of the image analysis capability that Google has been developing over the past few years. In particular, the Google Brain team have been developing a software system that can automatically generate a textual caption to describe any given image that is presented to it. Now they’re making the latest version of that captioning system available as open source in the TensorFlow software library.

First developed in 2014, an earlier version of the software won joint first place at the Microsoft COCO 2015 image captioning competition. The updated system is said to now be faster to learn, as well as producing more detailed and accurate descriptions of an image. It is reported to have improved from an original 89.6% level of accuracy to now achieve a 93.9% level of accuracy over a range of images. And this isn’t just classifying an image into a certain category, but giving a description of what it thinks are the important features contained within the image.

The system uses machine learning techniques in order to train itself from a large number of example images. The example images have already been captioned by humans, and Google’s system then gradually learns how those captions relate to the elements of an image, including aspects of the image such as color or size. It also learns how elements within the image interrelate to each other. So, for example, if there are three objects in an image, it doesn’t just list them but it attempts to describe the relative positioning or whether one object is acting on another. We have already seen some benefits of this type of complex image analysis in being able to more easily manage a large photo collection. There are many potential applications and as we’re beginning to use natural language more frequently to interact with personal assistants and messaging bots, the ability for a device to also associate an image with a description would help to enhance that interaction.