X

YouTube Passes 1 Billion Auto-Captioned Videos

YouTube tries to automatically caption most videos using specially-built speech recognition technology with varying rates of success and accuracy, but the fact that the service now boasts over 1 billion videos with automatically generated captions is nothing to sneeze at. The service takes a while to caption each video, but the billion videos that it has managed to caption thus far represent a significant chunk of the 300 million or so new videos that crop up each year. Counting in the large amount of videos that aren’t in languages that YouTube can caption, contain music without lyrics or with lyrics in an incompatible language, videos that YouTube Heroes decide to caption by hand, and the YouTube videos that have little to no speech, the estimated percent of coverage gets pretty impressive.

Those captions are not just spreading out to more videos, though; they’re also getting more and more accurate each day. YouTube’s auto captioning algorithm is based not only on speech recognition, but on machine learning. As it is fed more and more data, and the parts of it that can be improved manually are tweaked by experienced coders, YouTube’s ability to automatically caption videos is getting better and better. This can be seen in the two example photos attached below, with one featuring an inaccurate caption that shows the system taking some time to catch up and getting a number of words wrong. For people who are deaf or hard of hearing, as well as people who normally tend to watch YouTube videos in environments where they either shouldn’t have the sound on or can’t hear it, this is a pretty big deal.

Automated captions do happen through machine learning magic, but they actually have a bit of help in learning and in making captions as accurate as possible. When a video has been automatically captioned, the video’s uploader has a chance to review the captions. If a creator chooses to do so, they can help with correcting the captions, which in turn feeds the captioning bot more data about not only what different speech sounds like and what words it correlates to, but where its own flaws lie, what tendencies it needs to correct, and how best to improve.