The system, which has been trained on thousands of hours of BBC News programmes, has been developed in collaboration with Google’s DeepMind AI division.
“Watch, Attend and Spell”, as the system has been called, can now watch silent speech and get about 50% of the words correct. That may not sound too impressive – but when the researchers supplied the same clips to professional lip-readers, they got only 12% of words right.
Joon Son Chung, a doctoral student at Oxford University’s Department of Engineering, explained to me just how challenging a task this is. “Words like mat, bat and pat all have similar mouth shapes.” It’s context that helps his system – or indeed a professional lip reader – to understand what word is being spoken.
“What the system does,” explains Joon, “is to learn things that come together, in this case the mouth shapes and the characters and what the likely upcoming characters are.”
The BBC supplied the Oxford researchers with clips from Breakfast, Newsnight, Question Time and other BBC news programmes, with subtitles aligned with the lip movements of the speakers. Then a neural network combining state-of-the-art image and speech recognition set to work to learn how to lip-read.
After examining 118,000 sentences in the clips, the system now has 17,500 words stored in its vocabulary. Because it has been trained on the language of news, it is now quite good at understanding that “Prime” will often be followed by “Minister” and “European” by “Union”, but much less adept at recognising words not spoken by newsreaders.