Google researchers unveil ‘VLOGGER’, an AI that can animate a single still photos to allow them to talk

Described in a research paper titled “VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis,” the AI model can take a photo of a person and an audio clip as input, and then output a video that matches the audio, showing the person speaking the words and making corresponding facial expressions, head movements and hand gestures. The videos are not perfect, with some artifacts, but represent a significant leap in the ability to animate still images.

VLOGGER generates photorealistic videos of talking and gesturing avatars from a single image. (Credit:

A breakthrough in synthesizing talking heads

The researchers, led by Enric Corona at Google Research, leveraged a type of machine learning model called diffusion models to achieve the novel result. Diffusion models have recently shown remarkable performance at generating highly realistic images from text descriptions. By extending them into the video domain and training on a vast new dataset, the team was able to create an AI system that can bring photos to life in a highly convincing way.

“In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate,” the authors wrote.

A key enabler was the curation of a huge new dataset called MENTOR containing over 800,000 diverse identities and 2,200 hours of video — an order of magnitude larger than what was previously available. This allowed VLOGGER to learn to generate videos of people with varied ethnicities, ages, clothing, poses and surroundings without bias.

The paper demonstrates VLOGGER’s ability to automatically dub videos into other languages by simply swapping out the audio track, to seamlessly edit and fill in missing frames in a video, and to create full videos of a person from a single photo.

[…] One could imagine actors being able to license detailed 3D models of themselves that could be used to generate new performances. The technology could also be used to create photorealistic avatars for virtual reality and gaming. And it might enable the creation of AI-powered virtual assistants and chatbots that are more engaging and expressive.[…] the technology also has the potential for misuse, for example in creating deepfakes — synthetic media in which a person in a video is replaced with someone else’s likeness. As these AI-generated videos become more realistic and easier to create, it could exacerbate the challenges around misinformation and digital fakery.


Source: Google researchers unveil ‘VLOGGER’, an AI that can bring still photos to life | VentureBeat

Robin Edgar

Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft