Video Machine-learning experts have built a neural network that can manipulate facial movements in videos to create fake footage – in which people appear to say something they never actually said.
It could be used to create convincing yet faked announcements and confessions seemingly uttered by the rich and powerful as well as the average and mediocre, producing a new class of fake news and further separating us all from reality… if it works well enough, naturally.
It’s not quite like Deepfakes, which perversely superimposed the faces of famous actresses and models onto the bodies of raunchy X-rated movie stars.
Instead of mapping faces onto different bodies, though, this latest AI technology controls the target’s face, and manipulates it into copying the head movements and facial expressions of a source. In one of the examples, Barack Obama acts as the source and Vladimir Putin as the target. So it looks as though a speech given by Obama was instead given by Putin.
Obama’s facial expressions are mapped onto Putin’s face using this latest AI technique … Image credit: Hyeongwoo Kim et al
A paper describing the technique, which popped up online at the end of last month, claims to produce realistic results. The method was developed by Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt.
The Deepfakes Reddit forum, which has since been shut down, was flooded with people posting tragically bad computer-generated videos of celebs’ blurry and twitchy faces pasted onto porno babes using machine-learning software, with mismatched eyebrows and skittish movements. You could, after a few seconds, tell they were bogus, basically.
A previous similar project created a video of someone pretending to say something he or she hadn’t through lip-synching and an audio clip. Again, the researchers used Barack Obama as an example. But the results weren’t completely convincing since the lip movements didn’t always align properly.
That’s less of a problem with this new approach, however. It’s, supposedly, the first model that can transfer the full three-dimensional head position, head rotation, face expression, eye gaze and blinking from a source onto a portrait video of a target, according to the paper.
Controlling the target head
It uses a series of landmarks to reconstruct a face so it can track the head and facial movements to capture facial expressions for the input source video and output target video for every frame. A facial representation method computes the parameters of the face for both videos.
Next, these parameters are slightly modified and copied from the source to the target face for a realistic mapping. Synthetic images of the target’s face are rendered using an Nvidia GeForce GTX Titan X GPU.
The rendering part is where the generative adversarial network comes in. The training data comes from the tracked video frames of the target video sequence. The goal is to generate fake images that are as good as enough as the ones in the target video frames to trick a discriminator network.
Only about two thousand frames – which amounts to a minute of footage – is enough to train the network. At the moment, it’s only the facial expressions that can be modified realistically. It doesn’t copy the upper body, and cannot deal with backgrounds that change too much.