What if you could manipulate the facial features of a historical figure, a politician, or a CEO realistically and convincingly using nothing but a webcam and an illustrated or photographic still image? A tool called MarioNETte that was recently developed by researchers at Seoul-based Hyperconnect accomplishes this, thanks in part to cutting-edge machine learning techniques. The researchers claim it outperforms all baselines even where there’s “significant” mismatch between the face to be manipulated and the person doing the manipulating.
MarioNETte is technically a face reenactment tool, in that it aims to synthesize a reenacted face animated by the movement of a person (a “driver”) while preserving the face’s (target’s) appearance. It’s not a new idea, but previous approaches either (1) required a few minutes of training data and could only reenact predefined targets, or (2) would distort the target’s features when dealing with large poses.
MarioNETte advances the state of the art by incorporating three novel components: an image attention block, a target feature alignment, and a landmark transformer. The attention block allows the model to attend to relevant positions of mapped physical features, while the target feature alignment mitigates artifacts, warping, and distortion. As for the landmark transformer bit, it adapts the geometry of the driver’s poses to that of the target without the need for labeled data, in contrast to approaches that require human-annotated examples.
The researchers trained and tested MarioNETte using VoxCeleb1 and CelebV, two open source corpora of celebrity photos and videos. The models and baselines were trained using 1,251 different celebrities from VoxCeleb1 and tested on a set compiled by sampling 2,083 image sets from a randomly selected 100 videos of VoxCeleb1 (plus 2,000 sets from every celebrity in CelebV).
The result? Empirically, across up to eight target images, MarioNETte surpassed all other models save one (PSNR). In a separate user study in which 100 volunteers were tasked with selecting one of two images generated by different models based on their quality and realism, MarioNETte’s output ranked higher than all baselines.
The researchers leave to future work improving the landmark transformer to make reenactments even more convincing. “[Our] proposed method [does] not need [an] additional fine-tuning phase for identity adaptation, which significantly increases the usefulness of the model when deployed in the wild,” wrote the coauthors of a preprint paper detailing MarioNETte’s architecture and validation. “Our experiments including human evaluation suggest the excellence of the proposed method.”
The work might enable videographers to cheaply animate figures without motion tracking equipment. But it might also be abused to create highly realistic deepfakes, which take a person in an existing image or video and replace them with someone else’s likeness.
In less than a year, the number of deepfake videos online has jumped 84%, prompting respondents in a Pew Center survey to say they expect 57% of news shared on social media to be “largely inaccurate.” Amid concerns about deepfakes, about three-quarters of people in the U.S. favor steps to restrict altered videos and images, and companies such as Google and Facebook have released data sets and AI models designed to detect deepfakes.