Eerily realistic: Microsoft’s new AI model makes images talk, sing

APR 20, 2024

Microsoft has developed an artificial intelligence model that converts images of a person’s face and audio clips into a video with proper lip-syncing, facial expressions, and head movements. Developed by a team of AI researchers at Microsoft Research Asia, the new AI model is called VASA-1.

“We introduce VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) given a single static image and a speech audio clip,” said the team in a research paper. “Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. “

The team claims that their method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512×512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors, according to them.

VASA— short for Visual Affective Skills Animator— is capable of transforming any static images whether clicked by the camera, painted, or drawn, into “exquisitely synchronized” animations.

VASA can generate scary real video where the newly animated subject is not only able to accurately lip-sync to a supplied voice audio track but also sports varied facial expressions and natural head movements- all from a single static headshot photo.

The team utilized the publicly available VoxCeleb2 dataset which contains video clips of over 6,000 real-life celebrities. Discarding clips with multiple individuals and of low quality, the team trained their model on the processed dataset.

Remarkably, their model can handle inputs outside the training set, such as artistic photos, and non-English speech.

Mona Lisa singing

Using Anne Hathaway’s “Paparazzi,” audio, researchers experimented with the Mona Lisa.

The researchers also claimed that the AI system can work in real-time, demonstrating a clip showing the tool instantly animating pictures with facial expressions and head movements.

The model offers control over gaze, distance, and emotions in the generated video.

The researchers said the model can take in audio of any length and generate a talking face according to the clip.

Impersonation fears

While the model’s capabilities raise impersonation fears, the researchers are adamant that their intention with the tool is not to enhance deepfaking.

“We are exploring visual affective skill generation for virtual, interactive characters, NOT impersonating any person in the real world,” they wrote in a blog post.

Product won’t be released

The research team maintains that the model will be used for education and provide companionship. They have also refused to release the code that powers the model.

The team emphasized their interest in applying the new technique to advance forgery detection, adding that videos generated by VASA contain identifiable artifacts.

“We have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations,” they added.

Details of the team’s research was published on the preprint server arXiv.