Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
The TikTok logo is being displayed on a smartphone with ByteDance visible in the background in this … [+]
Sometimes there’s a disconnect between what research teams are looking at with AI, and what people on the street encounter when they read the news.
This seems to be the case with new announcements of a model created by Bytedance, the maker of TikTok, that uses minimal inputs to create full-body video automations complete with gestures and hyper-realistic facial features, as well as realistic motion.
If you look at the research papers, this technology is described in sort of a prosaic way:
“OmniHuman significantly outperforms existing methods, generating extremely realistic human videos based on weak signal inputs, especially audio,” the authors write. “It supports image inputs of any aspect ratio, whether they are portraits, half-body, or full-body images, delivering more lifelike and high-quality results across various scenarios.”
Citing a “mixed conditioning strategy,” the authors note the versatility of the model:
“OmniHuman supports various visual and audio styles. It can generate realistic human videos at any aspect ratio and body proportion (portrait, half-body, full-body all in one), with realism stemming from comprehensive aspects including motion, lighting, and texture details.”
People who are interested can click on one of a series of videos at the bottom of the presentational web page that shows some of these virtual subjects singing and lecturing and showing off their uncanny digital incarnations.
So I went over and asked ChatGPT about Omnihuman and competing models. I had identified one called Cyberhost based on a research paper that came up in a Google search. Here’s how CyberHost people describe its design:
“The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results.”
I used the new voice setting to talk to ChatGPT verbally about the two models and the differences between them.
Basically, my AI chat partner explained to me that both models have the capability to generate a full video from one image and an audio stream. However, she (I selected the feminine voice element titled “Maple”) suggested that Cyberhost might be better for generating high-quality avatars, and that each model would have its own strengths and weaknesses. Omnihuman, she suggested, might be good for working across multiple platforms or channels.
All of this seemed pretty nonchalant in terms of explaining the technology, but elsewhere on the web, there was more concern, and really a sort of negative hype around the idea that we might be opening the Pandora’s box of deepfakes.
I came across multiple videos of human Cassandras warning about the potential for deepfake problems.
I also saw that Omnihuman had to take down a video of Taylor Swift singing a song that she never sang, And checking that out, it did look really realistic.
The upshot, as explained by the technology’s critics, is that prior tools gave you sort of clunky and less realistic results, where now you can have pretty extensive videos of someone without detailed inputs.
It is easy to see how this technology could be misused. Right now, there’s a gatekeeper in terms of Bytedance being able to delist results from the Internet, but it seems like it’s only a matter of time until one of these generators comes along that’s open source, and then the gloves are off.
In point of fact, there’s some Omnihuman source code on Github, so… are we already there?
How on earth would you keep people from generating compromising videos of their enemies, or misleading people on political news?
On the other hand, as the research papers show, the technology has been around for a while. It’s just getting better incrementally. So some of the hysteria may be unfounded, or at least misdirected.
Stay tuned for more on where we’re going with generative AI – multimodal capabilities, and video expressions that we could have only dreamed of a decade ago.