In this article
Google’s AI research lab DeepMind has achieved a new leap forward relative to other AI video generation models that are currently available, according to VIP+ sources who have been among the limited pool of filmmakers and creatives given early access on Dec. 16 to beta test the model in Google Labs’ VideoFX toolkit.
VIP+ Analysis: How Gen AI Toolsets Are Transforming Hollywood’s Production Workflows
Sources agreed Veo 2 was superior even to OpenAI’s Sora, which has been generally regarded as the most impressive U.S. video model.
DeepMind has had a “huge number of requests for access” to Veo 2, but researchers have “prioritized filmmakers using AI and creatives” as they seek direct feedback on using Veo 2 as a creative tool, a DeepMind spokesperson told VIP+ via email. “We want to see what these professionals can do with this technology. We must give people an opportunity to experience the technology firsthand so we can learn and improve it.”
The release strategy for Veo 2 remains unknown, though DeepMind indicated it was working with YouTube and Cloud on deploying Veo 2 as it did with Veo. The first version of its video model, Veo has already been integrated into YouTube’s Dream Screen feature and is also available on Cloud’s Vertex AI in private preview.
Beyond any publicly accessible release for Veo 2, DeepMind said it was still intent on partnering with creative industries as it developed tools, including filmmakers, referring to its experimental work with Donald Glover’s creative studio Gilga. “We are continuing deep conversations with a wide variety of filmmakers and will explore doing partnerships where mutually beneficial,” said Neil Parris, head of filmmaker partnerships at Google, in an email.
There is no recognized benchmark or unanimous way of evaluating video generation models. Still, some criteria are regarded as critical, particularly for professional use, as VIP+ has discussed.
Filmmakers highlighted Veo 2 as performing better than other video models on the following aspects:
Image Quality & Realism
Despite some lingering imperfections with complex motion and physics, sources felt Veo 2 photorealism and physics realism far surpassed outputs from other video models. DeepMind likewise highlighted more realistic motion and physics and detail fidelity as improved relative to the first version of Veo.
“Veo 2 is a whole other class. I’ve never seen video that looks that realistic come out of AI,” said Jason Zada, filmmaker and founder of AI studio Secret Level. Whereas often video model outputs need a lot of post-production edits, Zada noted raw outputs from Veo 2 haven’t needed any image cleanup or color correction.
Video from Veo 2 is starting to be able to fool the eye, with imperfections becoming less easily perceptible, particularly to the untrained eye. “Many shots pass the ‘visual Turing test,’ in which most people would not be able to distinguish that it’s completely synthetic,” said filmmaker Paul Trillo, strategic partner at AI studio Asteria.
For example, Trillo described how Veo 2 produced a black stallion on fire running down a boardwalk before jumping off a dock into the water. Overall, the model successfully simulated the horse’s musculature and the responses of fire in wind and being extinguished by water. “That would be a very hard job for VFX, and it did a very good job with that.”
Prompt Adherence
Sources each said that Veo 2 excels at adhering to even very complicated text prompts in ways that other video models simply don’t. DeepMind also referenced improving prompt adherence as a key differentiator. “Previous iterations [of Veo] had a hard time following camera motion or even when too much detail is provided,” said a DeepMind spokesperson.
Text-to-video generators often struggle to output video that accurately follows specific instructions contained in text prompts, especially when a prompt is complex or detailed. This has been a common frustration, requiring users to go through multiple rounds of trial and error before producing a usable clip or shot. This kind of iteration is almost entirely wasteful. “You do 30 or 40 different variations to try to get that one thing you first asked for,” said Zada.
“It’s the first time I’ve really felt like an AI image or video tool is actually creating what’s in my head — and sometimes better,” said Trillo. “I’ve been giving it incredibly specific prompts with multiple characters, and it’s been able to keep it all coherent.”
Testers were impressed by how much the model “listened,” understanding and producing even very nuanced details provided in the wording of a prompt.
“Any word you change changes the output you get. You can put little things in, like a misstep [person tripping a little], and it does it,” said Daniel Barak, VP and global executive director at R/GA, who shared his short made with Veo 2 called “Lynx.”
Improved consistency in text-to-video might also be attributable to improving prompt adherence, though DeepMind wasn’t explicit about how such consistency emerged from the model. Filmmaker sources said Veo 2 more automatically generated more consistent characters or objects from one discrete generation to the next — something other video models have also struggled to do — and found they could achieve better consistency if they used precise and identical descriptive wording in their prompts.
For example, for Zada’s widely circulated short “The Heist,” he noted that simply by repeating “1970s green car” in the prompts, Veo 2 kept making a similar car over multiple subsequent outputs.
Getting consistent characters, objects, environments or styles from video models has required methods like image-to-video, video-to-video or fine-tuning. But Veo 2 suggests such consistency might even be able to reliably emerge from text-to-video generation, making prompt engineering merely a stopgap solution to a fixable failing of text-to-video.
Prompt adherence only faltered when prompts clashed with built-in guardrails. Google’s VideoFX policy “currently disallows photorealistic children and well-known individuals, like politicians and celebrities,” said the DeepMind spokesperson, adding that DeepMind intends to further improve the model’s safety with additional policies, filters against harmful content and cleaning the training data to mitigate bias. “Some prompts can challenge these tools’ guardrails, and we remain committed to continually enhancing and refining the safeguards we have in place.”
While it’s important that versions of these tools restrict the clearest types of potential abuse, including pornographic or excessively violent outputs or the most egregious copyright or likeness infringing derivatives, sources also sometimes experienced guardrails as a creative hamstring.
For example, the model refused to generate a car crash needed for a scene or fingers in the act of pulling the trigger on a gun. In some cases, testers described being able to “jailbreak” (trick) models with ulterior wording (e.g., prompting “red liquid” for blood), a workaround that sometimes allowed them to bypass guardrails and make desired outputs anyway.
Another complication for professional users is it's unclear what Veo 2 has trained on, though it’s hard to ignore the possibility that researchers would have used videos on YouTube, owned by their mutual parent company Google. DeepMind didn’t specify the sources of its training data for Veo 2, saying only that the model was trained on “high-quality video-description pairs,” explained as a video with its associated description of what happens in that video.
One source described successfully prompting the model for recognizable IP, including “Star Wars” and specifically “The Mandalorian,” which Veo 2 readily reproduced — proof the model was very likely trained on copyrighted shows and movies (with the caveat that it’s possible Google has licensed data without announcing a deal).
If technical guardrails fail to prevent such outputs from being created, Google’s terms of service would seem to prevent their being used. “All VideoFX users must agree to our Generative AI Additional Terms of Service, which requires users to respect the rights of others, including privacy and intellectual property rights. Users can request the removal of images under our policies or applicable laws,” wrote the DeepMind spokesperson.
Variety VIP+ Explores Gen AI From All Angles — Pick a Story
More on AI video generation from VIP+ ...
• Video Generation Model Evaluation: Veo 2, Sora, Pika 2.0, Ray2
• Coming Jan. 27: Why film, TV and VFX studios are in a state of limbo over video generation models