Vidu
China's Shengshu Technology and Tsinghua University have unveiled Vidu, a text-to-video model capable of
generating 16-second clips at 1080p resolution with a single click.
The announcement was made at the 2024 Zhongguancun Forum in Beijing, where they tried to position Vidu as
a strong competitor to OpenAI's Sora.
Like Sora, Vidu is capable of producing 16-second clips at 1080p resolution.
Vidu is based on a Universal Vision Transformer (U-ViT) architecture, which the company says allows it to
simulate the real physical world with multi-camera view generation.
This architecture was reportedly developed by the Shengshu Technology team in September 2022 and as such
would predate the diffusion transformer (DiT) architecture used by Sora.
According to the company, Vidu can generate videos with complex scenes adhering to real-world physics, such
as realistic lighting and shadows, and detailed facial expressions.
The model also demonstrates a rich imagination, creating non-existent, surreal content with depth and
complexity.
Vidu's multi-camera capabilities allows for the generation of dynamic shots, seamlessly transitioning between
long shots, close-ups, and medium shots within a single scene.
The company, in its demo, attempted to recreate similar scenes that were previously shared by OpenAI during
the release of Sora.
And while Vidu is an impressive accomplishment and a testament to China's rapid progress in AI research, a
side-by-side comparison with Sora reveals that the generated videos are not at Sora's level of realism.
The output, while impressive, falls short in terms of visual fidelity.
However, it is important to acknowledge that the temporal consistency achieved by Vidu is commendable, and
this technology has the potential for further refinement and improvement over time.
—————每日英语听力
文稿|夏嘉宝
播报|王心羽
审核|胡海燕 盛雨濛 李雪英
排版|田卓雯
