Generative video AI has made significant progress, but often still struggles to depict the movements of humans and objects physically realistically, i.e., correctly and causally. Meta has now introduced a new model, VideoJAM, that addresses these issues because—unlike competing video AIs according to Meta researchers—it does not prioritize the quality of the rendering over the movement during training.
VideoJAM consists of two central components: During training, it predicts both pixels and the associated motion from a single learned representation. Interestingly, VideoJAM can be applied to any video model with minimal adaptations, without requiring changes to the training data or scaling, and thus achieves the current state-of-the-art in terms of motion consistency while simultaneously improving visual quality. VideoJAM thus demonstrates that integrating appearance and motion can improve both the coherence and the overall quality of videos.
VideoJAM&s superiority over currently competing models is to be demonstrated by a qualitative comparison with leading AI models (the proprietary models—Sora, Kling, and Runway Gen3) and the base model from which VideoJAM was fine-tuned (DiT-30B) using representative prompts:
Bilder zur Newsmeldung: