Abstract

Modern video generators produce visually compelling clips but still struggle with physical and motion consistency, limiting their use as reliable world simulators. Existing remedies often rely on external simulators, teacher models, or curated physics-focused data. We explore a complementary self-supervised direction: extracting motion cues from the unlabeled videos already used to train video diffusion models. We propose LaMo, which formulates a latent motion prior over frame-to-frame latent changes conditioned on the current latent and prompt. This prior is exposed through two lightweight readouts: a macro motion drift used during training as a Motion Drift Loss, and a learned micro motion field used during sampling as Motion Prior Guidance. Both components are plug-and-play with existing video diffusion backbones, requiring no architectural or I/O changes. On VideoPhy and VideoPhy2, LaMo improves CogVideoX backbones and outperforms recent physics-aware baselines that use external supervision. On VBench, it preserves overall generation quality while improving motion-related dimensions. These results suggest that unlabeled video contains useful motion supervision for improving physical fidelity in modern video diffusion models.

LaMo teaser figure

Method

LaMo method overview

Motion Drift Loss

LaMo extracts a stable macro motion target from clean VAE latents by spatially averaging each tau-lag latent difference into a channel-wise drift vector. During fine-tuning, the denoiser's predicted clean latents match this drift with a scale-normalized L2 loss, anchored on x0_hat and down-weighted when high noise makes motion estimates unreliable.

Motion Prior Guidance

LaMo complements the macro target with a prompt-conditioned CNN that predicts spatial motion fields from clean latent pairs. At inference, this frozen predictor defines a motion-consistency loss after the CFG mix, and its gradient updates the noise prediction rather than directly editing latents. Guidance is used only in the lower-noise window, where predicted motion is reliable.

Qualitative Results

Milk spilling into coffee creates waves.
CogVideoX-5B
LaMo-5B
Coin spins rapidly on a wooden table.
CogVideoX-5B
LaMo-5B
A honey dipper drizzles honey onto Greek yogurt.
CogVideoX-5B
LaMo-5B
An apple falls into a vat of cider, sending up a spray.
CogVideoX-5B
LaMo-5B
A pro surfer sails smoothly on the wave-kissed waters.
CogVideoX-5B
LaMo-5B
Tablecloth is draped over the dining table.
CogVideoX-5B
LaMo-5B
Pouring beer into a glass, creating white foam.
CogVideoX-5B
LaMo-5B
A mountain biker descends fast through a dirt trail.
CogVideoX-5B
LaMo-5B

Physical Motion Fidelity

VideoPhy Results

SA measures semantic adherence and PC measures physical commonsense.

Methods Extra Supervision Solid-Solid Solid-Fluid Fluid-Fluid Overall
SAPC SAPC SAPC SAPC
CogVideoX-2B-49.613.371.228.160.050.960.525.6
DreamWorld-1.3BVGGT, DINOv254.524.548.625.460.132.752.926.2
MoAlign-2B (paper)VideoMAE24.731.766.940.767.356.449.339.4
MoAlign-2B (reimpl.)VideoMAE54.618.373.531.966.255.764.530.1
VideoREPA-2BVideoMAEv252.418.277.432.260.052.764.229.7
LaMo-2B (Ours)Self-supervised58.716.874.732.269.167.367.231.4
CogVideoX-5B-62.919.676.033.672.761.870.032.3
PhyT2V-5Bo1-preview------6137
WISA-5BQwen2VL------6738
PHANTOM-5BV-JEPA2------47.537.9
MoAlign-5B (reimpl.)VideoMAE62.526.379.638.478.076.272.239.4
VideoREPA-5BVideoMAEv258.028.082.939.080.074.572.140.1
LaMo-5B (Ours)Self-supervised62.926.680.841.178.278.273.041.0

Key Component Ablation

ComponentSAPC
Baseline64.835.2
+ Motion Drift Loss71.839.0
+ Motion Prior Guidance68.938.4
LaMo-5B (full)73.041.0

Design Choice Ablation

VariantSAPC
Dense motion loss64.234.6
Raw L2 motion loss67.336.1
Adj-frame lag (tau=1)68.037.5
Direct latent edit62.831.6
No predictor aug.66.437.6
LaMo-5B (Ours)73.041.0

VideoPhy2 Results

MethodsSAPC
CogVideoX-2B21.068.0
PHANTOM-5B27.871.7
MoAlign-2B (paper)28.875.0
MoAlign-2B (reimpl.)24.673.1
VideoREPA-2B21.072.5
LaMo-2B (Ours)25.475.4

General Video Quality

VBench Results

LaMo-5B improves reported quality, semantic, motion, and consistency dimensions over CogVideoX-5B.

Method Motion Smoothness Multiple Objects Object Class Overall Consistency Scene Spatial Relationship Temporal Flickering Quality Score Semantic Score Total Score
CogVideoX-5B97.650.478.725.040.352.397.380.568.778.2
LaMo-5B (Ours)98.251.682.025.742.662.298.481.970.779.6

Interpretability Analysis

Motion drift and motion field heatmap interpretability
Additional motion drift and motion field heatmap interpretability

Motion Drift Heatmap

The drift heatmap visualizes where local latent changes align with the macro drift direction supervised by Motion Drift Loss. Its response concentrates on physically active regions, indicating that the training-time readout captures the dominant event-level motion rather than static appearance.

Motion Field Heatmap

The field heatmap shows the spatial response of the frozen motion predictor used by Motion Prior Guidance after subtracting a no-motion baseline. It localizes the moving object or interaction region, supporting the claim that LaMo allocates motion where the physical change actually happens.

Conclusions and Limitations

Conclusions

LaMo extracts a self-supervised latent motion prior from ordinary video and exposes it as a training-time Motion Drift Loss and sampling-time Motion Prior Guidance. Without simulators, teacher models, or physics annotations, it improves physical fidelity on VideoPhy and VideoPhy2 while preserving general video quality on VBench.

Limitations

LaMo encourages plausible latent motion, but it is not a constraint-satisfying simulator. Its gains depend on video-data coverage and on current VideoPhy-style model-judged metrics, which do not isolate contact, deformation, fluids, conservation, or long-horizon stability.

Citation

@article{jiang2026lamo,
  title={LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation},
  author={Bo Jiang and Depu Meng and Yihan Hu and Yichen Xie and Tianshuo Xu and Wei Zhan},
  journal={arXiv preprint arXiv:2605.23878},
  year={2026}
}