LaMo: Self-Supervised Latent Motion Priors

Abstract

Modern video generators produce visually compelling clips but still struggle with physical and motion consistency, limiting their use as reliable world simulators. Existing remedies often rely on external simulators, teacher models, or curated physics-focused data. We explore a complementary self-supervised direction: extracting motion cues from the unlabeled videos already used to train video diffusion models. We propose LaMo, which formulates a latent motion prior over frame-to-frame latent changes conditioned on the current latent and prompt. This prior is exposed through two lightweight readouts: a macro motion drift used during training as a Motion Drift Loss, and a learned micro motion field used during sampling as Motion Prior Guidance. Both components are plug-and-play with existing video diffusion backbones, requiring no architectural or I/O changes. On VideoPhy and VideoPhy2, LaMo improves CogVideoX backbones and outperforms recent physics-aware baselines that use external supervision. On VBench, it preserves overall generation quality while improving motion-related dimensions. These results suggest that unlabeled video contains useful motion supervision for improving physical fidelity in modern video diffusion models.

Method

Motion Drift Loss

LaMo extracts a stable macro motion target from clean VAE latents by spatially averaging each tau-lag latent difference into a channel-wise drift vector. During fine-tuning, the denoiser's predicted clean latents match this drift with a scale-normalized L2 loss, anchored on x0_hat and down-weighted when high noise makes motion estimates unreliable.

Motion Prior Guidance

LaMo complements the macro target with a prompt-conditioned CNN that predicts spatial motion fields from clean latent pairs. At inference, this frozen predictor defines a motion-consistency loss after the CFG mix, and its gradient updates the noise prediction rather than directly editing latents. Guidance is used only in the lower-noise window, where predicted motion is reliable.

Qualitative Results

Milk spilling into coffee creates waves.

CogVideoX-5B

LaMo-5B

Coin spins rapidly on a wooden table.

CogVideoX-5B

LaMo-5B

A honey dipper drizzles honey onto Greek yogurt.

CogVideoX-5B

LaMo-5B

An apple falls into a vat of cider, sending up a spray.

CogVideoX-5B

LaMo-5B

A pro surfer sails smoothly on the wave-kissed waters.

CogVideoX-5B

LaMo-5B

Tablecloth is draped over the dining table.

CogVideoX-5B

LaMo-5B

Pouring beer into a glass, creating white foam.

CogVideoX-5B

LaMo-5B

A mountain biker descends fast through a dirt trail.

CogVideoX-5B

LaMo-5B

Physical Motion Fidelity

VideoPhy Results

SA measures semantic adherence and PC measures physical commonsense.

Methods	Extra Supervision	Solid-Solid		Solid-Fluid		Fluid-Fluid		Overall
Methods	Extra Supervision	SA	PC	SA	PC	SA	PC	SA	PC
CogVideoX-2B	-	49.6	13.3	71.2	28.1	60.0	50.9	60.5	25.6
DreamWorld-1.3B	VGGT, DINOv2	54.5	24.5	48.6	25.4	60.1	32.7	52.9	26.2
MoAlign-2B (paper)	VideoMAE	24.7	31.7	66.9	40.7	67.3	56.4	49.3	39.4
MoAlign-2B (reimpl.)	VideoMAE	54.6	18.3	73.5	31.9	66.2	55.7	64.5	30.1
VideoREPA-2B	VideoMAEv2	52.4	18.2	77.4	32.2	60.0	52.7	64.2	29.7
LaMo-2B (Ours)	Self-supervised	58.7	16.8	74.7	32.2	69.1	67.3	67.2	31.4
CogVideoX-5B	-	62.9	19.6	76.0	33.6	72.7	61.8	70.0	32.3
PhyT2V-5B	o1-preview	-	-	-	-	-	-	61	37
WISA-5B	Qwen2VL	-	-	-	-	-	-	67	38
PHANTOM-5B	V-JEPA2	-	-	-	-	-	-	47.5	37.9
MoAlign-5B (reimpl.)	VideoMAE	62.5	26.3	79.6	38.4	78.0	76.2	72.2	39.4
VideoREPA-5B	VideoMAEv2	58.0	28.0	82.9	39.0	80.0	74.5	72.1	40.1
LaMo-5B (Ours)	Self-supervised	62.9	26.6	80.8	41.1	78.2	78.2	73.0	41.0

Key Component Ablation

Component	SA	PC
Baseline	64.8	35.2
+ Motion Drift Loss	71.8	39.0
+ Motion Prior Guidance	68.9	38.4
LaMo-5B (full)	73.0	41.0

Design Choice Ablation

Variant	SA	PC
Dense motion loss	64.2	34.6
Raw L2 motion loss	67.3	36.1
Adj-frame lag (tau=1)	68.0	37.5
Direct latent edit	62.8	31.6
No predictor aug.	66.4	37.6
LaMo-5B (Ours)	73.0	41.0

VideoPhy2 Results

Methods	SA	PC
CogVideoX-2B	21.0	68.0
PHANTOM-5B	27.8	71.7
MoAlign-2B (paper)	28.8	75.0
MoAlign-2B (reimpl.)	24.6	73.1
VideoREPA-2B	21.0	72.5
LaMo-2B (Ours)	25.4	75.4

General Video Quality

VBench Results

LaMo-5B improves reported quality, semantic, motion, and consistency dimensions over CogVideoX-5B.

Method	Motion Smoothness	Multiple Objects	Object Class	Overall Consistency	Scene	Spatial Relationship	Temporal Flickering	Quality Score	Semantic Score	Total Score
CogVideoX-5B	97.6	50.4	78.7	25.0	40.3	52.3	97.3	80.5	68.7	78.2
LaMo-5B (Ours)	98.2	51.6	82.0	25.7	42.6	62.2	98.4	81.9	70.7	79.6

Interpretability Analysis

Motion drift and motion field heatmap interpretability

Additional motion drift and motion field heatmap interpretability

Motion Drift Heatmap

The drift heatmap visualizes where local latent changes align with the macro drift direction supervised by Motion Drift Loss. Its response concentrates on physically active regions, indicating that the training-time readout captures the dominant event-level motion rather than static appearance.

Motion Field Heatmap

The field heatmap shows the spatial response of the frozen motion predictor used by Motion Prior Guidance after subtracting a no-motion baseline. It localizes the moving object or interaction region, supporting the claim that LaMo allocates motion where the physical change actually happens.

Conclusions and Limitations

Conclusions

LaMo extracts a self-supervised latent motion prior from ordinary video and exposes it as a training-time Motion Drift Loss and sampling-time Motion Prior Guidance. Without simulators, teacher models, or physics annotations, it improves physical fidelity on VideoPhy and VideoPhy2 while preserving general video quality on VBench.

Limitations

LaMo encourages plausible latent motion, but it is not a constraint-satisfying simulator. Its gains depend on video-data coverage and on current VideoPhy-style model-judged metrics, which do not isolate contact, deformation, fluids, conservation, or long-horizon stability.

Citation

@article{jiang2026lamo,
  title={LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation},
  author={Bo Jiang and Depu Meng and Yihan Hu and Yichen Xie and Tianshuo Xu and Wei Zhan},
  journal={arXiv preprint arXiv:2605.23878},
  year={2026}
}