Title: VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
Authors: Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen
arXiv: https://arxiv.org/abs/2602.10098
This paper critiques latent-action pretraining pipelines that still leak or overfit to pixel-level nuisance dynamics. Their JEPA-style answer is leakage-free latent prediction: future frames are encoded only as targets, never as student inputs, forcing representation learning toward action-relevant state transitions.
The architecture is intentionally simple: stage one learns predictive latent dynamics from current observation to future latent target; stage two attaches/fine-tunes an action head. By predicting in latent space, the method reduces sensitivity to camera motion and background variation.
Where is target encoder and is student predictor conditioned on present observation only. The no-leakage design is the central claim: supervision comes from the future, but information flow at inference time remains causal.
Results across LIBERO, LIBERO-Plus, SimplerEnv, and real manipulation show better robustness/generalization than prior latent-action baselines. The practical takeaway is that representation causality constraints can matter as much as model scale when pretraining VLA on internet video.
Graph: Paper Node 2602.10098