VLA-JEPA Enhancing Vision-Language-Action Model with Latent World Model

Title: VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
Authors: Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen
arXiv: https://arxiv.org/abs/2602.10098

This paper critiques latent-action pretraining pipelines that still leak or overfit to pixel-level nuisance dynamics. Their JEPA-style answer is leakage-free latent prediction: future frames are encoded only as targets, never as student inputs, forcing representation learning toward action-relevant state transitions.

The architecture is intentionally simple: stage one learns predictive latent dynamics from current observation to future latent target; stage two attaches/fine-tunes an action head. By predicting in latent space, the method reduces sensitivity to camera motion and background variation.

L_{JEPA} = g_{θ} (o_{t}) - z_{t + Δ}^{target}_{2}^{2}, z_{t + Δ}^{target} = h_{ξ} (o_{t + Δ})

Where $h_{ξ}$ is target encoder and $g_{θ}$ is student predictor conditioned on present observation only. The no-leakage design is the central claim: supervision comes from the future, but information flow at inference time remains causal.

Results across LIBERO, LIBERO-Plus, SimplerEnv, and real manipulation show better robustness/generalization than prior latent-action baselines. The practical takeaway is that representation causality constraints can matter as much as model scale when pretraining VLA on internet video.

Graph: Paper Node 2602.10098