ST4VLA Spatially Guided Training for Vision-Language-Action Models

Title: ST4VLA: Spatially Guided Training for Vision-Language-Action Models
Authors: Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, Yilun Chen, Jiangmiao Pang
arXiv: https://arxiv.org/abs/2602.10109

ST4VLA addresses a common failure mode in VLA systems: strong multimodal understanding without stable spatial grounding during policy learning. The paper proposes a dual-stage training recipe that keeps spatial priors alive all the way into action generation.

Stage 1 performs spatial grounding pretraining using point/box/trajectory prediction on web and robot data. Stage 2 adds spatially guided action post-training, where explicit spatial prompting encourages action heads to condition on richer geometric intent rather than pure language tokens.

L = L_{spatial} + β L_{action}

The important part is optimization coupling: spatial and action objectives are co-optimized so the model does not forget grounding when switching to policy learning. Empirical gains are large on Google Robot and WidowX, and robustness improves under unseen objects, paraphrases, and long-horizon perturbations.

For your own VLA framing, this is compelling support for “pretrain geometry, then preserve geometry” rather than treating spatial prediction as disposable auxiliary pretext. The design pattern is likely reusable in manipulation policies that need language flexibility without spatial drift.

Graph: Paper Node 2602.10109