BagelVLA Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation

Title: BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation
Authors: Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, Wei Li, Jianyu Chen
arXiv: https://arxiv.org/abs/2602.09849

BagelVLA tackles a practical integration gap: many VLA approaches bolt on either textual planning or visual forecasting, but not both inside the control loop. This work unifies linguistic reasoning, visual prediction, and action generation in a single interleaved sequence model.

A key efficiency component is Residual Flow Guidance (RFG). Instead of expensive iterative denoising, they initialize from current observation and run single-step denoising to extract predictive visual cues that directly guide action generation with low latency.

a_{t} \sim π_{θ} (a_{t} ∣ o_{t}, r_{t}^{text}, r_{t}^{vision})

Where $r_{t}^{text}$ is intermediate reasoning state and $r_{t}^{vision}$ is forecast-derived feature from RFG. The interleaving claim is that action quality improves when these two streams are synchronized temporally instead of used as detached side modules.

Reported results show significant improvements on simulated and real long-horizon manipulation benchmarks, especially multi-stage tasks. The transferable idea is to treat planning and forecasting as online conditioning signals, not offline artifacts.

Graph: Paper Node 2602.09849