Rethinking Visual-Language-Action Model Scaling Alignment Mixture and Regularization

Title: Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization
Authors: Ye Wang, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Haoqi Yuan, Chaoyi Xu, Haiweng Xu, Yicheng Feng, Mingyang Yu, Zhiyu Kang, Zongqing Lu, Qin Jin
arXiv: https://arxiv.org/abs/2602.09722

Rather than proposing yet another VLA variant, this paper asks whether the standard “just scale data” doctrine actually survives robotics heterogeneity. The authors run a controlled ablation suite around a representative VLA with flow-matching and compare design choices under matched settings in sim and real robots.

Three findings stand out. First, physical alignment matters more than raw volume: an end-effector-relative unified action representation is critical for cross-embodiment transfer. Second, naïve pooling across robot datasets can produce negative transfer, so embodiment mixture is not monotonic with performance. Third, common regularizers (sensory dropout, staged fine-tuning) are less reliable than intuition suggests at scale.

L = L_{flow} + λ_{align} L_{EEF-align} + λ_{reg} L_{reg}

The equation reflects their framing: scaling behavior is governed by representation alignment and data composition terms, not only model/data size. A practical bonus is the Grouped Blind Ensemble protocol, which reduces operator and evaluator bias in real-robot benchmarking.

For your pipeline decisions, this is highly actionable evidence against indiscriminate dataset aggregation. The recommendation is to invest first in cross-embodiment action-space alignment and mixture diagnostics, then scale data only where transfer signs are positive.

Graph: Paper Node 2602.09722