How Fast Can I Run My VLA Demystifying VLA Inference Performance with VLA-Perf

Title: How Fast Can I Run My VLA? Demystifying VLA Inference Performance with VLA-Perf
Authors: Wenqi Jiang, Jason Clemons, Karu Sankaralingam, Christos Kozyrakis
arXiv: https://arxiv.org/abs/2602.18397

Problem framing

VLA 上实机时最硬约束是时延，但目前缺乏系统化性能画像工具。VLA-Perf 把“模型结构×推理系统×硬件”的组合空间显式 benchmark 化，回答“在哪些配置下可实时”。

Core method

构建端到端推理性能分析栈：分解视觉编码、语言上下文、动作解码等阶段，测吞吐、延迟、显存、批大小敏感性，并给出系统级瓶颈定位。

Key equations and mechanisms

更偏性能工程，核心指标是时延预算分解：

T_{e x t e 2 e} = T_{e x t v i s i o n} + T_{e x t f u s i o n} + T_{e x t a c t i o n} + T_{e x t r u n t im e - o v er h e a d}

并以实时约束 $T_{e x t e 2 e} \leq T_{e x t c t r l}$ 评估可部署性。

Experiment reading guide

优先看不同模型规模与 runtime 优化（量化、并行、缓存）对 $T_{e x t e 2 e}$ 的影响曲线，判断“能跑”与“跑得稳”的分界。

Limitations

这是测量框架，不直接提升策略质量；结论依赖硬件栈与实现版本。

Future work

可与策略训练联动，把 latency-aware objective 直接并入模型设计。

Replication angle

建议把你的候选 VLA 先过一遍 VLA-Perf，再决定是否值得上真机大规模实验。

Graph: Paper Node 2602.18397