Spatial reasoning has been such a massive bottleneck for VLMs, so moving away from standard fine-tuning toward code-based action logic is a smart pivot. Relying on code execution…
Spatial reasoning has been such a massive bottleneck for VLMs, so moving away from standard fine-tuning toward code-based action logic is a smart pivot. Relying on code execution to handle spatial relationships is way more robust than just hoping the model learned enough visual priors during pre-training.
Tbh, training-free frameworks are the move because they sidestep those massive compute costs and potential catastrophic forgetting. If this actually generalizes across complex environments without requiring a custom dataset for every new task, we might finally get agents that can actually navigate a UI or a physical space without hallucinating coordinates.
Pretty cool stuff.