TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

IF 4.6 2区计算机科学 Q2 ROBOTICS

IEEE Robotics and Automation Letters Pub Date : 2025-02-24 DOI:10.1109/LRA.2025.3544909

Junjie Wen;Yichen Zhu;Jinming Li;Minjie Zhu;Zhibin Tang;Kun Wu;Zhiyuan Xu;Ning Liu;Ran Cheng;Chaomin Shen;Yaxin Peng;Feifei Feng;Jian Tang

{"title":"TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation","authors":"Junjie Wen;Yichen Zhu;Jinming Li;Minjie Zhu;Zhibin Tang;Kun Wu;Zhiyuan Xu;Ning Liu;Ran Cheng;Chaomin Shen;Yaxin Peng;Feifei Feng;Jian Tang","doi":"10.1109/LRA.2025.3544909","DOIUrl":null,"url":null,"abstract":"Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this letter, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive evaluations of TinyVLA in both simulation and on real robots, demonstrating that our approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. Additionally, TinyVLA exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts, often matching or exceeding the performance of OpenVLA. We believe that TinyVLA offers an interesting perspective on utilizing pre-trained multimodal models for policy learning.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 4","pages":"3988-3995"},"PeriodicalIF":4.6000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10900471/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes. However, current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data, making real-world deployment difficult. In this letter, we introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models: (1) faster inference speeds, and (2) improved data efficiency, eliminating the need for pre-training stage. Our framework incorporates two essential components to build TinyVLA: (1) initializing the policy backbone with robust, high-speed multimodal models, and (2) integrating a diffusion policy decoder during fine-tuning to enable precise robot actions. We conducted extensive evaluations of TinyVLA in both simulation and on real robots, demonstrating that our approach significantly outperforms the state-of-the-art VLA model, OpenVLA, in terms of speed and data efficiency, while delivering comparable or superior performance. Additionally, TinyVLA exhibits strong generalization capabilities across various dimensions, including language instructions, novel objects, unseen positions, changes in object appearance, background variations, and environmental shifts, often matching or exceeding the performance of OpenVLA. We believe that TinyVLA offers an interesting perspective on utilizing pre-trained multimodal models for policy learning.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Robotics and Automation Letters Computer Science-Computer Science Applications

CiteScore

9.60

自引率

15.40%

发文量

1428

期刊介绍： The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.