Andrew Lee, Ian Chuang, Ling-Yuan Chen, Iman Soltani
{"title":"InterACT: Inter-dependency Aware Action Chunking with Hierarchical Attention Transformers for Bimanual Manipulation","authors":"Andrew Lee, Ian Chuang, Ling-Yuan Chen, Iman Soltani","doi":"arxiv-2409.07914","DOIUrl":null,"url":null,"abstract":"We present InterACT: Inter-dependency aware Action Chunking with Hierarchical\nAttention Transformers, a novel imitation learning framework for bimanual\nmanipulation that integrates hierarchical attention to capture\ninter-dependencies between dual-arm joint states and visual inputs. InterACT\nconsists of a Hierarchical Attention Encoder and a Multi-arm Decoder, both\ndesigned to enhance information aggregation and coordination. The encoder\nprocesses multi-modal inputs through segment-wise and cross-segment attention\nmechanisms, while the decoder leverages synchronization blocks to refine\nindividual action predictions, providing the counterpart's prediction as\ncontext. Our experiments on a variety of simulated and real-world bimanual\nmanipulation tasks demonstrate that InterACT significantly outperforms existing\nmethods. Detailed ablation studies validate the contributions of key components\nof our work, including the impact of CLS tokens, cross-segment encoders, and\nsynchronization blocks.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"107 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We present InterACT: Inter-dependency aware Action Chunking with Hierarchical
Attention Transformers, a novel imitation learning framework for bimanual
manipulation that integrates hierarchical attention to capture
inter-dependencies between dual-arm joint states and visual inputs. InterACT
consists of a Hierarchical Attention Encoder and a Multi-arm Decoder, both
designed to enhance information aggregation and coordination. The encoder
processes multi-modal inputs through segment-wise and cross-segment attention
mechanisms, while the decoder leverages synchronization blocks to refine
individual action predictions, providing the counterpart's prediction as
context. Our experiments on a variety of simulated and real-world bimanual
manipulation tasks demonstrate that InterACT significantly outperforms existing
methods. Detailed ablation studies validate the contributions of key components
of our work, including the impact of CLS tokens, cross-segment encoders, and
synchronization blocks.