{"title":"Simultaneous text and gesture generation for social robots with small language models.","authors":"Alessio Galatolo, Katie Winkle","doi":"10.3389/frobt.2025.1581024","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>As social robots gain advanced communication capabilities, users increasingly expect coherent verbal and non-verbal behaviours. Recent work has shown that Large Language Models (LLMs) can support autonomous generation of such multimodal behaviours. However, current LLM-based approaches to non-verbal behaviour often involve multi-step reasoning with large, closed-source models-resulting in significant computational overhead and limiting their feasibility in low-resource or privacy-constrained environments.</p><p><strong>Methods: </strong>To address these limitations, we propose a novel method for simultaneous generation of text and gestures with minimal computational overhead compared to plain text generation. Our system does not produce low-level joint trajectories, but instead predicts high-level communicative intentions, which are mapped to platform-specific expressions. Central to our approach is the introduction of lightweight, robot-specific \"gesture heads\" derived from the LLM's architecture, requiring no pose-based datasets and enabling generalisability across platforms.</p><p><strong>Results: </strong>We evaluate our method on two distinct robot platforms: Furhat (facial expressions) and Pepper (bodily gestures). Experimental results demonstrate that our method maintains behavioural quality while introducing negligible computational and memory overhead. Furthermore, the gesture heads operate in parallel with the language generation component, ensuring scalability and responsiveness even on small or locally deployed models.</p><p><strong>Discussion: </strong>Our approach supports the use of Small Language Models for multimodal generation, offering an effective alternative to existing high-resource methods. By abstracting gesture generation and eliminating reliance on platform-specific motion data, we enable broader applicability in real-world, low-resource, and privacy-sensitive HRI settings.</p>","PeriodicalId":47597,"journal":{"name":"Frontiers in Robotics and AI","volume":"12 ","pages":"1581024"},"PeriodicalIF":2.9000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12122315/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Robotics and AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frobt.2025.1581024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: As social robots gain advanced communication capabilities, users increasingly expect coherent verbal and non-verbal behaviours. Recent work has shown that Large Language Models (LLMs) can support autonomous generation of such multimodal behaviours. However, current LLM-based approaches to non-verbal behaviour often involve multi-step reasoning with large, closed-source models-resulting in significant computational overhead and limiting their feasibility in low-resource or privacy-constrained environments.
Methods: To address these limitations, we propose a novel method for simultaneous generation of text and gestures with minimal computational overhead compared to plain text generation. Our system does not produce low-level joint trajectories, but instead predicts high-level communicative intentions, which are mapped to platform-specific expressions. Central to our approach is the introduction of lightweight, robot-specific "gesture heads" derived from the LLM's architecture, requiring no pose-based datasets and enabling generalisability across platforms.
Results: We evaluate our method on two distinct robot platforms: Furhat (facial expressions) and Pepper (bodily gestures). Experimental results demonstrate that our method maintains behavioural quality while introducing negligible computational and memory overhead. Furthermore, the gesture heads operate in parallel with the language generation component, ensuring scalability and responsiveness even on small or locally deployed models.
Discussion: Our approach supports the use of Small Language Models for multimodal generation, offering an effective alternative to existing high-resource methods. By abstracting gesture generation and eliminating reliance on platform-specific motion data, we enable broader applicability in real-world, low-resource, and privacy-sensitive HRI settings.
期刊介绍:
Frontiers in Robotics and AI publishes rigorously peer-reviewed research covering all theory and applications of robotics, technology, and artificial intelligence, from biomedical to space robotics.