Simultaneous text and gesture generation for social robots with small language models.

IF 2.9 Q2 ROBOTICS

Frontiers in Robotics and AI Pub Date : 2025-05-16 eCollection Date: 2025-01-01 DOI:10.3389/frobt.2025.1581024

Alessio Galatolo, Katie Winkle

{"title":"Simultaneous text and gesture generation for social robots with small language models.","authors":"Alessio Galatolo, Katie Winkle","doi":"10.3389/frobt.2025.1581024","DOIUrl":null,"url":null,"abstract":"Introduction: As social robots gain advanced communication capabilities, users increasingly expect coherent verbal and non-verbal behaviours. Recent work has shown that Large Language Models (LLMs) can support autonomous generation of such multimodal behaviours. However, current LLM-based approaches to non-verbal behaviour often involve multi-step reasoning with large, closed-source models-resulting in significant computational overhead and limiting their feasibility in low-resource or privacy-constrained environments.Methods: To address these limitations, we propose a novel method for simultaneous generation of text and gestures with minimal computational overhead compared to plain text generation. Our system does not produce low-level joint trajectories, but instead predicts high-level communicative intentions, which are mapped to platform-specific expressions. Central to our approach is the introduction of lightweight, robot-specific \"gesture heads\" derived from the LLM's architecture, requiring no pose-based datasets and enabling generalisability across platforms.Results: We evaluate our method on two distinct robot platforms: Furhat (facial expressions) and Pepper (bodily gestures). Experimental results demonstrate that our method maintains behavioural quality while introducing negligible computational and memory overhead. Furthermore, the gesture heads operate in parallel with the language generation component, ensuring scalability and responsiveness even on small or locally deployed models.Discussion: Our approach supports the use of Small Language Models for multimodal generation, offering an effective alternative to existing high-resource methods. By abstracting gesture generation and eliminating reliance on platform-specific motion data, we enable broader applicability in real-world, low-resource, and privacy-sensitive HRI settings.","PeriodicalId":47597,"journal":{"name":"Frontiers in Robotics and AI","volume":"12 ","pages":"1581024"},"PeriodicalIF":2.9000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12122315/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Robotics and AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frobt.2025.1581024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: As social robots gain advanced communication capabilities, users increasingly expect coherent verbal and non-verbal behaviours. Recent work has shown that Large Language Models (LLMs) can support autonomous generation of such multimodal behaviours. However, current LLM-based approaches to non-verbal behaviour often involve multi-step reasoning with large, closed-source models-resulting in significant computational overhead and limiting their feasibility in low-resource or privacy-constrained environments.

Methods: To address these limitations, we propose a novel method for simultaneous generation of text and gestures with minimal computational overhead compared to plain text generation. Our system does not produce low-level joint trajectories, but instead predicts high-level communicative intentions, which are mapped to platform-specific expressions. Central to our approach is the introduction of lightweight, robot-specific "gesture heads" derived from the LLM's architecture, requiring no pose-based datasets and enabling generalisability across platforms.

Results: We evaluate our method on two distinct robot platforms: Furhat (facial expressions) and Pepper (bodily gestures). Experimental results demonstrate that our method maintains behavioural quality while introducing negligible computational and memory overhead. Furthermore, the gesture heads operate in parallel with the language generation component, ensuring scalability and responsiveness even on small or locally deployed models.

Discussion: Our approach supports the use of Small Language Models for multimodal generation, offering an effective alternative to existing high-resource methods. By abstracting gesture generation and eliminating reliance on platform-specific motion data, we enable broader applicability in real-world, low-resource, and privacy-sensitive HRI settings.

查看原文本刊更多论文

具有小语言模型的社交机器人的同步文本和手势生成。

引言：随着社交机器人获得先进的沟通能力，用户越来越期望连贯的语言和非语言行为。最近的研究表明，大型语言模型（llm）可以支持这种多模态行为的自主生成。然而，目前基于llm的非语言行为方法通常涉及大型闭源模型的多步骤推理，这导致了巨大的计算开销，并限制了它们在低资源或隐私受限环境中的可行性。方法：为了解决这些限制，我们提出了一种新的方法来同时生成文本和手势，与纯文本生成相比，它的计算开销最小。我们的系统不产生低级的联合轨迹，而是预测高级的交流意图，这些意图被映射到特定于平台的表达。我们方法的核心是引入轻量级的，机器人特定的“手势头”，源自LLM的架构，不需要基于姿势的数据集，并实现跨平台的通用性。结果：我们在两个不同的机器人平台上评估了我们的方法：Furhat（面部表情）和Pepper（身体手势）。实验结果表明，我们的方法在引入可忽略不计的计算和内存开销的同时保持了行为质量。此外，手势头与语言生成组件并行操作，即使在小型或本地部署的模型上也能确保可伸缩性和响应性。讨论：我们的方法支持使用小语言模型进行多模态生成，为现有的高资源方法提供了一个有效的替代方案。通过抽象手势生成并消除对平台特定运动数据的依赖，我们可以在现实世界、低资源和隐私敏感的HRI设置中实现更广泛的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in Robotics and AI ROBOTICS-

CiteScore

6.50

自引率

5.90%

发文量

355

审稿时长

14 weeks

期刊介绍： Frontiers in Robotics and AI publishes rigorously peer-reviewed research covering all theory and applications of robotics, technology, and artificial intelligence, from biomedical to space robotics.