Cheng Charles Ma, Kevin Hyekang Joo, Alexandria K. Vail, Sunreeta Bhattacharya, Álvaro Fernández García, Kailana Baker-Matsuoka, Sheryl Mathew, Lori L. Holt, Fernando De la Torre
{"title":"Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation","authors":"Cheng Charles Ma, Kevin Hyekang Joo, Alexandria K. Vail, Sunreeta Bhattacharya, Álvaro Fernández García, Kailana Baker-Matsuoka, Sheryl Mathew, Lori L. Holt, Fernando De la Torre","doi":"arxiv-2409.09135","DOIUrl":null,"url":null,"abstract":"Over the past decade, wearable computing devices (``smart glasses'') have\nundergone remarkable advancements in sensor technology, design, and processing\npower, ushering in a new era of opportunity for high-density human behavior\ndata. Equipped with wearable cameras, these glasses offer a unique opportunity\nto analyze non-verbal behavior in natural settings as individuals interact. Our\nfocus lies in predicting engagement in dyadic interactions by scrutinizing\nverbal and non-verbal cues, aiming to detect signs of disinterest or confusion.\nLeveraging such analyses may revolutionize our understanding of human\ncommunication, foster more effective collaboration in professional\nenvironments, provide better mental health support through empathetic virtual\ninteractions, and enhance accessibility for those with communication barriers. In this work, we collect a dataset featuring 34 participants engaged in\ncasual dyadic conversations, each providing self-reported engagement ratings at\nthe end of each conversation. We introduce a novel fusion strategy using Large\nLanguage Models (LLMs) to integrate multiple behavior modalities into a\n``multimodal transcript'' that can be processed by an LLM for behavioral\nreasoning tasks. Remarkably, this method achieves performance comparable to\nestablished fusion techniques even in its preliminary implementation,\nindicating strong potential for further research and optimization. This fusion\nmethod is one of the first to approach ``reasoning'' about real-world human\nbehavior through a language model. Smart glasses provide us the ability to\nunobtrusively gather high-density multimodal data on human behavior, paving the\nway for new approaches to understanding and improving human communication with\nthe potential for important societal benefits. The features and data collected\nduring the studies will be made publicly available to promote further research.","PeriodicalId":501541,"journal":{"name":"arXiv - CS - Human-Computer Interaction","volume":"49 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Human-Computer Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Over the past decade, wearable computing devices (``smart glasses'') have
undergone remarkable advancements in sensor technology, design, and processing
power, ushering in a new era of opportunity for high-density human behavior
data. Equipped with wearable cameras, these glasses offer a unique opportunity
to analyze non-verbal behavior in natural settings as individuals interact. Our
focus lies in predicting engagement in dyadic interactions by scrutinizing
verbal and non-verbal cues, aiming to detect signs of disinterest or confusion.
Leveraging such analyses may revolutionize our understanding of human
communication, foster more effective collaboration in professional
environments, provide better mental health support through empathetic virtual
interactions, and enhance accessibility for those with communication barriers. In this work, we collect a dataset featuring 34 participants engaged in
casual dyadic conversations, each providing self-reported engagement ratings at
the end of each conversation. We introduce a novel fusion strategy using Large
Language Models (LLMs) to integrate multiple behavior modalities into a
``multimodal transcript'' that can be processed by an LLM for behavioral
reasoning tasks. Remarkably, this method achieves performance comparable to
established fusion techniques even in its preliminary implementation,
indicating strong potential for further research and optimization. This fusion
method is one of the first to approach ``reasoning'' about real-world human
behavior through a language model. Smart glasses provide us the ability to
unobtrusively gather high-density multimodal data on human behavior, paving the
way for new approaches to understanding and improving human communication with
the potential for important societal benefits. The features and data collected
during the studies will be made publicly available to promote further research.