Applying a Transformer-based machine-learning model to classify caregiver and infant behaviours during dyadic interactions

IF 2 3区心理学 Q3 PSYCHOLOGY, DEVELOPMENTAL

Infant Behavior & Development Pub Date : 2026-03-01 Epub Date: 2025-12-16 DOI:10.1016/j.infbeh.2025.102175

Alexander Turner , Aly Magassouba , Sobanawartiny Wijeakumar

{"title":"Applying a Transformer-based machine-learning model to classify caregiver and infant behaviours during dyadic interactions","authors":"Alexander Turner , Aly Magassouba , Sobanawartiny Wijeakumar","doi":"10.1016/j.infbeh.2025.102175","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal caregiver-infant interactions have both concurrent and long-term impacts on child attention, cognitive and social skills. These behaviours are typically manually coded by human researchers, making this approach susceptible to observer bias, dependent on inter-rater reliability, and substantial demands on time and resources. In this study, we aimed to develop a multimodal machine-learning model that could be capable of automatically detecting and classifying multimodal behaviours from video recordings of caregivers and their infants (N = 81; infant mean age = 251.3 ± 34.9 days) engaging with objects. We focused on caregiver scaffolding, caregiver intrusiveness, infant object engagement and infant distractibility. Low-level features from audio, video, and pose data were extracted using specific AI models, and input into a Transformer-based architecture capable of learning temporal patterns across modalities. Our findings revealed a significant contrast in model performance depending on how the data was partitioned. Following previous research, when the dataset was split such that data from all dyads contributed to the training, validation, and test sets - the models achieved notably high classification accuracy of over 98 %. However, when tested on entirely unseen dyads, the performance dropped markedly to around 55 %. These results suggest that the models did not learn behaviors of interest but instead relied on video-specific or dyad-specific details - underscoring key generalizability challenges in applying Transformer-based models to complex, multimodal behavioral data. Nonetheless, this work lays a foundation for future research aimed at refining these models and extending their applicability across diverse caregiving contexts.</div></div>","PeriodicalId":48222,"journal":{"name":"Infant Behavior & Development","volume":"82 ","pages":"Article 102175"},"PeriodicalIF":2.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Infant Behavior & Development","FirstCategoryId":"102","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0163638325001493","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/12/16 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"PSYCHOLOGY, DEVELOPMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal caregiver-infant interactions have both concurrent and long-term impacts on child attention, cognitive and social skills. These behaviours are typically manually coded by human researchers, making this approach susceptible to observer bias, dependent on inter-rater reliability, and substantial demands on time and resources. In this study, we aimed to develop a multimodal machine-learning model that could be capable of automatically detecting and classifying multimodal behaviours from video recordings of caregivers and their infants (N = 81; infant mean age = 251.3 ± 34.9 days) engaging with objects. We focused on caregiver scaffolding, caregiver intrusiveness, infant object engagement and infant distractibility. Low-level features from audio, video, and pose data were extracted using specific AI models, and input into a Transformer-based architecture capable of learning temporal patterns across modalities. Our findings revealed a significant contrast in model performance depending on how the data was partitioned. Following previous research, when the dataset was split such that data from all dyads contributed to the training, validation, and test sets - the models achieved notably high classification accuracy of over 98 %. However, when tested on entirely unseen dyads, the performance dropped markedly to around 55 %. These results suggest that the models did not learn behaviors of interest but instead relied on video-specific or dyad-specific details - underscoring key generalizability challenges in applying Transformer-based models to complex, multimodal behavioral data. Nonetheless, this work lays a foundation for future research aimed at refining these models and extending their applicability across diverse caregiving contexts.

查看原文本刊更多论文

应用基于transformer的机器学习模型对二元交互过程中看护人和婴儿的行为进行分类。

多模式的照顾者-婴儿互动对儿童的注意力、认知和社交技能有同时和长期的影响。这些行为通常是由人类研究人员手动编码的，这使得这种方法容易受到观察者偏见的影响，依赖于评分者之间的可靠性，并且对时间和资源的需求很大。在这项研究中，我们的目标是开发一个多模态机器学习模型，该模型能够自动检测和分类护理人员及其婴儿（N = 81；婴儿平均年龄= 251.3 ± 34.9天）与物体接触的视频记录中的多模态行为。我们关注的是看护人的脚手架，看护人的侵入性，婴儿对物体的参与和婴儿的注意力分散。使用特定的人工智能模型提取音频、视频和姿态数据中的低级特征，并将其输入到基于transformer的架构中，该架构能够跨模态学习时间模式。我们的发现揭示了模型性能的显著差异，这取决于数据的划分方式。根据之前的研究，当数据集被分割，使得来自所有双组的数据都用于训练、验证和测试集时，模型实现了超过98 %的高分类准确率。然而，当在完全看不见的对子上测试时，性能明显下降到55% %左右。这些结果表明，模型没有学习感兴趣的行为，而是依赖于特定于视频或特定于双元组的细节——强调了将基于transformer的模型应用于复杂的多模态行为数据时的关键通用性挑战。尽管如此，这项工作为未来的研究奠定了基础，旨在完善这些模型并扩展其在不同护理环境中的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Infant Behavior & Development PSYCHOLOGY, DEVELOPMENTAL-

CiteScore

4.10

自引率

4.80%

发文量

期刊介绍： Infant Behavior & Development publishes empirical (fundamental and clinical), theoretical, methodological and review papers. Brief reports dealing with behavioral development during infancy (up to 3 years) will also be considered. Papers of an inter- and multidisciplinary nature, for example neuroscience, non-linear dynamics and modelling approaches, are particularly encouraged. Areas covered by the journal include cognitive development, emotional development, perception, perception-action coupling, motor development and socialisation.