Alexander Turner , Aly Magassouba , Sobanawartiny Wijeakumar
{"title":"Applying a Transformer-based machine-learning model to classify caregiver and infant behaviours during dyadic interactions","authors":"Alexander Turner , Aly Magassouba , Sobanawartiny Wijeakumar","doi":"10.1016/j.infbeh.2025.102175","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal caregiver-infant interactions have both concurrent and long-term impacts on child attention, cognitive and social skills. These behaviours are typically manually coded by human researchers, making this approach susceptible to observer bias, dependent on inter-rater reliability, and substantial demands on time and resources. In this study, we aimed to develop a multimodal machine-learning model that could be capable of automatically detecting and classifying multimodal behaviours from video recordings of caregivers and their infants (N = 81; infant mean age = 251.3 ± 34.9 days) engaging with objects. We focused on caregiver scaffolding, caregiver intrusiveness, infant object engagement and infant distractibility. Low-level features from audio, video, and pose data were extracted using specific AI models, and input into a Transformer-based architecture capable of learning temporal patterns across modalities. Our findings revealed a significant contrast in model performance depending on how the data was partitioned. Following previous research, when the dataset was split such that data from all dyads contributed to the training, validation, and test sets - the models achieved notably high classification accuracy of over 98 %. However, when tested on entirely unseen dyads, the performance dropped markedly to around 55 %. These results suggest that the models did not learn behaviors of interest but instead relied on video-specific or dyad-specific details - underscoring key generalizability challenges in applying Transformer-based models to complex, multimodal behavioral data. Nonetheless, this work lays a foundation for future research aimed at refining these models and extending their applicability across diverse caregiving contexts.</div></div>","PeriodicalId":48222,"journal":{"name":"Infant Behavior & Development","volume":"82 ","pages":"Article 102175"},"PeriodicalIF":2.0000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Infant Behavior & Development","FirstCategoryId":"102","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0163638325001493","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/12/16 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"PSYCHOLOGY, DEVELOPMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal caregiver-infant interactions have both concurrent and long-term impacts on child attention, cognitive and social skills. These behaviours are typically manually coded by human researchers, making this approach susceptible to observer bias, dependent on inter-rater reliability, and substantial demands on time and resources. In this study, we aimed to develop a multimodal machine-learning model that could be capable of automatically detecting and classifying multimodal behaviours from video recordings of caregivers and their infants (N = 81; infant mean age = 251.3 ± 34.9 days) engaging with objects. We focused on caregiver scaffolding, caregiver intrusiveness, infant object engagement and infant distractibility. Low-level features from audio, video, and pose data were extracted using specific AI models, and input into a Transformer-based architecture capable of learning temporal patterns across modalities. Our findings revealed a significant contrast in model performance depending on how the data was partitioned. Following previous research, when the dataset was split such that data from all dyads contributed to the training, validation, and test sets - the models achieved notably high classification accuracy of over 98 %. However, when tested on entirely unseen dyads, the performance dropped markedly to around 55 %. These results suggest that the models did not learn behaviors of interest but instead relied on video-specific or dyad-specific details - underscoring key generalizability challenges in applying Transformer-based models to complex, multimodal behavioral data. Nonetheless, this work lays a foundation for future research aimed at refining these models and extending their applicability across diverse caregiving contexts.
期刊介绍:
Infant Behavior & Development publishes empirical (fundamental and clinical), theoretical, methodological and review papers. Brief reports dealing with behavioral development during infancy (up to 3 years) will also be considered. Papers of an inter- and multidisciplinary nature, for example neuroscience, non-linear dynamics and modelling approaches, are particularly encouraged. Areas covered by the journal include cognitive development, emotional development, perception, perception-action coupling, motor development and socialisation.