Yueran Pan;Biyuan Chen;Wenxing Liu;Ming Cheng;Dong Zhang;Hongzhu Deng;Xiaobing Zou;Ming Li
{"title":"自闭症儿童在家庭干预中的语言表达水平评估","authors":"Yueran Pan;Biyuan Chen;Wenxing Liu;Ming Cheng;Dong Zhang;Hongzhu Deng;Xiaobing Zou;Ming Li","doi":"10.1109/TCSS.2025.3563733","DOIUrl":null,"url":null,"abstract":"The World Health Organization (WHO) has established the caregiver skill training (CST) program, designed to empower families with children diagnosed with autism spectrum disorder the essential caregiving skills. The joint engagement rating inventory (JERI) protocol evaluates participants’ engagement levels within the CST initiative. Traditionally, rating the expressive language level and use (EXLA) item in JERI relies on retrospective video analysis conducted by qualified professionals, thus incurring substantial labor costs. This study introduces a multimodal behavioral signal-processing framework designed to analyze both child and caregiver behaviors automatically, thereby rating EXLA. Initially, raw audio and video signals are segmented into concise intervals via voice activity detection, speaker diarization and speaker age classification, serving the dual purpose of eliminating nonspeech content and tagging each segment with its respective speaker. Subsequently, we extract an array of audio-visual features, encompassing our proposed interpretable, hand-crafted textual features, end-to-end audio embeddings and end-to-end video embeddings. Finally, these features are fused at the feature level to train a linear regression model aimed at predicting the EXLA scores. Our framework has been evaluated on the largest in-the-wild database currently available under the CST program. Experimental results indicate that the proposed system achieves a Pearson correlation coefficient of 0.768 against the expert ratings, evidencing promising performance comparable to that of human experts.","PeriodicalId":13044,"journal":{"name":"IEEE Transactions on Computational Social Systems","volume":"12 5","pages":"3647-3659"},"PeriodicalIF":4.5000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing the Expressive Language Levels of Autistic Children in Home Intervention\",\"authors\":\"Yueran Pan;Biyuan Chen;Wenxing Liu;Ming Cheng;Dong Zhang;Hongzhu Deng;Xiaobing Zou;Ming Li\",\"doi\":\"10.1109/TCSS.2025.3563733\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The World Health Organization (WHO) has established the caregiver skill training (CST) program, designed to empower families with children diagnosed with autism spectrum disorder the essential caregiving skills. The joint engagement rating inventory (JERI) protocol evaluates participants’ engagement levels within the CST initiative. Traditionally, rating the expressive language level and use (EXLA) item in JERI relies on retrospective video analysis conducted by qualified professionals, thus incurring substantial labor costs. This study introduces a multimodal behavioral signal-processing framework designed to analyze both child and caregiver behaviors automatically, thereby rating EXLA. Initially, raw audio and video signals are segmented into concise intervals via voice activity detection, speaker diarization and speaker age classification, serving the dual purpose of eliminating nonspeech content and tagging each segment with its respective speaker. Subsequently, we extract an array of audio-visual features, encompassing our proposed interpretable, hand-crafted textual features, end-to-end audio embeddings and end-to-end video embeddings. Finally, these features are fused at the feature level to train a linear regression model aimed at predicting the EXLA scores. Our framework has been evaluated on the largest in-the-wild database currently available under the CST program. Experimental results indicate that the proposed system achieves a Pearson correlation coefficient of 0.768 against the expert ratings, evidencing promising performance comparable to that of human experts.\",\"PeriodicalId\":13044,\"journal\":{\"name\":\"IEEE Transactions on Computational Social Systems\",\"volume\":\"12 5\",\"pages\":\"3647-3659\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computational Social Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11024030/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, CYBERNETICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computational Social Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11024030/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}
Assessing the Expressive Language Levels of Autistic Children in Home Intervention
The World Health Organization (WHO) has established the caregiver skill training (CST) program, designed to empower families with children diagnosed with autism spectrum disorder the essential caregiving skills. The joint engagement rating inventory (JERI) protocol evaluates participants’ engagement levels within the CST initiative. Traditionally, rating the expressive language level and use (EXLA) item in JERI relies on retrospective video analysis conducted by qualified professionals, thus incurring substantial labor costs. This study introduces a multimodal behavioral signal-processing framework designed to analyze both child and caregiver behaviors automatically, thereby rating EXLA. Initially, raw audio and video signals are segmented into concise intervals via voice activity detection, speaker diarization and speaker age classification, serving the dual purpose of eliminating nonspeech content and tagging each segment with its respective speaker. Subsequently, we extract an array of audio-visual features, encompassing our proposed interpretable, hand-crafted textual features, end-to-end audio embeddings and end-to-end video embeddings. Finally, these features are fused at the feature level to train a linear regression model aimed at predicting the EXLA scores. Our framework has been evaluated on the largest in-the-wild database currently available under the CST program. Experimental results indicate that the proposed system achieves a Pearson correlation coefficient of 0.768 against the expert ratings, evidencing promising performance comparable to that of human experts.
期刊介绍:
IEEE Transactions on Computational Social Systems focuses on such topics as modeling, simulation, analysis and understanding of social systems from the quantitative and/or computational perspective. "Systems" include man-man, man-machine and machine-machine organizations and adversarial situations as well as social media structures and their dynamics. More specifically, the proposed transactions publishes articles on modeling the dynamics of social systems, methodologies for incorporating and representing socio-cultural and behavioral aspects in computational modeling, analysis of social system behavior and structure, and paradigms for social systems modeling and simulation. The journal also features articles on social network dynamics, social intelligence and cognition, social systems design and architectures, socio-cultural modeling and representation, and computational behavior modeling, and their applications.