Rajamanickam Yuvaraj , Jack S Fogarty , Ratnavel Rajalakshmi , Ritika Sarkar
{"title":"使用混合3d - cnn进行课堂活动识别,并使用Grad-CAM进行动作特征可视化","authors":"Rajamanickam Yuvaraj , Jack S Fogarty , Ratnavel Rajalakshmi , Ritika Sarkar","doi":"10.1016/j.neucom.2025.131694","DOIUrl":null,"url":null,"abstract":"<div><div>In the era of advanced computer vision technology, it is possible to use automatic methods to detect and classify student and teacher activities in classroom environments, providing novel approaches to study or evaluate the quality of teaching or learning. However, to date, there has been little research developing and testing these methods to work towards an optimal activity recognition system. This paper proposes an automated framework using a 3D-convolutional neural network (CNN) to recognize classroom activities, including teacher and student behaviors, from classroom videos. The 3D-CNN captured spatiotemporal features from the video data. Then, an extreme learning machine (ELM) classifier was trained over the 3D-CNN features to recognize different activities in the classroom. Multi-layer perceptron (MLP) and support vector machine (SVM) classifiers were also examined in comparison to ELM. Gradient-weighted class activation mapping (Grad-CAM) was employed to provide visual explanations of what information the highest performing model learned from videos to classify classroom activities. To evaluate each model, classifications were carried out on the EduNet dataset, containing annotated classroom activities featuring students and teachers. Classroom videos from the internet were also utilized to further evaluate the performance of the proposed frameworks. The proposed 3D-CNN+ELM model achieved a maximum average recognition accuracy of 88.17 % on EduNet, as estimated by 5-fold cross-validation, which is 5.87 % higher than the standard baseline I3D-ResNet-50 model proposed by the EduNet authors. The model also achieved an accuracy of 80.00 % when applied to an independent dataset of videos sourced from the internet, indicating reasonable reliability and generalizability. The Grad-CAM outcomes indicate that the model focuses on valid features to determine its recognition; however, in some cases, the recognition can still be incorrect. With its high level of performance, the proposed automated framework may assist in providing information on a range of classroom actions, which may offer preliminary insights to support the evaluation of classroom teaching and learning in real-world educational environments.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"657 ","pages":"Article 131694"},"PeriodicalIF":6.5000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Classroom activity recognition using hybrid 3D-CNNs and visualization of action features with Grad-CAM\",\"authors\":\"Rajamanickam Yuvaraj , Jack S Fogarty , Ratnavel Rajalakshmi , Ritika Sarkar\",\"doi\":\"10.1016/j.neucom.2025.131694\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In the era of advanced computer vision technology, it is possible to use automatic methods to detect and classify student and teacher activities in classroom environments, providing novel approaches to study or evaluate the quality of teaching or learning. However, to date, there has been little research developing and testing these methods to work towards an optimal activity recognition system. This paper proposes an automated framework using a 3D-convolutional neural network (CNN) to recognize classroom activities, including teacher and student behaviors, from classroom videos. The 3D-CNN captured spatiotemporal features from the video data. Then, an extreme learning machine (ELM) classifier was trained over the 3D-CNN features to recognize different activities in the classroom. Multi-layer perceptron (MLP) and support vector machine (SVM) classifiers were also examined in comparison to ELM. Gradient-weighted class activation mapping (Grad-CAM) was employed to provide visual explanations of what information the highest performing model learned from videos to classify classroom activities. To evaluate each model, classifications were carried out on the EduNet dataset, containing annotated classroom activities featuring students and teachers. Classroom videos from the internet were also utilized to further evaluate the performance of the proposed frameworks. The proposed 3D-CNN+ELM model achieved a maximum average recognition accuracy of 88.17 % on EduNet, as estimated by 5-fold cross-validation, which is 5.87 % higher than the standard baseline I3D-ResNet-50 model proposed by the EduNet authors. The model also achieved an accuracy of 80.00 % when applied to an independent dataset of videos sourced from the internet, indicating reasonable reliability and generalizability. The Grad-CAM outcomes indicate that the model focuses on valid features to determine its recognition; however, in some cases, the recognition can still be incorrect. With its high level of performance, the proposed automated framework may assist in providing information on a range of classroom actions, which may offer preliminary insights to support the evaluation of classroom teaching and learning in real-world educational environments.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"657 \",\"pages\":\"Article 131694\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-10-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225023665\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225023665","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Classroom activity recognition using hybrid 3D-CNNs and visualization of action features with Grad-CAM
In the era of advanced computer vision technology, it is possible to use automatic methods to detect and classify student and teacher activities in classroom environments, providing novel approaches to study or evaluate the quality of teaching or learning. However, to date, there has been little research developing and testing these methods to work towards an optimal activity recognition system. This paper proposes an automated framework using a 3D-convolutional neural network (CNN) to recognize classroom activities, including teacher and student behaviors, from classroom videos. The 3D-CNN captured spatiotemporal features from the video data. Then, an extreme learning machine (ELM) classifier was trained over the 3D-CNN features to recognize different activities in the classroom. Multi-layer perceptron (MLP) and support vector machine (SVM) classifiers were also examined in comparison to ELM. Gradient-weighted class activation mapping (Grad-CAM) was employed to provide visual explanations of what information the highest performing model learned from videos to classify classroom activities. To evaluate each model, classifications were carried out on the EduNet dataset, containing annotated classroom activities featuring students and teachers. Classroom videos from the internet were also utilized to further evaluate the performance of the proposed frameworks. The proposed 3D-CNN+ELM model achieved a maximum average recognition accuracy of 88.17 % on EduNet, as estimated by 5-fold cross-validation, which is 5.87 % higher than the standard baseline I3D-ResNet-50 model proposed by the EduNet authors. The model also achieved an accuracy of 80.00 % when applied to an independent dataset of videos sourced from the internet, indicating reasonable reliability and generalizability. The Grad-CAM outcomes indicate that the model focuses on valid features to determine its recognition; however, in some cases, the recognition can still be incorrect. With its high level of performance, the proposed automated framework may assist in providing information on a range of classroom actions, which may offer preliminary insights to support the evaluation of classroom teaching and learning in real-world educational environments.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.