Classroom activity recognition using hybrid 3D-CNNs and visualization of action features with Grad-CAM

IF 6.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-10-03 DOI:10.1016/j.neucom.2025.131694

Rajamanickam Yuvaraj , Jack S Fogarty , Ratnavel Rajalakshmi , Ritika Sarkar

{"title":"Classroom activity recognition using hybrid 3D-CNNs and visualization of action features with Grad-CAM","authors":"Rajamanickam Yuvaraj , Jack S Fogarty , Ratnavel Rajalakshmi , Ritika Sarkar","doi":"10.1016/j.neucom.2025.131694","DOIUrl":null,"url":null,"abstract":"<div><div>In the era of advanced computer vision technology, it is possible to use automatic methods to detect and classify student and teacher activities in classroom environments, providing novel approaches to study or evaluate the quality of teaching or learning. However, to date, there has been little research developing and testing these methods to work towards an optimal activity recognition system. This paper proposes an automated framework using a 3D-convolutional neural network (CNN) to recognize classroom activities, including teacher and student behaviors, from classroom videos. The 3D-CNN captured spatiotemporal features from the video data. Then, an extreme learning machine (ELM) classifier was trained over the 3D-CNN features to recognize different activities in the classroom. Multi-layer perceptron (MLP) and support vector machine (SVM) classifiers were also examined in comparison to ELM. Gradient-weighted class activation mapping (Grad-CAM) was employed to provide visual explanations of what information the highest performing model learned from videos to classify classroom activities. To evaluate each model, classifications were carried out on the EduNet dataset, containing annotated classroom activities featuring students and teachers. Classroom videos from the internet were also utilized to further evaluate the performance of the proposed frameworks. The proposed 3D-CNN+ELM model achieved a maximum average recognition accuracy of 88.17 % on EduNet, as estimated by 5-fold cross-validation, which is 5.87 % higher than the standard baseline I3D-ResNet-50 model proposed by the EduNet authors. The model also achieved an accuracy of 80.00 % when applied to an independent dataset of videos sourced from the internet, indicating reasonable reliability and generalizability. The Grad-CAM outcomes indicate that the model focuses on valid features to determine its recognition; however, in some cases, the recognition can still be incorrect. With its high level of performance, the proposed automated framework may assist in providing information on a range of classroom actions, which may offer preliminary insights to support the evaluation of classroom teaching and learning in real-world educational environments.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"657 ","pages":"Article 131694"},"PeriodicalIF":6.5000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225023665","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In the era of advanced computer vision technology, it is possible to use automatic methods to detect and classify student and teacher activities in classroom environments, providing novel approaches to study or evaluate the quality of teaching or learning. However, to date, there has been little research developing and testing these methods to work towards an optimal activity recognition system. This paper proposes an automated framework using a 3D-convolutional neural network (CNN) to recognize classroom activities, including teacher and student behaviors, from classroom videos. The 3D-CNN captured spatiotemporal features from the video data. Then, an extreme learning machine (ELM) classifier was trained over the 3D-CNN features to recognize different activities in the classroom. Multi-layer perceptron (MLP) and support vector machine (SVM) classifiers were also examined in comparison to ELM. Gradient-weighted class activation mapping (Grad-CAM) was employed to provide visual explanations of what information the highest performing model learned from videos to classify classroom activities. To evaluate each model, classifications were carried out on the EduNet dataset, containing annotated classroom activities featuring students and teachers. Classroom videos from the internet were also utilized to further evaluate the performance of the proposed frameworks. The proposed 3D-CNN+ELM model achieved a maximum average recognition accuracy of 88.17 % on EduNet, as estimated by 5-fold cross-validation, which is 5.87 % higher than the standard baseline I3D-ResNet-50 model proposed by the EduNet authors. The model also achieved an accuracy of 80.00 % when applied to an independent dataset of videos sourced from the internet, indicating reasonable reliability and generalizability. The Grad-CAM outcomes indicate that the model focuses on valid features to determine its recognition; however, in some cases, the recognition can still be incorrect. With its high level of performance, the proposed automated framework may assist in providing information on a range of classroom actions, which may offer preliminary insights to support the evaluation of classroom teaching and learning in real-world educational environments.

查看原文本刊更多论文

使用混合3d - cnn进行课堂活动识别，并使用Grad-CAM进行动作特征可视化

在先进的计算机视觉技术时代，使用自动方法检测和分类课堂环境中的学生和教师活动成为可能，为研究或评估教学质量提供了新的方法。然而，迄今为止，很少有研究开发和测试这些方法，以实现最佳的活动识别系统。本文提出了一个使用3d卷积神经网络（CNN）的自动化框架，从课堂视频中识别课堂活动，包括教师和学生的行为。3D-CNN从视频数据中捕捉时空特征。然后，在3D-CNN特征上训练一个极限学习机（ELM）分类器来识别教室中的不同活动。与ELM相比，多层感知器（MLP）和支持向量机（SVM）分类器也得到了检验。采用梯度加权类激活映射（Grad-CAM）对表现最好的模型从视频中学习到的信息进行视觉解释，以对课堂活动进行分类。为了评估每个模型，对EduNet数据集进行了分类，其中包含以学生和教师为特征的注释课堂活动。来自互联网的课堂视频也被用来进一步评估所提出框架的性能。经5次交叉验证，所提出的3D-CNN+ELM模型在EduNet上的最大平均识别准确率为88.17 %，比EduNet作者提出的标准基线I3D-ResNet-50模型高出5.87 %。当应用于来自互联网的独立视频数据集时，该模型也达到了80.00 %的准确率，表明了合理的可靠性和泛化性。Grad-CAM结果表明，该模型关注有效特征来确定其识别；然而，在某些情况下，识别仍然可能是不正确的。由于其高水平的性能，所提出的自动化框架可能有助于提供一系列课堂行为的信息，这可能提供初步的见解，以支持在现实教育环境中评估课堂教学。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.