Multi-Level Feature Fusion in CNN-Based Human Action Recognition: A Case Study on EfficientNet-B7.

IF 2.7 Q3 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY
Pitiwat Lueangwitchajaroen, Sitapa Watcharapinchai, Worawit Tepsan, Sorn Sooksatra
{"title":"Multi-Level Feature Fusion in CNN-Based Human Action Recognition: A Case Study on EfficientNet-B7.","authors":"Pitiwat Lueangwitchajaroen, Sitapa Watcharapinchai, Worawit Tepsan, Sorn Sooksatra","doi":"10.3390/jimaging10120320","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate human action recognition is becoming increasingly important across various fields, including healthcare and self-driving cars. A simple approach to enhance model performance is incorporating additional data modalities, such as depth frames, point clouds, and skeleton information, while previous studies have predominantly used late fusion techniques to combine these modalities, our research introduces a multi-level fusion approach that combines information at early, intermediate, and late stages together. Furthermore, recognizing the challenges of collecting multiple data types in real-world applications, our approach seeks to exploit multimodal techniques while relying solely on RGB frames as the single data source. In our work, we used RGB frames from the NTU RGB+D dataset as the sole data source. From these frames, we extracted 2D skeleton coordinates and optical flow frames using pre-trained models. We evaluated our multi-level fusion approach with EfficientNet-B7 as a case study, and our methods demonstrated significant improvement, achieving 91.5% in NTU RGB+D 60 dataset accuracy compared to single-modality and single-view models. Despite their simplicity, our methods are also comparable to other state-of-the-art approaches.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"10 12","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11677249/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/jimaging10120320","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Accurate human action recognition is becoming increasingly important across various fields, including healthcare and self-driving cars. A simple approach to enhance model performance is incorporating additional data modalities, such as depth frames, point clouds, and skeleton information, while previous studies have predominantly used late fusion techniques to combine these modalities, our research introduces a multi-level fusion approach that combines information at early, intermediate, and late stages together. Furthermore, recognizing the challenges of collecting multiple data types in real-world applications, our approach seeks to exploit multimodal techniques while relying solely on RGB frames as the single data source. In our work, we used RGB frames from the NTU RGB+D dataset as the sole data source. From these frames, we extracted 2D skeleton coordinates and optical flow frames using pre-trained models. We evaluated our multi-level fusion approach with EfficientNet-B7 as a case study, and our methods demonstrated significant improvement, achieving 91.5% in NTU RGB+D 60 dataset accuracy compared to single-modality and single-view models. Despite their simplicity, our methods are also comparable to other state-of-the-art approaches.

基于cnn的多层次特征融合人体动作识别——以高效网络b7为例
准确的人类行为识别在各个领域变得越来越重要,包括医疗保健和自动驾驶汽车。提高模型性能的一种简单方法是结合额外的数据模式,如深度帧、点云和骨架信息,而之前的研究主要使用后期融合技术来结合这些模式,我们的研究引入了一种多级融合方法,将早期、中期和后期的信息结合在一起。此外,认识到在实际应用中收集多种数据类型的挑战,我们的方法寻求利用多模态技术,同时仅依赖RGB帧作为单一数据源。在我们的工作中,我们使用来自NTU RGB+D数据集的RGB帧作为唯一的数据源。从这些帧中,我们使用预训练的模型提取二维骨架坐标和光流帧。我们以EfficientNet-B7为例对我们的多级融合方法进行了评估,结果表明,与单模式和单视图模型相比,我们的方法在NTU RGB+D 60数据集上的准确率达到了91.5%。尽管它们很简单,我们的方法也可以与其他最先进的方法相媲美。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Imaging
Journal of Imaging Medicine-Radiology, Nuclear Medicine and Imaging
CiteScore
5.90
自引率
6.20%
发文量
303
审稿时长
7 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信