Evaluating the Performance of Mobile-Convolutional Neural Networks for Spatial and Temporal Human Action Recognition Analysis

IF 3.3 Q2 ROBOTICS

Robotics Pub Date : 2023-12-08 DOI:10.3390/robotics12060167

Stavros N. Moutsis, Konstantinos A. Tsintotas, Ioannis Kansizoglou, Antonios Gasteratos

{"title":"Evaluating the Performance of Mobile-Convolutional Neural Networks for Spatial and Temporal Human Action Recognition Analysis","authors":"Stavros N. Moutsis, Konstantinos A. Tsintotas, Ioannis Kansizoglou, Antonios Gasteratos","doi":"10.3390/robotics12060167","DOIUrl":null,"url":null,"abstract":"Human action recognition is a computer vision task that identifies how a person or a group acts on a video sequence. Various methods that rely on deep-learning techniques, such as two- or three-dimensional convolutional neural networks (2D-CNNs, 3D-CNNs), recurrent neural networks (RNNs), and vision transformers (ViT), have been proposed to address this problem over the years. Motivated by the fact that most of the used CNNs in human action recognition present high complexity, and the necessity of implementations on mobile platforms that are characterized by restricted computational resources, in this article, we conduct an extensive evaluation protocol over the performance metrics of five lightweight architectures. In particular, we examine how these mobile-oriented CNNs (viz., ShuffleNet-v2, EfficientNet-b0, MobileNet-v3, and GhostNet) execute in spatial analysis compared to a recent tiny ViT, namely EVA-02-Ti, and a higher computational model, ResNet-50. Our models, previously trained on ImageNet and BU101, are measured for their classification accuracy on HMDB51, UCF101, and six classes of the NTU dataset. The average and max scores, as well as the voting approaches, are generated through three and fifteen RGB frames of each video, while two different rates for the dropout layers were assessed during the training. Last, a temporal analysis via multiple types of RNNs that employ features extracted by the trained networks is examined. Our results reveal that EfficientNet-b0 and EVA-02-Ti surpass the other mobile-CNNs, achieving comparable or superior performance to ResNet-50.","PeriodicalId":37568,"journal":{"name":"Robotics","volume":"83 24","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/robotics12060167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Human action recognition is a computer vision task that identifies how a person or a group acts on a video sequence. Various methods that rely on deep-learning techniques, such as two- or three-dimensional convolutional neural networks (2D-CNNs, 3D-CNNs), recurrent neural networks (RNNs), and vision transformers (ViT), have been proposed to address this problem over the years. Motivated by the fact that most of the used CNNs in human action recognition present high complexity, and the necessity of implementations on mobile platforms that are characterized by restricted computational resources, in this article, we conduct an extensive evaluation protocol over the performance metrics of five lightweight architectures. In particular, we examine how these mobile-oriented CNNs (viz., ShuffleNet-v2, EfficientNet-b0, MobileNet-v3, and GhostNet) execute in spatial analysis compared to a recent tiny ViT, namely EVA-02-Ti, and a higher computational model, ResNet-50. Our models, previously trained on ImageNet and BU101, are measured for their classification accuracy on HMDB51, UCF101, and six classes of the NTU dataset. The average and max scores, as well as the voting approaches, are generated through three and fifteen RGB frames of each video, while two different rates for the dropout layers were assessed during the training. Last, a temporal analysis via multiple types of RNNs that employ features extracted by the trained networks is examined. Our results reveal that EfficientNet-b0 and EVA-02-Ti surpass the other mobile-CNNs, achieving comparable or superior performance to ResNet-50.

查看原文本刊更多论文

评估移动卷积神经网络在空间和时间人类动作识别分析中的性能

人类行为识别是一项计算机视觉任务，用于识别一个人或一组人对视频序列的行为。多年来，人们提出了各种依赖深度学习技术的方法，如二维或三维卷积神经网络(2d - cnn, 3d - cnn)，循环神经网络(rnn)和视觉变压器(ViT)来解决这个问题。考虑到人类动作识别中使用的大多数cnn都具有很高的复杂性，以及在计算资源有限的移动平台上实现的必要性，在本文中，我们对五种轻量级架构的性能指标进行了广泛的评估协议。特别是，我们研究了这些面向移动的cnn(即，ShuffleNet-v2, EfficientNet-b0, MobileNet-v3和GhostNet)与最近的小型ViT(即EVA-02-Ti)和更高的计算模型ResNet-50相比，如何在空间分析中执行。我们之前在ImageNet和BU101上训练的模型，在HMDB51、UCF101和NTU数据集的六个类别上测量了它们的分类精度。通过每个视频的3帧和15帧RGB帧生成平均和最高分数以及投票方法，同时在训练期间评估了两种不同的退出层率。最后，通过多种类型的rnn进行时间分析，这些rnn采用由训练过的网络提取的特征。我们的研究结果表明，EfficientNet-b0和EVA-02-Ti超越了其他移动cnn，实现了与ResNet-50相当或更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Robotics Mathematics-Control and Optimization

CiteScore

6.70

自引率

8.10%

发文量

114

审稿时长

11 weeks

期刊介绍： Robotics publishes original papers, technical reports, case studies, review papers and tutorials in all the aspects of robotics. Special Issues devoted to important topics in advanced robotics will be published from time to time. It particularly welcomes those emerging methodologies and techniques which bridge theoretical studies and applications and have significant potential for real-world applications. It provides a forum for information exchange between professionals, academicians and engineers who are working in the area of robotics, helping them to disseminate research findings and to learn from each other’s work. Suitable topics include, but are not limited to: -intelligent robotics, mechatronics, and biomimetics -novel and biologically-inspired robotics -modelling, identification and control of robotic systems -biomedical, rehabilitation and surgical robotics -exoskeletons, prosthetics and artificial organs -AI, neural networks and fuzzy logic in robotics -multimodality human-machine interaction -wireless sensor networks for robot navigation -multi-sensor data fusion and SLAM