基于深度学习的ConvLSTM和LRCN网络用于人类活动识别

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation Pub Date : 2025-04-28 DOI:10.1016/j.jvcir.2025.104469

Muhammad Hassan Khan, Muhammad Ahtisham Javed, Muhammad Shahid Farid

{"title":"基于深度学习的ConvLSTM和LRCN网络用于人类活动识别","authors":"Muhammad Hassan Khan, Muhammad Ahtisham Javed, Muhammad Shahid Farid","doi":"10.1016/j.jvcir.2025.104469","DOIUrl":null,"url":null,"abstract":"<div><div>Human activity recognition (HAR) has received significant research attention lately due to its numerous applications in automated systems such as human-behavior assessment, visual surveillance, healthcare, and entertainment. The objective of a vision-based HAR system is to understand human behavior in video data and determine the action being performed. This paper presents two end-to-end deep networks for human activity recognition, one based on the Convolutional Long Short Term Memory (ConvLSTM) and the other based on Long-term Recurrent Convolution Network (LRCN). The ConvLSTM (Shi et al., 2015) network exploits convolutions that help to extract spatial features considering their temporal correlations (i.e., spatiotemporal prediction). The LRCN (Donahue et al., 2015) fuses the advantages of simple convolution layers and LSTM layers into a single model to adequately encode the spatiotemporal data. Usually, the CNN and LSTM models are used independently: the CNN is used to separate the spatial information from the frames in the first phase. The characteristics gathered by CNN can later be used by the LSTM model to anticipate the video’s action. Rather than building two separate networks and making the whole process computationally inexpensive, we proposed a single LRCN-based network that binds CNN and LSTM layers together into a single model. Additionally, the TimeDistributed layer was introduced in the network which plays a vital role in the encoding of action videos and achieving the highest recognition accuracy. A side contribution of the paper is the evaluation of different convolutional neural network variants including 2D-CNN, and 3D-CNN, for human action recognition. An extensive experimental evaluation of the proposed deep network is carried out on three large benchmark action datasets: UCF50, HMDB51, and UCF-101 action datasets. The results reveal the effectiveness of the proposed algorithms; particularly, our LRCN-based algorithm outperformed the current state-of-the-art, achieving the highest recognition accuracy of 97.42% on UCF50, 73.63% on HMDB51, and 95.70% UCF101 datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104469"},"PeriodicalIF":2.6000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep-learning-based ConvLSTM and LRCN networks for human activity recognition\",\"authors\":\"Muhammad Hassan Khan, Muhammad Ahtisham Javed, Muhammad Shahid Farid\",\"doi\":\"10.1016/j.jvcir.2025.104469\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Human activity recognition (HAR) has received significant research attention lately due to its numerous applications in automated systems such as human-behavior assessment, visual surveillance, healthcare, and entertainment. The objective of a vision-based HAR system is to understand human behavior in video data and determine the action being performed. This paper presents two end-to-end deep networks for human activity recognition, one based on the Convolutional Long Short Term Memory (ConvLSTM) and the other based on Long-term Recurrent Convolution Network (LRCN). The ConvLSTM (Shi et al., 2015) network exploits convolutions that help to extract spatial features considering their temporal correlations (i.e., spatiotemporal prediction). The LRCN (Donahue et al., 2015) fuses the advantages of simple convolution layers and LSTM layers into a single model to adequately encode the spatiotemporal data. Usually, the CNN and LSTM models are used independently: the CNN is used to separate the spatial information from the frames in the first phase. The characteristics gathered by CNN can later be used by the LSTM model to anticipate the video’s action. Rather than building two separate networks and making the whole process computationally inexpensive, we proposed a single LRCN-based network that binds CNN and LSTM layers together into a single model. Additionally, the TimeDistributed layer was introduced in the network which plays a vital role in the encoding of action videos and achieving the highest recognition accuracy. A side contribution of the paper is the evaluation of different convolutional neural network variants including 2D-CNN, and 3D-CNN, for human action recognition. An extensive experimental evaluation of the proposed deep network is carried out on three large benchmark action datasets: UCF50, HMDB51, and UCF-101 action datasets. The results reveal the effectiveness of the proposed algorithms; particularly, our LRCN-based algorithm outperformed the current state-of-the-art, achieving the highest recognition accuracy of 97.42% on UCF50, 73.63% on HMDB51, and 95.70% UCF101 datasets.</div></div>\",\"PeriodicalId\":54755,\"journal\":{\"name\":\"Journal of Visual Communication and Image Representation\",\"volume\":\"110 \",\"pages\":\"Article 104469\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Visual Communication and Image Representation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1047320325000835\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325000835","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

人类活动识别（HAR）由于其在人类行为评估、视觉监控、医疗保健和娱乐等自动化系统中的大量应用，最近受到了广泛的研究关注。基于视觉的HAR系统的目标是理解视频数据中的人类行为并确定正在执行的动作。本文提出了基于卷积长短期记忆（ConvLSTM）和基于长期循环卷积网络（LRCN）的两种端到端深度人类活动识别网络。ConvLSTM （Shi et al., 2015）网络利用卷积，考虑到它们的时间相关性（即时空预测），有助于提取空间特征。LRCN （Donahue et al., 2015）将简单卷积层和LSTM层的优点融合到一个模型中，对时空数据进行充分编码。通常，CNN和LSTM模型是独立使用的：第一阶段使用CNN将空间信息从帧中分离出来。CNN收集到的特征之后可以被LSTM模型用来预测视频的动作。我们提出了一个基于lrcn的网络，将CNN和LSTM层绑定到一个模型中，而不是构建两个独立的网络，从而使整个过程的计算成本降低。此外，在网络中引入了timedidistributed层，该层对动作视频的编码和达到最高的识别精度起着至关重要的作用。本文的一个侧面贡献是评估不同的卷积神经网络变体，包括2D-CNN和3D-CNN，用于人类动作识别。在三个大型基准动作数据集：UCF50、HMDB51和UCF-101动作数据集上，对所提出的深度网络进行了广泛的实验评估。实验结果表明了所提算法的有效性；特别是，我们基于lrcn的算法优于当前最先进的算法，在UCF50、HMDB51和UCF101数据集上实现了97.42%、73.63%和95.70%的最高识别准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep-learning-based ConvLSTM and LRCN networks for human activity recognition

Human activity recognition (HAR) has received significant research attention lately due to its numerous applications in automated systems such as human-behavior assessment, visual surveillance, healthcare, and entertainment. The objective of a vision-based HAR system is to understand human behavior in video data and determine the action being performed. This paper presents two end-to-end deep networks for human activity recognition, one based on the Convolutional Long Short Term Memory (ConvLSTM) and the other based on Long-term Recurrent Convolution Network (LRCN). The ConvLSTM (Shi et al., 2015) network exploits convolutions that help to extract spatial features considering their temporal correlations (i.e., spatiotemporal prediction). The LRCN (Donahue et al., 2015) fuses the advantages of simple convolution layers and LSTM layers into a single model to adequately encode the spatiotemporal data. Usually, the CNN and LSTM models are used independently: the CNN is used to separate the spatial information from the frames in the first phase. The characteristics gathered by CNN can later be used by the LSTM model to anticipate the video’s action. Rather than building two separate networks and making the whole process computationally inexpensive, we proposed a single LRCN-based network that binds CNN and LSTM layers together into a single model. Additionally, the TimeDistributed layer was introduced in the network which plays a vital role in the encoding of action videos and achieving the highest recognition accuracy. A side contribution of the paper is the evaluation of different convolutional neural network variants including 2D-CNN, and 3D-CNN, for human action recognition. An extensive experimental evaluation of the proposed deep network is carried out on three large benchmark action datasets: UCF50, HMDB51, and UCF-101 action datasets. The results reveal the effectiveness of the proposed algorithms; particularly, our LRCN-based algorithm outperformed the current state-of-the-art, achieving the highest recognition accuracy of 97.42% on UCF50, 73.63% on HMDB51, and 95.70% UCF101 datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Visual Communication and Image Representation 工程技术-计算机：软件工程

CiteScore

5.40

自引率

11.50%

发文量

188

审稿时长

9.9 months

期刊介绍： The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.