Two-Stream Modality-Based Deep Learning Approach for Enhanced Two-Person Human Interaction Recognition in Videos.

IF 3.4 3区综合性期刊 Q2 CHEMISTRY, ANALYTICAL

Sensors Pub Date : 2024-11-03 DOI:10.3390/s24217077

Hemel Sharker Akash, Md Abdur Rahim, Abu Saleh Musa Miah, Hyoun-Sup Lee, Si-Woong Jang, Jungpil Shin

{"title":"Two-Stream Modality-Based Deep Learning Approach for Enhanced Two-Person Human Interaction Recognition in Videos.","authors":"Hemel Sharker Akash, Md Abdur Rahim, Abu Saleh Musa Miah, Hyoun-Sup Lee, Si-Woong Jang, Jungpil Shin","doi":"10.3390/s24217077","DOIUrl":null,"url":null,"abstract":"<p><p>Human interaction recognition (HIR) between two people in videos is a critical field in computer vision and pattern recognition, aimed at identifying and understanding human interaction and actions for applications such as healthcare, surveillance, and human-computer interaction. Despite its significance, video-based HIR faces challenges in achieving satisfactory performance due to the complexity of human actions, variations in motion, different viewpoints, and environmental factors. In the study, we proposed a two-stream deep learning-based HIR system to address these challenges and improve the accuracy and reliability of HIR systems. In the process, two streams extract hierarchical features based on the skeleton and RGB information, respectively. In the first stream, we utilised YOLOv8-Pose for human pose extraction, then extracted features with three stacked LSM modules and enhanced them with a dense layer that is considered the final feature of the first stream. In the second stream, we utilised SAM on the input videos, and after filtering the Segment Anything Model (SAM) feature, we employed integrated LSTM and GRU to extract the long-range dependency feature and then enhanced them with a dense layer that was considered the final feature for the second stream module. Here, SAM was utilised for segmented mesh generation, and ImageNet was used for feature extraction from images or meshes, focusing on extracting relevant features from sequential image data. Moreover, we newly created a custom filter function to enhance computational efficiency and eliminate irrelevant keypoints and mesh components from the dataset. We concatenated the two stream features and produced the final feature that fed into the classification module. The extensive experiment with the two benchmark datasets of the proposed model achieved 96.56% and 96.16% accuracy, respectively. The high-performance accuracy of the proposed model proved its superiority.</p>","PeriodicalId":21698,"journal":{"name":"Sensors","volume":"24 21","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11548307/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sensors","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.3390/s24217077","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Human interaction recognition (HIR) between two people in videos is a critical field in computer vision and pattern recognition, aimed at identifying and understanding human interaction and actions for applications such as healthcare, surveillance, and human-computer interaction. Despite its significance, video-based HIR faces challenges in achieving satisfactory performance due to the complexity of human actions, variations in motion, different viewpoints, and environmental factors. In the study, we proposed a two-stream deep learning-based HIR system to address these challenges and improve the accuracy and reliability of HIR systems. In the process, two streams extract hierarchical features based on the skeleton and RGB information, respectively. In the first stream, we utilised YOLOv8-Pose for human pose extraction, then extracted features with three stacked LSM modules and enhanced them with a dense layer that is considered the final feature of the first stream. In the second stream, we utilised SAM on the input videos, and after filtering the Segment Anything Model (SAM) feature, we employed integrated LSTM and GRU to extract the long-range dependency feature and then enhanced them with a dense layer that was considered the final feature for the second stream module. Here, SAM was utilised for segmented mesh generation, and ImageNet was used for feature extraction from images or meshes, focusing on extracting relevant features from sequential image data. Moreover, we newly created a custom filter function to enhance computational efficiency and eliminate irrelevant keypoints and mesh components from the dataset. We concatenated the two stream features and produced the final feature that fed into the classification module. The extensive experiment with the two benchmark datasets of the proposed model achieved 96.56% and 96.16% accuracy, respectively. The high-performance accuracy of the proposed model proved its superiority.

查看原文本刊更多论文

基于双流模式的深度学习方法，用于增强视频中的双人互动识别。

视频中两个人之间的人机交互识别（HIR）是计算机视觉和模式识别的一个重要领域，其目的是识别和理解人机交互和动作，以应用于医疗保健、监控和人机交互等领域。尽管意义重大，但由于人类动作的复杂性、运动的变化、不同的视角和环境因素，基于视频的 HIR 在实现令人满意的性能方面面临着挑战。在这项研究中，我们提出了一种基于双流深度学习的 HIR 系统来应对这些挑战，并提高 HIR 系统的准确性和可靠性。在此过程中，两个流分别基于骨架和 RGB 信息提取分层特征。在第一个数据流中，我们利用 YOLOv8-Pose 进行人体姿态提取，然后利用三个堆叠的 LSM 模块提取特征，并用密集层对其进行增强，这被视为第一个数据流的最终特征。在第二数据流中，我们在输入视频中使用了 SAM，在过滤了 Segment Anything Model（SAM）特征后，我们使用集成的 LSTM 和 GRU 提取长距离依赖特征，然后用密集层对其进行增强，这被视为第二数据流模块的最终特征。在这里，SAM 被用于生成分割网格，ImageNet 被用于从图像或网格中提取特征，重点是从连续图像数据中提取相关特征。此外，我们还新创建了一个自定义过滤函数，以提高计算效率，并从数据集中剔除无关的关键点和网格组件。我们将两个流特征串联起来，生成最终特征并输入分类模块。通过对两个基准数据集的广泛实验，所提模型的准确率分别达到了 96.56% 和 96.16%。所提模型的高准确率证明了其优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Sensors 工程技术-电化学

CiteScore

7.30

自引率

12.80%

发文量

8430

审稿时长

1.7 months

期刊介绍： Sensors (ISSN 1424-8220) provides an advanced forum for the science and technology of sensors and biosensors. It publishes reviews (including comprehensive reviews on the complete sensors products), regular research papers and short notes. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced.