GSE：一种全局-局部存储增强视频目标识别模型。

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-01-02 DOI:10.1016/j.neunet.2024.107109

Yuhong Shi, Hongguang Pan, Ze Jiang, Libin Zhang, Rui Miao, Zheng Wang, Xinyu Lei

{"title":"GSE：一种全局-局部存储增强视频目标识别模型。","authors":"Yuhong Shi, Hongguang Pan, Ze Jiang, Libin Zhang, Rui Miao, Zheng Wang, Xinyu Lei","doi":"10.1016/j.neunet.2024.107109","DOIUrl":null,"url":null,"abstract":"The presence of substantial similarities and redundant information within video data limits the performance of video object recognition models. To address this issue, a Global-Local Storage Enhanced video object recognition model (GSE) is proposed in this paper. Firstly, the model incorporates a two-stage dynamic multi-frame aggregation module to aggregate shallow frame features. This module aggregates features in batches from each input video using feature extraction, dynamic multi-frame aggregation, and centralized concatenations, significantly reducing the model's computational burden while retaining key information. In addition, a Global-Local Storage (GS) module is constructed to retain and utilize the information in the frame sequence effectively. This module classifies features using a temporal difference threshold method and employs a processing approach of inheritance, storage, and output to filter and retain features. By integrating global, local and key features, the model can accurately capture important temporal features when facing complex video scenes. Subsequently, a Cascaded Multi-head Attention (CMA) mechanism is designed. The multi-head cascade structure in this mechanism progressively focuses on object features and explores the correlations between key and global, local features. The differential step attention calculation is used to ensure computational efficiency. Finally, we optimize the model structure and adjust parameters, and verify the GSE model performance through comprehensive experiments. Experimental results on the ImageNet 2015 and NPS-Drones datasets demonstrate that the GSE model achieves the highest mAP of 0.8352 and 0.8617, respectively. Compared with other models, the GSE model achieves a commendable balance across metrics such as precision, efficiency, and power consumption.","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"184 ","pages":"107109"},"PeriodicalIF":6.0000,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GSE: A global-local storage enhanced video object recognition model.\",\"authors\":\"Yuhong Shi, Hongguang Pan, Ze Jiang, Libin Zhang, Rui Miao, Zheng Wang, Xinyu Lei\",\"doi\":\"10.1016/j.neunet.2024.107109\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The presence of substantial similarities and redundant information within video data limits the performance of video object recognition models. To address this issue, a Global-Local Storage Enhanced video object recognition model (GSE) is proposed in this paper. Firstly, the model incorporates a two-stage dynamic multi-frame aggregation module to aggregate shallow frame features. This module aggregates features in batches from each input video using feature extraction, dynamic multi-frame aggregation, and centralized concatenations, significantly reducing the model's computational burden while retaining key information. In addition, a Global-Local Storage (GS) module is constructed to retain and utilize the information in the frame sequence effectively. This module classifies features using a temporal difference threshold method and employs a processing approach of inheritance, storage, and output to filter and retain features. By integrating global, local and key features, the model can accurately capture important temporal features when facing complex video scenes. Subsequently, a Cascaded Multi-head Attention (CMA) mechanism is designed. The multi-head cascade structure in this mechanism progressively focuses on object features and explores the correlations between key and global, local features. The differential step attention calculation is used to ensure computational efficiency. Finally, we optimize the model structure and adjust parameters, and verify the GSE model performance through comprehensive experiments. Experimental results on the ImageNet 2015 and NPS-Drones datasets demonstrate that the GSE model achieves the highest mAP of 0.8352 and 0.8617, respectively. Compared with other models, the GSE model achieves a commendable balance across metrics such as precision, efficiency, and power consumption.\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"184 \",\"pages\":\"107109\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-01-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1016/j.neunet.2024.107109\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.neunet.2024.107109","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视频数据中大量相似性和冗余信息的存在限制了视频对象识别模型的性能。为了解决这一问题，本文提出了一种全局-局部存储增强视频目标识别模型（GSE）。首先，该模型采用两阶段动态多帧聚合模块对浅帧特征进行聚合；该模块通过特征提取、动态多帧聚合、集中拼接等方式对每个输入视频的特征进行批量聚合，在保留关键信息的同时显著降低了模型的计算负担。此外，为了有效地保留和利用帧序列中的信息，构造了全局局部存储模块。该模块采用时间差阈值法对特征进行分类，采用继承、存储、输出的处理方式对特征进行过滤和保留。该模型通过整合全局特征、局部特征和关键特征，能够在面对复杂视频场景时准确捕捉重要的时间特征。随后，设计了一个级联多头注意（CMA）机制。该机制中的多头级联结构逐步关注目标特征，探索关键特征与全局、局部特征之间的关系。为了保证计算效率，采用差分步长注意力计算。最后对模型结构进行了优化和参数调整，并通过综合实验验证了GSE模型的性能。在ImageNet 2015和NPS-Drones数据集上的实验结果表明，GSE模型的mAP值最高，分别为0.8352和0.8617。与其他模型相比，GSE模型在精度、效率和功耗等指标之间实现了值得称赞的平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GSE: A global-local storage enhanced video object recognition model.

The presence of substantial similarities and redundant information within video data limits the performance of video object recognition models. To address this issue, a Global-Local Storage Enhanced video object recognition model (GSE) is proposed in this paper. Firstly, the model incorporates a two-stage dynamic multi-frame aggregation module to aggregate shallow frame features. This module aggregates features in batches from each input video using feature extraction, dynamic multi-frame aggregation, and centralized concatenations, significantly reducing the model's computational burden while retaining key information. In addition, a Global-Local Storage (GS) module is constructed to retain and utilize the information in the frame sequence effectively. This module classifies features using a temporal difference threshold method and employs a processing approach of inheritance, storage, and output to filter and retain features. By integrating global, local and key features, the model can accurately capture important temporal features when facing complex video scenes. Subsequently, a Cascaded Multi-head Attention (CMA) mechanism is designed. The multi-head cascade structure in this mechanism progressively focuses on object features and explores the correlations between key and global, local features. The differential step attention calculation is used to ensure computational efficiency. Finally, we optimize the model structure and adjust parameters, and verify the GSE model performance through comprehensive experiments. Experimental results on the ImageNet 2015 and NPS-Drones datasets demonstrate that the GSE model achieves the highest mAP of 0.8352 and 0.8617, respectively. Compared with other models, the GSE model achieves a commendable balance across metrics such as precision, efficiency, and power consumption.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.