Feng Yang , Xi Liu , Botong Zhou , Xuehua Guan , Anyong Qin , Tiecheng Song , Yue Zhao , Xiaohua Wang , Chenqiang Gao
{"title":"基于窗口语义增强视频转换器的航空视频分类","authors":"Feng Yang , Xi Liu , Botong Zhou , Xuehua Guan , Anyong Qin , Tiecheng Song , Yue Zhao , Xiaohua Wang , Chenqiang Gao","doi":"10.1016/j.eswa.2025.127883","DOIUrl":null,"url":null,"abstract":"<div><div>With their exceptional flexibility and cost-effectiveness, unmanned aerial vehicles can capture vast amounts of high-quality aerial videos. Consequently, the research on unmanned aerial vehicle video classification, aiming to analyze the spatio-temporal patterns embedded in these videos automatically, is currently flourishing. Compared to conventional ground videos, aerial videos offer a broader perspective, introducing complex visual patterns of both global scenes and local motions. Although current Transformer-based methods have achieved impressive results in video classification, they struggle to capture small key subject movements from the large backgrounds of aerial videos due to a fixed global receptive field. To address these issues, we propose <em>Window Semantic Enhanced Aerial Video Transformers</em> that explicitly enhance local semantics and learn spatio-temporal features through self-attention design. We introduce a <em>Window Semantic Enhanced Transformer Block</em>, comprising a <em>Window Localization</em> module to identify crucial local regions in aerial videos and then enhance local semantics through <em>Window-based Time Attention</em>. Furthermore, we devise a <em>Video Class Attention Transformer Block</em> that directly learns video-level features by late class embedding of video semantic tokens, preventing intermediate frame-level representation that may lead to information loss. To validate the effectiveness of our approach, we conduct extensive experiments on two aerial video classification datasets, ERA and MOD20, demonstrating superior performance with accuracies of 73.9% and 97.0%, respectively.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"285 ","pages":"Article 127883"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Aerial video classification with Window Semantic Enhanced Video Transformers\",\"authors\":\"Feng Yang , Xi Liu , Botong Zhou , Xuehua Guan , Anyong Qin , Tiecheng Song , Yue Zhao , Xiaohua Wang , Chenqiang Gao\",\"doi\":\"10.1016/j.eswa.2025.127883\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>With their exceptional flexibility and cost-effectiveness, unmanned aerial vehicles can capture vast amounts of high-quality aerial videos. Consequently, the research on unmanned aerial vehicle video classification, aiming to analyze the spatio-temporal patterns embedded in these videos automatically, is currently flourishing. Compared to conventional ground videos, aerial videos offer a broader perspective, introducing complex visual patterns of both global scenes and local motions. Although current Transformer-based methods have achieved impressive results in video classification, they struggle to capture small key subject movements from the large backgrounds of aerial videos due to a fixed global receptive field. To address these issues, we propose <em>Window Semantic Enhanced Aerial Video Transformers</em> that explicitly enhance local semantics and learn spatio-temporal features through self-attention design. We introduce a <em>Window Semantic Enhanced Transformer Block</em>, comprising a <em>Window Localization</em> module to identify crucial local regions in aerial videos and then enhance local semantics through <em>Window-based Time Attention</em>. Furthermore, we devise a <em>Video Class Attention Transformer Block</em> that directly learns video-level features by late class embedding of video semantic tokens, preventing intermediate frame-level representation that may lead to information loss. To validate the effectiveness of our approach, we conduct extensive experiments on two aerial video classification datasets, ERA and MOD20, demonstrating superior performance with accuracies of 73.9% and 97.0%, respectively.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"285 \",\"pages\":\"Article 127883\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425015052\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425015052","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Aerial video classification with Window Semantic Enhanced Video Transformers
With their exceptional flexibility and cost-effectiveness, unmanned aerial vehicles can capture vast amounts of high-quality aerial videos. Consequently, the research on unmanned aerial vehicle video classification, aiming to analyze the spatio-temporal patterns embedded in these videos automatically, is currently flourishing. Compared to conventional ground videos, aerial videos offer a broader perspective, introducing complex visual patterns of both global scenes and local motions. Although current Transformer-based methods have achieved impressive results in video classification, they struggle to capture small key subject movements from the large backgrounds of aerial videos due to a fixed global receptive field. To address these issues, we propose Window Semantic Enhanced Aerial Video Transformers that explicitly enhance local semantics and learn spatio-temporal features through self-attention design. We introduce a Window Semantic Enhanced Transformer Block, comprising a Window Localization module to identify crucial local regions in aerial videos and then enhance local semantics through Window-based Time Attention. Furthermore, we devise a Video Class Attention Transformer Block that directly learns video-level features by late class embedding of video semantic tokens, preventing intermediate frame-level representation that may lead to information loss. To validate the effectiveness of our approach, we conduct extensive experiments on two aerial video classification datasets, ERA and MOD20, demonstrating superior performance with accuracies of 73.9% and 97.0%, respectively.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.