Hui Zheng;Yesheng Zhao;Bo Zhang;Guoqiang Shang;Mohammad H. Yahya Al-Shamri;Haya Aldossary
{"title":"A Low-Resolution Video Action Recognition Approach Based on Multi-Scale Reconstruction and Multi-Modal Fusion","authors":"Hui Zheng;Yesheng Zhao;Bo Zhang;Guoqiang Shang;Mohammad H. Yahya Al-Shamri;Haya Aldossary","doi":"10.1109/TCE.2024.3521512","DOIUrl":null,"url":null,"abstract":"The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 (<inline-formula> <tex-math>$16\\times 12~58.70$ </tex-math></inline-formula>%, <inline-formula> <tex-math>$14\\times 14~62.25$ </tex-math></inline-formula>%, <inline-formula> <tex-math>$80\\times 60~68.94$ </tex-math></inline-formula>%), UCF101 (<inline-formula> <tex-math>$14\\times 14~76.74$ </tex-math></inline-formula>%, <inline-formula> <tex-math>$28\\times 28~84.15$ </tex-math></inline-formula>%, <inline-formula> <tex-math>$80\\times 60~92.78$ </tex-math></inline-formula>%), and Tiny-VIRAT (35.63%) datasets.","PeriodicalId":13208,"journal":{"name":"IEEE Transactions on Consumer Electronics","volume":"71 1","pages":"970-983"},"PeriodicalIF":4.3000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Consumer Electronics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10812810/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
The challenge of low-resolution video action recognition task lies in recovering and extracting feature representations that can effectively capture action characteristics with limited semantic information. In this paper, we propose an approach to address this challenge, which primarily comprises a multi-scale reconstruction module and a multi-modal fusion module. In multi-scale reconstruction module, we introduce a frequency-adaptive reconstruction model to reconstruct lost information from multiple scales. For crucial high-frequency sub-band images, we propose a wavelet-based super-resolution generative adversarial network to recover detailed information. In multi-modal fusion module, we propose a two-stream Transformer-based network to mine spatial-temporal joint feature representations from the reconstructed video. Additionally, we utilize another Transformer model to fuse features from different modalities, capturing both consistent and complementary representations. Finally, the fused features are fed into a classifier for recognition. Experimental results show that our proposed model outperforms other models for low-quality action recognition on HMDB51 ($16\times 12~58.70$ %, $14\times 14~62.25$ %, $80\times 60~68.94$ %), UCF101 ($14\times 14~76.74$ %, $28\times 28~84.15$ %, $80\times 60~92.78$ %), and Tiny-VIRAT (35.63%) datasets.
期刊介绍:
The main focus for the IEEE Transactions on Consumer Electronics is the engineering and research aspects of the theory, design, construction, manufacture or end use of mass market electronics, systems, software and services for consumers.