{"title":"Video Question Answering: A survey of the state-of-the-art","authors":"Jeshmol P.J., Binsu C. Kovoor","doi":"10.1016/j.jvcir.2024.104320","DOIUrl":"10.1016/j.jvcir.2024.104320","url":null,"abstract":"<div><div>Video Question Answering (VideoQA) emerges as a prominent trend in the domain of Artificial Intelligence, Computer Vision, and Natural Language Processing. It involves developing systems capable of understanding, analyzing, and responding to questions about the content of videos. The Proposed survey presents an in-depth overview of the current landscape of Question Answering, shedding light on the challenges, methodologies, datasets, and innovative approaches in the domain. The key components of the Video Question Answering (VideoQA) framework include video feature extraction, question processing, reasoning, and response generation. It underscores the importance of datasets in shaping VideoQA research and the diversity of question types, from factual inquiries to spatial and temporal reasoning. The survey highlights the ongoing research directions and future prospects for VideoQA. Finally, the proposed survey gives a road map for future explorations at the intersection of multiple disciplines, emphasizing the ultimate objective of pushing the boundaries of knowledge and innovation.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104320"},"PeriodicalIF":2.6,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huadong Lin , Xiaohan Yu , Pengcheng Zhang , Xiao Bai , Jin Zheng
{"title":"Consistent prototype contrastive learning for weakly supervised person search","authors":"Huadong Lin , Xiaohan Yu , Pengcheng Zhang , Xiao Bai , Jin Zheng","doi":"10.1016/j.jvcir.2024.104321","DOIUrl":"10.1016/j.jvcir.2024.104321","url":null,"abstract":"<div><div>Weakly supervised person search simultaneously addresses detection and re-identification tasks without relying on person identity labels. Prototype-based contrastive learning is commonly used to address unsupervised person re-identification. We argue that prototypes suffer from spatial, temporal, and label inconsistencies, which result in their inaccurate representation. In this paper, we propose a novel Consistent Prototype Contrastive Learning (CPCL) framework to address prototype inconsistency. For spatial inconsistency, a greedy update strategy is developed to introduce ground truth proposals in the training process and update the memory bank only with the ground truth features. To improve temporal consistency, CPCL employs a local window strategy to calculate the prototype within a specific temporal domain window. To tackle label inconsistency, CPCL adopts a prototype nearest neighbor consistency method that leverages the intrinsic information of the prototypes to rectify the pseudo-labels. Experimentally, the proposed method exhibits remarkable performance improvements on both the CUHK-SYSU and PRW datasets, achieving an mAP of 90.2% and 29.3% respectively. Moreover, it achieves state-of-the-art performance on the CUHK-SYSU dataset. The code will be available on the project website: <span><span>https://github.com/JackFlying/cpcl</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104321"},"PeriodicalIF":2.6,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142586124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianlei Liu, Bingqing Yang, Shilong Wang, Maoli Wang
{"title":"MT-Net: Single image dehazing based on meta learning, knowledge transfer and contrastive learning","authors":"Jianlei Liu, Bingqing Yang, Shilong Wang, Maoli Wang","doi":"10.1016/j.jvcir.2024.104325","DOIUrl":"10.1016/j.jvcir.2024.104325","url":null,"abstract":"<div><div>Single image dehazing is becoming increasingly important as its results impact the efficiency of subsequent computer vision tasks. While many methods have been proposed to address this challenge, existing dehazing approaches often exhibit limited adaptability to different types of images and lack future learnability. In light of this, we propose a dehazing network based on meta-learning, knowledge transfer, and contrastive learning, abbreviated as MT-Net. In our approach, we combine knowledge transfer with meta-learning to tackle these challenges, thus enhancing the network’s generalization performance. We refine the structure of knowledge transfer by introducing a two-phases approach to facilitate learning under the guidance of teacher networks and learning committee networks. We also optimize the negative examples of contrastive learning to reduce the contrast space. Extensive experiments conducted on synthetic and real datasets demonstrate the remarkable performance of our method in both quantitative and qualitative comparisons. The code has been released on <span><span>https://github.com/71717171fan/MT-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104325"},"PeriodicalIF":2.6,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Iman Junaid , Allam Jaya Prakash , Samit Ari
{"title":"Human gait recognition using joint spatiotemporal modulation in deep convolutional neural networks","authors":"Mohammad Iman Junaid , Allam Jaya Prakash , Samit Ari","doi":"10.1016/j.jvcir.2024.104322","DOIUrl":"10.1016/j.jvcir.2024.104322","url":null,"abstract":"<div><div>Gait, a person’s distinctive walking pattern, offers a promising biometric modality for surveillance applications. Unlike fingerprints or iris scans, gait can be captured from a distance without the subject’s direct cooperation or awareness. This makes it ideal for surveillance and security applications. Traditional convolutional neural networks (CNNs) often struggle with the inherent variations within video data, limiting their effectiveness in gait recognition. The proposed technique in this work introduces a unique joint spatial–temporal modulation network designed to overcome this limitation. By extracting discriminative feature representations across varying frame levels, the network effectively leverages both spatial and temporal variations within video sequences. The proposed architecture integrates attention-based CNNs for spatial feature extraction and a Bidirectional Long Short-Term Memory (Bi-LSTM) network with a temporal attention module to analyse temporal dynamics. The use of attention in spatial and temporal blocks enhances the network’s capability of focusing on the most relevant segments of the video data. This can improve efficiency since the combined approach enhances learning capabilities when processing complex gait videos. We evaluated the effectiveness of the proposed network using two major datasets, namely CASIA-B and OUMVLP. Experimental analysis on CASIA B demonstrates that the proposed network achieves an average rank-1 accuracy of 98.20% for normal walking, 94.50% for walking with a bag and 80.40% for clothing scenarios. The proposed network also achieved an accuracy of 89.10% for OU-MVLP. These results show the proposed method‘s ability to generalize to large-scale data and consistently outperform current state-of-the-art gait recognition techniques.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104322"},"PeriodicalIF":2.6,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Crowd counting network based on attention feature fusion and multi-column feature enhancement","authors":"Qian Liu, Yixiong Zhong, Jiongtao Fang","doi":"10.1016/j.jvcir.2024.104323","DOIUrl":"10.1016/j.jvcir.2024.104323","url":null,"abstract":"<div><div>Density map estimation is commonly used for crowd counting. However, using it alone may make some individuals difficult to recognize, due to the problems of target occlusions, scale variations, complex background and heterogeneous distribution. To alleviate these problems, we propose a two-stage crowd counting network based on attention feature fusion and multi-column feature enhancement (AFF-MFE-TNet). In the first stage, AFF-MFE-TNet transforms the input image into a probability map. In the second stage, a multi-column feature enhancement module is constructed to enhance features by expanding the receptive fields, a dual attention feature fusion module is designed to adaptively fuse the features of different scales through attention mechanisms, and a triple counting loss is presented for AFF-MFE-TNet, which can fit the ground truth probability maps and density maps better, and improve the counting performance. Experimental results show that AFF-MFE-TNet can effectively improve the accuracy of crowd counting, as compared with the state-of-the-art.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104323"},"PeriodicalIF":2.6,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Zhao, Shaoxiong Xie, Jia Li, Ping Tan, Wenjin Hu
{"title":"MVP-HOT: A Moderate Visual Prompt for Hyperspectral Object Tracking","authors":"Lin Zhao, Shaoxiong Xie, Jia Li, Ping Tan, Wenjin Hu","doi":"10.1016/j.jvcir.2024.104326","DOIUrl":"10.1016/j.jvcir.2024.104326","url":null,"abstract":"<div><div>The growing attention to hyperspectral object tracking (HOT) can be attributed to the extended spectral information available in hyperspectral images (HSIs), especially in complex scenarios. This potential makes it a promising alternative to traditional RGB-based tracking methods. However, the scarcity of large hyperspectral datasets poses a challenge for training robust hyperspectral trackers using deep learning methods. Prompt learning, a new paradigm emerging in large language models, involves adapting or fine-tuning a pre-trained model for a specific downstream task by providing task-specific inputs. Inspired by the recent success of prompt learning in language and visual tasks, we propose a novel and efficient prompt learning method for HOT tasks, termed Moderate Visual Prompt for HOT (MVP-HOT). Specifically, MVP-HOT freezes the parameters of the pre-trained model and employs HSIs as visual prompts to leverage the knowledge of the underlying RGB model. Additionally, we develop a moderate and effective strategy to incrementally adapt the HSI prompt information. Our proposed method uses only a few (1.7M) learnable parameters and demonstrates its effectiveness through extensive experiments, MVP-HOT can achieve state-of-the-art performance on three hyperspectral datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104326"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the rate-distortion-complexity optimization in neural image compression","authors":"Yixin Gao, Runsen Feng, Zongyu Guo, Zhibo Chen","doi":"10.1016/j.jvcir.2024.104294","DOIUrl":"10.1016/j.jvcir.2024.104294","url":null,"abstract":"<div><div>Despite a short history, neural image codecs have been shown to surpass classical image codecs in terms of rate–distortion performance. However, most of them suffer from significantly longer decoding times, which hinders the practical applications of neural image codecs. This issue is especially pronounced when employing an effective yet time-consuming autoregressive context model since it would increase entropy decoding time by orders of magnitude. In this paper, unlike most previous works that pursue optimal RD performance while temporally overlooking the coding complexity, we make a systematical investigation on the rate–distortion-complexity (RDC) optimization in neural image compression. By quantifying the decoding complexity as a factor in the optimization goal, we are now able to precisely control the RDC trade-off and then demonstrate how the rate–distortion performance of neural image codecs could adapt to various complexity demands. Going beyond the investigation of RDC optimization, a variable-complexity neural codec is designed to leverage the spatial dependencies adaptively according to industrial demands, which supports fine-grained complexity adjustment by balancing the RDC tradeoff. By implementing this scheme in a powerful base model, we demonstrate the feasibility and flexibility of RDC optimization for neural image codecs.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104294"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142659506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3D human model guided pose transfer via progressive flow prediction network","authors":"Furong Ma , Guiyu Xia , Qingshan Liu","doi":"10.1016/j.jvcir.2024.104327","DOIUrl":"10.1016/j.jvcir.2024.104327","url":null,"abstract":"<div><div>Human pose transfer is to transfer a conditional person image to a new target pose. The difficulty lies in modeling the large-scale spatial deformation from the conditional pose to the target one. However, the commonly used 2D data representations and one-step flow prediction scheme lead to unreliable deformation prediction because of the lack of 3D information guidance and the great changes in the pose transfer. Therefore, to bring the original 3D motion information into human pose transfer, we propose to simulate the generation process of real person image. We drive the 3D human model reconstructed from the conditional person image with the target pose and project it to the 2D plane. The 2D projection thereby inherits the 3D information of the poses which can guide the flow prediction. Furthermore, we propose a progressive flow prediction network consisting of two streams. One stream is to predict the flow by decomposing the complex pose transformation into multiple sub-transformations. The other is to generate the features of the target image according to the predicted flow. Besides, to enhance the reliability of the generated invisible regions, we use the target pose information which contains structural information from the flow prediction stream as the supplementary information to the feature generation. The synthesized images with accurate depth information and sharp details demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104327"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GLIC: Underwater target detection based on global–local information coupling and multi-scale feature fusion","authors":"Huipu Xu , Meixiang Zhang , Yongzhi Li","doi":"10.1016/j.jvcir.2024.104330","DOIUrl":"10.1016/j.jvcir.2024.104330","url":null,"abstract":"<div><div>With the rapid development of object detection technology, underwater object detection has attracted widespread attention. Most of the existing underwater target detection methods are built based on convolutional neural networks (CNNs), which still have some limitations in the utilization of global information and cannot fully capture the key information in the images. To overcome the challenge of insufficient global–local feature extraction, an underwater target detector (namely GLIC) based on global–local information coupling and multi-scale feature fusion is proposed in this paper. Our GLIC consists of three main components: spatial pyramid pooling, global–local information coupling, and multi-scale feature fusion. Firstly, we embed spatial pyramid pooling, which improves the robustness of the model while retaining more spatial information. Secondly, we design the feature pyramid network with global–local information coupling. The global context of the transformer branch and the local features of the CNN branch interact with each other to enhance the feature representation. Finally, we construct a Multi-scale Feature Fusion (MFF) module that utilizes balanced semantic features integrated at the same depth for multi-scale feature fusion. In this way, each resolution in the pyramid receives equal information from others, thus balancing the information flow and making the features more discriminative. As demonstrated in comprehensive experiments, our GLIC, respectively, achieves 88.46%, 87.51%, and 74.94% mAP on the URPC2019, URPC2020, and UDD datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104330"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142659505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingbo Ji , Pengfei Zhang , Kuicheng Chen , Lei Zhang , Changbo Hou
{"title":"Scene-aware classifier and re-detector for thermal infrared tracking","authors":"Qingbo Ji , Pengfei Zhang , Kuicheng Chen , Lei Zhang , Changbo Hou","doi":"10.1016/j.jvcir.2024.104319","DOIUrl":"10.1016/j.jvcir.2024.104319","url":null,"abstract":"<div><div>Compared with common visible light scenes, the target of infrared scenes lacks information such as the color, texture. Infrared images have low contrast, which not only lead to interference between targets, but also interference between the target and the background. In addition, most infrared tracking algorithms lack a redetection mechanism after lost target, resulting in poor tracking effect after occlusion or blurring. To solve these problems, we propose a scene-aware classifier to dynamically adjust low, middle, and high level features, improving the ability to utilize features in different infrared scenes. Besides, we designed an infrared target re-detector based on multi-domain convolutional network to learn from the tracked target samples and background samples, improving the ability to identify the differences between the target and the background. The experimental results on <em>VOT-TIR2015</em>, <em>VOT-TIR2017</em> and <em>LSOTB-TIR</em> show that the proposed algorithm achieves the most advanced results in the three infrared object tracking benchmark.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104319"},"PeriodicalIF":2.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}