Hongjian Gu, Wenxuan Zou, Keyang Cheng, Bin Wu, Humaira Abdul Ghafoor, Yongzhao Zhan
{"title":"Person re-identification via deep compound eye network and pose repair module","authors":"Hongjian Gu, Wenxuan Zou, Keyang Cheng, Bin Wu, Humaira Abdul Ghafoor, Yongzhao Zhan","doi":"10.1049/cvi2.12282","DOIUrl":"10.1049/cvi2.12282","url":null,"abstract":"<p>Person re-identification is aimed at searching for specific target pedestrians from non-intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time-consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re-identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi-camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial-temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi-level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi-camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded-DukeMTMC, CUHK-SYSU, PRW, SLP, and UJS-reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK-SYSU dataset, the authors’ model achieved a top-1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"826-841"},"PeriodicalIF":1.5,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12282","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140741587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video frame interpolation via spatial multi-scale modelling","authors":"Zhe Qu, Weijing Liu, Lizhen Cui, Xiaohui Yang","doi":"10.1049/cvi2.12281","DOIUrl":"10.1049/cvi2.12281","url":null,"abstract":"<p>Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super-resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three-scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two-level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three-scale decoding features from the decoder and warp them for a set of three-scale pre-warped maps. The authors fuse them into the synthesis network to generate high-quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"458-472"},"PeriodicalIF":1.7,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12281","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140746884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continuous-dilated temporal and inter-frame motion excitation feature learning for gait recognition","authors":"Chunsheng Hua, Hao Zhang, Jia Li, Yingjie Pan","doi":"10.1049/cvi2.12278","DOIUrl":"10.1049/cvi2.12278","url":null,"abstract":"<p>The authors present global-interval and local-continuous feature extraction networks for gait recognition. Unlike conventional gait recognition methods focussing on the full gait cycle, the authors introduce a novel global- continuous-dilated temporal feature extraction (<i>TFE</i>) to extract continuous and interval motion features from the silhouette frames globally. Simultaneously, an inter-frame motion excitation (<i>IME</i>) module is proposed to enhance the unique motion expression of an individual, which remains unchanged regardless of clothing variations. The spatio-temporal features extracted from the <i>TFE</i> and <i>IME</i> modules are then weighted and concatenated by an adaptive aggregator network for recognition. Through the experiments over CASIA-B and mini-OUMVLP datasets, the proposed method has shown the comparable performance (as 98%, 95%, and 84.9% in the normal walking, carrying a bag or packbag, and wearing coats or jackets categories in CASIA-B, and 89% in mini-OUMVLP) to the other state-of-the-art approaches. Extensive experiments conducted on the CASIA-B and mini-OUMVLP datasets have demonstrated the comparable performance of our proposed method compared to other state-of-the-art approaches.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"788-800"},"PeriodicalIF":1.5,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140781350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong-hwi Kim, Dong-hun Lee, Aro Kim, Jinwoo Jeong, Jong Taek Lee, Sungjei Kim, Sang-hyo Park
{"title":"Pruning-guided feature distillation for an efficient transformer-based pose estimation model","authors":"Dong-hwi Kim, Dong-hun Lee, Aro Kim, Jinwoo Jeong, Jong Taek Lee, Sungjei Kim, Sang-hyo Park","doi":"10.1049/cvi2.12277","DOIUrl":"https://doi.org/10.1049/cvi2.12277","url":null,"abstract":"<p>The authors propose a compression strategy for a 3D human pose estimation model based on a transformer which yields high accuracy but increases the model size. This approach involves a pruning-guided determination of the search range to achieve lightweight pose estimation under limited training time and to identify the optimal model size. In addition, the authors propose a transformer-based feature distillation (TFD) method, which efficiently exploits the pose estimation model in terms of both model size and accuracy by leveraging transformer architecture characteristics. Pruning-guided TFD is the first approach for 3D human pose estimation that employs transformer architecture. The proposed approach was tested on various extensive data sets, and the results show that it can reduce the model size by 30% compared to the state-of-the-art while ensuring high accuracy.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"745-758"},"PeriodicalIF":1.5,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12277","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prompt guidance query with cascaded constraint decoders for human–object interaction detection","authors":"Sheng Liu, Bingnan Guo, Feng Zhang, Junhao Chen, Ruixiang Chen","doi":"10.1049/cvi2.12276","DOIUrl":"10.1049/cvi2.12276","url":null,"abstract":"<p>Human–object interaction (HOI) detection, which localises and recognises interactions between human and object, requires high-level image and scene understanding. Recent methods for HOI detection typically utilise transformer-based architecture to build unified future representation. However, these methods use random initial queries to predict interactive human–object pairs, leading to a lack of prior knowledge. Furthermore, most methods provide unified features to forecast interactions using conventional decoder structures, but they lack the ability to build efficient multi-task representations. To address these problems, we propose a novel two-stage HOI detector called PGCD, mainly consisting of prompt guidance query and cascaded constraint decoders. Firstly, the authors propose a novel prompt guidance query generation module (PGQ) to introduce the guidance-semantic features. In PGQ, the authors build visual-semantic transfer to obtain fuller semantic representations. In addition, a cascaded constraint decoder architecture (CD) with random masks is designed to build fine-grained interaction features and improve the model's generalisation performance. Experimental results demonstrate that the authors’ proposed approach obtains significant performance on the two widely used benchmarks, that is, HICO-DET and V-COCO.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"772-787"},"PeriodicalIF":1.5,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12276","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140366408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Wang, Meimei Xu, Huazhu Xue, Zhanqiang Huo, Fen Luo
{"title":"Joint image restoration for object detection in snowy weather","authors":"Jing Wang, Meimei Xu, Huazhu Xue, Zhanqiang Huo, Fen Luo","doi":"10.1049/cvi2.12274","DOIUrl":"10.1049/cvi2.12274","url":null,"abstract":"<p>Although existing object detectors achieve encouraging performance of object detection and localisation under real ideal conditions, the detection performance in adverse weather conditions (snowy) is very poor and not enough to cope with the detection task in adverse weather conditions. Existing methods do not deal well with the effect of snow on the identity of object features or usually ignore or even discard potential information that can help improve the detection performance. To this end, the authors propose a novel and improved end-to-end object detection network joint image restoration. Specifically, in order to address the problem of identity degradation of object detection due to snow, an ingenious restoration-detection dual branch network structure combined with a Multi-Integrated Attention module is proposed, which can well mitigate the effect of snow on the identity of object features, thus improving the detection performance of the detector. In order to make more effective use of the features that are beneficial to the detection task, a Self-Adaptive Feature Fusion module is introduced, which can help the network better learn the potential features that are beneficial to the detection and eliminate the effect of heavy or large local snow in the object area on detection by a special feature fusion, thus improving the network's detection capability in snowy. In addition, the authors construct a large-scale, multi-size snowy dataset called Synthetic and Real Snowy Dataset (SRSD), and it is a good and necessary complement and improvement to the existing snowy-related tasks. Extensive experiments on a public snowy dataset (Snowy-weather Datasets) and SRSD indicate that our method outperforms the existing state-of-the-art object detectors.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"759-771"},"PeriodicalIF":1.5,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12274","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140376973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tag-inferring and tag-guided Transformer for image captioning","authors":"Yaohua Yi, Yinkai Liang, Dezhu Kong, Ziwei Tang, Jibing Peng","doi":"10.1049/cvi2.12280","DOIUrl":"10.1049/cvi2.12280","url":null,"abstract":"<p>Image captioning is an important task for understanding images. Recently, many studies have used tags to build alignments between image information and language information. However, existing methods ignore the problem that simple semantic tags have difficulty expressing the detailed semantics for different image contents. Therefore, the authors propose a tag-inferring and tag-guided Transformer for image captioning to generate fine-grained captions. First, a tag-inferring encoder is proposed, which uses the tags extracted by the scene graph model to infer tags with deeper semantic information. Then, with the obtained deep tag information, a tag-guided decoder that includes short-term attention to improve the features of words in the sentence and gated cross-modal attention to combine image features, tag features and language features to produce informative semantic features is proposed. Finally, the word probability distribution of all positions in the sequence is calculated to generate descriptions for the image. The experiments demonstrate that the authors’ method can combine tags to obtain precise captions and that it achieves competitive performance with a 40.6% BLEU-4 score and 135.3% CIDEr score on the MSCOCO data set.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"801-812"},"PeriodicalIF":1.5,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12280","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140218940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learnable fusion mechanisms for multimodal object detection in autonomous vehicles","authors":"Yahya Massoud, Robert Laganiere","doi":"10.1049/cvi2.12259","DOIUrl":"10.1049/cvi2.12259","url":null,"abstract":"<p>Perception systems in autonomous vehicles need to accurately detect and classify objects within their surrounding environments. Numerous types of sensors are deployed on these vehicles, and the combination of such multimodal data streams can significantly boost performance. The authors introduce a novel sensor fusion framework using deep convolutional neural networks. The framework employs both camera and LiDAR sensors in a multimodal, multiview configuration. The authors leverage both data types by introducing two new innovative fusion mechanisms: element-wise multiplication and multimodal factorised bilinear pooling. The methods improve the bird's eye view moderate average precision score by +4.97% and +8.35% on the KITTI dataset when compared to traditional fusion operators like element-wise addition and feature map concatenation. An in-depth analysis of key design choices impacting performance, such as data augmentation, multi-task learning, and convolutional architecture design is offered. The study aims to pave the way for the development of more robust multimodal machine vision systems. The authors conclude the paper with qualitative results, discussing both successful and problematic cases, along with potential ways to mitigate the latter.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"499-511"},"PeriodicalIF":1.7,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12259","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140237870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaohan Ma, Rize Jin, Jianming Wang, Tae-Sun Chung
{"title":"Attentional bias for hands: Cascade dual-decoder transformer for sign language production","authors":"Xiaohan Ma, Rize Jin, Jianming Wang, Tae-Sun Chung","doi":"10.1049/cvi2.12273","DOIUrl":"10.1049/cvi2.12273","url":null,"abstract":"<p>Sign Language Production (SLP) refers to the task of translating textural forms of spoken language into corresponding sign language expressions. Sign languages convey meaning by means of multiple asynchronous articulators, including manual and non-manual information channels. Recent deep learning-based SLP models directly generate the full-articulatory sign sequence from the text input in an end-to-end manner. However, these models largely down weight the importance of subtle differences in the manual articulation due to the effect of regression to the mean. To explore these neglected aspects, an efficient cascade dual-decoder Transformer (CasDual-Transformer) for SLP is proposed to learn, successively, two mappings <i>SLP</i><sub><i>hand</i></sub>: <i>Text</i> → <i>Hand pose</i> and <i>SLP</i><sub>sign</sub>: <i>Text</i> → <i>Sign pose</i>, utilising an attention-based alignment module that fuses the hand and sign features from previous time steps to predict more expressive sign pose at the current time step. In addition, to provide more efficacious guidance, a novel spatio-temporal loss to penalise shape dissimilarity and temporal distortions of produced sequences is introduced. Experimental studies are performed on two benchmark sign language datasets from distinct cultures to verify the performance of the proposed model. Both quantitative and qualitative results show that the authors’ model demonstrates competitive performance compared to state-of-the-art models, and in some cases, achieves considerable improvements over them.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"696-708"},"PeriodicalIF":1.5,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12273","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140257431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nasirul Mumenin, Mohammad Abu Yousuf, Md Asif Nashiry, A. K. M. Azad, Salem A. Alyami, Pietro Lio', Mohammad Ali Moni
{"title":"ASDNet: A robust involution-based architecture for diagnosis of autism spectrum disorder utilising eye-tracking technology","authors":"Nasirul Mumenin, Mohammad Abu Yousuf, Md Asif Nashiry, A. K. M. Azad, Salem A. Alyami, Pietro Lio', Mohammad Ali Moni","doi":"10.1049/cvi2.12271","DOIUrl":"10.1049/cvi2.12271","url":null,"abstract":"<p>Autism Spectrum Disorder (ASD) is a chronic condition characterised by impairments in social interaction and communication. Early detection of ASD is desired, and there exists a demand for the development of diagnostic aids to facilitate this. A lightweight Involutional Neural Network (INN) architecture has been developed to diagnose ASD. The model follows a simpler architectural design and has less number of parameters than the state-of-the-art (SOTA) image classification models, requiring lower computational resources. The proposed model is trained to detect ASD from eye-tracking scanpath (SP), heatmap (HM), and fixation map (FM) images. Monte Carlo Dropout has been applied to the model to perform an uncertainty analysis and ensure the effectiveness of the output provided by the proposed INN model. The model has been trained and evaluated using two publicly accessible datasets. From the experiment, it is seen that the model has achieved 98.12% accuracy, 96.83% accuracy, and 97.61% accuracy on SP, FM, and HM, respectively, which outperforms the current SOTA image classification models and other existing works conducted on this topic.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"666-681"},"PeriodicalIF":1.5,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12271","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139785035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}