{"title":"CLaSP: Cross-view 6-DoF localisation assisted by synthetic panorama","authors":"Juelin Zhu, Shen Yan, Xiaoya Cheng, Rouwan Wu, Yuxiang Liu, Maojun Zhang","doi":"10.1049/cvi2.12285","DOIUrl":"10.1049/cvi2.12285","url":null,"abstract":"<p>Despite the impressive progress in visual localisation, 6-DoF cross-view localisation is still a challenging task in the computer vision community due to the huge appearance changes. To address this issue, the authors propose the CLaSP, a coarse-to-fine framework, which leverages a synthetic panorama to facilitate cross-view 6-DoF localisation in a large-scale scene. The authors first leverage a segmentation map to correct the prior pose, followed by a synthetic panorama on the ground to enable coarse pose estimation combined with a template matching method. The authors finally formulate the refine localisation process as feature matching and pose refinement to obtain the final result. The authors evaluate the performance of the CLaSP and several state-of-the-art baselines on the <i>Airloc</i> dataset, which demonstrates the effectiveness of our proposed framework.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"859-874"},"PeriodicalIF":1.5,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12285","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140986129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo
{"title":"Guest Editorial: Advanced image restoration and enhancement in the wild","authors":"Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo","doi":"10.1049/cvi2.12283","DOIUrl":"https://doi.org/10.1049/cvi2.12283","url":null,"abstract":"<p>Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.</p><p>In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.</p><p>The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.</p><p>Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world dataset","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"435-438"},"PeriodicalIF":1.7,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12283","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141246088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyue Lei, Bin Tang, Yanhua Chen, Mingfu Zhao, Yifei Xu, Zourong Long
{"title":"Temporal channel reconfiguration multi-graph convolution network for skeleton-based action recognition","authors":"Siyue Lei, Bin Tang, Yanhua Chen, Mingfu Zhao, Yifei Xu, Zourong Long","doi":"10.1049/cvi2.12279","DOIUrl":"10.1049/cvi2.12279","url":null,"abstract":"<p>Skeleton-based action recognition has received much attention and achieved remarkable achievements in the field of human action recognition. In time series action prediction for different scales, existing methods mainly focus on attention mechanisms to enhance modelling capabilities in spatial dimensions. However, this approach strongly depends on the local information of a single input feature and fails to facilitate the flow of information between channels. To address these issues, the authors propose a novel Temporal Channel Reconfiguration Multi-Graph Convolution Network (TRMGCN). In the temporal convolution part, the authors designed a module called Temporal Channel Fusion with Guidance (TCFG) to capture important temporal information within channels at different scales and avoid ignoring cross-spatio-temporal dependencies among joints. In the graph convolution part, the authors propose Top-Down Attention Multi-graph Independent Convolution (TD-MIG), which uses multi-graph independent convolution to learn the topological graph feature for different length time series. Top-down attention is introduced for spatial and channel modulation to facilitate information flow in channels that do not establish topological relationships. Experimental results on the large-scale datasets NTU-RGB + D60 and 120, as well as UAV-Human, demonstrate that TRMGCN exhibits advanced performance and capabilities. Furthermore, experiments on the smaller dataset NW-UCLA have indicated that the authors’ model possesses strong generalisation abilities.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"813-825"},"PeriodicalIF":1.5,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12279","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140693975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instance segmentation by blend U-Net and VOLO network","authors":"Hongfei Deng, Bin Wen, Rui Wang, Zuwei Feng","doi":"10.1049/cvi2.12275","DOIUrl":"10.1049/cvi2.12275","url":null,"abstract":"<p>Instance segmentation is still challengeable to correctly distinguish different instances on overlapping, dense and large number of target objects. To address this, the authors simplify the instance segmentation problem to an instance classification problem and propose a novel end-to-end trained instance segmentation algorithm CotuNet. Firstly, the algorithm combines convolutional neural networks (CNN), Outlooker and Transformer to design a new hybrid Encoder (COT) to further feature extraction. It consists of extracting low-level features of the image using CNN, which is passed through the Outlooker to extract more refined local data representations. Then global contextual information is generated by aggregating the data representations in local space using Transformer. Finally, the combination of cascaded upsampling and skip connection modules is used as Decoders (C-UP) to enable the blend of multiple different scales of high-resolution information to generate accurate masks. By validating on the CVPPP 2017 dataset and comparing with previous state-of-the-art methods, CotuNet shows superior competitiveness and segmentation performance.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"735-744"},"PeriodicalIF":1.5,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12275","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140726439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongjian Gu, Wenxuan Zou, Keyang Cheng, Bin Wu, Humaira Abdul Ghafoor, Yongzhao Zhan
{"title":"Person re-identification via deep compound eye network and pose repair module","authors":"Hongjian Gu, Wenxuan Zou, Keyang Cheng, Bin Wu, Humaira Abdul Ghafoor, Yongzhao Zhan","doi":"10.1049/cvi2.12282","DOIUrl":"10.1049/cvi2.12282","url":null,"abstract":"<p>Person re-identification is aimed at searching for specific target pedestrians from non-intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time-consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re-identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi-camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial-temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi-level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi-camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded-DukeMTMC, CUHK-SYSU, PRW, SLP, and UJS-reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK-SYSU dataset, the authors’ model achieved a top-1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"826-841"},"PeriodicalIF":1.5,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12282","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140741587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video frame interpolation via spatial multi-scale modelling","authors":"Zhe Qu, Weijing Liu, Lizhen Cui, Xiaohui Yang","doi":"10.1049/cvi2.12281","DOIUrl":"10.1049/cvi2.12281","url":null,"abstract":"<p>Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super-resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three-scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two-level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three-scale decoding features from the decoder and warp them for a set of three-scale pre-warped maps. The authors fuse them into the synthesis network to generate high-quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"458-472"},"PeriodicalIF":1.7,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12281","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140746884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continuous-dilated temporal and inter-frame motion excitation feature learning for gait recognition","authors":"Chunsheng Hua, Hao Zhang, Jia Li, Yingjie Pan","doi":"10.1049/cvi2.12278","DOIUrl":"10.1049/cvi2.12278","url":null,"abstract":"<p>The authors present global-interval and local-continuous feature extraction networks for gait recognition. Unlike conventional gait recognition methods focussing on the full gait cycle, the authors introduce a novel global- continuous-dilated temporal feature extraction (<i>TFE</i>) to extract continuous and interval motion features from the silhouette frames globally. Simultaneously, an inter-frame motion excitation (<i>IME</i>) module is proposed to enhance the unique motion expression of an individual, which remains unchanged regardless of clothing variations. The spatio-temporal features extracted from the <i>TFE</i> and <i>IME</i> modules are then weighted and concatenated by an adaptive aggregator network for recognition. Through the experiments over CASIA-B and mini-OUMVLP datasets, the proposed method has shown the comparable performance (as 98%, 95%, and 84.9% in the normal walking, carrying a bag or packbag, and wearing coats or jackets categories in CASIA-B, and 89% in mini-OUMVLP) to the other state-of-the-art approaches. Extensive experiments conducted on the CASIA-B and mini-OUMVLP datasets have demonstrated the comparable performance of our proposed method compared to other state-of-the-art approaches.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"788-800"},"PeriodicalIF":1.5,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140781350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dong-hwi Kim, Dong-hun Lee, Aro Kim, Jinwoo Jeong, Jong Taek Lee, Sungjei Kim, Sang-hyo Park
{"title":"Pruning-guided feature distillation for an efficient transformer-based pose estimation model","authors":"Dong-hwi Kim, Dong-hun Lee, Aro Kim, Jinwoo Jeong, Jong Taek Lee, Sungjei Kim, Sang-hyo Park","doi":"10.1049/cvi2.12277","DOIUrl":"https://doi.org/10.1049/cvi2.12277","url":null,"abstract":"<p>The authors propose a compression strategy for a 3D human pose estimation model based on a transformer which yields high accuracy but increases the model size. This approach involves a pruning-guided determination of the search range to achieve lightweight pose estimation under limited training time and to identify the optimal model size. In addition, the authors propose a transformer-based feature distillation (TFD) method, which efficiently exploits the pose estimation model in terms of both model size and accuracy by leveraging transformer architecture characteristics. Pruning-guided TFD is the first approach for 3D human pose estimation that employs transformer architecture. The proposed approach was tested on various extensive data sets, and the results show that it can reduce the model size by 30% compared to the state-of-the-art while ensuring high accuracy.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"745-758"},"PeriodicalIF":1.5,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12277","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prompt guidance query with cascaded constraint decoders for human–object interaction detection","authors":"Sheng Liu, Bingnan Guo, Feng Zhang, Junhao Chen, Ruixiang Chen","doi":"10.1049/cvi2.12276","DOIUrl":"10.1049/cvi2.12276","url":null,"abstract":"<p>Human–object interaction (HOI) detection, which localises and recognises interactions between human and object, requires high-level image and scene understanding. Recent methods for HOI detection typically utilise transformer-based architecture to build unified future representation. However, these methods use random initial queries to predict interactive human–object pairs, leading to a lack of prior knowledge. Furthermore, most methods provide unified features to forecast interactions using conventional decoder structures, but they lack the ability to build efficient multi-task representations. To address these problems, we propose a novel two-stage HOI detector called PGCD, mainly consisting of prompt guidance query and cascaded constraint decoders. Firstly, the authors propose a novel prompt guidance query generation module (PGQ) to introduce the guidance-semantic features. In PGQ, the authors build visual-semantic transfer to obtain fuller semantic representations. In addition, a cascaded constraint decoder architecture (CD) with random masks is designed to build fine-grained interaction features and improve the model's generalisation performance. Experimental results demonstrate that the authors’ proposed approach obtains significant performance on the two widely used benchmarks, that is, HICO-DET and V-COCO.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"772-787"},"PeriodicalIF":1.5,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12276","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140366408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Wang, Meimei Xu, Huazhu Xue, Zhanqiang Huo, Fen Luo
{"title":"Joint image restoration for object detection in snowy weather","authors":"Jing Wang, Meimei Xu, Huazhu Xue, Zhanqiang Huo, Fen Luo","doi":"10.1049/cvi2.12274","DOIUrl":"10.1049/cvi2.12274","url":null,"abstract":"<p>Although existing object detectors achieve encouraging performance of object detection and localisation under real ideal conditions, the detection performance in adverse weather conditions (snowy) is very poor and not enough to cope with the detection task in adverse weather conditions. Existing methods do not deal well with the effect of snow on the identity of object features or usually ignore or even discard potential information that can help improve the detection performance. To this end, the authors propose a novel and improved end-to-end object detection network joint image restoration. Specifically, in order to address the problem of identity degradation of object detection due to snow, an ingenious restoration-detection dual branch network structure combined with a Multi-Integrated Attention module is proposed, which can well mitigate the effect of snow on the identity of object features, thus improving the detection performance of the detector. In order to make more effective use of the features that are beneficial to the detection task, a Self-Adaptive Feature Fusion module is introduced, which can help the network better learn the potential features that are beneficial to the detection and eliminate the effect of heavy or large local snow in the object area on detection by a special feature fusion, thus improving the network's detection capability in snowy. In addition, the authors construct a large-scale, multi-size snowy dataset called Synthetic and Real Snowy Dataset (SRSD), and it is a good and necessary complement and improvement to the existing snowy-related tasks. Extensive experiments on a public snowy dataset (Snowy-weather Datasets) and SRSD indicate that our method outperforms the existing state-of-the-art object detectors.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"759-771"},"PeriodicalIF":1.5,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12274","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140376973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}