{"title":"FastFaceCLIP: A lightweight text-driven high-quality face image manipulation","authors":"Jiaqi Ren, Junping Qin, Qianli Ma, Yin Cao","doi":"10.1049/cvi2.12295","DOIUrl":"10.1049/cvi2.12295","url":null,"abstract":"<p>Although many new methods have emerged in text-driven images, the large computational power required for model training causes these methods to have a slow training process. Additionally, these methods consume a considerable amount of video random access memory (VRAM) resources during training. When generating high-resolution images, the VRAM resources are often insufficient, which results in the inability to generate high-resolution images. Nevertheless, recent Vision Transformers (ViTs) advancements have demonstrated their image classification and recognition capabilities. Unlike the traditional Convolutional Neural Networks based methods, ViTs have a Transformer-based architecture, leverage attention mechanisms to capture comprehensive global information, moreover enabling enhanced global understanding of images through inherent long-range dependencies, thus extracting more robust features and achieving comparable results with reduced computational load. The adaptability of ViTs to text-driven image manipulation was investigated. Specifically, existing image generation methods were refined and the FastFaceCLIP method was proposed by combining the image-text semantic alignment function of the pre-trained CLIP model with the high-resolution image generation function of the proposed FastFace. Additionally, the Multi-Axis Nested Transformer module was incorporated for advanced feature extraction from the latent space, generating higher-resolution images that are further enhanced using the Real-ESRGAN algorithm. Eventually, extensive face manipulation-related tests on the CelebA-HQ dataset challenge the proposed method and other related schemes, demonstrating that FastFaceCLIP effectively generates semantically accurate, visually realistic, and clear images using fewer parameters and less time.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"950-967"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12295","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141687557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PSANet: Automatic colourisation using position-spatial attention for natural images","authors":"Peng-Jie Zhu, Yuan-Yuan Pu, Qiuxia Yang, Siqi Li, Zheng-Peng Zhao, Hao Wu, Dan Xu","doi":"10.1049/cvi2.12291","DOIUrl":"https://doi.org/10.1049/cvi2.12291","url":null,"abstract":"<p>Due to the richness of natural image semantics, natural image colourisation is a challenging problem. Existing methods often suffer from semantic confusion due to insufficient semantic understanding, resulting in unreasonable colour assignments, especially at the edges of objects. This phenomenon is referred to as colour bleeding. The authors have found that using the self-attention mechanism benefits the model's understanding and recognition of object semantics. However, this leads to another problem in colourisation, namely dull colour. With this in mind, a Position-Spatial Attention Network(PSANet) is proposed to address the colour bleeding and the dull colour. Firstly, a novel new attention module called position-spatial attention module (PSAM) is introduced. Through the proposed PSAM module, the model enhances the semantic understanding of images while solving the dull colour problem caused by self-attention. Then, in order to further prevent colour bleeding on object boundaries, a gradient-aware loss is proposed. Lastly, the colour bleeding phenomenon is further improved by the combined effect of gradient-aware loss and edge-aware loss. Experimental results show that this method can reduce colour bleeding largely while maintaining good perceptual quality.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"922-934"},"PeriodicalIF":1.5,"publicationDate":"2024-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12291","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knowledge distillation of face recognition via attention cosine similarity review","authors":"Zhuo Wang, SuWen Zhao, WanYi Guo","doi":"10.1049/cvi2.12288","DOIUrl":"https://doi.org/10.1049/cvi2.12288","url":null,"abstract":"<p>Deep learning-based face recognition models have demonstrated remarkable performance in benchmark tests, and knowledge distillation technology has been frequently accustomed to obtain high-precision real-time face recognition models specifically designed for mobile and embedded devices. However, in recent years, the knowledge distillation methods for face recognition, which mainly focus on feature or logit knowledge distillation techniques, neglect the attention mechanism that play an important role in the domain of neural networks. An innovation cross-stage connection review path of the attention cosine similarity knowledge distillation method that unites the attention mechanism with review knowledge distillation method is proposed. This method transfers the attention map obtained from the teacher network to the student through a cross-stage connection path. The efficacy and excellence of the proposed algorithm are demonstrated in popular benchmark tests.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"875-887"},"PeriodicalIF":1.5,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12288","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SkatingVerse: A large-scale benchmark for comprehensive evaluation on human action understanding","authors":"Ziliang Gan, Lei Jin, Yi Cheng, Yu Cheng, Yinglei Teng, Zun Li, Yawen Li, Wenhan Yang, Zheng Zhu, Junliang Xing, Jian Zhao","doi":"10.1049/cvi2.12287","DOIUrl":"https://doi.org/10.1049/cvi2.12287","url":null,"abstract":"<p>Human action understanding (HAU) is a broad topic that involves specific tasks, such as action localisation, recognition, and assessment. However, most popular HAU datasets are bound to one task based on particular actions. Combining different but relevant HAU tasks to establish a unified action understanding system is challenging due to the disparate actions across datasets. A large-scale and comprehensive benchmark, namely <b>SkatingVerse</b> is constructed for action recognition, segmentation, proposal, and assessment. SkatingVerse focus on fine-grained sport action, hence figure skating is chosen as the task object, which eliminates the biases of the object, scene, and space that exist in most previous datasets. In addition, skating actions have inherent complexity and similarity, which is an enormous challenge for current algorithms. A total of 1687 official figure skating competition videos was collected with a total of 184.4 h, exceeding four times over other datasets with a similar topic. SkatingVerse enables to formulate a unified task to output fine-grained human action classification and assessment results from a raw figure skating competition video. In addition, <i>SkatingVerse</i> can facilitate the study of HAU foundation model due to its large scale and abundant categories. Moreover, image modality is incorporated for human pose estimation task into <i>SkatingVerse</i>. Extensive experimental results show that (1) SkatingVerse significantly helps the training and evaluation of HAU methods, (2) the performance of existing HAU methods has much room to improve, and SkatingVerse helps to reduce such gaps, and (3) unifying relevant tasks in HAU through a uniform dataset can facilitate more practical applications. SkatingVerse will be publicly available to facilitate further studies on relevant problems.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"888-906"},"PeriodicalIF":1.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12287","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hengyu Mu, Jian Guo, Xingli Liu, Chong Han, Lijuan Sun
{"title":"Federated finger vein presentation attack detection for various clients","authors":"Hengyu Mu, Jian Guo, Xingli Liu, Chong Han, Lijuan Sun","doi":"10.1049/cvi2.12292","DOIUrl":"https://doi.org/10.1049/cvi2.12292","url":null,"abstract":"<p>Recently, the application of finger vein recognition has become popular. Studies have shown finger vein presentation attacks increasingly threaten these recognition devices. As a result, research on finger vein presentation attack detection (fvPAD) methods has received much attention. However, the current fvPAD methods have two limitations. (1) Most terminal devices cannot train fvPAD models independently due to a lack of data. (2) Several research institutes can train fvPAD models; however, these models perform poorly when applied to terminal devices due to inadequate generalisation. Consequently, it is difficult for threatened terminal devices to obtain an effective fvPAD model. To address this problem, the method of federated finger vein presentation attack detection for various clients is proposed, which is the first study that introduces federated learning (FL) to fvPAD. In the proposed method, the differences in data volume and computing power between clients are considered. Traditional FL clients are expanded into two categories: institutional and terminal clients. For institutional clients, an improved triplet training mode with FL is designed to enhance model generalisation. For terminal clients, their inability is solved to obtain effective fvPAD models. Finally, extensive experiments are conducted on three datasets, which demonstrate the superiority of our method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"935-949"},"PeriodicalIF":1.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12292","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Xie, Hengliang Tan, Jiao Du, Shuo Yang, Guofeng Yan, Wangwang Li, Jianwei Feng
{"title":"Eigenspectrum regularisation reverse neighbourhood discriminative learning","authors":"Ming Xie, Hengliang Tan, Jiao Du, Shuo Yang, Guofeng Yan, Wangwang Li, Jianwei Feng","doi":"10.1049/cvi2.12284","DOIUrl":"10.1049/cvi2.12284","url":null,"abstract":"<p>Linear discriminant analysis is a classical method for solving problems of dimensional reduction and pattern classification. Although it has been extensively developed, however, it still suffers from various common problems, such as the Small Sample Size (SSS) and the multimodal problem. Neighbourhood linear discriminant analysis (nLDA) was recently proposed to solve the problem of multimodal class caused by the contravention of independently and identically distributed samples. However, due to the existence of many small-scale practical applications, nLDA still has to face the SSS problem, which leads to instability and poor generalisation caused by the singularity of the within-neighbourhood scatter matrix. The authors exploit the eigenspectrum regularisation techniques to circumvent the singularity of the within-neighbourhood scatter matrix of nLDA, which is called Eigenspectrum Regularisation Reverse Neighbourhood Discriminative Learning (ERRNDL). The algorithm of nLDA is reformulated as a framework by searching two projection matrices. Three eigenspectrum regularisation models are introduced to our framework to evaluate the performance. Experiments are conducted on the University of California, Irvine machine learning repository and six image classification datasets. The proposed ERRNDL-based methods achieve considerable performance.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"842-858"},"PeriodicalIF":1.5,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12284","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140980457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLaSP: Cross-view 6-DoF localisation assisted by synthetic panorama","authors":"Juelin Zhu, Shen Yan, Xiaoya Cheng, Rouwan Wu, Yuxiang Liu, Maojun Zhang","doi":"10.1049/cvi2.12285","DOIUrl":"10.1049/cvi2.12285","url":null,"abstract":"<p>Despite the impressive progress in visual localisation, 6-DoF cross-view localisation is still a challenging task in the computer vision community due to the huge appearance changes. To address this issue, the authors propose the CLaSP, a coarse-to-fine framework, which leverages a synthetic panorama to facilitate cross-view 6-DoF localisation in a large-scale scene. The authors first leverage a segmentation map to correct the prior pose, followed by a synthetic panorama on the ground to enable coarse pose estimation combined with a template matching method. The authors finally formulate the refine localisation process as feature matching and pose refinement to obtain the final result. The authors evaluate the performance of the CLaSP and several state-of-the-art baselines on the <i>Airloc</i> dataset, which demonstrates the effectiveness of our proposed framework.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"859-874"},"PeriodicalIF":1.5,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12285","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140986129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo
{"title":"Guest Editorial: Advanced image restoration and enhancement in the wild","authors":"Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo","doi":"10.1049/cvi2.12283","DOIUrl":"https://doi.org/10.1049/cvi2.12283","url":null,"abstract":"<p>Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.</p><p>In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.</p><p>The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.</p><p>Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world dataset","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"435-438"},"PeriodicalIF":1.7,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12283","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141246088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyue Lei, Bin Tang, Yanhua Chen, Mingfu Zhao, Yifei Xu, Zourong Long
{"title":"Temporal channel reconfiguration multi-graph convolution network for skeleton-based action recognition","authors":"Siyue Lei, Bin Tang, Yanhua Chen, Mingfu Zhao, Yifei Xu, Zourong Long","doi":"10.1049/cvi2.12279","DOIUrl":"10.1049/cvi2.12279","url":null,"abstract":"<p>Skeleton-based action recognition has received much attention and achieved remarkable achievements in the field of human action recognition. In time series action prediction for different scales, existing methods mainly focus on attention mechanisms to enhance modelling capabilities in spatial dimensions. However, this approach strongly depends on the local information of a single input feature and fails to facilitate the flow of information between channels. To address these issues, the authors propose a novel Temporal Channel Reconfiguration Multi-Graph Convolution Network (TRMGCN). In the temporal convolution part, the authors designed a module called Temporal Channel Fusion with Guidance (TCFG) to capture important temporal information within channels at different scales and avoid ignoring cross-spatio-temporal dependencies among joints. In the graph convolution part, the authors propose Top-Down Attention Multi-graph Independent Convolution (TD-MIG), which uses multi-graph independent convolution to learn the topological graph feature for different length time series. Top-down attention is introduced for spatial and channel modulation to facilitate information flow in channels that do not establish topological relationships. Experimental results on the large-scale datasets NTU-RGB + D60 and 120, as well as UAV-Human, demonstrate that TRMGCN exhibits advanced performance and capabilities. Furthermore, experiments on the smaller dataset NW-UCLA have indicated that the authors’ model possesses strong generalisation abilities.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"813-825"},"PeriodicalIF":1.5,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12279","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140693975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instance segmentation by blend U-Net and VOLO network","authors":"Hongfei Deng, Bin Wen, Rui Wang, Zuwei Feng","doi":"10.1049/cvi2.12275","DOIUrl":"10.1049/cvi2.12275","url":null,"abstract":"<p>Instance segmentation is still challengeable to correctly distinguish different instances on overlapping, dense and large number of target objects. To address this, the authors simplify the instance segmentation problem to an instance classification problem and propose a novel end-to-end trained instance segmentation algorithm CotuNet. Firstly, the algorithm combines convolutional neural networks (CNN), Outlooker and Transformer to design a new hybrid Encoder (COT) to further feature extraction. It consists of extracting low-level features of the image using CNN, which is passed through the Outlooker to extract more refined local data representations. Then global contextual information is generated by aggregating the data representations in local space using Transformer. Finally, the combination of cascaded upsampling and skip connection modules is used as Decoders (C-UP) to enable the blend of multiple different scales of high-resolution information to generate accurate masks. By validating on the CVPPP 2017 dataset and comparing with previous state-of-the-art methods, CotuNet shows superior competitiveness and segmentation performance.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"735-744"},"PeriodicalIF":1.5,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12275","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140726439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}