{"title":"GR-Former: Graph-reinforcement transformer for skeleton-based driver action recognition","authors":"Zhuoyan Xu, Jingke Xu","doi":"10.1049/cvi2.12298","DOIUrl":"10.1049/cvi2.12298","url":null,"abstract":"<p>In in-vehicle driving scenarios, composite action recognition is crucial for improving safety and understanding the driver's intention. Due to spatial constraints and occlusion factors, the driver's range of motion is limited, thus resulting in similar action patterns that are difficult to differentiate. Additionally, collecting skeleton data that characterise the full human posture is difficult, posing additional challenges for action recognition. To address the problems, a novel Graph-Reinforcement Transformer (GR-Former) model is proposed. Using limited skeleton data as inputs, by introducing graph structure information to directionally reinforce the effect of the self-attention mechanism, dynamically learning and aggregating features between joints at multiple levels, the authors’ model constructs a richer feature vector space, enhancing its expressiveness and recognition accuracy. Based on the Drive & Act dataset for composite action recognition, the authors’ work only applies human upper-body skeleton data to achieve state-of-the-art performance compared to existing methods. Using complete human skeleton data also has excellent recognition accuracy on the NTU RGB + D- and NTU RGB + D 120 dataset, demonstrating the great generalisability of the GR-Former. Generally, the authors’ work provides a new and effective solution for driver action recognition in in-vehicle scenarios.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"982-991"},"PeriodicalIF":1.5,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12298","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141659905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-scale skeleton simplification graph convolutional network for skeleton-based action recognition","authors":"Fan Zhang, Ding Chongyang, Kai Liu, Liu Hongjin","doi":"10.1049/cvi2.12300","DOIUrl":"10.1049/cvi2.12300","url":null,"abstract":"<p>Human action recognition based on graph convolutional networks (GCNs) is one of the hotspots in computer vision. However, previous methods generally rely on handcrafted graph, which limits the effectiveness of the model in characterising the connections between indirectly connected joints. The limitation leads to weakened connections when joints are separated by long distances. To address the above issue, the authors propose a skeleton simplification method which aims to reduce the number of joints and the distance between joints by merging adjacent joints into simplified joints. Group convolutional block is devised to extract the internal features of the simplified joints. Additionally, the authors enhance the method by introducing multi-scale modelling, which maps inputs into sequences across various levels of simplification. Combining with spatial temporal graph convolution, a multi-scale skeleton simplification GCN for skeleton-based action recognition (M3S-GCN) is proposed for fusing multi-scale skeleton sequences and modelling the connections between joints. Finally, M3S-GCN is evaluated on five benchmarks of NTU RGB+D 60 (C-Sub, C-View), NTU RGB+D 120 (X-Sub, X-Set) and NW-UCLA datasets. Experimental results show that the authors’ M3S-GCN achieves state-of-the-art performance with the accuracies of 93.0%, 97.0% and 91.2% on C-Sub, C-View and X-Set benchmarks, which validates the effectiveness of the method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"992-1003"},"PeriodicalIF":1.5,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12300","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141668289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-supervised multi-view clustering in computer vision: A survey","authors":"Jiatai Wang, Zhiwei Xu, Xuewen Yang, Hailong Li, Bo Li, Xuying Meng","doi":"10.1049/cvi2.12299","DOIUrl":"https://doi.org/10.1049/cvi2.12299","url":null,"abstract":"<p>In recent years, multi-view clustering (MVC) has had significant implications in the fields of cross-modal representation learning and data-driven decision-making. Its main objective is to cluster samples into distinct groups by leveraging consistency and complementary information among multiple views. However, the field of computer vision has witnessed the evolution of contrastive learning, and self-supervised learning has made substantial research progress. Consequently, self-supervised learning is progressively becoming dominant in MVC methods. It involves designing proxy tasks to extract supervisory information from image and video data, thereby guiding the clustering process. Despite the rapid development of self-supervised MVC, there is currently no comprehensive survey analysing and summarising the current state of research progress. Hence, the authors aim to explore the emergence of self-supervised MVC by discussing the reasons and advantages behind it. Additionally, the internal connections and classifications of common datasets, data issues, representation learning methods, and self-supervised learning methods are investigated. The authors not only introduce the mechanisms for each category of methods, but also provide illustrative examples of their applications. Finally, some open problems are identified for further investigation and development.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"709-734"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12299","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FastFaceCLIP: A lightweight text-driven high-quality face image manipulation","authors":"Jiaqi Ren, Junping Qin, Qianli Ma, Yin Cao","doi":"10.1049/cvi2.12295","DOIUrl":"10.1049/cvi2.12295","url":null,"abstract":"<p>Although many new methods have emerged in text-driven images, the large computational power required for model training causes these methods to have a slow training process. Additionally, these methods consume a considerable amount of video random access memory (VRAM) resources during training. When generating high-resolution images, the VRAM resources are often insufficient, which results in the inability to generate high-resolution images. Nevertheless, recent Vision Transformers (ViTs) advancements have demonstrated their image classification and recognition capabilities. Unlike the traditional Convolutional Neural Networks based methods, ViTs have a Transformer-based architecture, leverage attention mechanisms to capture comprehensive global information, moreover enabling enhanced global understanding of images through inherent long-range dependencies, thus extracting more robust features and achieving comparable results with reduced computational load. The adaptability of ViTs to text-driven image manipulation was investigated. Specifically, existing image generation methods were refined and the FastFaceCLIP method was proposed by combining the image-text semantic alignment function of the pre-trained CLIP model with the high-resolution image generation function of the proposed FastFace. Additionally, the Multi-Axis Nested Transformer module was incorporated for advanced feature extraction from the latent space, generating higher-resolution images that are further enhanced using the Real-ESRGAN algorithm. Eventually, extensive face manipulation-related tests on the CelebA-HQ dataset challenge the proposed method and other related schemes, demonstrating that FastFaceCLIP effectively generates semantically accurate, visually realistic, and clear images using fewer parameters and less time.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"950-967"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12295","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141687557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PSANet: Automatic colourisation using position-spatial attention for natural images","authors":"Peng-Jie Zhu, Yuan-Yuan Pu, Qiuxia Yang, Siqi Li, Zheng-Peng Zhao, Hao Wu, Dan Xu","doi":"10.1049/cvi2.12291","DOIUrl":"https://doi.org/10.1049/cvi2.12291","url":null,"abstract":"<p>Due to the richness of natural image semantics, natural image colourisation is a challenging problem. Existing methods often suffer from semantic confusion due to insufficient semantic understanding, resulting in unreasonable colour assignments, especially at the edges of objects. This phenomenon is referred to as colour bleeding. The authors have found that using the self-attention mechanism benefits the model's understanding and recognition of object semantics. However, this leads to another problem in colourisation, namely dull colour. With this in mind, a Position-Spatial Attention Network(PSANet) is proposed to address the colour bleeding and the dull colour. Firstly, a novel new attention module called position-spatial attention module (PSAM) is introduced. Through the proposed PSAM module, the model enhances the semantic understanding of images while solving the dull colour problem caused by self-attention. Then, in order to further prevent colour bleeding on object boundaries, a gradient-aware loss is proposed. Lastly, the colour bleeding phenomenon is further improved by the combined effect of gradient-aware loss and edge-aware loss. Experimental results show that this method can reduce colour bleeding largely while maintaining good perceptual quality.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"922-934"},"PeriodicalIF":1.5,"publicationDate":"2024-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12291","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Knowledge distillation of face recognition via attention cosine similarity review","authors":"Zhuo Wang, SuWen Zhao, WanYi Guo","doi":"10.1049/cvi2.12288","DOIUrl":"https://doi.org/10.1049/cvi2.12288","url":null,"abstract":"<p>Deep learning-based face recognition models have demonstrated remarkable performance in benchmark tests, and knowledge distillation technology has been frequently accustomed to obtain high-precision real-time face recognition models specifically designed for mobile and embedded devices. However, in recent years, the knowledge distillation methods for face recognition, which mainly focus on feature or logit knowledge distillation techniques, neglect the attention mechanism that play an important role in the domain of neural networks. An innovation cross-stage connection review path of the attention cosine similarity knowledge distillation method that unites the attention mechanism with review knowledge distillation method is proposed. This method transfers the attention map obtained from the teacher network to the student through a cross-stage connection path. The efficacy and excellence of the proposed algorithm are demonstrated in popular benchmark tests.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"875-887"},"PeriodicalIF":1.5,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12288","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SkatingVerse: A large-scale benchmark for comprehensive evaluation on human action understanding","authors":"Ziliang Gan, Lei Jin, Yi Cheng, Yu Cheng, Yinglei Teng, Zun Li, Yawen Li, Wenhan Yang, Zheng Zhu, Junliang Xing, Jian Zhao","doi":"10.1049/cvi2.12287","DOIUrl":"https://doi.org/10.1049/cvi2.12287","url":null,"abstract":"<p>Human action understanding (HAU) is a broad topic that involves specific tasks, such as action localisation, recognition, and assessment. However, most popular HAU datasets are bound to one task based on particular actions. Combining different but relevant HAU tasks to establish a unified action understanding system is challenging due to the disparate actions across datasets. A large-scale and comprehensive benchmark, namely <b>SkatingVerse</b> is constructed for action recognition, segmentation, proposal, and assessment. SkatingVerse focus on fine-grained sport action, hence figure skating is chosen as the task object, which eliminates the biases of the object, scene, and space that exist in most previous datasets. In addition, skating actions have inherent complexity and similarity, which is an enormous challenge for current algorithms. A total of 1687 official figure skating competition videos was collected with a total of 184.4 h, exceeding four times over other datasets with a similar topic. SkatingVerse enables to formulate a unified task to output fine-grained human action classification and assessment results from a raw figure skating competition video. In addition, <i>SkatingVerse</i> can facilitate the study of HAU foundation model due to its large scale and abundant categories. Moreover, image modality is incorporated for human pose estimation task into <i>SkatingVerse</i>. Extensive experimental results show that (1) SkatingVerse significantly helps the training and evaluation of HAU methods, (2) the performance of existing HAU methods has much room to improve, and SkatingVerse helps to reduce such gaps, and (3) unifying relevant tasks in HAU through a uniform dataset can facilitate more practical applications. SkatingVerse will be publicly available to facilitate further studies on relevant problems.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"888-906"},"PeriodicalIF":1.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12287","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hengyu Mu, Jian Guo, Xingli Liu, Chong Han, Lijuan Sun
{"title":"Federated finger vein presentation attack detection for various clients","authors":"Hengyu Mu, Jian Guo, Xingli Liu, Chong Han, Lijuan Sun","doi":"10.1049/cvi2.12292","DOIUrl":"https://doi.org/10.1049/cvi2.12292","url":null,"abstract":"<p>Recently, the application of finger vein recognition has become popular. Studies have shown finger vein presentation attacks increasingly threaten these recognition devices. As a result, research on finger vein presentation attack detection (fvPAD) methods has received much attention. However, the current fvPAD methods have two limitations. (1) Most terminal devices cannot train fvPAD models independently due to a lack of data. (2) Several research institutes can train fvPAD models; however, these models perform poorly when applied to terminal devices due to inadequate generalisation. Consequently, it is difficult for threatened terminal devices to obtain an effective fvPAD model. To address this problem, the method of federated finger vein presentation attack detection for various clients is proposed, which is the first study that introduces federated learning (FL) to fvPAD. In the proposed method, the differences in data volume and computing power between clients are considered. Traditional FL clients are expanded into two categories: institutional and terminal clients. For institutional clients, an improved triplet training mode with FL is designed to enhance model generalisation. For terminal clients, their inability is solved to obtain effective fvPAD models. Finally, extensive experiments are conducted on three datasets, which demonstrate the superiority of our method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"935-949"},"PeriodicalIF":1.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12292","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Xie, Hengliang Tan, Jiao Du, Shuo Yang, Guofeng Yan, Wangwang Li, Jianwei Feng
{"title":"Eigenspectrum regularisation reverse neighbourhood discriminative learning","authors":"Ming Xie, Hengliang Tan, Jiao Du, Shuo Yang, Guofeng Yan, Wangwang Li, Jianwei Feng","doi":"10.1049/cvi2.12284","DOIUrl":"10.1049/cvi2.12284","url":null,"abstract":"<p>Linear discriminant analysis is a classical method for solving problems of dimensional reduction and pattern classification. Although it has been extensively developed, however, it still suffers from various common problems, such as the Small Sample Size (SSS) and the multimodal problem. Neighbourhood linear discriminant analysis (nLDA) was recently proposed to solve the problem of multimodal class caused by the contravention of independently and identically distributed samples. However, due to the existence of many small-scale practical applications, nLDA still has to face the SSS problem, which leads to instability and poor generalisation caused by the singularity of the within-neighbourhood scatter matrix. The authors exploit the eigenspectrum regularisation techniques to circumvent the singularity of the within-neighbourhood scatter matrix of nLDA, which is called Eigenspectrum Regularisation Reverse Neighbourhood Discriminative Learning (ERRNDL). The algorithm of nLDA is reformulated as a framework by searching two projection matrices. Three eigenspectrum regularisation models are introduced to our framework to evaluate the performance. Experiments are conducted on the University of California, Irvine machine learning repository and six image classification datasets. The proposed ERRNDL-based methods achieve considerable performance.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"842-858"},"PeriodicalIF":1.5,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12284","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140980457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLaSP: Cross-view 6-DoF localisation assisted by synthetic panorama","authors":"Juelin Zhu, Shen Yan, Xiaoya Cheng, Rouwan Wu, Yuxiang Liu, Maojun Zhang","doi":"10.1049/cvi2.12285","DOIUrl":"10.1049/cvi2.12285","url":null,"abstract":"<p>Despite the impressive progress in visual localisation, 6-DoF cross-view localisation is still a challenging task in the computer vision community due to the huge appearance changes. To address this issue, the authors propose the CLaSP, a coarse-to-fine framework, which leverages a synthetic panorama to facilitate cross-view 6-DoF localisation in a large-scale scene. The authors first leverage a segmentation map to correct the prior pose, followed by a synthetic panorama on the ground to enable coarse pose estimation combined with a template matching method. The authors finally formulate the refine localisation process as feature matching and pose refinement to obtain the final result. The authors evaluate the performance of the CLaSP and several state-of-the-art baselines on the <i>Airloc</i> dataset, which demonstrates the effectiveness of our proposed framework.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"859-874"},"PeriodicalIF":1.5,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12285","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140986129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}