Chengcheng Zhong, Na Gong, Zitong Zhang, Yanan Jiang, Kai Zhang
{"title":"LiteCCLKNet: A lightweight criss-cross large kernel convolutional neural network for hyperspectral image classification","authors":"Chengcheng Zhong, Na Gong, Zitong Zhang, Yanan Jiang, Kai Zhang","doi":"10.1049/cvi2.12218","DOIUrl":"10.1049/cvi2.12218","url":null,"abstract":"<p>High-performance convolutional neural networks (CNNs) stack many convolutional layers to obtain powerful feature extraction capability, which leads to huge storing and computational costs. The authors focus on lightweight models for hyperspectral image (HSI) classification, so a novel lightweight criss-cross large kernel convolutional neural network (LiteCCLKNet) is proposed. Specifically, a lightweight module containing two 1D convolutions with self-attention mechanisms in orthogonal directions is presented. By setting large kernels within the 1D convolutional layers, the proposed module can efficiently aggregate long-range contextual features. In addition, the authors effectively obtain a global receptive field by stacking only two of the proposed modules. Compared with traditional lightweight CNNs, LiteCCLKNet reduces the number of parameters for easy deployment to resource-limited platforms. Experimental results on three HSI datasets demonstrate that the proposed LiteCCLKNet outperforms the previous lightweight CNNs and has higher storage efficiency.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"763-776"},"PeriodicalIF":1.7,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12218","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46053001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic facial expression recognition with pseudo-label guided multi-modal pre-training","authors":"Bing Yin, Shi Yin, Cong Liu, Yanyong Zhang, Changfeng Xi, Baocai Yin, Zhenhua Ling","doi":"10.1049/cvi2.12217","DOIUrl":"10.1049/cvi2.12217","url":null,"abstract":"<p>Due to the huge cost of manual annotations, the labelled data may not be sufficient to train a dynamic facial expression (DFR) recogniser with good performance. To address this, the authors propose a multi-modal pre-training method with a pseudo-label guidance mechanism to make full use of unlabelled video data for learning informative representations of facial expressions. First, the authors build a pre-training dataset of videos with aligned vision and audio modals. Second, the vision and audio feature encoders are trained through an instance discrimination strategy and a cross-modal alignment strategy on the pre-training data. Third, the vision feature encoder is extended as a dynamic expression recogniser and is fine-tuned on the labelled training data. Fourth, the fine-tuned expression recogniser is adopted to predict pseudo-labels for the pre-training data, and then start a new pre-training phase with the guidance of pseudo-labels to alleviate the long-tail distribution problem and the instance-class confliction. Fifth, since the representations learnt with the guidance of pseudo-labels are more informative, a new fine-tuning phase is added to further boost the generalisation performance on the DFR recognition task. Experimental results on the Dynamic Facial Expression in the Wild dataset demonstrate the superiority of the proposed method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"33-45"},"PeriodicalIF":1.7,"publicationDate":"2023-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12217","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46222126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Position-aware spatio-temporal graph convolutional networks for skeleton-based action recognition","authors":"Ping Yang, Qin Wang, Hao Chen, Zizhao Wu","doi":"10.1049/cvi2.12223","DOIUrl":"10.1049/cvi2.12223","url":null,"abstract":"<p>Graph Convolutional Networks (GCNs) have been widely used in skeleton-based action recognition. Though significant performance has been achieved, it is still challenging to effectively model the complex dynamics of skeleton sequences. A novel position-aware spatio-temporal GCN for skeleton-based action recognition is proposed, where the positional encoding is investigated to enhance the capacity of typical baselines for comprehending the dynamic characteristics of action sequence. Specifically, the authors’ method systematically investigates the temporal position encoding and spatial position embedding, in favour of explicitly capturing the sequence ordering information and the identity information of nodes that are used in graphs. Additionally, to alleviate the redundancy and over-smoothing problems of typical GCNs, the authors’ method further investigates a subgraph mask, which gears to mine the prominent subgraph patterns over the underlying graph, letting the model be robust against the impaction of some irrelevant joints. Extensive experiments on three large-scale datasets demonstrate that our model can achieve competitive results comparing to the previous state-of-art methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"844-854"},"PeriodicalIF":1.7,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12223","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48043129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A point-image fusion network for event-based frame interpolation","authors":"Chushu Zhang, Wei An, Ye Zhang, Miao Li","doi":"10.1049/cvi2.12220","DOIUrl":"10.1049/cvi2.12220","url":null,"abstract":"<p>Temporal information in event streams plays a critical role in event-based video frame interpolation as it provides temporal context cues complementary to images. Most previous event-based methods first transform the unstructured event data to structured data formats through voxelisation, and then employ advanced CNNs to extract temporal information. However, voxelisation inevitably leads to information loss, and processing the sparse voxels introduces severe computation redundancy. To address these limitations, this study proposes a point-image fusion network (PIFNet). In our PIFNet, rich temporal information from the events can be directly extracted at the point level. Then, a fusion module is designed to fuse complementary cues from both points and images for frame interpolation. Extensive experiments on both synthetic and real datasets demonstrate that our PIFNet achieves state-of-the-art performance with high efficiency.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"439-447"},"PeriodicalIF":1.7,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12220","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47200165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing human parsing with region-level learning","authors":"Yanghong Zhou, P. Y. Mok","doi":"10.1049/cvi2.12222","DOIUrl":"10.1049/cvi2.12222","url":null,"abstract":"<p>Human parsing is very important in a diverse range of industrial applications. Despite the considerable progress that has been achieved, the performance of existing methods is still less than satisfactory, since these methods learn the shared features of various parsing labels at the image level. This limits the representativeness of the learnt features, especially when the distribution of parsing labels is imbalanced or the scale of different labels is substantially different. To address this limitation, a Region-level Parsing Refiner (RPR) is proposed to enhance parsing performance by the introduction of region-level parsing learning. Region-level parsing focuses specifically on small regions of the body, for example, the head. The proposed RPR is an adaptive module that can be integrated with different existing human parsing models to improve their performance. Extensive experiments are conducted on two benchmark datasets, and the results demonstrated the effectiveness of our RPR model in terms of improving the overall parsing performance as well as parsing rare labels. This method was successfully applied to a commercial application for the extraction of human body measurements and has been used in various online shopping platforms for clothing size recommendations. The code and dataset are released at this link https://github.com/applezhouyp/PRP.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"60-71"},"PeriodicalIF":1.7,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12222","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46633684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAGAN: Classifier-augmented generative adversarial networks for weakly-supervised COVID-19 lung lesion localisation","authors":"Xiaojie Li, Xin Fei, Zhe Yan, Hongping Ren, Canghong Shi, Xian Zhang, Imran Mumtaz, Yong Luo, Xi Wu","doi":"10.1049/cvi2.12216","DOIUrl":"10.1049/cvi2.12216","url":null,"abstract":"<p>The Coronavirus Disease 2019 (COVID-19) epidemic has constituted a Public Health Emergency of International Concern. Chest computed tomography (CT) can help early reveal abnormalities indicative of lung disease. Thus, accurate and automatic localisation of lung lesions is particularly important to assist physicians in rapid diagnosis of COVID-19 patients. The authors propose a classifier-augmented generative adversarial network framework for weakly supervised COVID-19 lung lesion localisation. It consists of an abnormality map generator, discriminator and classifier. The generator aims to produce the abnormality feature map <i>M</i> to locate lesion regions and then constructs images of the pseudo-healthy subjects by adding <i>M</i> to the input patient images. Besides constraining the generated images of healthy subjects with real distribution by the discriminator, a pre-trained classifier is introduced to enhance the generated images of healthy subjects to possess similar feature representations with real healthy people in terms of high-level semantic features. Moreover, an attention gate is employed in the generator to reduce the noise effect in the irrelevant regions of <i>M</i>. Experimental results on the COVID-19 CT dataset show that the method is effective in capturing more lesion areas and generating less noise in unrelated areas, and it has significant advantages in terms of quantitative and qualitative results over existing methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"1-14"},"PeriodicalIF":1.7,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12216","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46620917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mirror complementary transformer network for RGB-thermal salient object detection","authors":"Xiurong Jiang, Yifan Hou, Hui Tian, Lin Zhu","doi":"10.1049/cvi2.12221","DOIUrl":"10.1049/cvi2.12221","url":null,"abstract":"<p>Conventional RGB-T salient object detection treats RGB and thermal modalities equally to locate the common salient regions. However, the authors observed that the rich colour and texture information of the RGB modality makes the objects more prominent compared to the background; and the thermal modality records the temperature difference of the scene, so the objects usually contain clear and continuous edge information. In this work, a novel mirror-complementary Transformer network (MCNet) is proposed for RGB-T SOD, which supervise the two modalities separately with a complementary set of saliency labels under a symmetrical structure. Moreover, the attention-based feature interaction and serial multiscale dilated convolution (SDC)-based feature fusion modules are introduced to make the two modalities complement and adjust each other flexibly. When one modality fails, the proposed model can still accurately segment the salient regions. To demonstrate the robustness of the proposed model under challenging scenes in real world, the authors build a novel RGB-T SOD dataset VT723 based on a large public semantic segmentation RGB-T dataset used in the autonomous driving domain. Extensive experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches, including CNN-based and Transformer-based methods. The code and dataset can be found at https://github.com/jxr326/SwinMCNet.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"15-32"},"PeriodicalIF":1.7,"publicationDate":"2023-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12221","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135155949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peiqiang Liu, Qifeng Liang, Zhiyong An, Jingyi Fu, Yanyan Mao
{"title":"Robust object tracking via ensembling semantic-aware network and redetection","authors":"Peiqiang Liu, Qifeng Liang, Zhiyong An, Jingyi Fu, Yanyan Mao","doi":"10.1049/cvi2.12219","DOIUrl":"10.1049/cvi2.12219","url":null,"abstract":"<p>Most Siamese-based trackers use classification and regression to determine the target bounding box, which can be formulated as a linear matching process of the template and search region. However, this only takes into account the similarity of features while ignoring the semantic object information, resulting in some cases in which the regression box with the highest classification score is not accurate. To address the lack of semantic information, an object tracking approach based on an ensemble semantic-aware network and redetection (ESART) is proposed. Furthermore, a DarkNet53 network with transfer learning is used as our semantic-aware model to adapt the detection task for extracting semantic information. In addition, a semantic tag redetection method to re-evaluate the bounding box and overcome inaccurate scaling issues is proposed. Extensive experiments based on OTB2015, UAV123, UAV20L, and GOT-10k show that our tracker is superior to other state-of-the-art trackers. It is noteworthy that our semantic-aware ensemble method can be embedded into any tracker for classification and regression task.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"46-59"},"PeriodicalIF":1.7,"publicationDate":"2023-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12219","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42081075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attribute-guided transformer for robust person re-identification","authors":"Zhe Wang, Jun Wang, Junliang Xing","doi":"10.1049/cvi2.12215","DOIUrl":"10.1049/cvi2.12215","url":null,"abstract":"<p>Recent studies reveal the crucial role of local features in learning robust and discriminative representations for person re-identification (Re-ID). Existing approaches typically rely on external tasks, for example, semantic segmentation, or pose estimation, to locate identifiable parts of given images. However, they heuristically utilise the predictions from off-the-shelf models, which may be sub-optimal in terms of both local partition and computational efficiency. They also ignore the mutual information with other inputs, which weakens the representation capabilities of local features. In this study, the authors put forward a novel Attribute-guided Transformer (AiT), which explicitly exploits pedestrian attributes as semantic priors for discriminative representation learning. Specifically, the authors first introduce an attribute learning process, which generates a set of attention maps highlighting the informative parts of pedestrian images. Then, the authors design a Feature Diffusion Module (FDM) to iteratively inject attribute information into global feature maps, aiming at suppressing unnecessary noise and inferring attribute-aware representations. Last, the authors propose a Feature Aggregation Module (FAM) to exploit mutual information for aggregating attribute characteristics from different images, enhancing the representation capabilities of feature embedding. Extensive experiments demonstrate the superiority of our AiT in learning robust and discriminative representations. As a result, the authors achieve competitive performance with state-of-the-art methods on several challenging benchmarks without any bells and whistles.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 8","pages":"977-992"},"PeriodicalIF":1.7,"publicationDate":"2023-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12215","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49366041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DASTSiam: Spatio-temporal fusion and discriminative enhancement for Siamese visual tracking","authors":"Yucheng Huang, Eksan Firkat, Jinlai Zhang, Lijuan Zhu, Bin Zhu, Jihong Zhu, Askar Hamdulla","doi":"10.1049/cvi2.12213","DOIUrl":"10.1049/cvi2.12213","url":null,"abstract":"<p>The use of deep neural networks has revolutionised object tracking tasks, and Siamese trackers have emerged as a prominent technique for this purpose. Existing Siamese trackers use a fixed template or template updating technique, but it is prone to overfitting, lacks the capacity to exploit global temporal sequences, and cannot utilise multi-layer features. As a result, it is challenging to deal with dramatic appearance changes in complicated scenarios. Siamese trackers also struggle to learn background information, which impairs their discriminative ability. Hence, two transformer-based modules, the Spatio-Temporal Fusion (ST) module and the Discriminative Enhancement (DE) module, are proposed to improve the performance of Siamese trackers. The ST module leverages cross-attention to accumulate global temporal cues and generates an attention matrix with ST similarity to enhance the template's adaptability to changes in target appearance. The DE module associates semantically similar points from the template and search area, thereby generating a learnable discriminative mask to enhance the discriminative ability of the Siamese trackers. In addition, a Multi-Layer ST module (ST + ML) was constructed, which can be integrated into Siamese trackers based on multi-layer cross-correlation for further improvement. The authors evaluate the proposed modules on four public datasets and show comparative performance compared to existing Siamese trackers.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 8","pages":"1017-1033"},"PeriodicalIF":1.7,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12213","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48829474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}