Le Quan Nguyen , Jinwoo Choi , L. Minh Dang , Hyeonjoon Moon
{"title":"Background debiased class incremental learning for video action recognition","authors":"Le Quan Nguyen , Jinwoo Choi , L. Minh Dang , Hyeonjoon Moon","doi":"10.1016/j.imavis.2024.105295","DOIUrl":"10.1016/j.imavis.2024.105295","url":null,"abstract":"<div><div>In this work, we tackle class incremental learning (CIL) for video action recognition, a relatively under-explored problem despite its practical importance. Directly applying image-based CIL methods does not work well in the video action recognition setting. We hypothesize the major reason is the spurious correlation between the action and background in video action recognition datasets/models. Recent literature shows that the spurious correlation hampers the generalization of models in the conventional action recognition setting. The problem is even more severe in the CIL setting due to the limited exemplars available in the rehearsal memory. We empirically show that mitigating the spurious correlation between the action and background is crucial to the CIL for video action recognition. We propose to learn background invariant action representations in the CIL setting by providing training videos with diverse backgrounds generated from background augmentation techniques. We validate the proposed method on public benchmarks: HMDB-51, UCF-101, and Something-Something-v2.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105295"},"PeriodicalIF":4.2,"publicationDate":"2024-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142445304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhongyu Zhang , Shujun Liu , Yingxiang Qin , Huajun Wang
{"title":"MATNet: Multilevel attention-based transformers for change detection in remote sensing images","authors":"Zhongyu Zhang , Shujun Liu , Yingxiang Qin , Huajun Wang","doi":"10.1016/j.imavis.2024.105294","DOIUrl":"10.1016/j.imavis.2024.105294","url":null,"abstract":"<div><div>Remote sensing image change detection is crucial for natural disaster monitoring and land use change. As the resolution increases, the scenes covered by remote sensing images become more complex, and traditional methods have difficulties in extracting detailed information. With the development of deep learning, the field of change detection has new opportunities. However, existing algorithms mainly focus on the difference analysis between bi-temporal images, while ignoring the semantic information between images, resulting in the global and local information not being able to interact effectively. In this paper, we introduce a new transformer-based multilevel attention network (MATNet), which is capable of extracting multilevel features of global and local information, enabling information interaction and fusion, and thus modeling the global context more effectively. Specifically, we extract multilevel semantic features through the Transformer encoder, and utilize the Feature Enhancement Module (FEM) to perform feature summing and differencing on the multilevel features in order to better extract the local detail information, and thus better detect the changes in small regions. In addition, we employ a multilevel attention decoder (MAD) to obtain information in spatial and spectral dimensions, which can effectively fuse global and local information. In experiments, our method performs excellently on CDD, DSIFN-CD, LEVIR-CD, and SYSU-CD datasets, with F1 scores and OA reaching 95.67%∕87.75%∕90.94%∕86.82% and 98.95%∕95.93%∕99.11%∕90.53% respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105294"},"PeriodicalIF":4.2,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fariba Lotfi , Mansour Jamzad , Hamid Beigy , Helia Farhood , Quan Z. Sheng , Amin Beheshti
{"title":"Knowledge graph construction in hyperbolic space for automatic image annotation","authors":"Fariba Lotfi , Mansour Jamzad , Hamid Beigy , Helia Farhood , Quan Z. Sheng , Amin Beheshti","doi":"10.1016/j.imavis.2024.105293","DOIUrl":"10.1016/j.imavis.2024.105293","url":null,"abstract":"<div><div>Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to incorporate tag correlations in annotation models, the method of constructing a knowledge graph based on external knowledge sources and hyperbolic space has not been explored. In this paper, we create an attributed knowledge graph based on vocabulary, integrate external knowledge sources such as WordNet, and utilize hyperbolic word embeddings for the tag representations. These embeddings provide a sophisticated tag representation that captures hierarchical and complex correlations more effectively, enhancing the image annotation results. In addition, leveraging external knowledge sources enhances contextuality and significantly enriches existing AIA datasets. We exploit two deep learning-based models, the Relational Graph Convolutional Network (R-GCN) and the Vision Transformer (ViT), to extract the input features. We apply two R-GCN operations to obtain word descriptors and fuse them with the extracted visual features. We evaluate the proposed approach using three public benchmark datasets. Our experimental results demonstrate that the proposed architecture achieves state-of-the-art performance across most metrics on Corel5k, ESP Game, and IAPRTC-12.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105293"},"PeriodicalIF":4.2,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiangxiao Han , Shikang Wang , Xianbo Deng , Wenyu Liu
{"title":"High-performance mitosis detection using single-level feature and hybrid label assignment","authors":"Jiangxiao Han , Shikang Wang , Xianbo Deng , Wenyu Liu","doi":"10.1016/j.imavis.2024.105291","DOIUrl":"10.1016/j.imavis.2024.105291","url":null,"abstract":"<div><div>Mitosis detection poses a significant challenge in medical image analysis, primarily due to the substantial variability in the appearance and shape of mitotic targets. This paper introduces an efficient and accurate mitosis detection framework, which stands apart from previous mitosis detection techniques with its two key features: Single-Level Feature (SLF) for bounding box prediction and Dense-Sparse Hybrid Label Assignment (HLA) for bounding box matching. The SLF component of our method employs a multi-scale Transformer backbone to capture the global context and morphological characteristics of both mitotic and non-mitotic cells. This information is then consolidated into a single-scale feature map, thereby enhancing the model's receptive field and reducing redundant detection across various feature maps. In the HLA component, we propose a hybrid label assignment strategy to facilitate the model's adaptation to mitotic cells of different shapes and positions during training, thereby improving the model's adaptability to diverse cell morphologies. Our method has been tested on the largest mitosis detection datasets and achieves state-of-the-art (SOTA) performance, with an F1 score of 0.782 on the TUPAC 16 benchmark, and 0.792 with test time augmentation (TTA). Our method also exhibits superior accuracy and faster processing speed compared to previous methods. The source code and pretrained models will be released to facilitate related research.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105291"},"PeriodicalIF":4.2,"publicationDate":"2024-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diabetic retinopathy data augmentation and vessel segmentation through deep learning based three fully convolution neural networks","authors":"Jainy Sachdeva PhD , Puneet Mishra , Deeksha Katoch","doi":"10.1016/j.imavis.2024.105284","DOIUrl":"10.1016/j.imavis.2024.105284","url":null,"abstract":"<div><h3>Problem</h3><div>The eye fundus imaging is used for early diagnosis of most damaging concerns such as diabetic retinopathy, retinal detachments and vascular occlusions. However, the presence of noise, low contrast between background and vasculature during imaging, and vessel morphology lead to uncertain vessel segmentation.</div></div><div><h3>Aim</h3><div>This paper proposes a novel retinalblood vessel segmentation method for fundus imaging using a Difference of Gaussian (DoG) filter and an ensemble of three fully convolutional neural network (FCNN) models.</div></div><div><h3>Methods</h3><div>A Gaussian filter with standard deviation <span><math><msub><mi>σ</mi><mn>1</mn></msub></math></span> is applied on the preprocessed grayscale fundus image and is subtracted from a similarly applied Gaussian filter with standard deviation <span><math><msub><mi>σ</mi><mn>2</mn></msub></math></span> on the same image. The resultant image is then fed into each of the three fully convolutional neural networks as the input. The FCNN models' output is then passed through a voting classifier, and a final segmented vessel structure is obtained.The Difference of Gaussian filter played an essential part in removing the high frequency details (noise) and thus finely extracted the blood vessels from the retinal fundus with underlying artifacts.</div></div><div><h3>Results</h3><div>The total dataset consists of 3832 augmented images transformed from 479 fundus images. The result shows that the proposed method has performed extremely well by achieving an accuracy of 96.50%, 97.69%, and 95.78% on DRIVE, CHASE,and real-time clinical datasets respectively.</div></div><div><h3>Conclusion</h3><div>The FCNN ensemble model has demonstrated efficacy in precisely detecting retinal vessels and in the presence of various pathologies and vasculatures.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105284"},"PeriodicalIF":4.2,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142442942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiao Zhou , Xiaogang Peng , Hao Wen , Yikai Luo , Keyang Yu , Ping Yang , Zizhao Wu
{"title":"Learning weakly supervised audio-visual violence detection in hyperbolic space","authors":"Xiao Zhou , Xiaogang Peng , Hao Wen , Yikai Luo , Keyang Yu , Ping Yang , Zizhao Wu","doi":"10.1016/j.imavis.2024.105286","DOIUrl":"10.1016/j.imavis.2024.105286","url":null,"abstract":"<div><div>In recent years, the task of weakly supervised audio-visual violence detection has gained considerable attention. The goal of this task is to identify violent segments within multimodal data based on video-level labels. Despite advances in this field, traditional Euclidean neural networks, which have been used in prior research, encounter difficulties in capturing highly discriminative representations due to limitations of the feature space. To overcome this, we propose <em>HyperVD</em>, a novel framework that learns snippet embeddings in hyperbolic space to improve model discrimination. We contribute two branches of fully hyperbolic graph convolutional networks that excavate feature similarities and temporal relationships among snippets in hyperbolic space. By learning snippet representations in this space, the framework effectively learns semantic discrepancies between violent snippets and normal ones. Extensive experiments on the XD-Violence benchmark demonstrate that our method achieves 85.67% AP, outperforming the state-of-the-art methods by a sizable margin.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105286"},"PeriodicalIF":4.2,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated grading of diabetic retinopathy and Radiomics analysis on ultra-wide optical coherence tomography angiography scans","authors":"Vivek Noel Soren, H.S. Prajwal, Vaanathi Sundaresan","doi":"10.1016/j.imavis.2024.105292","DOIUrl":"10.1016/j.imavis.2024.105292","url":null,"abstract":"<div><div>Diabetic retinopathy (DR), a progressive condition due to diabetes that can lead to blindness, is typically characterized by a number of stages, including non-proliferative (mild, moderate and severe) and proliferative DR. These stages are marked by various vascular abnormalities, such as intraretinal microvascular abnormalities (IRMA), neovascularization (NV), and non-perfusion areas (NPA). Automated detection of these abnormalities and grading the severity of DR are crucial for computer-aided diagnosis. Ultra-wide optical coherence tomography angiography (UW-OCTA) images, a type of retinal imaging, are particularly well-suited for analyzing vascular abnormalities due to their prominence on these images. However, accurate detection of abnormalities and subsequent grading of DR is quite challenging due to noisy data, presence of artifacts, poor contrast and subtle nature of abnormalities. In this work, we aim to develop an automated method for accurate grading of DR severity on UW-OCTA images. Our method consists of various components such as UW-OCTA scan quality assessment, segmentation of vascular abnormalities and grading the scans for DR severity. Applied on publicly available data from Diabetic retinopathy analysis challenge (DRAC 2022), our method shows promising results with a Dice overlap metric and recall values of 0.88 for abnormality segmentation, and the coefficient-of-agreement (<span><math><mi>κ</mi></math></span>) value of 0.873 for DR grading. We also performed a radiomics analysis, and observed that the radiomics features are significantly different for increasing levels of DR severity. This suggests that radiomics could be used for multimodal grading and further analysis of DR, indicating its potential scope in this area.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105292"},"PeriodicalIF":4.2,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142423325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Utilizing Inherent Bias for Memory Efficient Continual Learning: A Simple and Robust Baseline","authors":"Neela Rahimi, Ming Shao","doi":"10.1016/j.imavis.2024.105288","DOIUrl":"10.1016/j.imavis.2024.105288","url":null,"abstract":"<div><div>Learning from continuously evolving data is critical in real-world applications. This type of learning, known as Continual Learning (CL), aims to assimilate new information without compromising performance on prior knowledge. However, learning new information leads to a bias in the network towards recent observations, resulting in a phenomenon known as catastrophic forgetting. The complexity increases in Online Continual Learning (OCL) scenarios where models are allowed only a single pass over data. Existing OCL approaches that rely on replaying exemplar sets are not only memory-intensive when it comes to large-scale datasets but also raise security concerns. While recent dynamic network models address memory concerns, they often present computationally demanding, over-parameterized solutions with limited generalizability. To address this longstanding problem, we propose a novel OCL approach termed “Bias Robust online Continual Learning (BRCL).” BRCL retains all intermediate models generated. These models inherently exhibit a preference for recently learned classes. To leverage this property for enhanced performance, we devise a strategy we describe as ‘utilizing bias to counteract bias.’ This method involves the development of an Inference function that capitalizes on the inherent biases of each model towards the recent tasks. Furthermore, we integrate a model consolidation technique that aligns the first layers of these models, particularly focusing on similar feature representations. This process effectively reduces the memory requirement, ensuring a low memory footprint. Despite the simplicity of the methodology to guarantee expandability to various frameworks, extensive experiments reveal a notable performance edge over leading methods on key benchmarks, getting continual learning closer to matching offline training. (Source code will be made publicly available upon the publication of this paper.)</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105288"},"PeriodicalIF":4.2,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A dual-channel network based on occlusion feature compensation for human pose estimation","authors":"Jiahong Jiang, Nan Xia","doi":"10.1016/j.imavis.2024.105290","DOIUrl":"10.1016/j.imavis.2024.105290","url":null,"abstract":"<div><div>Human pose estimation is an important technique in computer vision. Existing methods perform well in ideal environments, but there is room for improvement in occluded environments. The specific reasons are that the ambiguity of the features in the occlusion area makes the network pay insufficient attention to it, and the inadequate expressive ability of the features in the occlusion part cannot describe the true keypoint features. To address the occlusion issue, we propose a dual-channel network based on occlusion feature compensation. The dual channels are occlusion area enhancement channel based on convolution and occlusion feature compensation channel based on graph convolution, respectively. In the convolution channel, we propose an occlusion handling enhanced attention mechanism (OHE-attention) to improve the attention to the occlusion area. In the graph convolution channel, we propose a node feature compensation module that eliminates the obstacle features and integrates the shared and private attributes of the keypoints to improve the expressive ability of the node features. We conduct experiments on the COCO2017 dataset, COCO-Wholebody dataset, and CrowdPose dataset, achieving accuracy of 78.7%, 66.4%, and 77.9%, respectively. In addition, a series of ablation experiments and visualization demonstrations verify the performance of the dual-channel network in occluded environments.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105290"},"PeriodicalIF":4.2,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Landmark-in-facial-component: Towards occlusion-robust facial landmark localization","authors":"Xiaoqiang Li , Kaiyuan Wu , Shaohua Zhang","doi":"10.1016/j.imavis.2024.105289","DOIUrl":"10.1016/j.imavis.2024.105289","url":null,"abstract":"<div><div>Despite great efforts in recent years to research robust facial landmark localization methods, occlusion remains a challenge. To tackle this challenge, we propose a model called the Landmark-in-Facial-Component Network (LFCNet). Unlike mainstream models that focus on boundary information, LFCNet utilizes the strong structural constraints inherent in facial anatomy to address occlusion. Specifically, two key modules are designed, a component localization module and an offset localization module. After grouping landmarks based on facial components, the component localization module accomplishes coarse localization of facial components. Offset localization module performs fine localization of landmarks based on the coarse localization results, which can also be seen as delineating the shape of facial components. These two modules form a coarse-to-fine localization pipeline and can also enable LFCNet to better learn the shape constraint of human faces, thereby enhancing LFCNet's robustness to occlusion. LFCNet achieves 4.82% normalized mean error on occlusion subset of WFLW dataset and 6.33% normalized mean error on Masked 300W dataset. The results demonstrate that LFCNet achieves excellent performance in comparison to state-of-the-art methods, especially on occlusion datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105289"},"PeriodicalIF":4.2,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}