Chen Lv , Chenggong Han , Jochen Lang , He Jiang , Deqiang Cheng , Jiansheng Qian
{"title":"GDM-depth: Leveraging global dependency modelling for self-supervised indoor depth estimation","authors":"Chen Lv , Chenggong Han , Jochen Lang , He Jiang , Deqiang Cheng , Jiansheng Qian","doi":"10.1016/j.imavis.2024.105160","DOIUrl":"https://doi.org/10.1016/j.imavis.2024.105160","url":null,"abstract":"<div><p>Self-supervised depth estimation algorithms eschew depth ground truth and employ the convolutional U-Net with a fixed receptive field which confines its focus primarily to nearby spatial distances. These factors obscure adequate supervision during image reconstruction, consequently hindering accurate depth estimation, particularly in complex indoor scenes. The pure transformer framework can perform global modelling to provide more semantic information. However, the cost is significant. To tackle these challenges, we introduce GDM-Depth, which utilizes global dependency modelling to offer more precise depth guidance from the network itself. Initially, we propose integrating learnable tree filters with unary terms, leveraging the structural properties of spanning trees to facilitate efficient long-range interactions. Subsequently, instead of replacing the convolutional framework entirely, we employ the transformer to design a scale-aware global feature extractor, establishing global relationships among local features at various scales, achieving both efficiency and cost-effectiveness. Furthermore, inter-class disparities between depth global and local features are observed. To address this issue, we introduce the global feature injector to further enhance the representation. GDM-Depth's effectiveness is demonstrated on the NYUv2, ScanNet, and InteriorNet depth datasets, achieving impressive test set performances of 87.2%, 83.1%, and 76.1% in key indicators <span><math><mi>δ</mi><mo><</mo><mn>0.125</mn></math></span>, respectively.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141605867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhicheng Ma , Zhaoxiang Liu , Kai Wang , Shiguo Lian
{"title":"Hybrid attention transformer with re-parameterized large kernel convolution for image super-resolution","authors":"Zhicheng Ma , Zhaoxiang Liu , Kai Wang , Shiguo Lian","doi":"10.1016/j.imavis.2024.105162","DOIUrl":"https://doi.org/10.1016/j.imavis.2024.105162","url":null,"abstract":"<div><p>Single image super-resolution is a well-established low-level vision task that aims to reconstruct high-resolution images from low-resolution images. Methods based on Transformer have shown remarkable success and achieved outstanding performance in SISR tasks. While Transformer effectively models global information, it is less effective at capturing high frequencies such as stripes that primarily provide local information. Additionally, it has the potential to further enhance the capture of global information. To tackle this, we propose a novel Large Kernel Hybrid Attention Transformer using re-parameterization. It combines different kernel sizes and different steps re-parameterized convolution layers with Transformer to effectively capture global and local information to learn comprehensive features with low-frequency and high-frequency information. Moreover, in order to solve the problem of using batch normalization layer to introduce artifacts in SISR, we propose a new training strategy which is fusing convolution layer and batch normalization layer after certain training epochs. This strategy can enjoy the acceleration convergence effect of batch normalization layer in training and effectively eliminate the problem of artifacts in the inference stage. For re-parameterization of multiple parallel branch convolution layers, adopting this strategy can further reduce the amount of calculation of training. By coupling these core improvements, our LKHAT achieves state-of-the-art performance for single image super-resolution task.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AI-powered trustable and explainable fall detection system using transfer learning","authors":"Aryan Nikul Patel , Ramalingam Murugan , Praveen Kumar Reddy Maddikunta , Gokul Yenduri , Rutvij H. Jhaveri , Yaodong Zhu , Thippa Reddy Gadekallu","doi":"10.1016/j.imavis.2024.105164","DOIUrl":"https://doi.org/10.1016/j.imavis.2024.105164","url":null,"abstract":"<div><p>Accidental falls pose a significant public health challenge, especially among vulnerable populations. To address this issue, comprehensive research on fall detection and rescue systems is essential. Vision-based technologies, with their promising potential, offer an effective means to detect falls. This research paper presents a cutting-edge fall detection methodology aimed at enhancing individual safety and well-being. The proposed methodology utilizes deep neural networks, leveraging their capabilities to drive advancements in fall detection. To overcome data limitations and computational efficiency concerns, this study employ transfer learning by fine-tuning pre-trained models on large-scale image datasets for fall detection. This approach significantly enhances model performance, enabling better generalization and accuracy, especially in real-time applications with constrained resources. Notably, the methodology achieved an impressive test accuracy of 98.15%. Additionally, the incorporation of Explainable Artificial Intelligence (XAI) techniques is used to ensure transparent and trustworthy decision-making in fall detection using deep learning models, especially in critical healthcare contexts for vulnerable individuals. XAI provides valuable insights into complex model architectures and parameters, enabling a deeper understanding of fall identification patterns. To evaluate the effectiveness of this approach, a rigorous experimentation was conducted using a diverse dataset containing real-world fall and non-fall scenarios. The results demonstrate substantial improvements in both accuracy and interpretability, confirming the superiority of this method over conventional fall detection approaches.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangzhe Zhao, Chen Zhang, Xueping Wang, Benwang Lin, Feihu Yan
{"title":"PMANet: Progressive multi-stage attention networks for skin disease classification","authors":"Guangzhe Zhao, Chen Zhang, Xueping Wang, Benwang Lin, Feihu Yan","doi":"10.1016/j.imavis.2024.105166","DOIUrl":"https://doi.org/10.1016/j.imavis.2024.105166","url":null,"abstract":"<div><p>Automated skin disease classification is crucial for the timely diagnosis of skin lesions. However, accurate skin disease classification presents a challenge, given the significant intra-class variation and inter-class similarity among different kinds of skin diseases. Previous studies have attempted to address this issue by identifying the most discriminative part of a lesion, but they tend to overlook the interactions between multi-scale features. In this paper, we propose a Progressive Multi-stage Attention Network (PMANet) to enhance the learning of multi-scale discriminative features, so that the model can gradually localize from stable fine-grained to coarse-grained regions in order to improve the accuracy of disease classification. Specifically, we utilize a progressive multi-stage network to supervise feature and classification, thereby fostering multi-scale information and improving the model's ability to learn intra-class consistent information. Additionally, we propose an enhanced region proposal block that highlights key discriminative features and suppresses background noise of lesions, reinforcing the learning of inter-class discriminative features. Furthermore, we propose a multi-branch feature fusion block that effectively fuses multi-scale lesion features from different stages. Comprehensive experiments conducted on two datasets substantiate the effectiveness and superiority of the proposed method in accurately classifying skin disease.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141595170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A semi-parallel CNN-transformer fusion network for semantic change detection","authors":"Changzhong Zou, Ziyuan Wang","doi":"10.1016/j.imavis.2024.105157","DOIUrl":"https://doi.org/10.1016/j.imavis.2024.105157","url":null,"abstract":"<div><p>Semantic change detection (SCD) can recognize the region and the type of changes in remote sensing images. Existing methods are either based on transformer or convolutional neural network (CNN), but due to the size of various ground objects is different, it is necessary to have global modeling ability and local information extraction ability at the same time. Therefore, in this paper we propose a fusion semantic change detection network (FSCD) with both global modeling ability and local information extraction ability by fusing transformer and CNN. A semi-parallel fusion block has also been proposed to construct FSCD. It can not only have global and local features in parallel, but also fuse them as deeply as serial. To better adaptively decide which mechanism is applied to which pixel, we design a self-attention and convolution selection module (ACSM). ACSM is a self-attention mechanism used to selectively combine transformer and CNN. Specifically, the importance of each mechanism is automatically obtained by learning. According to the importance, the mechanism suitable for a pixel is selected, which is better than using either mechanism alone. We evaluate the proposed FSCD on two datasets, and the proposed network has a significant improvement compared with the state-of-the-art network.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141605868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Yu , Yonghong Hou , Zihui Guo , Zhiyi Gao , Yueyang Li
{"title":"FTAN: Frame-to-frame temporal alignment network with contrastive learning for few-shot action recognition","authors":"Bin Yu , Yonghong Hou , Zihui Guo , Zhiyi Gao , Yueyang Li","doi":"10.1016/j.imavis.2024.105159","DOIUrl":"https://doi.org/10.1016/j.imavis.2024.105159","url":null,"abstract":"<div><p>Most current few-shot action recognition approaches follow the metric learning paradigm, measuring the distance of any sub-sequences (frames, any frame combinations or clips) between different actions for classification. However, this disordered distance metric between action sub-sequences ignores the long-term temporal relations of actions, which may result in significant metric deviations. What's more, the distance metric suffers from the distinctive temporal distribution of different actions, including intra-class temporal offsets and inter-class local similarity. In this paper, a novel few-shot action recognition framework, Frame-to-frame Temporal Alignment Network (<strong>FTAN</strong>), is proposed to address the above challenges. Specifically, an attention-based temporal alignment (<strong>ATA</strong>) module is devised to calculate the distance between corresponding frames of different actions along the temporal dimension to achieve frame-to-frame temporal alignment. Meanwhile, the Temporal Context module (<strong>TCM</strong>) is proposed to increase inter-class diversity by enriching the frame-level feature representation, and the Frames Cyclic Shift Module (<strong>FCSM</strong>) performs frame-level temporal cyclic shift to reduce intra-class inconsistency. In addition, we present temporal and global contrastive objectives to assist in learning discriminative and class-agnostic visual features. Experimental results show that the proposed architecture achieves state-of-the-art on HMDB51, UCF101, Something-Something V2 and Kinetics-100 datasets.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two-dimensional hybrid incremental learning (2DHIL) framework for semantic segmentation of skin tissues","authors":"","doi":"10.1016/j.imavis.2024.105147","DOIUrl":"10.1016/j.imavis.2024.105147","url":null,"abstract":"<div><p>This study aims to enhance the robustness and generalization capability of a deep learning transformer model used for segmenting skin carcinomas and tissues through the introduction of incremental learning. Deep learning AI models demonstrate their claimed performance only for tasks and data types for which they are specifically trained. Their performance is severely challenged for the test cases which are not similar to training data thus questioning their robustness and ability to generalize. Moreover, these models require an enormous amount of annotated data for training to achieve desired performance. The availability of large annotated data, particularly for medical applications, is itself a challenge. Despite efforts to alleviate this limitation through techniques like data augmentation, transfer learning, and few-shot training, the challenge persists. To address this, we propose refining the models incrementally as new classes are discovered and more data becomes available, emulating the human learning process. However, deep learning models face the challenge of catastrophic forgetting during incremental training. Therefore, we introduce a two-dimensional hybrid incremental learning framework for segmenting non-melanoma skin cancers and tissues from histopathology images. Our approach involves progressively adding new classes and introducing data of varying specifications to introduce adaptability in the models. We also employ a combination of loss functions to facilitate new learning and mitigate catastrophic forgetting. Our extended experiments demonstrate significant improvements, with an F1 score reaching 91.78, mIoU of 93.00, and an average accuracy of 95%. These findings highlight the effectiveness of our incremental learning strategy in enhancing the robustness and generalization of deep learning segmentation models while mitigating catastrophic forgetting.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0262885624002518/pdfft?md5=d44cd642beec8e071716f174c3ad2a5f&pid=1-s2.0-S0262885624002518-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141623389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel infrared and visible image fusion algorithm based on global information-enhanced attention network","authors":"Jia Tian, Dong Sun, Qingwei Gao, Yixiang Lu, Muxi Bao, De Zhu, Dawei Zhao","doi":"10.1016/j.imavis.2024.105161","DOIUrl":"https://doi.org/10.1016/j.imavis.2024.105161","url":null,"abstract":"<div><p>The fusion of infrared and visible images aims to extract and fuse thermal target information and texture details to the fullest extent possible, enhancing the visual understanding capabilities of images for both humans and computers in complex scenes. However, existing methods have difficulties in preserving the comprehensiveness of source image feature information and enhancing the saliency of image texture information. Therefore, we put forward a novel infrared and visible image fusion algorithm based on global information-enhanced attention network (GIEA). Specifically, we develop an attention-guided Transformer module (AGTM) to make sure the fused images have enough global information. This module combines the convolutional neural network and Transformer to perform adequate feature extraction from shallow to deep layers, and utilize the attention network for multi-level feature-guided learning. Then, we build the contrast enhancement module (CENM), which enhances the feature representation and contrast of the image so that the fused image contains significant texture information. Furthermore, our network is driven to fully preserve the texture and structure details of the source images with a loss function that consists of content loss and total variance loss. Numerous experiments demonstrate that our fusion approach outperforms other fusion approaches in both subjective and objective assessments.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141582166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Artificial immune systems for data augmentation","authors":"","doi":"10.1016/j.imavis.2024.105163","DOIUrl":"10.1016/j.imavis.2024.105163","url":null,"abstract":"<div><p>We study object detection models and observe that their respective architectures are vulnerable to image distortions such as noise, compression, blur, or snow. We propose alleviating this problem by training the models with antibodies generated using Artificial Immune Systems (AIS) from original training samples (antigens). These antibodies are AIS-distorted antigens at the pixel level through cycles of “select, clone, mutate, select” until an affinity to the antigen is achieved. We then add the antibodies to the antigens, train the models, validate and test them under 15 distortions, and show that our data augmentation approach (AISbod) significantly improved their accuracy without altering their architecture or inference speed. For example, the DINO object detector under the COCO dataset improves by 4% under clean samples, by 6.50% on average over all 15 distortions, by 2.15% under snow, and by 27.60% under impulse noise. Our simulations show that our method performs better under distortions and clean samples than related defense methods and is more consistent across datasets and object detection models. For instance, our method is, on average, 70% better than the closest related method across 15 distortions for the evaluated models under COCO. Moreover, we show that our approach to image classification and object tracking models significantly improves accuracy under distortions. We provide the code of our method and the DINO model trained using our method at <span><span><span>https://github.com/moforio/AISbod</span></span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141699521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video object segmentation based on dynamic perception update and feature fusion","authors":"","doi":"10.1016/j.imavis.2024.105156","DOIUrl":"10.1016/j.imavis.2024.105156","url":null,"abstract":"<div><p>The current popular video object segmentation algorithms based on memory network indiscriminately update the frame information to the memory pool, fails to make reasonable use of the historical frame information, causing frame information redundancy in the memory pool, resulting in the increase of the computation amount. At the same time, the mask refinement method is relatively rough, resulting in blurred edges of the generated mask. To solve these problems, This paper proposes a video object segmentation algorithm based on dynamic perception update and feature fusion. In order to reasonably utilize the historical frame information, a dynamic perception update module is proposed to selectively update the segmentation frame mask. Meanwhile, a mask refinement module is established to enhance the detail information of the shallow features of the backbone network. This module uses a double kernels fusion block to fuse the different scale information of the features, and finally uses the Laplacian operator to sharpen the edges of the mask. The experimental results show that on the public datasets DAVIS2016, DAVIS2017 and YouTube-VOS<sub>18</sub>, the comprehensive performance of the algorithm in this paper reaches 86.9%, 79.3% and 71.6%, respectively, and the segmentation speed reaches 15FPS on the DAVIS2016 dataset. Compared with many mainstream algorithms in recent years, it has obvious advantages in performance.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":null,"pages":null},"PeriodicalIF":4.2,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141715214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}