{"title":"Diff-STAR: Exploring student-teacher adaptive reconstruction through diffusion-based generation for image harmonization","authors":"An Cao , Gang Shen","doi":"10.1016/j.imavis.2024.105254","DOIUrl":"10.1016/j.imavis.2024.105254","url":null,"abstract":"<div><p>Image harmonization aims to seamlessly integrate foreground and background elements from distinct photos into a visually realistic composite. However, achieving high-quality image composition remains challenging in adjusting color balance, retaining fine details, and ensuring perceptual consistency. This article introduces a novel approach named Diffusion-based Student-Teacher Adaptive Reconstruction (Diff-STAR) to address foreground adjustment by framing it as an image reconstruction task. Leveraging natural photographs for model pretraining eliminates the need for data augmentation within Diff-STAR's framework. Employing the pre-trained Denoising Diffusion Implicit Model (DDIM) enhances photorealism and fidelity in generating high-quality outputs from reconstructed latent representations. By effectively identifying similarities in low-frequency style and semantic relationships across various regions within latent images, we develop a student-teacher architecture combining Transformer encoders and decoders to predict adaptively masked patches derived through diffusion processes. Evaluated on the public datasets, including iHarmony4 and RealHM, the experiment results confirm Diff-STAR's superiority over other state-of-the-art approaches based on metrics including Mean Squared Error (MSE) and Peak Signal-to-noise ratio (PSNR).</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105254"},"PeriodicalIF":4.2,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142242399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feidu Akmel , Fanman Meng , Mingyu Liu , Runtong Zhang , Asebe Teka , Elias Lemuye
{"title":"Few-shot class incremental learning via prompt transfer and knowledge distillation","authors":"Feidu Akmel , Fanman Meng , Mingyu Liu , Runtong Zhang , Asebe Teka , Elias Lemuye","doi":"10.1016/j.imavis.2024.105251","DOIUrl":"10.1016/j.imavis.2024.105251","url":null,"abstract":"<div><p>The ability of a model to learn incrementally from very limited data while still retaining knowledge about previously seen classes is called few-shot incremental learning. The challenge of the few-shot learning model is data overfitting while the challenge of incremental learning models is catastrophic forgetting. To address these problems, we propose a distillation algorithm coupled with prompting, which effectively addresses the problem encountered in few-shot class-incremental learning by facilitating the transfer of distilled knowledge from a source to a target prompt. Furthermore, we employ a feature embedding module that monitors the semantic similarity between the input labels and the semantic vectors. This enables the learners to receive additional guidance, thereby mitigating the occurrence of catastrophic forgetting and overfitting. As our third contribution, we introduce an attention-based knowledge distillation method that learns relative similarities between features by creating effective links between teacher and student. This enables the regulation of the distillation intensities of all potential pairs between teacher and student. To validate the effectiveness of our proposed method, we conducted extensive experiments on diverse datasets, including miniImageNet, CIFAR100, and CUB200. The results of these experiments demonstrated that our method achieves state-of-the-art performance.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105251"},"PeriodicalIF":4.2,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142172528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xun Ji , Xu Wang , Na Leng , Li-Ying Hao , Hui Guo
{"title":"Dual-branch underwater image enhancement network via multiscale neighborhood interaction attention learning","authors":"Xun Ji , Xu Wang , Na Leng , Li-Ying Hao , Hui Guo","doi":"10.1016/j.imavis.2024.105256","DOIUrl":"10.1016/j.imavis.2024.105256","url":null,"abstract":"<div><p>Due to the light scattering and absorption, underwater images inevitably suffer from diverse quality degradation, including color distortion, low contrast, and blurred details. To address the problems, we present a dual-branch convolutional neural network via multiscale neighborhood interaction attention learning for underwater image enhancement. Specifically, the proposed network is trained by an ensemble of locally-aware and globally-aware branches processed in parallel, where the locally-aware branch with stronger representation ability aims to recover high-frequency local details sufficiently, and the globally-aware branch with weaker learning ability aims to prevent information loss in low-frequency global structure effectively. On the other hand, we develop a plug-and-play multiscale neighborhood interaction attention module, which further enhances image quality through appropriate cross-channel interactions with inputs from different receptive fields. Compared with the well-received methods, extensive experiments on both real-world and synthetic underwater images reveal that our proposed network can achieve superior color and contrast enhancement in terms of subjective visual perception and objective evaluation metrics. Ablation study is also conducted to demonstrate the effectiveness of each component in the network.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105256"},"PeriodicalIF":4.2,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive graph reasoning network for object detection","authors":"Xinfang Zhong , Wenlan Kuang , Zhixin Li","doi":"10.1016/j.imavis.2024.105248","DOIUrl":"10.1016/j.imavis.2024.105248","url":null,"abstract":"<div><p>In recent years, Transformer-based object detection has achieved leaps and bounds in performance. Nevertheless, these methods still face some problems such as difficulty in detecting heavy occluded objects and tiny objects. Besides, the mainstream object detection paradigms usually deal with region proposals alone, without considering contextual information and the relationships between objects, which results in limited improvement. In this paper, we propose an Adaptive Graph Reasoning Network (AGRN) that explores the relationships between specific objects in an image and mines high-level semantic information via GCN to enrich visual features. Firstly, to enhance the semantic correlation between objects, a cross-scale semantic-aware module is proposed to realize the semantic interaction between feature maps of different scales so as to obtain a cross-scale semantic feature. Secondly, we activate the instance features in the image and combine the cross-scale semantic feature to create a dynamic graph. Finally, guided by the specific semantics, the attention mechanism is introduced to focus on the corresponding critical regions. On the MS-COCO 2017 dataset, our method improves the performance by 3.9% box AP and 3.6% mask AP in object detection and instance segmentation respectively relative to baseline. Additionally, our model has demonstrated exceptional performance on the PASCAL VOC dataset.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105248"},"PeriodicalIF":4.2,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0262885624003536/pdfft?md5=c327d5634e930b5455fb578d65af5bcf&pid=1-s2.0-S0262885624003536-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142164197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianwen Song , Arcot Sowmya , Jien Kato , Changming Sun
{"title":"Efficient masked feature and group attention network for stereo image super-resolution","authors":"Jianwen Song , Arcot Sowmya , Jien Kato , Changming Sun","doi":"10.1016/j.imavis.2024.105252","DOIUrl":"10.1016/j.imavis.2024.105252","url":null,"abstract":"<div><p>Current stereo image super-resolution methods do not fully exploit cross-view and intra-view information, resulting in limited performance. While vision transformers have shown great potential in super-resolution, their application in stereo image super-resolution is hindered by high computational demands and insufficient channel interaction. This paper introduces an efficient masked feature and group attention network for stereo image super-resolution (EMGSSR) designed to integrate the strengths of transformers into stereo super-resolution while addressing their inherent limitations. Specifically, an efficient masked feature block is proposed to extract local features from critical areas within images, guided by sparse masks. A group-weighted cross-attention module consisting of group-weighted cross-view feature interactions along epipolar lines is proposed to fully extract cross-view information from stereo images. Additionally, a group-weighted self-attention module consisting of group-weighted self-attention feature extractions with different local windows is proposed to effectively extract intra-view information from stereo images. Experimental results demonstrate that the proposed EMGSSR outperforms state-of-the-art methods at relatively low computational costs. The proposed EMGSSR offers a robust solution that effectively extracts cross-view and intra-view information for stereo image super-resolution, bringing a promising direction for future research in high-fidelity stereo image super-resolution. Source codes will be released at <span><span>https://github.com/jianwensong/EMGSSR</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105252"},"PeriodicalIF":4.2,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0262885624003573/pdfft?md5=f16b8e31aca64b2993c5abd2e28251d5&pid=1-s2.0-S0262885624003573-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142228982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhisheng Cui , Yibing Yao , Shilong Li , Yongcan Zhao , Ming Xin
{"title":"A lightweight hash-directed global perception and self-calibrated multiscale fusion network for image super-resolution","authors":"Zhisheng Cui , Yibing Yao , Shilong Li , Yongcan Zhao , Ming Xin","doi":"10.1016/j.imavis.2024.105255","DOIUrl":"10.1016/j.imavis.2024.105255","url":null,"abstract":"<div><p>In recent years, with the increase in the depth and width of convolutional neural networks, single image super-resolution (SISR) algorithms have made significant breakthroughs in objective quantitative metrics and subjective visual quality. However, these operations have inevitably caused model inference time to surge. In order to find a balance between model speed and accuracy, we propose a lightweight hash-directed global perception and self-calibrated multiscale fusion network for image Super-Resolution (HSNet) in this paper. The HSNet makes the following two main improvements: first, the Hash-Directed Global Perception module (HDGP) designed in this paper is able to capture the dependencies between features in a global perspective by using the hash encoding to direct the attention mechanism. Second, the Self-Calibrated Multiscale Fusion module (SCMF) proposed in this paper has two independent task branches: the upper branch of the SCMF utilizes the feature fusion module to capture multiscale contextual information, while the lower branch focuses on local details through a small convolutional kernel. These two branches are fused with each other to effectively enhance the network's multiscale understanding capability. Extensive experimental results demonstrate the remarkable superiority of our approach over other state-of-the-art methods in both subjective visual effects and objective evaluation metrics, including PSNR, SSIM, and computational complexity.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105255"},"PeriodicalIF":4.2,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3D face alignment through fusion of head pose information and features","authors":"Jaehyun So , Youngjoon Han","doi":"10.1016/j.imavis.2024.105253","DOIUrl":"10.1016/j.imavis.2024.105253","url":null,"abstract":"<div><p>The ability of humans to infer head poses from face shapes, and vice versa, indicates a strong correlation between them. Recent studies on face alignment used head pose information to predict facial landmarks in computer vision tasks. However, many studies have been limited to using head pose information primarily to initialize mean landmarks, as it cannot represent detailed face shapes. To enhance face alignment performance through effective utilization, we introduce a novel approach that integrates head pose information into the feature maps of a face alignment network, rather than simply using it to initialize facial landmarks. Furthermore, the proposed network structure achieves reliable face alignment through a dual-dimensional network. This structure uses multidimensional features such as 2D feature maps and a 3D heatmap to reduce reliance on a single type of feature map and enrich the feature information. We also propose a dense face alignment method through an appended fully connected layer at the end of a dual-dimensional network, trained with sparse face alignment. This method easily trains dense face alignment by directly using predicted keypoints as knowledge and indirectly using semantic information. We experimentally assessed the correlation between the predicted facial landmarks and head pose information, as well as variations in the accuracy of facial landmarks with respect to the quality of head pose information. In addition, we demonstrated the effectiveness of the proposed method through a competitive performance comparison with state-of-the-art methods on the AFLW2000-3D, AFLW, and BIWI datasets. In the evaluation of the face alignment task, we achieved an NME of 3.21 for the AFLW2000-3D and 3.68 for the AFLW dataset.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105253"},"PeriodicalIF":4.2,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0262885624003585/pdfft?md5=9951cd09c51d4f1ecd2222839b6c8209&pid=1-s2.0-S0262885624003585-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142164198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distilling OCT cervical dataset with evidential uncertainty proxy","authors":"Yuxuan Xiong , Yongchao Xu , Yan Zhang , Bo Du","doi":"10.1016/j.imavis.2024.105250","DOIUrl":"10.1016/j.imavis.2024.105250","url":null,"abstract":"<div><p>Deep learning-based OCT image classification method is of paramount importance for early screening of cervical cancer. For the sake of efficiency and privacy, the emerging data distillation technique becomes a promising way to condense the large-scale original OCT dataset into a much smaller synthetic dataset, without losing much information for network training. However, OCT cervical images often suffer from redundancy, mis-operation and noise, <em>etc.</em> These challenges make it hard to compress as much valuable information as possible into extremely small synthesized dataset. To this end, we design an uncertainty-aware distribution matching based dataset distillation framework (UDM). Precisely, we adopt a pre-trained plug-and-play uncertainty estimation proxy to compute classification uncertainty for each data point in the original and synthetic dataset. The estimated uncertainty allows us to adaptively calculate class-wise feature centers of the original and synthetic data, thereby increasing the importance of typical patterns and reducing the impact of redundancy, mis-operation, and noise, <em>etc.</em> Extensive experiments show that our UDM effectively improves distribution-matching-based dataset distillation under both homogeneous and heterogeneous training scenarios.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105250"},"PeriodicalIF":4.2,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142172529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the synergy between textual identity and visual signals in human-object interaction","authors":"Pinzhu An, Zhi Tan","doi":"10.1016/j.imavis.2024.105249","DOIUrl":"10.1016/j.imavis.2024.105249","url":null,"abstract":"<div><p>Human-Object Interaction (HOI) detection task aims to recognize and understand interactions between humans and objects depicted in images. Unlike instance recognition tasks, which focus on isolated objects, HOI detection requires considering various explanatory factors, such as instance identity, spatial relationships, and scene context. However, previous HOI detection methods have primarily relied on local visual cues, often overlooking the vital role of instance identity and thus limiting the performance of models. In this paper, we introduce textual features to expand the definition of HOI representations, incorporating instance identity into the HOI reasoning process. Drawing inspiration from the human activity perception process, we explore the synergy between textual identity and visual signals to leverage various explanatory factors more effectively and enhance HOI detection performance. Specifically, our method extracts HOI explanatory factors using both modal representations. Visual features capture interactive cues, while textual features explicitly denote instance identities within human-object pairs, delineating relevant interaction categories. Additionally, we utilize Contrastive Language-Image Pre-training (CLIP) to enhance the semantic alignment between visual and textual features and design a cross-modal learning module for integrating HOI multimodal information. Extensive experiments on several benchmarks demonstrate that our proposed framework surpasses most existing methods, achieving outstanding performance with a mean average precision (mAP) of 33.95 on the HICO-DET dataset and 63.2 mAP on the V-COCO dataset.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105249"},"PeriodicalIF":4.2,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Long Chen , Li Song , Haiyu Feng , Rediet Tesfaye Zeru , Senchun Chai , Enjun Zhu
{"title":"Privacy-SF: An encoding-based privacy-preserving segmentation framework for medical images","authors":"Long Chen , Li Song , Haiyu Feng , Rediet Tesfaye Zeru , Senchun Chai , Enjun Zhu","doi":"10.1016/j.imavis.2024.105246","DOIUrl":"10.1016/j.imavis.2024.105246","url":null,"abstract":"<div><p>Deep learning is becoming increasingly popular and is being extensively used in the field of medical image analysis. However, the privacy sensitivity of medical data limits the availability of data, which constrains the advancement of medical image analysis and impedes collaboration across multiple centers. To address this problem, we propose a novel encoding-based framework, named Privacy-SF, aimed at implementing privacy-preserving segmentation for medical images. Our proposed segmentation framework consists of three CNN networks: 1) two encoding networks on the client side that encode medical images and their corresponding segmentation masks individually to remove the privacy features, 2) a unique mapping network that analyzes the content of encoded data and learns the mapping from the encoded image to the encoded mask. By sequentially encoding data and optimizing the mapping network, our approach ensures privacy protection for images and masks during both the training and inference phases of medical image analysis. Additionally, to further improve the segmentation performance, we carefully design augmentation strategies specifically for encoded data based on its sequence nature. Extensive experiments conducted on five datasets with different modalities demonstrate excellent performance in privacy-preserving segmentation and multi-center collaboration. Furthermore, the analysis of encoded data and the experiment of model inversion attacks validate the privacy-preserving capability of our approach.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"151 ","pages":"Article 105246"},"PeriodicalIF":4.2,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}