Yunlei Sun, Pengxiao Shi, Tiancheng Chen, Danning Qi, Ke Xu
{"title":"MFET: Multi-frequency enhancement transformer for single-image super-resolution","authors":"Yunlei Sun, Pengxiao Shi, Tiancheng Chen, Danning Qi, Ke Xu","doi":"10.1016/j.imavis.2025.105751","DOIUrl":"10.1016/j.imavis.2025.105751","url":null,"abstract":"<div><div>Single-Image Super-Resolution (SISR) aims to reconstruct a high-resolution image from a low-resolution input while effectively preserving structural integrity and fine details. However, (i) low-frequency structural cues progressively fade during deep-layer propagation, and (ii) existing upsampling modules either ignore multi-scale context or incur excessive computation, leading to unsatisfactory high-frequency texture recovery. To address these limitations, we propose the Multi-Frequency Enhancement Transformer (MFET), a novel Transformer-based network tailored for efficient SISR. MFET seamlessly integrates low-frequency structural preservation with high-frequency detail recovery through its Multi-Frequency Block (MFB). The MFB employs a Residual Attention Mechanism (RAM) to propagate fine-grained features across layers, ensuring robust retention of low-level details, and an Efficient Upscale Module (EUM) with a pyramidal structure and depthwise separable convolutions to enhance high-frequency components with minimal computational cost. Extensive experiments on benchmark datasets demonstrate that MFET achieves superior performance in PSNR and SSIM, particularly at ×3 and ×4 scales, excelling in texture and edge reconstruction. MFET strikes an optimal balance between quality and efficiency, offering a promising solution for high-quality super-resolution. Our code is available at <span><span>https://github.com/snh4/MFET</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105751"},"PeriodicalIF":4.2,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145227530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soyoun Won , Hyeon Bae Kim , Yong Hyun Ahn , Hong Joo Lee , Seong Tae Kim
{"title":"Understanding adversarial robustness of deep neural networks via decision reliance","authors":"Soyoun Won , Hyeon Bae Kim , Yong Hyun Ahn , Hong Joo Lee , Seong Tae Kim","doi":"10.1016/j.imavis.2025.105743","DOIUrl":"10.1016/j.imavis.2025.105743","url":null,"abstract":"<div><div>Adversarial robustness has become a major concern as machine learning models are increasingly deployed in high-risk and high-impact applications. Accordingly, various adversarial training strategies are proposed, making the model more robust under adversarial attack. However, similar to deep neural networks (DNNs) themselves, the mechanisms through which adversarial training strategies improve model robustness remain opaque. In this paper, we reveal how adversarial training alters the internal workings of deep neural networks by conducting neuron-wise decision reliance analysis. We find that adversarially vulnerable models predominantly rely on a small subset of predictive neurons while adversarially robust models tend to distribute their reliance across a broader range of neurons. We validate the relationship between decision reliance and adversarial robustness through comprehensive experiments across various models, training objectives, and attack scenarios. We observe that this relationship also holds for standard trained models, including those trained with Mixup or CutMix, which demonstrate improved performance against one-step adversarial attacks. Furthermore, we show that minimizing decision reliance leads to improved adversarial robustness. Our findings enrich the understanding of adversarially trained models and offer an interpretable and efficient approach to analyzing their internal mechanisms.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105743"},"PeriodicalIF":4.2,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145269521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Wang , Weijie Wang , Fausto Giunchiglia , Fengzhi Zhao , Ye Zhang , Duo Yu , Guixia Liu
{"title":"MBT-Polyp: A new Multi-Branch Memory-augmented Transformer for polyp segmentation","authors":"Tao Wang , Weijie Wang , Fausto Giunchiglia , Fengzhi Zhao , Ye Zhang , Duo Yu , Guixia Liu","doi":"10.1016/j.imavis.2025.105747","DOIUrl":"10.1016/j.imavis.2025.105747","url":null,"abstract":"<div><div>Polyp segmentation plays a critical role in the early diagnosis and precise clinical intervention of colorectal cancer (CRC). Despite significant advancements in deep learning for medical image segmentation, accurate localization of small polyps and precise delineation of polyp boundaries remain challenges in colorectal polyp segmentation. In this study, we introduce MBT-Polyp, a Multi-branch Memory-augmented Transformer architecture designed to improve segmentation sensitivity for small polyps and enhance the delineation accuracy of ambiguous polyp boundaries. At the core of our framework is MemoryFormer, a Transformer-based U-shaped architecture that incorporates three key components: a Dynamic Focal Attention block (DFA) for efficient small target enhancement and edge refinement, a High-Level Memory Attention Module (HMAM) for preserving boundary details via cross-resolution fusion, and a Multi-View Channel Memory Attention Module (MCMAM) for suppressing background noise and modeling local spatial context. To guide specialized learning, we derive small polyp and edge labels alongside ground truth, enabling MemoryFormer to process them through dedicated branches. The outputs are fused using a Small Polyp Fusion Strategy (SPFS) and an Edge Correction Strategy (ECS) to alleviate over- and under-segmentation. The quantitative results on Kvasir-SEG, CVC-ColonDB, CVC-ClinicDB, CVC-300, and ETIS-Larib yield mean Dice scores of 0.930, 0.818, 0.943, 0.912, and 0.763, respectively, demonstrating strong generalization across diverse polyp segmentation scenarios. Code and datasets are available at: <span><span>https://github.com/taojlu/PolypSeg</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105747"},"PeriodicalIF":4.2,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145227531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PixTention: Dynamic pixel-level adapter using attention maps","authors":"Dooho Choi, Yunsick Sung","doi":"10.1016/j.imavis.2025.105746","DOIUrl":"10.1016/j.imavis.2025.105746","url":null,"abstract":"<div><div>Recent advances in image generation have popularized adapter-based fine-tuning, where Low-Rank Adaptation (LoRA) modules enable efficient personalization with minimal storage costs. However, current approaches often suffer from two key limitations: (1) manually selecting suitable LoRA adapters is time-consuming and requires expert knowledge, and (2) applying multiple adapters globally can introduce style interference and reduce image fidelity, especially for prompts with multiple distinct concepts. We propose <strong>PixTention</strong>, a framework that addresses these challenges via a novel three-stage process: <em>Curator</em>, <em>Selector</em>, and <em>Integrator</em>. The Curator uses a vision-language model to generate enriched semantic descriptions of LoRA adapters and clusters their embeddings based on shared visual themes, enabling efficient hierarchical retrieval. The Selector embeds user prompts and first selects the most relevant adapter clusters, then identifies top-K adapters within them via cosine similarity. The Integrator leverages cross-attention maps from diffusion models to assign each retrieved adapter to specific semantic regions in the output image, ensuring localized, prompt-aligned transformations without global style overwriting. Through experiments on COCO-Multi and a custom StyleCompose dataset, PixTention achieves higher CLIP scores, IoU and lower FID than baseline retrieval and reranking methods, demonstrating superior text-image alignment and image realism. Our results highlight the importance of semantic clustering, region-specific adapter composition, and cross-modal alignment in advancing controllable, high-fidelity image generation.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105746"},"PeriodicalIF":4.2,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145227532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DECF-FGVC: A discriminative enhancement and complementary fusion approach for fine-grained bird visual classification","authors":"ShuaiShuai Deng , Tianhua Chen , Qinghua Qiao","doi":"10.1016/j.imavis.2025.105744","DOIUrl":"10.1016/j.imavis.2025.105744","url":null,"abstract":"<div><div>Fine-grained bird image recognition plays a critical role in species conservation. However, existing approaches are constrained by complex background interference, insufficient extraction of discriminative features, and limited integration of hierarchical information. While Vision Transformers (ViTs) demonstrate superior performance over CNNs in fine-grained classification tasks, they remain vulnerable to background noise, with class tokens often failing to capture key regions - overlooking the complementarity between low-level details and high-level semantics. This study proposes DECF-FGVC, a novel model incorporating three modules: Patch Contrast Enhancement (PCE), Contrast Token Refiner (CTR), and Hierarchical Token Synthesizer (HTS). These modules synergistically suppress background noise, emphasize key regions, and integrate multi-layer features through attention-weighted image reconstruction, counterfactual learning-based token refinement, and hierarchical token fusion. Extensive experiments on CUB-200-2011, NABirds, and iNaturalist2017 datasets achieve classification accuracies of 91.9%, 91.4%, and 77.92% respectively, consistently outperforming state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105744"},"PeriodicalIF":4.2,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145227535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yukang Huo , Mingyuan Yao , Tonghao Wang , Qingbin Tian , Jiayin Zhao , Xiao Liu , Haihua Wang
{"title":"PR-DETR: Extracting and utilizing prior knowledge for improved end-to-end object detection","authors":"Yukang Huo , Mingyuan Yao , Tonghao Wang , Qingbin Tian , Jiayin Zhao , Xiao Liu , Haihua Wang","doi":"10.1016/j.imavis.2025.105745","DOIUrl":"10.1016/j.imavis.2025.105745","url":null,"abstract":"<div><div>The query initialization in the Transformer-based target detection algorithm has static characteristics, resulting in a limitation to flexibly adjust the degree of attention to different image features during the learning process. In addition, without the guidance of global spatial semantic information, it will cause the model to disregard the relationship between the target and the surrounding environment due to relying on local features for target detection, causing the problem of false detection or missed detection of the target. In order to solve the above problems, this paper proposes a query-optimized target detection model <strong>PR-DETR</strong> based on feature map guidance. PR-DETR designs the Aggregating Global Spatial Semantic Information module (AGSSI module) to extract and enhance global spatial semantic information. Afterwards, we design queries that participate in the interaction of local and global spatial semantic information in the encoding part in advance, so as to obtain sufficient prior knowledge and provide more accurate and efficient queries for subsequent decoding feature maps. Experiment results show that PR-DETR has significantly improved detection accuracy on the MS COCO data set compared with existing related research work. The mAP is 3.5, 2.3 and 2.0 higher than Conditional-DETR, Anchor-DETR and DAB-DETR respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105745"},"PeriodicalIF":4.2,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145227533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepDCT-VO: 3D directional coordinate transformation for low-complexity monocular visual odometry using deep learning","authors":"E. Simsek , B. Ozyer","doi":"10.1016/j.imavis.2025.105742","DOIUrl":"10.1016/j.imavis.2025.105742","url":null,"abstract":"<div><div>Deep learning-based monocular visual odometry has gained importance in robotics and autonomous navigation due to its robustness in visually challenging environments and minimal sensor requirements. However, many existing deep learning-based MVO methods suffer from high computational costs and large model sizes, making them less suitable for real-time applications in resource-limited systems. In this study, we propose DeepDCT-VO, a lightweight visual odometry method that combines three-dimensional directional coordinate transformation with a compact deep learning architecture. Unlike traditional approaches that estimate translation in a global coordinate system and are prone to drift accumulation, DeepDCT-VO uses local directional motion derived from composite rotations. This approach avoids global trajectory reconstruction, thereby improving the method’s stability and reliability. The proposed model operates on input images at multiple resolutions (120 × 120, 240 × 240, 360 × 360, and 480 × 480), leveraging attention-guided residual learning to extract robust features. Additionally, it incorporates multi-modal information—specifically depth and semantic maps—to further improve the accuracy of pose estimation. Evaluations on the KITTI odometry benchmark demonstrate that DeepDCT-VO achieves competitive trajectory estimation accuracy while maintaining real-time performance—8 ms per frame on GPU and 12 ms on CPU. Compared to the existing method with the lowest translational drift (<span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span>), DeepDCT-VO reduces model size by approximately 96.3% (from 37.5 million to 1.4 million parameters). Conversely, when compared to the lightest model in terms of parameter count, DeepDCT-VO reduces <span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span> from 8.57% to 1.69%, achieving an 80.3% reduction in translational drift. These results underscore the effectiveness of DeepDCT-VO in delivering accurate and efficient monocular visual odometry, particularly suited for embedded and resource-limited applications, while the proposed transformation method offers an auxiliary function in reducing translational complexity.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105742"},"PeriodicalIF":4.2,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145227529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fall detection using deep learning with features computed from recursive quadratic splits of video frames","authors":"Zahra Solatidehkordi, Tamer Shanableh","doi":"10.1016/j.imavis.2025.105749","DOIUrl":"10.1016/j.imavis.2025.105749","url":null,"abstract":"<div><div>Accidental falls are a leading cause of injury and death worldwide, particularly among the elderly. Despite extensive research on fall detection, many existing systems remain limited by reliance on wearable sensors that are inconvenient for continuous use, or vision-based approaches that require full video decoding, human pose estimation, or simplified datasets that fail to capture the complexity of real-life environments. As a result, their accuracy often deteriorates in realistic scenarios such as nursing homes or crowded public spaces. In this paper, we introduce a novel fall detection framework that leverages information embedded in the High Efficiency Video Coding (HEVC) standard. Unlike traditional vision-based methods, our approach extracts spatio-temporal features directly from recursive block splits and other HEVC coding information. This includes creating a sequence of four RGB input images which capture block sizes and splits of the video frames in a visual manner. The block sizes in video coding are determined based on the spatio-temporal activities in the frames, hence the suitability of using them as features. Other features are also derived from the coded videos, including compression modes, motion vectors, and prediction residuals. To enhance robustness, we integrate these features into deep learning models and employ fusion strategies that combine complementary representations. Extensive evaluations on two challenging datasets: the Real-World Fall Dataset (RFDS) and the High-Quality Fall Simulation Dataset (HQFSD), demonstrate that our method achieves superior accuracy and robustness compared to prior work. In addition, our method requires only around 23 GFLOPs per video because the deep learning network is executed on just four fixed-frame representations, whereas traditional pipelines process every frame individually, often amounting to hundreds of frames per video and orders of magnitude higher FLOPs.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105749"},"PeriodicalIF":4.2,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145227534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic sparse and weight allocation-based text-driven person retrieval","authors":"Shuren Zhou , Qihang Zhou , Jiao Liu","doi":"10.1016/j.imavis.2025.105737","DOIUrl":"10.1016/j.imavis.2025.105737","url":null,"abstract":"<div><div>Text-to-image person retrieval aims to find the most matching personimages in a large-scale persondataset through textual descriptions. However, most of the existing methods have the following problems: (1) There are still some inaccurate matching pairs in the retrieval system, and the errors of these matching pairs negatively affect the performance of the whole retrieval system. (2) In the whole training process of the model, the whole text is used directly, but there are still non-important parts of the text that are not important for recognizing the images, and how to process the text effectively is still a hot topic in current research. These critical issues significantly degrade the retrieval performance. To this end, we propose a new alignment optimization framework for text-based person retrieval. Precisely, our framework consists of three key components: (1) progressive enhancement for a multimodal integration, which not only simulates coarse-grained alignment through mathematical modeling, but also appropriately combines coarse-grained and fine-grained alignment through progressive learning; (2) global bidirectional match filtering, which utilizes subjective logic to effectively mitigate the interference of incorrectly matched pairs of image text, and at the same time utilizes a bidirectional KL match filtering algorithm so as to select the matching pairs with high degree of image text matching for training; (3) fine-grained dynamic sparse mask modeling, which uses mask language modeling and constructs a dynamic spatial sparsification module, which not only applies more expressive modules to important positions but also mines the relationship between image text pairs at a fine-grained level, thus improving retrieval performance. Extensive experiments show that the method achieves state-of-the-art results on three benchmark datasets and performs well on domain generalization tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105737"},"PeriodicalIF":4.2,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145159770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Test-time adaptation for object detection via Dynamic Dual Teaching","authors":"Siqi Zhang , Lu Zhang , Zhiyong Liu","doi":"10.1016/j.imavis.2025.105740","DOIUrl":"10.1016/j.imavis.2025.105740","url":null,"abstract":"<div><div>Test-Time Adaptation (TTA) is a practical setting in real-world applications, which aims to adapt a source-trained model to target domains with online unlabeled test data streams. Current approaches often rely on self-training, utilizing supervision signals from the source-trained model, suffering from poor adaptation due to diverse domain shifts. In this paper, we propose a novel test-time adaptation method for object detection guided by dual teachers, termed <strong>D</strong>ynamic <strong>D</strong>ual <strong>T</strong>eaching (<strong>DDT</strong>). Inspired by the generalization potentials of Vision-Language Models (VLMs), we integrate the VLM as an additional language-driven instructor. This integration exploits the domain-robustness of language prompts to mitigate domain shifts, collaborating with the teacher of source information within the teacher–student framework. Firstly, we utilize an ensemble prompt to guide the prediction process of the language-driven instructor. Secondly, a dynamic fusion strategy of the dual teachers is designed to generate high-quality pseudo-labels for student learning. Moreover, we incorporate a dual prediction consistency regularization to further mitigate the sensitivity of the adapted detector to domain shifts. Experiments on diverse domain adaptation benchmarks demonstrate that the proposed DDT method achieves state-of-the-art performance on both online and offline domain adaptation settings.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105740"},"PeriodicalIF":4.2,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145159769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}