{"title":"STIFormer: RGB-T tracking via Spatial–Temporal Interaction Transformer","authors":"Boyue Xu, Yaqun Fang, Ruichao Hou, Tongwei Ren","doi":"10.1016/j.imavis.2026.105929","DOIUrl":"10.1016/j.imavis.2026.105929","url":null,"abstract":"<div><div>Existing RGB-Thermal (RGB-T) trackers integrate the RGB and thermal modalities by using cross-attention and estimate the object position by computing the correlation between a single template and the search region. However, many trackers yield unsatisfactory performance due to their disregard for inter-frame cues between modalities and dynamic changes in the dominant modality. To address this issue, we propose a novel <strong>S</strong>patial-<strong>T</strong>emporal <strong>I</strong>nteraction Trans<strong>former</strong>, called <strong>STIFormer</strong>, which effectively merges multi-modal features from both spatial and temporal domains, enhancing the robustness of RGB-T tracking. In particular, a spatial–temporal feature representation module is proposed to facilitate inter-frame information exchange through token propagation, which encodes features from multi-frames and a temporal token. In addition, a token-guided mixed attention fusion module is proposed to fuse the frame features and token features from different modalities. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on public RGB-T benchmarks. The project page is available at: <span><span>https://github.com/xuboyue1999/STIFormer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105929"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zenghao Zheng , Lianping Yang , Jinshan Pan , Hegui Zhu
{"title":"Mamba-Driven Topology Fusion for monocular 3D human pose estimation","authors":"Zenghao Zheng , Lianping Yang , Jinshan Pan , Hegui Zhu","doi":"10.1016/j.imavis.2026.105927","DOIUrl":"10.1016/j.imavis.2026.105927","url":null,"abstract":"<div><div>The Mamba model has gradually garnered widespread attention in 3D human pose estimation tasks due to its linear time scaling capability and excellent expressive power. However, the Mamba model exhibits deficiencies in handling human body topological structures, as its internal state space model and one-dimensional causal convolutional network have inherent design limitations in processing global topological sequences and local structures. To address these issues, we propose the Mamba-Driven Topology Fusion framework. For global topological guidance of the Mamba, we design a Bone Aware Module to deliver directional and length guidance of human skeletons in the spherical coordinate system. To capture dependencies between local joints, we enhance the convolutional structure within the Mamba by integrating forward and backward graph convolutional networks. Additionally, a Bone-Joint Fusion Embedding and a Spatiotemporal Refinement Module are proposed to fuse global skeletal and keypoint information and extract spatiotemporal features, respectively. The proposed Mamba-Driven Topology Fusion framework effectively alleviates the Mamba model’s incompatibility with the topological structures of human keypoints. We conduct extensive experiments on the Human3.6M and MPI-INF-3DHP datasets for evaluation and comparison, and the results demonstrate that the proposed method significantly reduces computational cost while achieving higher accuracy. Our model and code are available at <span><span>https://github.com/ZenghaoZheng/MDTF-3DHPE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105927"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Non-target information also matters: InverseFormer tracker for single object tracking","authors":"Qiuhang Gu , Baopeng Zhang , Zhu Teng , Hongwei Xu","doi":"10.1016/j.imavis.2026.105922","DOIUrl":"10.1016/j.imavis.2026.105922","url":null,"abstract":"<div><div>Visual object tracking has been significantly improved by Transformer-based methods. However, most existing trackers perform target-oriented inference, which enhances target-relevant features while ignoring non-target features. We argue that non-target information also contains abundant clues that can provide significant guidance for tracking inference. In this work, we propose a novel InverseFormer tracker constructed by stacking multiple InverseFormer blocks. The proposed InverseFormer block consists of a context aggregation unit and an inverse enhancement unit. The former aggregates local context correlation information while boosting tracking efficiency. The latter enhances the template-search image pair by using non-target information in the search region, which significantly suppresses background-relevant features while preserving target details, leading to more accurate tracking. Extensive experiments conducted on seven benchmarks demonstrate that our tracker outperforms state-of-the-art methods at a real-time speed of 45 FPS.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105922"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yehu Shen , Jikun Wei , Xuemei Niu , Guizhong Fu , Zihe Cao
{"title":"Efficient ultra-lightweight convolutional attention network for embedded identity document recognition system","authors":"Yehu Shen , Jikun Wei , Xuemei Niu , Guizhong Fu , Zihe Cao","doi":"10.1016/j.imavis.2026.105930","DOIUrl":"10.1016/j.imavis.2026.105930","url":null,"abstract":"<div><div>With the rapid development of IoT, identity document recognition has been widely applied in various fields. Efficient recognition systems are crucial for deployment on resource-constrained embedded devices, but many deep learning models suffer from high computational complexity. We propose an efficient character recognition system with a two-stage framework: a document number detection network and an ultra-lightweight attention-based recognition network named EULCAN (Efficient Ultra-Lightweight Convolutional Attention Network). EULCAN's feature extraction module employs a novel Dense Simplified Convolutional Attention Module (DSCAM) and a Dual Dimensionality Reduction Block (DDRB) to capture discriminative features efficiently. DSCAM combines an Efficient Bottleneck Convolution Block and a Simplified Channel Attention Block, significantly reducing computational costs while maintaining accuracy. For sequence transcription, a simple fully connected layer coupled with a Connectionist Temporal Classification (CTC) layer is used for robust recognition. Evaluated on the BDCI benchmark and a real-world SUST dataset, EULCAN achieves competitive accuracies of 97.1% and 95.3%, respectively, while maintaining only 2.8 M parameters and 0.497 GFLOPs. Compared to MobileNetV3, the second most lightweight deployment-ready model, EULCAN improves accuracy by 11.7%, while its parameter size is only 0.6% of OmniParser, the most accurate model. Furthermore, the proposed identity document recognition system has been successfully deployed in real-world scenarios. On the RK3588S2 development board, EULCAN achieves an impressive inference speed of 65 FPS, demonstrating its practicality for embedded IoT applications. The source code is publicly available at <span><span>https://github.com/ymxb1/EULCAN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105930"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mutaz Ibrahim Mohammed Ahmed Ibrahim , Dejiao Niu , Tao Cai , Lei Li , Bilal Ahmad
{"title":"DSAC-Hash: Distribution-Similarity-Aware Cross-modal Hashing","authors":"Mutaz Ibrahim Mohammed Ahmed Ibrahim , Dejiao Niu , Tao Cai , Lei Li , Bilal Ahmad","doi":"10.1016/j.imavis.2026.105926","DOIUrl":"10.1016/j.imavis.2026.105926","url":null,"abstract":"<div><div>The rapid growth of online multimedia data has made cross-modal hashing crucial for efficient retrieval. Existing methods often fail to handle the heterogeneity of image and text data and lack sufficient semantic interaction, resulting in reduced retrieval accuracy. To address these issues, we introduce the DSAC-Hash framework, which includes an Innovative Semantic Interaction Aggregator (SIA) to refine inter- and intra-modal relationships, reducing semantic discrepancies and enhancing retrieval performance. Additionally, we present a unified weighted loss framework optimizes cross-modal similarity by incorporating weighted triplet, contrastive, and semantic loss functions, improving the quality of binary hash codes. These enhancements significantly boost image-to-text (I2T) and text-to-image (T2I) retrieval performance.Experiments on MS COCO, Mirflickr-25k, and NUS-Wide show that DSAC-Hash achieves state-of-the-art performance, with notable MAP improvements with at least: <span><math><mrow><mn>4</mn><mo>.</mo><mn>59</mn><mo>∼</mo><mn>10</mn><mo>.</mo><mn>45</mn><mtext>%</mtext></mrow></math></span> (I2T) and <span><math><mrow><mn>7</mn><mo>.</mo><mn>39</mn><mo>∼</mo><mn>12</mn><mo>.</mo><mn>96</mn><mtext>%</mtext></mrow></math></span> (T2I) on MS COCO, <span><math><mrow><mn>1</mn><mo>.</mo><mn>52</mn><mo>∼</mo><mn>8</mn><mo>.</mo><mn>81</mn><mtext>%</mtext></mrow></math></span> (I2T) and <span><math><mrow><mn>2</mn><mo>.</mo><mn>75</mn><mo>∼</mo><mn>7</mn><mo>.</mo><mn>34</mn><mtext>%</mtext></mrow></math></span> (T2I) on Mir-Flickr, and <span><math><mrow><mn>4</mn><mo>.</mo><mn>78</mn><mo>∼</mo><mn>7</mn><mo>.</mo><mn>74</mn><mtext>%</mtext></mrow></math></span> (I2T) and <span><math><mrow><mn>7</mn><mo>.</mo><mn>03</mn><mo>∼</mo><mn>9</mn><mo>.</mo><mn>42</mn><mtext>%</mtext></mrow></math></span> (T2I) on NUS-WIDE, confirming its robustness, scalability, and effectiveness in large-scale multimedia retrieval scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105926"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A lightweight shallow convolution neural network for automatic identification of Diabetic Foot Ulcers","authors":"Sujit Kumar Das , Parag Bhuyan , Nageswara Rao Moparthi , Suyel Namasudra","doi":"10.1016/j.imavis.2026.105925","DOIUrl":"10.1016/j.imavis.2026.105925","url":null,"abstract":"<div><div>In standard clinical practices, disease diagnosis demands expensive tests and time-consuming procedures. Additionally, manual inspection by clinicians may sometimes lead to incorrect diagnostic results. Accurate identification of Diabetic Foot Ulcers (DFUs) is essential for early intervention and reducing the risk of serious complications. The evolution of deep learning techniques in image analysis has made significant contributions over the last decade. However, designing a computationally efficient and cost-effective deep learning network remains a challenge. This study proposes a lightweight and computationally efficient Convolutional Neural Network (CNN) architecture for automatic DFU classification. The proposed model primarily consists of varying-sized convolution kernels connected in a parallel manner, positional encoding (PE), and aggregated pooling (AGP) to enhance both global and local feature representation while maintaining a shallow and resource-efficient design. The proposed network is evaluated on publicly available DFU datasets and benchmarked against widely used deep learning models. Experimental results demonstrate that the proposed model outperforms state-of-the-art works with the highest average F1-Score of 94.83%, 94.63%, and 99.49% for DFU, infection, and ischaemia identification, respectively. The results also indicate that the proposed CNN achieves superior performance with significantly reduced computational cost, making it suitable for deployment on low-power and IoT-enabled medical devices.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105925"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SRformer: A hybrid semantic-regional transformer for indoor 3D object detection","authors":"Kunpeng Bi, Shuang Wang, Xiangyang Jiang, Miaohui Zhang","doi":"10.1016/j.imavis.2026.105919","DOIUrl":"10.1016/j.imavis.2026.105919","url":null,"abstract":"<div><div>Detection transformer has been widely applied to 3D object detection, achieving impressive results in various scenarios. However, effectively fusing regional and semantic features in query selection and cross-attention remains a challenge. This paper systematically analyzes detection transformers and proposes SRformer, a novel two-stage 3D object detector with several key designs. First, SRformer introduces a Hybrid Query Selector (HQS), which splits the first stage into a prediction branch and a sampling branch. The sampling branch is supervised by a novel hybrid query loss based on regional and semantic features, thereby filtering out high-quality initial query boxes. Next, a Regional Reinforcement Attention (RRA) is introduced to enhance instance-level attention. The RRA learns a set of key points and maps their regional differences to a relative coordinate table to construct explicit instance-level regional context feature constraints, thereby modulating the cross-attention map. Additionally, a Top-K Bipartite Graph Matching (KBM) is introduced to increase the number of positive samples and enhance training stability, along with a Residual-based Bounding Box Decoder (RBBD) that parameterizes the bounding box into residual components relative to predefined base sizes for more robust and precise regression. Extensive experiments on the challenging ScanNetV2 and SUN RGB-D datasets demonstrate the effectiveness and robustness of SRformer, achieving a new state-of-the-art result on ScanNetV2, with 76.8 and 64.8 in mAP25 and mAP50, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105919"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146102558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue Zhan , Xin Wang , Zhaoxiang Liu , Shiguo Lian , Tangwen Yang
{"title":"RelPose-TTA: Energy-based relative pose correction for test-time adaptation of category-level object pose estimation","authors":"Yue Zhan , Xin Wang , Zhaoxiang Liu , Shiguo Lian , Tangwen Yang","doi":"10.1016/j.imavis.2026.105928","DOIUrl":"10.1016/j.imavis.2026.105928","url":null,"abstract":"<div><div>Category-level object pose estimation is fundamental for robotic grasping and manipulation, yet models trained on synthetic data often generalize poorly to real-world environments due to substantial domain gaps. Test-time adaptation (TTA) offers a promising solution to address this challenge, but existing methods frequently depend on noisy pseudo-labels or complex optimization, which can lead to performance degradation and error accumulation over time. In this paper, we propose RelPose-TTA, a test-time adaptation framework that improves the generalization and long-term stability for category-level object pose estimation in previously unseen real-world environments. The core idea is to exploit the relative motion between consecutive frames, which is typically more stable and reliable than single-frame absolute pose estimation, and to use it as a self-supervisory signal during inference. Concretely, RelPose-TTA introduces an energy-based relative pose corrector to model inter-frame motion and mitigate ambiguities induced by occlusions, object symmetries, and large viewpoint changes. During test-time adaptation, the corrector is updated online via contrastive learning and is tightly coupled with point cloud registration, so that refined relative pose estimates can effectively guide absolute pose refinement. Extensive experiments demonstrate that RelPose-TTA consistently outperforms prior TTA methods in unseen real-world settings, while substantially reducing long-term drift and maintaining stable pose predictions.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105928"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DR-TrustNet: Enhancing diabetic retinopathy detection using reliable efficient networks and uncertainty quantification","authors":"Preeti Verma , Sivasankar Elango , Kunwar Singh","doi":"10.1016/j.imavis.2026.105921","DOIUrl":"10.1016/j.imavis.2026.105921","url":null,"abstract":"<div><div>Diabetic retinopathy (DR) is one of the main reasons people lose their vision, and catching it early is key to stopping permanent damage. Right now, doctors rely on manual screening, which takes a lot of time and it is not always consistent. The introduction of deep neural networks (DNNs) is a revolutionary step in analyzing high-precision DR detection, but there are concerns: these models can be over-confident in their prediction, leading to mistakes, especially in critical health care. Another problem is that the current method of deep learning does not respond well to uncertainties, which makes it difficult to trust them in the real medical environment. To address these challenges, we have developed a new system of three components. First, we improved the quality of retinal images using the Adaptive Fundus Enhancement Pipeline (AFEP). Then we will extract more useful features from the image using a modified version of EfficientNet-B0. Finally, we add steps to calibrate the model's prediction to ensure that its level of confidence is actually accurate. This step reduces the chances of incorrect diagnosis by utilizing a test time data augmentation and temperature scaling. The results of the IDRiD dataset test were promising. The model achieved 96% accuracy and showed a much better uncertainty calibration, with an expected calibration error of only 0.030. In other words, it is not only accurate, but also more reliable in the real world. Overall, our methodology can make AI-based DR screening more practical and reliable for both doctors and patients.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105921"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Riccio, Mara Sangiovanni, Francesco Longobardi, Andrea Francesco Scalella, Vincenzo Manfredi
{"title":"HOPE: Histopathological image Organization and Processing Environment","authors":"Daniel Riccio, Mara Sangiovanni, Francesco Longobardi, Andrea Francesco Scalella, Vincenzo Manfredi","doi":"10.1016/j.imavis.2026.105924","DOIUrl":"10.1016/j.imavis.2026.105924","url":null,"abstract":"<div><div>In disciplines such as digital pathology, the management of vast amounts of data, primarily ultra-high-resolution images, remains a significant barrier to the widespread adoption and seamless sharing of knowledge. Current research efforts are heavily focused on image encoding, often overlooking equally critical aspects such as indexing and efficient content transmission. Traditional compression methods, such as JPEG2000, prioritize reconstruction quality but do not inherently support direct retrieval or progressive transmission, both of which are essential for applications like telemedicine and large-scale digital pathology archives. To bridge this gap, we introduce a novel framework that integrates fractal compression, deep learning-based retrieval, and adaptive transmission, optimizing not only storage efficiency but also accessibility and scalability in histopathological imaging.</div><div>The Histopathological image Organization and Processing Environment (HOPE) framework here proposed exploits Partitioned Iterated Function Systems for image compression, achieving high compression ratios while preserving essential structural details. To mitigate the inherent artifacts of fractal compression, a U-Net autoencoder is integrated, refining decompressed images and enhancing visual quality. Additionally, a residual encoding mechanism is employed, allowing for lossless reconstruction when necessary. Unlike conventional methods, this framework enables direct retrieval from the compressed domain by extracting discriminative features from the fractal encoding coefficients. Another key innovation is its progressive transmission capability, which allows an initial low-bitrate preview to be sent, followed by incremental quality refinements based on diagnostic needs. This significantly reduces network load and enables real-time access to high-resolution histopathological images on resource-limited devices. Experimental results demonstrate that the proposed framework achieves compression performance comparable to JPEG2000, while simultaneously enabling efficient indexing, high-accuracy retrieval, and scalable transmission.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"168 ","pages":"Article 105924"},"PeriodicalIF":4.2,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}