{"title":"TopicFM+: Boosting Accuracy and Efficiency of Topic-Assisted Feature Matching","authors":"Khang Truong Giang;Soohwan Song;Sungho Jo","doi":"10.1109/TIP.2024.3473301","DOIUrl":"10.1109/TIP.2024.3473301","url":null,"abstract":"This study tackles image matching in difficult scenarios, such as scenes with significant variations or limited texture, with a strong emphasis on computational efficiency. Previous studies have attempted to address this challenge by encoding global scene contexts using Transformers. However, these approaches have high computational costs and may not capture sufficient high-level contextual information, such as spatial structures or semantic shapes. To overcome these limitations, we propose a novel image-matching method that leverages a topic-modeling strategy to capture high-level contexts in images. Our method represents each image as a multinomial distribution over topics, where each topic represents semantic structures. By incorporating these topics, we can effectively capture comprehensive context information and obtain discriminative and high-quality features. Notably, our coarse-level matching network enhances efficiency by employing attention layers only to fixed-sized topics and small-sized features. Finally, we design a dynamic feature refinement network for precise results at a finer matching stage. Through extensive experiments, we have demonstrated the superiority of our method in challenging scenarios. Specifically, our method ranks in the top 9% in the Image Matching Challenge 2023 without using ensemble techniques. Additionally, we achieve an approximately 50% reduction in computational costs compared to other Transformer-based methods. Code is available at \u0000<uri>https://github.com/TruongKhang/TopicFM</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6016-6028"},"PeriodicalIF":0.0,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142448478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Searching Discriminative Regions for Convolutional Neural Networks in Fundus Image Classification With Genetic Algorithms","authors":"Yibiao Rong;Tian Lin;Haoyu Chen;Zhun Fan;Xinjian Chen","doi":"10.1109/TIP.2024.3477932","DOIUrl":"10.1109/TIP.2024.3477932","url":null,"abstract":"Deep convolutional neural networks (CNNs) have been widely used for fundus image classification and have achieved very impressive performance. However, the explainability of CNNs is poor because of their black-box nature, which limits their application in clinical practice. In this paper, we propose a novel method to search for discriminative regions to increase the confidence of CNNs in the classification of features in specific category, thereby helping users understand which regions in an image are important for a CNN to make a particular prediction. In the proposed method, a set of superpixels is selected in an evolutionary process, such that discriminative regions can be found automatically. Many experiments are conducted to verify the effectiveness of the proposed method. The average drop and average increase obtained with the proposed method are 0 and 77.8%, respectively, in fundus image classification, indicating that the proposed method is very effective in identifying discriminative regions. Additionally, several interesting findings are reported: 1) Some superpixels, which contain the evidence used by humans to make a certain decision in practice, can be identified as discriminative regions via the proposed method; 2) The superpixels identified as discriminative regions are distributed in different locations in an image rather than focusing on regions with a specific instance; and 3) The number of discriminative superpixels obtained via the proposed method is relatively small. In other words, a CNN model can employ a small portion of the pixels in an image to increase the confidence for a specific category.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5949-5958"},"PeriodicalIF":0.0,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved MRF Reconstruction via Structure-Preserved Graph Embedding Framework","authors":"Peng Li;Yuping Ji;Yue Hu","doi":"10.1109/TIP.2024.3477980","DOIUrl":"10.1109/TIP.2024.3477980","url":null,"abstract":"Highly undersampled schemes in magnetic resonance fingerprinting (MRF) typically lead to aliasing artifacts in reconstructed images, thereby reducing quantitative imaging accuracy. Existing studies mainly focus on improving the reconstruction quality by incorporating temporal or spatial data priors. However, these methods seldom exploit the underlying MRF data structure driven by imaging physics and usually suffer from high computational complexity due to the high-dimensional nature of MRF data. In addition, data priors constructed in a pixel-wise manner struggle to incorporate non-local and non-linear correlations. To address these issues, we introduce a novel MRF reconstruction framework based on the graph embedding framework, exploiting non-linear and non-local redundancies in MRF data. Our work remodels MRF data and parameter maps as graph nodes, redefining the MRF reconstruction problem as a structure-preserved graph embedding problem. Furthermore, we propose a novel scheme for accurately estimating the underlying graph structure, demonstrating that the parameter nodes inherently form a low-dimensional representation of the high-dimensional MRF data nodes. The reconstruction framework is then built by preserving the intrinsic graph structure between MRF data nodes and parameter nodes and extended to exploiting the globality of graph structure. Our approach integrates the MRF data recovery and parameter map estimation into a single optimization problem, facilitating reconstructions geared toward quantitative accuracy. Moreover, by introducing graph representation, our methods substantially reduce the computational complexity, with the computational cost showing a minimal increase as the data acquisition length grows. Experiments show that the proposed method can reconstruct high-quality MRF data and multiple parameter maps within reduced computational time.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5989-6001"},"PeriodicalIF":0.0,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shunxing Chen;Guobao Xiao;Junwen Guo;Qiangqiang Wu;Jiayi Ma
{"title":"DHM-Net: Deep Hypergraph Modeling for Robust Feature Matching","authors":"Shunxing Chen;Guobao Xiao;Junwen Guo;Qiangqiang Wu;Jiayi Ma","doi":"10.1109/TIP.2024.3477916","DOIUrl":"10.1109/TIP.2024.3477916","url":null,"abstract":"We present a novel deep hypergraph modeling architecture (called DHM-Net) for feature matching in this paper. Our network focuses on learning reliable correspondences between two sets of initial feature points by establishing a dynamic hypergraph structure that models group-wise relationships and assigns weights to each node. Compared to existing feature matching methods that only consider pair-wise relationships via a simple graph, our dynamic hypergraph is capable of modeling nonlinear higher-order group-wise relationships among correspondences in an interaction capturing and attention representation learning fashion. Specifically, we propose a novel Deep Hypergraph Modeling block, which initializes an overall hypergraph by utilizing neighbor information, and then adopts node-to-hyperedge and hyperedge-to-node strategies to propagate interaction information among correspondences while assigning weights based on hypergraph attention. In addition, we propose a Differentiation Correspondence-Aware Attention mechanism to optimize the hypergraph for promoting representation learning. The proposed mechanism is able to effectively locate the exact position of the object of importance via the correspondence aware encoding and simple feature gating mechanism to distinguish candidates of inliers. In short, we learn such a dynamic hypergraph format that embeds deep group-wise interactions to explicitly infer categories of correspondences. To demonstrate the effectiveness of DHM-Net, we perform extensive experiments on both real-world outdoor and indoor datasets. Particularly, experimental results show that DHM-Net surpasses the state-of-the-art method by a sizable margin. Our approach obtains an 11.65% improvement under error threshold of 5° for relative pose estimation task on YFCC100M dataset. Code will be released at \u0000<uri>https://github.com/CSX777/DHM-Net</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6002-6015"},"PeriodicalIF":0.0,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyuan Yao;Hao Sun;Tian-Zhu Xiang;Xiao Wang;Xiaochun Cao
{"title":"Hierarchical Graph Interaction Transformer With Dynamic Token Clustering for Camouflaged Object Detection","authors":"Siyuan Yao;Hao Sun;Tian-Zhu Xiang;Xiao Wang;Xiaochun Cao","doi":"10.1109/TIP.2024.3475219","DOIUrl":"10.1109/TIP.2024.3475219","url":null,"abstract":"Camouflaged object detection (COD) aims to identify the objects that seamlessly blend into the surrounding backgrounds. Due to the intrinsic similarity between the camouflaged objects and the background region, it is extremely challenging to precisely distinguish the camouflaged objects by existing approaches. In this paper, we propose a hierarchical graph interaction network termed HGINet for camouflaged object detection, which is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Specifically, we first design a region-aware token focusing attention (RTFA) with dynamic token clustering to excavate the potentially distinguishable tokens in the local region. Afterwards, a hierarchical graph interaction transformer (HGIT) is proposed to construct bi-directional aligned communication between hierarchical features in the latent interaction space for visual semantics enhancement. Furthermore, we propose a decoder network with confidence aggregated feature fusion (CAFF) modules, which progressively fuses the hierarchical interacted features to refine the local detail in ambiguous regions. Extensive experiments conducted on the prevalent datasets, i.e. COD10K, CAMO, NC4K and CHAMELEON demonstrate the superior performance of HGINet compared to existing state-of-the-art methods. Our code is available at \u0000<uri>https://github.com/Garyson1204/HGINet</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5936-5948"},"PeriodicalIF":0.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142439744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianqun Zhou;Pengwen Dai;Yang Li;Manjiang Hu;Xiaochun Cao
{"title":"Explicitly-Decoupled Text Transfer With Minimized Background Reconstruction for Scene Text Editing","authors":"Jianqun Zhou;Pengwen Dai;Yang Li;Manjiang Hu;Xiaochun Cao","doi":"10.1109/TIP.2024.3477355","DOIUrl":"10.1109/TIP.2024.3477355","url":null,"abstract":"Scene text editing aims to replace the source text with the target text while preserving the original background. Its practical applications span various domains, such as data generation and privacy protection, highlighting its increasing importance in recent years. In this study, we propose a novel Scene Text Editing network with Explicitly-decoupled text transfer and Minimized background reconstruction, called STEEM. Unlike existing methods that usually fuse text style, text content, and background, our approach focuses on decoupling text style and content from the background and utilizes the minimized background reconstruction to reduce the impact of text replacement on the background. Specifically, the text-background separation module predicts the text mask of the scene text image, separating the source text from the background. Subsequently, the style-guided text transfer decoding module transfers the geometric and stylistic attributes of the source text to the content text, resulting in the target text. Next, the background and target text are combined to determine the minimal reconstruction area. Finally, the context-focused background reconstruction module is applied to the reconstruction area, producing the editing result. Furthermore, to ensure stable joint optimization of the four modules, a task-adaptive training optimization strategy has been devised. Experimental evaluations conducted on two popular datasets demonstrate the effectiveness of our approach. STEEM outperforms state-of-the-art methods, as evidenced by a reduction in the FID index from 29.48 to 24.67 and an increase in text recognition accuracy from 76.8% to 78.8%.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5921-5935"},"PeriodicalIF":0.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142440011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Injecting Text Clues for Improving Anomalous Event Detection From Weakly Labeled Videos","authors":"Tianshan Liu;Kin-Man Lam;Bing-Kun Bao","doi":"10.1109/TIP.2024.3477351","DOIUrl":"10.1109/TIP.2024.3477351","url":null,"abstract":"Video anomaly detection (VAD) aims at localizing the snippets containing anomalous events in long unconstrained videos. The weakly supervised (WS) setting, where solely video-level labels are available during training, has attracted considerable attention, owing to its satisfactory trade-off between the detection performance and annotation cost. However, due to lack of snippet-level dense labels, the existing WS-VAD methods still get easily stuck on the detection errors, caused by false alarms and incomplete localization. To address this dilemma, in this paper, we propose to inject text clues of anomaly-event categories for improving WS-VAD, via a dedicated dual-branch framework. For suppressing the response of confusing normal contexts, we first present a text-guided anomaly discovering (TAG) branch based on a hierarchical matching scheme, which utilizes the label-text queries to search the discriminative anomalous snippets in a global-to-local fashion. To facilitate the completeness of anomaly-instance localization, an anomaly-conditioned text completion (ATC) branch is further designed to perform an auxiliary generative task, which intrinsically forces the model to gather sufficient event semantics from all the relevant anomalous snippets for completely reconstructing the masked description sentence. Furthermore, to encourage the cross-branch knowledge sharing, a mutual learning strategy is introduced by imposing a consistency constraint on the anomaly scores of these two branches. Extensive experimental results on two public benchmarks validate that the proposed method achieves superior performance over the competing methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5907-5920"},"PeriodicalIF":0.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142440012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-Scope Spatial-Spectral Information Aggregation for Hyperspectral Image Super-Resolution","authors":"Shi Chen;Lefei Zhang;Liangpei Zhang","doi":"10.1109/TIP.2024.3468905","DOIUrl":"10.1109/TIP.2024.3468905","url":null,"abstract":"Hyperspectral image super-resolution has attained widespread prominence to enhance the spatial resolution of hyperspectral images. However, convolution-based methods have encountered challenges in harnessing the global spatial-spectral information. The prevailing transformer-based methods have not adequately captured the long-range dependencies in both spectral and spatial dimensions. To alleviate this issue, we propose a novel cross-scope spatial-spectral Transformer (CST) to efficiently investigate long-range spatial and spectral similarities for single hyperspectral image super-resolution. Specifically, we devise cross-attention mechanisms in spatial and spectral dimensions to comprehensively model the long-range spatial-spectral characteristics. By integrating global information into the rectangle-window self-attention, we first design a cross-scope spatial self-attention to facilitate long-range spatial interactions. Then, by leveraging appropriately characteristic spatial-spectral features, we construct a cross-scope spectral self-attention to effectively capture the intrinsic correlations among global spectral bands. Finally, we elaborate a concise feed-forward neural network to enhance the feature representation capacity in the Transformer structure. Extensive experiments over three hyperspectral datasets demonstrate that the proposed CST is superior to other state-of-the-art methods both quantitatively and visually. The code is available at \u0000<uri>https://github.com/Tomchenshi/CST.git</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5878-5891"},"PeriodicalIF":0.0,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142440010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GRiD: Guided Refinement for Detector-Free Multimodal Image Matching","authors":"Yuyan Liu;Wei He;Hongyan Zhang","doi":"10.1109/TIP.2024.3472491","DOIUrl":"10.1109/TIP.2024.3472491","url":null,"abstract":"Multimodal image matching is essential in image stitching, image fusion, change detection, and land cover mapping. However, the severe nonlinear radiometric distortion (NRD) and geometric distortions in multimodal images severely limit the accuracy of multimodal image matching, posing significant challenges to existing methods. Additionally, detector-based methods are prone to feature point offset issues in regions with substantial modal differences, which also hinder the subsequent fine registration and fusion of images. To address these challenges, we propose a guided refinement for detector-free multimodal image matching (GRiD) method, which weakens feature point offset issues by establishing pixel-level correspondences and utilizes reference points to guide and correct matches affected by NRD and geometric distortions. Specifically, we first introduce a detector-free framework to alleviate the feature point offset problem by directly finding corresponding pixels between images. Subsequently, to tackle NRD and geometric distortion in multimodal images, we design a guided correction module that establishes robust reference points (RPs) to guide the search for corresponding pixels in regions with significant modality differences. Moreover, to enhance RPs reliability, we incorporate a phase congruency module during the RPs confirmation stage to concentrate RPs around image edge structures. Finally, we perform finer localization on highly correlated corresponding pixels to obtain the optimized matches. We conduct extensive experiments on four multimodal image datasets to validate the effectiveness of the proposed approach. Experimental results demonstrate that our method can achieve sufficient and robust matches across various modality images and effectively suppress the feature point offset problem.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5892-5906"},"PeriodicalIF":0.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142407393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yabo Liu;Jinghua Wang;Chao Huang;Yiling Wu;Yong Xu;Xiaochun Cao
{"title":"MLFA: Toward Realistic Test Time Adaptive Object Detection by Multi-Level Feature Alignment","authors":"Yabo Liu;Jinghua Wang;Chao Huang;Yiling Wu;Yong Xu;Xiaochun Cao","doi":"10.1109/TIP.2024.3473532","DOIUrl":"10.1109/TIP.2024.3473532","url":null,"abstract":"Object detection methods have achieved remarkable performances when the training and testing data satisfy the assumption of i.i.d. However, the training and testing data may be collected from different domains, and the gap between the domains can significantly degrade the detectors. Test Time Adaptive Object Detection (TTA-OD) is a novel online approach that aims to adapt detectors quickly and make predictions during the testing procedure. TTA-OD is more realistic than the existing unsupervised domain adaptation and source-free unsupervised domain adaptation approaches. For example, self-driving cars need to improve their perception of new environments in the TTA-OD paradigm during driving. To address this, we propose a multi-level feature alignment (MLFA) method for TTA-OD, which is able to adapt the model online based on the steaming target domain data. For a more straightforward adaptation, we select informative foreground and background features from image feature maps and capture their distributions using probabilistic models. Our approach includes: i) global-level feature alignment to align all informative feature distributions, thereby encouraging detectors to extract domain-invariant features, and ii) cluster-level feature alignment to match feature distributions for each category cluster across different domains. Through the multi-level alignment, we can prompt detectors to extract domain-invariant features, as well as align the category-specific components of image features from distinct domains. We conduct extensive experiments to verify the effectiveness of our proposed method. Our code is accessible at \u0000<uri>https://github.com/yaboliudotug/MLFA</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5837-5848"},"PeriodicalIF":0.0,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142396289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}