IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

筛选
英文 中文
A Simple Yet Effective Network Based on Vision Transformer for Camouflaged Object and Salient Object Detection 一种简单有效的基于视觉变换的伪装目标和显著目标检测网络
Chao Hao;Zitong Yu;Xin Liu;Jun Xu;Huanjing Yue;Jingyu Yang
{"title":"A Simple Yet Effective Network Based on Vision Transformer for Camouflaged Object and Salient Object Detection","authors":"Chao Hao;Zitong Yu;Xin Liu;Jun Xu;Huanjing Yue;Jingyu Yang","doi":"10.1109/TIP.2025.3528347","DOIUrl":"10.1109/TIP.2025.3528347","url":null,"abstract":"Camouflaged object detection (COD) and salient object detection (SOD) are two distinct yet closely-related computer vision tasks widely studied during the past decades. Though sharing the same purpose of segmenting an image into binary foreground and background regions, their distinction lies in the fact that COD focuses on concealed objects hidden in the image, while SOD concentrates on the most prominent objects in the image. Building universal segmentation models is currently a hot topic in the community. Previous works achieved good performance on certain task by stacking various hand-designed modules and multi-scale features. However, these careful task-specific designs also make them lose their potential as general-purpose architectures. Therefore, we hope to build general architectures that can be applied to both tasks. In this work, we propose a simple yet effective network (SENet) based on vision Transformer (ViT), by employing a simple design of an asymmetric ViT-based encoder-decoder structure, we yield competitive results on both tasks, exhibiting greater versatility than meticulously crafted ones. To enhance the performance of universal architectures on both tasks, we propose some general methods targeting some common difficulties of the two tasks. First, we use image reconstruction as an auxiliary task during training to increase the difficulty of training, forcing the network to have a better perception of the image as a whole to help with segmentation tasks. In addition, we propose a local information capture module (LICM) to make up for the limitations of the patch-level attention mechanism in pixel-level COD and SOD tasks and a dynamic weighted loss (DW loss) to solve the problem that small target samples are more difficult to locate and segment in both tasks. Finally, we also conduct a preliminary exploration of joint training, trying to use one model to complete two tasks simultaneously. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method. The code is available at <uri>https://github.com/linuxsino/SENet</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"608-622"},"PeriodicalIF":0.0,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142987457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STanH: Parametric Quantization for Variable Rate Learned Image Compression 可变速率学习图像压缩的参数量化
Alberto Presta;Enzo Tartaglione;Attilio Fiandrotti;Marco Grangetto
{"title":"STanH: Parametric Quantization for Variable Rate Learned Image Compression","authors":"Alberto Presta;Enzo Tartaglione;Attilio Fiandrotti;Marco Grangetto","doi":"10.1109/TIP.2025.3527883","DOIUrl":"10.1109/TIP.2025.3527883","url":null,"abstract":"In end-to-end learned image compression, encoder and decoder are jointly trained to minimize a <inline-formula> <tex-math>$boldsymbol {R} boldsymbol {+} boldsymbol {lambda } boldsymbol {D}$ </tex-math></inline-formula> cost function, where <inline-formula> <tex-math>$boldsymbol {lambda }$ </tex-math></inline-formula> controls the trade-off between rate of the quantized latent representation and image quality. Unfortunately, a distinct encoder-decoder pair with millions of parameters must be trained for each <inline-formula> <tex-math>$boldsymbol {lambda }$ </tex-math></inline-formula>, hence the need to switch encoders and to store multiple encoders and decoders on the user device for every target rate. This paper proposes to exploit a differentiable quantizer designed around a parametric sum of hyperbolic tangents, called STanH, that relaxes the step-wise quantization function. STanH is implemented as a differentiable activation layer with learnable quantization parameters that can be plugged into a pre-trained fixed rate model and refined to achieve different target bitrates. Experimental results show that our method enables variable rate coding with comparable efficiency to the state-of-the-art, yet with significant savings in terms of ease of deployment, training time, and storage costs.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"639-651"},"PeriodicalIF":0.0,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142986394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects 基于低显著性但有区别目标的文本视频检索性能增强
Yanwei Zheng;Bowen Huang;Zekai Chen;Dongxiao Yu
{"title":"Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects","authors":"Yanwei Zheng;Bowen Huang;Zekai Chen;Dongxiao Yu","doi":"10.1109/TIP.2025.3527369","DOIUrl":"10.1109/TIP.2025.3527369","url":null,"abstract":"Text-video retrieval aims to establish a matching relationship between a video and its corresponding text. However, previous works have primarily focused on salient video subjects, such as humans or animals, often overlooking Low-Salient but Discriminative Objects (LSDOs) that play a critical role in understanding content. To address this limitation, we propose a novel model that enhances retrieval performance by emphasizing these overlooked elements across video and text modalities. In the video modality, our model first incorporates a feature selection module to gather video-level LSDO features, and applies cross-modal attention to assign frame-specific weights based on relevance, yielding frame-level LSDO features. In the text modality, text-level LSDO features are captured by generating multiple object prototypes in a sparse aggregation manner. Extensive experiments on benchmark datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo, demonstrate that our model achieves state-of-the-art results across various evaluation metrics.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"581-593"},"PeriodicalIF":0.0,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142986395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Pyramid Fusion MLP for Dense Prediction 用于密集预测的金字塔融合 MLP
Qiuyu Huang;Zequn Jie;Lin Ma;Li Shen;Shenqi Lai
{"title":"A Pyramid Fusion MLP for Dense Prediction","authors":"Qiuyu Huang;Zequn Jie;Lin Ma;Li Shen;Shenqi Lai","doi":"10.1109/TIP.2025.3526054","DOIUrl":"10.1109/TIP.2025.3526054","url":null,"abstract":"Recently, MLP-based architectures have achieved competitive performance with convolutional neural networks (CNNs) and vision transformers (ViTs) across various vision tasks. However, most MLP-based methods introduce local feature interactions to facilitate direct adaptation to downstream tasks, thereby lacking the ability to capture global visual dependencies and multi-scale context, ultimately resulting in unsatisfactory performance on dense prediction. This paper proposes a competitive and effective MLP-based architecture called Pyramid Fusion MLP (PFMLP) to address the above limitation. Specifically, each block in PFMLP introduces multi-scale pooling and fully connected layers to generate feature pyramids, which are subsequently fused using up-sample layers and an additional fully connected layer. Employing different down-sample rates allows us to obtain diverse receptive fields, enabling the model to simultaneously capture long-range dependencies and fine-grained cues, thereby exploiting the potential of global context information and enhancing the spatial representation power of the model. Our PFMLP is the first lightweight MLP to obtain comparable results with state-of-the-art CNNs and ViTs on the ImageNet-1K benchmark. With larger FLOPs, it exceeds state-of-the-art CNNs, ViTs, and MLPs under similar computational complexity. Furthermore, experiments in object detection, instance segmentation, and semantic segmentation demonstrate that the visual representation acquired from PFMLP can be seamlessly transferred to downstream tasks, producing competitive results. All materials contain the training codes and logs are released at <uri>https://github.com/huangqiuyu/PFMLP</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"455-467"},"PeriodicalIF":0.0,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142981486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IFENet: Interaction, Fusion, and Enhancement Network for V-D-T Salient Object Detection IFENet:用于 V-D-T 突出物体检测的交互、融合和增强网络
Liuxin Bao;Xiaofei Zhou;Bolun Zheng;Runmin Cong;Haibing Yin;Jiyong Zhang;Chenggang Yan
{"title":"IFENet: Interaction, Fusion, and Enhancement Network for V-D-T Salient Object Detection","authors":"Liuxin Bao;Xiaofei Zhou;Bolun Zheng;Runmin Cong;Haibing Yin;Jiyong Zhang;Chenggang Yan","doi":"10.1109/TIP.2025.3527372","DOIUrl":"10.1109/TIP.2025.3527372","url":null,"abstract":"Visible-depth-thermal (VDT) salient object detection (SOD) aims to highlight the most visually attractive object by utilizing the triple-modal cues. However, existing models don’t give sufficient exploration of the multi-modal correlations and differentiation, which leads to unsatisfactory detection performance. In this paper, we propose an interaction, fusion, and enhancement network (IFENet) to conduct the VDT SOD task, which contains three key steps including the multi-modal interaction, the multi-modal fusion, and the spatial enhancement. Specifically, embarking on the Transformer backbone, our IFENet can acquire multi-scale multi-modal features. Firstly, the inter-modal and intra-modal graph-based interaction (IIGI) module is deployed to explore inter-modal channel correlation and intra-modal long-term spatial dependency. Secondly, the gated attention-based fusion (GAF) module is employed to purify and aggregate the triple-modal features, where multi-modal features are filtered along spatial, channel, and modality dimensions, respectively. Lastly, the frequency split-based enhancement (FSE) module separates the fused feature into high-frequency and low-frequency components to enhance spatial information (i.e., boundary details and object location) of the salient object. Extensive experiments are performed on VDT-2048 dataset, and the results show that our saliency model consistently outperforms 13 state-of-the-art models. Our code and results are available at <uri>https://github.com/Lx-Bao/IFENet</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"483-494"},"PeriodicalIF":0.0,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142981487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Breaking Boundaries: Unifying Imaging and Compression for HDR Image Compression 突破边界:统一成像和压缩的HDR图像压缩
Xuelin Shen;Linfeng Pan;Zhangkai Ni;Yulin He;Wenhan Yang;Shiqi Wang;Sam Kwong
{"title":"Breaking Boundaries: Unifying Imaging and Compression for HDR Image Compression","authors":"Xuelin Shen;Linfeng Pan;Zhangkai Ni;Yulin He;Wenhan Yang;Shiqi Wang;Sam Kwong","doi":"10.1109/TIP.2025.3527365","DOIUrl":"10.1109/TIP.2025.3527365","url":null,"abstract":"High Dynamic Range (HDR) images present unique challenges for Learned Image Compression (LIC) due to their complex domain distribution compared to Low Dynamic Range (LDR) images. In coding practice, HDR-oriented LIC typically adopts preprocessing steps (e.g., perceptual quantization and tone mapping operation) to align the distributions between LDR and HDR images, which inevitably comes at the expense of perceptual quality. To address this challenge, we rethink the HDR imaging process which involves fusing multiple exposure LDR images to create an HDR image and propose a novel HDR image compression paradigm, Unifying Imaging and Compression (HDR-UIC). The key innovation lies in establishing a seamless pipeline from image capture to delivery and enabling end-to-end training and optimization. Specifically, a Mixture-ATtention (MAT)-based compression backbone merges LDR features while simultaneously generating a compact representation. Meanwhile, the Reference-guided Misalignment-aware feature Enhancement (RME) module mitigates ghosting artifacts caused by misalignment in the LDR branches, maintaining fidelity without introducing additional information. Furthermore, we introduce an Appearance Redundancy Removal (ARR) module to optimize coding resource allocation among LDR features, thereby enhancing the final HDR compression performance. Extensive experimental results demonstrate the efficacy of our approach, showing significant improvements over existing state-of-the-art HDR compression schemes. Our code is available at: <uri>https://github.com/plf1999/HDR-UIC</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"510-521"},"PeriodicalIF":0.0,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142981484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DATR: Unsupervised Domain Adaptive Detection Transformer With Dataset-Level Adaptation and Prototypical Alignment DATR:具有数据集级自适应和原型对齐功能的无监督领域自适应检测变换器
Liang Chen;Jianhong Han;Yupei Wang
{"title":"DATR: Unsupervised Domain Adaptive Detection Transformer With Dataset-Level Adaptation and Prototypical Alignment","authors":"Liang Chen;Jianhong Han;Yupei Wang","doi":"10.1109/TIP.2025.3527370","DOIUrl":"10.1109/TIP.2025.3527370","url":null,"abstract":"With the success of the DEtection TRansformer (DETR), numerous researchers have explored its effectiveness in addressing unsupervised domain adaptation tasks. Existing methods leverage carefully designed feature alignment techniques to align the backbone or encoder, yielding promising results. However, effectively aligning instance-level features within the unique decoder structure of the detector has largely been neglected. Related techniques primarily align instance-level features in a class-agnostic manner, overlooking distinctions between features from different categories, which results in only limited improvements. Furthermore, the scope of current alignment modules in the decoder is often restricted to a limited batch of images, failing to capture the dataset-level cues, thereby severely constraining the detector’s generalization ability to the target domain. To this end, we introduce a strong DETR-based detector named Domain Adaptive detection TRansformer (DATR) for unsupervised domain adaptation of object detection. First, we propose the Class-wise Prototypes Alignment (CPA) module, which effectively aligns cross-domain features in a class-aware manner by bridging the gap between the object detection task and the domain adaptation task. Then, the designed Dataset-level Alignment Scheme (DAS) explicitly guides the detector to achieve global representation and enhance inter-class distinguishability of instance-level features across the entire dataset, which spans both domains, by leveraging contrastive learning. Moreover, DATR incorporates a mean-teacher-based self-training framework, utilizing pseudo-labels generated by the teacher model to further mitigate domain bias. Extensive experimental results demonstrate superior performance and generalization capabilities of our proposed DATR in multiple domain adaptation scenarios. Code is released at <uri>https://github.com/h751410234/DATR</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"982-994"},"PeriodicalIF":0.0,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142981485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acoustic Resolution Photoacoustic Microscopy Imaging Enhancement: Integration of Group Sparsity With Deep Denoiser Prior 声分辨率光声显微镜成像增强:群稀疏性与深度去噪先验的集成
Zhengyuan Zhang;Zuozhou Pan;Zhuoyi Lin;Arunima Sharma;Chia-Wen Lin;Manojit Pramanik;Yuanjin Zheng
{"title":"Acoustic Resolution Photoacoustic Microscopy Imaging Enhancement: Integration of Group Sparsity With Deep Denoiser Prior","authors":"Zhengyuan Zhang;Zuozhou Pan;Zhuoyi Lin;Arunima Sharma;Chia-Wen Lin;Manojit Pramanik;Yuanjin Zheng","doi":"10.1109/TIP.2025.3526065","DOIUrl":"10.1109/TIP.2025.3526065","url":null,"abstract":"Acoustic resolution photoacoustic microscopy (AR-PAM) is a novel medical imaging modality, which can be used for both structural and functional imaging in deep bio-tissue. However, the imaging resolution is degraded and structural details are lost since its dependency on acoustic focusing, which significantly constrains its scope of applications in medical and clinical scenarios. To address the above issue, model-based approaches incorporating traditional analytical prior terms have been employed, making it challenging to capture finer details of anatomical bio-structures. In this paper, we proposed an innovative prior named group sparsity prior for simultaneous reconstruction, which utilizes the non-local structural similarity between patches extracted from internal AR-PAM images. The local image details and resolution are improved while artifacts are also introduced. To mitigate the artifacts introduced by patch-based reconstruction methods, we further integrate an external image dataset as an extra information provider and consolidate the group sparsity prior with a deep denoiser prior. In this way, complementary information can be exploited to improve reconstruction results. Extensive experiments are conducted to enhance the simulated and in vivo AR-PAM imaging results. Specifically, in the simulated images, the mean peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) values have increased from 16.36 dB and 0.46 to 27.62 dB and 0.92, respectively. The in vivo reconstructed results also demonstrate the proposed method achieves superior local and global perceptual qualities, the metrics of signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR) have significantly increased from 10.59 and 8.61 to 30.83 and 27.54, respectively. Additionally, reconstruction fidelity is validated with the optical resolution photoacoustic microscopy (OR-PAM) data as reference image.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"522-537"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Difference-Complementary Learning and Label Reassignment for Multimodal Semi-Supervised Semantic Segmentation of Remote Sensing Images 遥感图像多模态半监督语义分割的差分互补学习和标签重分配
Wenqi Han;Wen Jiang;Jie Geng;Wang Miao
{"title":"Difference-Complementary Learning and Label Reassignment for Multimodal Semi-Supervised Semantic Segmentation of Remote Sensing Images","authors":"Wenqi Han;Wen Jiang;Jie Geng;Wang Miao","doi":"10.1109/TIP.2025.3526064","DOIUrl":"10.1109/TIP.2025.3526064","url":null,"abstract":"The feature fusion of optical and Synthetic Aperture Radar (SAR) images is widely used for semantic segmentation of multimodal remote sensing images. It leverages information from two different sensors to enhance the analytical capabilities of land cover. However, the imaging characteristics of optical and SAR data are vastly different, and noise interference makes the fusion of multimodal data information challenging. Furthermore, in practical remote sensing applications, there are typically only a limited number of labeled samples available, with most pixels needing to be labeled. Semi-supervised learning has the potential to improve model performance in scenarios with limited labeled data. However, in remote sensing applications, the quality of pseudo-labels is frequently compromised, particularly in challenging regions such as blurred edges and areas with class confusion. This degradation in label quality can have a detrimental effect on the model’s overall performance. In this paper, we introduce the Difference-complementary Learning and Label Reassignment (DLLR) network for multimodal semi-supervised semantic segmentation of remote sensing images. Our proposed DLLR framework leverages asymmetric masking to create information discrepancies between the optical and SAR modalities, and employs a difference-guided complementary learning strategy to enable mutual learning. Subsequently, we introduce a multi-level label reassignment strategy, treating the label assignment problem as an optimal transport optimization task to allocate pixels to classes with higher precision for unlabeled pixels, thereby enhancing the quality of pseudo-label annotations. Finally, we introduce a multimodal consistency cross pseudo-supervision strategy to improve pseudo-label utilization. We evaluate our method on two multimodal remote sensing datasets, namely, the WHU-OPT-SAR and EErDS-OPT-SAR datasets. Experimental results demonstrate that our proposed DLLR model outperforms other relevant deep networks in terms of accuracy in multimodal semantic segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"566-580"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition 基于跨域监督信号的场景文本识别注意力引导
Fanfu Xue;Jiande Sun;Yaqi Xue;Qiang Wu;Lei Zhu;Xiaojun Chang;Sen-Ching Cheung
{"title":"Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition","authors":"Fanfu Xue;Jiande Sun;Yaqi Xue;Qiang Wu;Lei Zhu;Xiaojun Chang;Sen-Ching Cheung","doi":"10.1109/TIP.2024.3523799","DOIUrl":"10.1109/TIP.2024.3523799","url":null,"abstract":"Despite recent advances, scene text recognition remains a challenging problem due to the significant variability, irregularity and distortion in text appearance and localization. Attention-based methods have become the mainstream due to their superior vocabulary learning and observation ability. Nonetheless, they are susceptible to attention drift which can lead to word recognition errors. Most works focus on correcting attention drift in decoding but completely ignore the error accumulated during the encoding process. In this paper, we propose a novel scheme, called the Attention Guidance by Cross-Domain Supervision Signals for Scene Text Recognition (ACDS-STR), which can mitigate the attention drift at the feature encoding stage. At the heart of the proposed scheme is the cross-domain attention guidance and feature encoding fusion module (CAFM) that uses the core areas of characters to recursively guide attention to learn in the encoding process. With precise attention information sourced from CAFM, we propose a non-attention-based adaptive transformation decoder (ATD) to guarantee decoding performance and improve decoding speed. In the training stage, we fuse manual guidance and subjective learning to learn the core areas of characters, which notably augments the recognition performance of the model. Experiments are conducted on public benchmarks and show the state-of-the-art performance. The source will be available at <uri>https://github.com/xuefanfu/ACDS-STR</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"717-728"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信