Image and Vision Computing最新文献

筛选
英文 中文
ADPNet: Attention-Driven Dual-Path Network for automated polyp segmentation in colonoscopy ADPNet:用于结肠镜检查中自动息肉分割的注意驱动双路径网络
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-14 DOI: 10.1016/j.imavis.2025.105648
Mukhtiar Khan , Inam Ullah , Nadeem Khan , Sumaira Hussain , Muhammad ILyas Khattak
{"title":"ADPNet: Attention-Driven Dual-Path Network for automated polyp segmentation in colonoscopy","authors":"Mukhtiar Khan ,&nbsp;Inam Ullah ,&nbsp;Nadeem Khan ,&nbsp;Sumaira Hussain ,&nbsp;Muhammad ILyas Khattak","doi":"10.1016/j.imavis.2025.105648","DOIUrl":"10.1016/j.imavis.2025.105648","url":null,"abstract":"<div><div>Accurate automated polyp segmentation in colonoscopy images is crucial for early colorectal cancer detection and treatment, a major global health concern. Effective segmentation aids clinical decision-making and surgical planning. Leveraging advancements in deep learning, we propose an Attention-Driven Dual-Path Network (ADPNet) for precise polyp segmentation. ADPNet features a novel architecture with a specialized bridge integrating the Atrous Self-Attention Pyramid Module (ASAPM) and Dilated Convolution-Transformer Module (DCTM) between the encoder and decoder, enabling efficient feature extraction, long-range dependency capture, and enriched semantic representation. The decoder employs pixel shuffle, gated attention mechanisms, and residual blocks to enhance contextual and spatial feature refinement, ensuring precise boundary delineation and noise suppression. Comprehensive evaluations on public polyp datasets show ADPNet outperforms state-of-the-art models, demonstrating superior accuracy and robustness, particularly in challenging scenarios such as small or concealed polyps. ADPNet offers a robust solution for automated polyp segmentation, with potential to revolutionize early colorectal cancer detection and improve clinical outcomes. The code and results of this article are publicly available at https://github.com/Mkhan143/ADPNet.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105648"},"PeriodicalIF":4.2,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144885727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Invariant prompting with classifier rectification for continual learning 带分类器校正的不变量提示用于持续学习
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-11 DOI: 10.1016/j.imavis.2025.105641
Chunsing Lo , Hao Zhang , Andy J. Ma
{"title":"Invariant prompting with classifier rectification for continual learning","authors":"Chunsing Lo ,&nbsp;Hao Zhang ,&nbsp;Andy J. Ma","doi":"10.1016/j.imavis.2025.105641","DOIUrl":"10.1016/j.imavis.2025.105641","url":null,"abstract":"<div><div>Continual learning aims to train a model capable of continuously learning and retaining knowledge from a sequence of tasks. Recently, prompt-based continual learning has been proposed to leverage the generalization ability of a pre-trained model with task-specific prompts for instruction. Prompt component training is a promising approach to enhancing the plasticity for prompt-based continual learning. Nevertheless, this approach changes the instructed features to be noisy for query samples from the old tasks. Additionally, the problem of scale misalignment in classifier logits between different tasks leads to misclassification. To address these issues, we propose an invariant Prompting with Classifier Rectification (iPrompt-CR) method for prompt-based continual learning. In our method, the learnable keys corresponding to each new-task component are constrained to be orthogonal to the query prototype in the old tasks for invariant prompting, which reduces feature representation noise. After prompt learning, instructed features are sampled from Gaussian-distributed prototypes for classifier rectification with unified logit scale for more accurate predictions. Extensive experimental results on four benchmark datasets demonstrate that our method outperforms the state of the arts in both class-incremental learning and more realistic general incremental learning scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105641"},"PeriodicalIF":4.2,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DFF-Net: Deep Feature Fusion Network for low-light image enhancement DFF-Net:用于弱光图像增强的深度特征融合网络
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-09 DOI: 10.1016/j.imavis.2025.105645
Hongchang Zhang, Longtao Wang, Qizhan Zou, Juan Zeng
{"title":"DFF-Net: Deep Feature Fusion Network for low-light image enhancement","authors":"Hongchang Zhang,&nbsp;Longtao Wang,&nbsp;Qizhan Zou,&nbsp;Juan Zeng","doi":"10.1016/j.imavis.2025.105645","DOIUrl":"10.1016/j.imavis.2025.105645","url":null,"abstract":"<div><div>Low-light image enhancement methods are designed to improve brightness, recover texture details, restore color fidelity and suppress noise in images captured in low-light environments. Although many low-light image enhancement methods have been proposed, existing methods still face two limitations: (1) the inability to achieve all these objectives at the same time; and (2) heavy reliance on supervised methods that limits practical applicability in real-world scenarios. To overcome these challenges, we propose a Deep Feature Fusion Network (DFF-Net) for low-light image enhancement which builds upon Zero-DCE’s light-enhancement curve. The network is trained without requiring any paired datasets through a set of carefully designed non-reference loss functions. Furthermore, we develop a Fast Deep-level Residual Block (FDRB) to strengthen DFF-Net’s performance, which demonstrates superior performance in both feature extraction and computational efficiency. Comprehensive quantitative and qualitative experiments demonstrate that DFF-Net achieves superior performance in both subjective visual quality and downstream computer vision tasks. In low-light image enhancement experiments, DFF-Net achieves either optimal or sub-optimal metrics across all six public datasets compared to other unsupervised methods. And in low-light object detection experiments, DFF-Net achieves maximum scores in four key metrics on the ExDark dataset: P at 83.3%, F1 at 72.8%, mAP50 at 74.9%, and mAP50-95 at 48.9%. Code is available at <span><span>https://github.com/WangL0ngTa0/DFF-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105645"},"PeriodicalIF":4.2,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ACMC: Adaptive cross-modal multi-grained contrastive learning for continuous sign language recognition 连续手语识别的自适应跨模态多粒度对比学习
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-09 DOI: 10.1016/j.imavis.2025.105622
Xu-Hua Yang, Hong-Xiang Hu, XuanYu Lin
{"title":"ACMC: Adaptive cross-modal multi-grained contrastive learning for continuous sign language recognition","authors":"Xu-Hua Yang,&nbsp;Hong-Xiang Hu,&nbsp;XuanYu Lin","doi":"10.1016/j.imavis.2025.105622","DOIUrl":"10.1016/j.imavis.2025.105622","url":null,"abstract":"<div><div>Continuous sign language recognition helps the hearing-impaired community participate in social communication by recognizing the semantics of sign language video. However, the existing CSLR methods usually only implement cross-modal alignment at the sentence level or frame level, and do not fully consider the potential impact of redundant frames and semantically independent gloss identifiers on the recognition results. In order to improve the limitations of the above methods, we propose an adaptive cross-modal multi-grained contrastive learning (ACMC) for continuous sign language recognition, which achieve more accurate cross-modal semantic alignment through a multi-grained contrast mechanism. First, the ACMC uses the frame extractor and the temporal modeling module to obtain the fine-grained and coarse-grained features of the visual modality in turn, and extracts the fine-grained and coarse-grained features of the text modality through the CLIP text encoder. Then, the ACMC adopts coarse-grained contrast and fine-grained contrast methods to effectively align the features of visual and text modalities from global and local perspectives, and alleviate the semantic interference caused by redundant frames and semantically independent gloss identifiers through cross-grained contrast. In addition, in the video frame extraction stage, we design an adaptive learning module to strengthen the features of key regions of video frames through the calculated discrete spatial feature decision matrix, and adaptively fuse the convolution features of key frames with the trajectory information between adjacent frames, thereby reducing the computational cost. Experimental results show that the proposed ACMC model achieves very competitive recognition results on sign language datasets such as PHOENIX14, PHOENIX14-T and CSL-Daily.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105622"},"PeriodicalIF":4.2,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BSMEF: Optimized multi-exposure image fusion using B-splines and Mamba BSMEF:使用b样条和曼巴优化的多曝光图像融合
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-08 DOI: 10.1016/j.imavis.2025.105660
Jinyong Cheng , Qinghao Cui , Guohua Lv
{"title":"BSMEF: Optimized multi-exposure image fusion using B-splines and Mamba","authors":"Jinyong Cheng ,&nbsp;Qinghao Cui ,&nbsp;Guohua Lv","doi":"10.1016/j.imavis.2025.105660","DOIUrl":"10.1016/j.imavis.2025.105660","url":null,"abstract":"<div><div>In recent years, multi-exposure image fusion has been widely applied to process overexposed or underexposed images due to its simplicity, effectiveness, and low cost. With the development of deep learning techniques, related fusion methods have been continuously optimized. However, retaining global information from source images while preserving fine local details remains challenging, especially when fusing images with extreme exposure differences, where boundary transitions often exhibit shadows and noise. To address this, we propose a multi-exposure image fusion network model, BSMEF, based on B-Spline basis functions and Mamba. The B-Spline basis function, known for its smoothness, reduces edge artifacts and enables smooth transitions between images with varying exposure levels. In BSMEF, the feature extraction module, combining B-Spline and deformable convolutions, preserves global features while effectively extracting fine-grained local details. Additionally, we design a feature enhancement module based on Mamba blocks, leveraging its powerful global perception ability to capture contextual information. Furthermore, the fusion module integrates three feature enhancement methods: B-Spline basis functions, attention mechanisms, and Fourier transforms, addressing shadow and noise issues at fusion boundaries and enhancing the focus on important features. Experimental results demonstrate that BSMEF outperforms existing methods across multiple public datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105660"},"PeriodicalIF":4.2,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144605700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BCDPose: Diffusion-based 3D Human Pose Estimation with bone-chain prior knowledge BCDPose:基于骨链先验知识的扩散三维人体姿态估计
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-08 DOI: 10.1016/j.imavis.2025.105636
Xing Liu , Hao Tang
{"title":"BCDPose: Diffusion-based 3D Human Pose Estimation with bone-chain prior knowledge","authors":"Xing Liu ,&nbsp;Hao Tang","doi":"10.1016/j.imavis.2025.105636","DOIUrl":"10.1016/j.imavis.2025.105636","url":null,"abstract":"<div><div>Recently, diffusion-based methods have emerged as the golden standard in 3D Human Pose Estimation task, largely thanks to their exceptional generative capabilities. In the past, researchers have made concerted efforts to develop spatial and temporal denoisers utilizing transformer blocks in diffusion-based methods. However, existing Transformer-based denoisers in diffusion models often overlook implicit structural and kinematic supervision derived from prior knowledge of human biomechanics, including prior knowledge of human bone-chain structure and joint kinematics, which could otherwise enhance performance. We hold the view that joint movements are intrinsically constrained by neighboring joints within the bone-chain and by kinematic hierarchies. Then, we propose a <strong>B</strong>one-<strong>C</strong>hain enhanced <strong>D</strong>iffusion 3D pose estimation method, or <strong>BCDPose</strong>. In this method, we introduce a novel Bone-Chain prior knowledge enhanced transformer blocks within the denoiser to reconstruct contaminated 3D pose data. Additionally, we propose the Joint-DoF Hierarchical Temporal Embedding framework, which incorporates prior knowledge of joint kinematics. By integrating body hierarchy and temporal dependencies, this framework effectively captures the complexity of human motion, thereby enabling accurate and robust pose estimation. This innovation proposes a comprehensive framework for 3D human pose estimation by explicitly modeling joint kinematics, thereby overcoming the limitations of prior methods that fail to capture the intrinsic dynamics of human motion. We conduct extensive experiments on various open benchmarks to evaluate the effectiveness of BCDPose. The results convincingly demonstrate that BCDPose achieves highly competitive results compared with other state-of-the-art models. This underscores its promising potential and practical applicability in 2D–3D human pose estimation tasks. We plan to release our code publicly for further research.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105636"},"PeriodicalIF":4.2,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-scale feature fusion with task-specific data synthesis for pneumonia pathogen classification 基于任务数据合成的多尺度特征融合肺炎病原体分类
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-07 DOI: 10.1016/j.imavis.2025.105662
Yinzhe Cui , Jing Liu , Ze Teng , Shuangfeng Yang , Hongfeng Li , Pingkang Li , Jiabin Lu , Yajuan Gao , Yun Peng , Hongbin Han , Wanyi Fu
{"title":"Multi-scale feature fusion with task-specific data synthesis for pneumonia pathogen classification","authors":"Yinzhe Cui ,&nbsp;Jing Liu ,&nbsp;Ze Teng ,&nbsp;Shuangfeng Yang ,&nbsp;Hongfeng Li ,&nbsp;Pingkang Li ,&nbsp;Jiabin Lu ,&nbsp;Yajuan Gao ,&nbsp;Yun Peng ,&nbsp;Hongbin Han ,&nbsp;Wanyi Fu","doi":"10.1016/j.imavis.2025.105662","DOIUrl":"10.1016/j.imavis.2025.105662","url":null,"abstract":"<div><div>Pneumonia pathogen diagnosis from chest X-rays (CXR) is essential for timely and effective treatment for pediatric patients. However, the radiographic manifestations of pediatric pneumonia are often less distinct than those in adults, challenging for pathogen diagnosis, even for experienced clinicians. In this work, we propose a novel framework that integrates an adaptive hierarchical fusion network (AHFF) with task-specific diffusion-based data synthesis for pediatric pneumonia pathogen classification in clinical CXR. AHFF consists of dual branches to extract global and local features, and an adaptive feature fusion module that hierarchically integrates semantic information using cross attention mechanisms. Further, we develop a classifier-guided diffusion model that uses the task-specific AHFF classifier to generate class-consistent chest X-ray images for data augmentation. Experiments on one private and two public datasets demonstrate that the proposed classification model achieves superior performance, with accuracies of 78.00%, 84.43%, and 91.73%, respectively. Diffusion-based augmentation further improves accuracy to 84.37% using the private dataset. These results highlight the potential of feature fusion and data synthesis for improving automated pathogen-specific pneumonia diagnosis in clinical settings.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105662"},"PeriodicalIF":4.2,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144896514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MaxSwap-Enhanced Knowledge Consistency Learning for long-tailed recognition 基于maxswap的长尾识别增强知识一致性学习
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-07 DOI: 10.1016/j.imavis.2025.105643
Shengnan Fan, Zhilei Chai, Zhijun Fang, Yuying Pan, Hui Shen, Xiangyu Cheng, Qin Wu
{"title":"MaxSwap-Enhanced Knowledge Consistency Learning for long-tailed recognition","authors":"Shengnan Fan,&nbsp;Zhilei Chai,&nbsp;Zhijun Fang,&nbsp;Yuying Pan,&nbsp;Hui Shen,&nbsp;Xiangyu Cheng,&nbsp;Qin Wu","doi":"10.1016/j.imavis.2025.105643","DOIUrl":"10.1016/j.imavis.2025.105643","url":null,"abstract":"<div><div>Deep learning has made significant progress in image classification. However, real-world datasets often exhibit a long-tailed distribution, where a few head classes dominate while many tail classes have very few samples. This imbalance leads to poor performance on tail classes. To address this issue, we propose MaxSwap-Enhanced Knowledge Consistency Learning which includes two core components: Knowledge Consistency Learning and MaxSwap for Confusion Suppression. Knowledge Consistency Learning leverages the outputs from different augmented views as soft labels to capture inter-class similarities and introduces a consistency constraint to enforce output consistency across different perturbations, which enables tail classes to effectively learn from head classes with similar features. To alleviate the bias towards head classes, we further propose a MaxSwap for Confusion Suppression to adaptively adjust the soft labels when the model makes incorrect predictions which mitigates overconfidence in incorrect predictions. Experimental results demonstrate that our method achieves significant improvements on long-tailed datasets such as CIFAR10-LT, CIFAR100-LT, ImageNet-LT, and Places-LT, which validates the effectiveness of our approach.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105643"},"PeriodicalIF":4.2,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MG-KG: Unsupervised video anomaly detection based on motion guidance and knowledge graph MG-KG:基于运动引导和知识图的无监督视频异常检测
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-07 DOI: 10.1016/j.imavis.2025.105644
Qiyue Sun , Yang Yang , Haoxuan Xu , Zezhou Li , Yunxia Liu , Hongjun Wang
{"title":"MG-KG: Unsupervised video anomaly detection based on motion guidance and knowledge graph","authors":"Qiyue Sun ,&nbsp;Yang Yang ,&nbsp;Haoxuan Xu ,&nbsp;Zezhou Li ,&nbsp;Yunxia Liu ,&nbsp;Hongjun Wang","doi":"10.1016/j.imavis.2025.105644","DOIUrl":"10.1016/j.imavis.2025.105644","url":null,"abstract":"<div><div>Unsupervised Video Anomaly Detection (VAD) is a challenging and research-valuable task that is trained with only normal samples to detect anomalous samples. However, current solutions face two key issues: (1) a lack of spatio-temporal linkage in video data, and (2) limited interpretability of VAD results. To address these, we propose a new method named Motion Guidance-Knowledge Graph (MG-KG), inspired by video saliency detection and video understanding methods. Specifically, MG-KG has two components: the Motion Guidance Network (MGNet) and the Knowledge Graph retrieval for VAD (VAD-KG). MGNet emphasizes motion in the video foreground, crucial for real-time surveillance, while VAD-KG builds a knowledge graph to store structured video information and retrieve it during testing, enhancing interpretability. This combination improves both generalization and understanding in VAD for smart surveillance systems. Additionally, since training data has only normal samples, we propose a training baseline strategy, a tabu search strategy, and a score rectification strategy to enhance MG-KG for video anomaly detection tasks, which can further exploit the potential of MG-KG and significantly improve the performance of VAD. Extensive experiments demonstrate that MG-KG achieves competitive results in VAD for intelligent video surveillance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105644"},"PeriodicalIF":4.2,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144665469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Composed image retrieval by Multimodal Mixture-of-Expert Synergy 基于多模态混合专家协同的组合图像检索
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-07-07 DOI: 10.1016/j.imavis.2025.105634
Wenzhe Zhai , Mingliang Gao , Gwanggil Jeon , Qiang Zhou , David Camacho
{"title":"Composed image retrieval by Multimodal Mixture-of-Expert Synergy","authors":"Wenzhe Zhai ,&nbsp;Mingliang Gao ,&nbsp;Gwanggil Jeon ,&nbsp;Qiang Zhou ,&nbsp;David Camacho","doi":"10.1016/j.imavis.2025.105634","DOIUrl":"10.1016/j.imavis.2025.105634","url":null,"abstract":"<div><div>Composed image retrieval (CIR) is essential in security surveillance, e-commerce, and social media analysis. It provides precise information retrieval and intelligent analysis solutions for various industries. The majority of existing CIR models create a pseudo-word token from the reference image, which is subsequently incorporated into the corresponding caption for the image retrieval task. However, these pseudo-word-based prompting approaches are limited when the target image entails complex modifications to the reference image, such as object removal and attribute changes. To address the issue, we propose a Multimodal Mixture-of-Expert Synergy (MMES) model to achieve effective composed image retrieval. The MMES model initially utilizes multiple Mixture of Expert (MoE) modules through the mixture expert unit to process various types of multimodal input data. Subsequently, the outputs from these expert models are fused through the cross-modal integration module. Furthermore, the fused features generate implicit text embedding prompts, which are concatenated with the relative descriptions. Finally, retrieval is conducted using a text encoder and an image encoder. The Experiments demonstrate that the proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105634"},"PeriodicalIF":4.2,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信