基于模态内对比学习的硬样本挖掘改进视觉语言模型

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Chaojun Lin , Ying Shi, Gang Wang , Shijian Liu
{"title":"基于模态内对比学习的硬样本挖掘改进视觉语言模型","authors":"Chaojun Lin ,&nbsp;Ying Shi,&nbsp;Gang Wang ,&nbsp;Shijian Liu","doi":"10.1016/j.neucom.2025.131047","DOIUrl":null,"url":null,"abstract":"<div><div>Driving environmental perception is a core component of autonomous driving systems. Recently, emerging vision-language detectors, known for their superior detection accuracy, have gradually replaced traditional detectors and have been increasingly applied in open-world driving scenarios. However, these detectors still face challenges regarding the missed detection of hard positive samples. This study identifies that a primary cause of this problem is the rejection of hard samples due to their low cross-modal consistency. To address this challenge, this work proposes a contrastive learning strategy based on a hard sample prototype memory bank to recall the potential positive samples. Additionally, to enhance the representational capacity of the detection network, an instance-level contrastive learning loss is introduced. This loss aligns the feature representations of the same instance across the deep and shallow network layers, thereby improving the ability of shallow layers to extract features from hard samples. Experimental results demonstrate that the proposed method achieves outstanding detection accuracy and is highly effective in complex urban road scenarios. The code and trained models are available at https://github.com/unbelieboomboom/HSMG_DINO.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131047"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving vision-language models through intra-modal contrastive learning-based hard sample mining\",\"authors\":\"Chaojun Lin ,&nbsp;Ying Shi,&nbsp;Gang Wang ,&nbsp;Shijian Liu\",\"doi\":\"10.1016/j.neucom.2025.131047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Driving environmental perception is a core component of autonomous driving systems. Recently, emerging vision-language detectors, known for their superior detection accuracy, have gradually replaced traditional detectors and have been increasingly applied in open-world driving scenarios. However, these detectors still face challenges regarding the missed detection of hard positive samples. This study identifies that a primary cause of this problem is the rejection of hard samples due to their low cross-modal consistency. To address this challenge, this work proposes a contrastive learning strategy based on a hard sample prototype memory bank to recall the potential positive samples. Additionally, to enhance the representational capacity of the detection network, an instance-level contrastive learning loss is introduced. This loss aligns the feature representations of the same instance across the deep and shallow network layers, thereby improving the ability of shallow layers to extract features from hard samples. Experimental results demonstrate that the proposed method achieves outstanding detection accuracy and is highly effective in complex urban road scenarios. The code and trained models are available at https://github.com/unbelieboomboom/HSMG_DINO.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"652 \",\"pages\":\"Article 131047\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225017199\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225017199","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

驾驶环境感知是自动驾驶系统的核心组成部分。近年来,新兴的视觉语言检测器以其优越的检测精度逐渐取代了传统的检测器,并在开放世界驾驶场景中得到越来越多的应用。然而,这些检测器仍然面临着对硬阳性样品漏检的挑战。本研究确定,这一问题的主要原因是拒绝硬样品,由于他们的低跨模态一致性。为了解决这一挑战,本研究提出了一种基于硬样本原型记忆库的对比学习策略来召回潜在的正样本。此外,为了增强检测网络的表示能力,引入了实例级对比学习损失。这种损失使同一实例的特征表示在深层和浅层网络层之间保持一致,从而提高了浅层从硬样本中提取特征的能力。实验结果表明,该方法具有较高的检测精度,在复杂的城市道路场景下具有较高的检测效率。代码和经过训练的模型可在https://github.com/unbelieboomboom/HSMG_DINO上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Improving vision-language models through intra-modal contrastive learning-based hard sample mining
Driving environmental perception is a core component of autonomous driving systems. Recently, emerging vision-language detectors, known for their superior detection accuracy, have gradually replaced traditional detectors and have been increasingly applied in open-world driving scenarios. However, these detectors still face challenges regarding the missed detection of hard positive samples. This study identifies that a primary cause of this problem is the rejection of hard samples due to their low cross-modal consistency. To address this challenge, this work proposes a contrastive learning strategy based on a hard sample prototype memory bank to recall the potential positive samples. Additionally, to enhance the representational capacity of the detection network, an instance-level contrastive learning loss is introduced. This loss aligns the feature representations of the same instance across the deep and shallow network layers, thereby improving the ability of shallow layers to extract features from hard samples. Experimental results demonstrate that the proposed method achieves outstanding detection accuracy and is highly effective in complex urban road scenarios. The code and trained models are available at https://github.com/unbelieboomboom/HSMG_DINO.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信