基于模态内对比学习的硬样本挖掘改进视觉语言模型

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-07-24 DOI:10.1016/j.neucom.2025.131047

Chaojun Lin , Ying Shi, Gang Wang , Shijian Liu

{"title":"基于模态内对比学习的硬样本挖掘改进视觉语言模型","authors":"Chaojun Lin , Ying Shi, Gang Wang , Shijian Liu","doi":"10.1016/j.neucom.2025.131047","DOIUrl":null,"url":null,"abstract":"<div><div>Driving environmental perception is a core component of autonomous driving systems. Recently, emerging vision-language detectors, known for their superior detection accuracy, have gradually replaced traditional detectors and have been increasingly applied in open-world driving scenarios. However, these detectors still face challenges regarding the missed detection of hard positive samples. This study identifies that a primary cause of this problem is the rejection of hard samples due to their low cross-modal consistency. To address this challenge, this work proposes a contrastive learning strategy based on a hard sample prototype memory bank to recall the potential positive samples. Additionally, to enhance the representational capacity of the detection network, an instance-level contrastive learning loss is introduced. This loss aligns the feature representations of the same instance across the deep and shallow network layers, thereby improving the ability of shallow layers to extract features from hard samples. Experimental results demonstrate that the proposed method achieves outstanding detection accuracy and is highly effective in complex urban road scenarios. The code and trained models are available at https://github.com/unbelieboomboom/HSMG_DINO.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131047"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving vision-language models through intra-modal contrastive learning-based hard sample mining\",\"authors\":\"Chaojun Lin , Ying Shi, Gang Wang , Shijian Liu\",\"doi\":\"10.1016/j.neucom.2025.131047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Driving environmental perception is a core component of autonomous driving systems. Recently, emerging vision-language detectors, known for their superior detection accuracy, have gradually replaced traditional detectors and have been increasingly applied in open-world driving scenarios. However, these detectors still face challenges regarding the missed detection of hard positive samples. This study identifies that a primary cause of this problem is the rejection of hard samples due to their low cross-modal consistency. To address this challenge, this work proposes a contrastive learning strategy based on a hard sample prototype memory bank to recall the potential positive samples. Additionally, to enhance the representational capacity of the detection network, an instance-level contrastive learning loss is introduced. This loss aligns the feature representations of the same instance across the deep and shallow network layers, thereby improving the ability of shallow layers to extract features from hard samples. Experimental results demonstrate that the proposed method achieves outstanding detection accuracy and is highly effective in complex urban road scenarios. The code and trained models are available at https://github.com/unbelieboomboom/HSMG_DINO.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"652 \",\"pages\":\"Article 131047\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225017199\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225017199","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

驾驶环境感知是自动驾驶系统的核心组成部分。近年来，新兴的视觉语言检测器以其优越的检测精度逐渐取代了传统的检测器，并在开放世界驾驶场景中得到越来越多的应用。然而，这些检测器仍然面临着对硬阳性样品漏检的挑战。本研究确定，这一问题的主要原因是拒绝硬样品，由于他们的低跨模态一致性。为了解决这一挑战，本研究提出了一种基于硬样本原型记忆库的对比学习策略来召回潜在的正样本。此外，为了增强检测网络的表示能力，引入了实例级对比学习损失。这种损失使同一实例的特征表示在深层和浅层网络层之间保持一致，从而提高了浅层从硬样本中提取特征的能力。实验结果表明，该方法具有较高的检测精度，在复杂的城市道路场景下具有较高的检测效率。代码和经过训练的模型可在https://github.com/unbelieboomboom/HSMG_DINO上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving vision-language models through intra-modal contrastive learning-based hard sample mining

Driving environmental perception is a core component of autonomous driving systems. Recently, emerging vision-language detectors, known for their superior detection accuracy, have gradually replaced traditional detectors and have been increasingly applied in open-world driving scenarios. However, these detectors still face challenges regarding the missed detection of hard positive samples. This study identifies that a primary cause of this problem is the rejection of hard samples due to their low cross-modal consistency. To address this challenge, this work proposes a contrastive learning strategy based on a hard sample prototype memory bank to recall the potential positive samples. Additionally, to enhance the representational capacity of the detection network, an instance-level contrastive learning loss is introduced. This loss aligns the feature representations of the same instance across the deep and shallow network layers, thereby improving the ability of shallow layers to extract features from hard samples. Experimental results demonstrate that the proposed method achieves outstanding detection accuracy and is highly effective in complex urban road scenarios. The code and trained models are available at https://github.com/unbelieboomboom/HSMG_DINO.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.