{"title":"基于模态内对比学习的硬样本挖掘改进视觉语言模型","authors":"Chaojun Lin , Ying Shi, Gang Wang , Shijian Liu","doi":"10.1016/j.neucom.2025.131047","DOIUrl":null,"url":null,"abstract":"<div><div>Driving environmental perception is a core component of autonomous driving systems. Recently, emerging vision-language detectors, known for their superior detection accuracy, have gradually replaced traditional detectors and have been increasingly applied in open-world driving scenarios. However, these detectors still face challenges regarding the missed detection of hard positive samples. This study identifies that a primary cause of this problem is the rejection of hard samples due to their low cross-modal consistency. To address this challenge, this work proposes a contrastive learning strategy based on a hard sample prototype memory bank to recall the potential positive samples. Additionally, to enhance the representational capacity of the detection network, an instance-level contrastive learning loss is introduced. This loss aligns the feature representations of the same instance across the deep and shallow network layers, thereby improving the ability of shallow layers to extract features from hard samples. Experimental results demonstrate that the proposed method achieves outstanding detection accuracy and is highly effective in complex urban road scenarios. The code and trained models are available at https://github.com/unbelieboomboom/HSMG_DINO.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"652 ","pages":"Article 131047"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving vision-language models through intra-modal contrastive learning-based hard sample mining\",\"authors\":\"Chaojun Lin , Ying Shi, Gang Wang , Shijian Liu\",\"doi\":\"10.1016/j.neucom.2025.131047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Driving environmental perception is a core component of autonomous driving systems. Recently, emerging vision-language detectors, known for their superior detection accuracy, have gradually replaced traditional detectors and have been increasingly applied in open-world driving scenarios. However, these detectors still face challenges regarding the missed detection of hard positive samples. This study identifies that a primary cause of this problem is the rejection of hard samples due to their low cross-modal consistency. To address this challenge, this work proposes a contrastive learning strategy based on a hard sample prototype memory bank to recall the potential positive samples. Additionally, to enhance the representational capacity of the detection network, an instance-level contrastive learning loss is introduced. This loss aligns the feature representations of the same instance across the deep and shallow network layers, thereby improving the ability of shallow layers to extract features from hard samples. Experimental results demonstrate that the proposed method achieves outstanding detection accuracy and is highly effective in complex urban road scenarios. The code and trained models are available at https://github.com/unbelieboomboom/HSMG_DINO.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"652 \",\"pages\":\"Article 131047\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225017199\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225017199","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Improving vision-language models through intra-modal contrastive learning-based hard sample mining
Driving environmental perception is a core component of autonomous driving systems. Recently, emerging vision-language detectors, known for their superior detection accuracy, have gradually replaced traditional detectors and have been increasingly applied in open-world driving scenarios. However, these detectors still face challenges regarding the missed detection of hard positive samples. This study identifies that a primary cause of this problem is the rejection of hard samples due to their low cross-modal consistency. To address this challenge, this work proposes a contrastive learning strategy based on a hard sample prototype memory bank to recall the potential positive samples. Additionally, to enhance the representational capacity of the detection network, an instance-level contrastive learning loss is introduced. This loss aligns the feature representations of the same instance across the deep and shallow network layers, thereby improving the ability of shallow layers to extract features from hard samples. Experimental results demonstrate that the proposed method achieves outstanding detection accuracy and is highly effective in complex urban road scenarios. The code and trained models are available at https://github.com/unbelieboomboom/HSMG_DINO.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.