识别前三思:用于一般精细交通标志识别的大型多模态模型

Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
{"title":"识别前三思:用于一般精细交通标志识别的大型多模态模型","authors":"Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama","doi":"arxiv-2409.01534","DOIUrl":null,"url":null,"abstract":"We propose a new strategy called think twice before recognizing to improve\nfine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is\ndifficult due to the complex road conditions, and existing approaches\nparticularly struggle with cross-country TSR when data is lacking. Our strategy\nachieves effective fine-grained TSR by stimulating the multiple-thinking\ncapability of large multimodal models (LMM). We introduce context,\ncharacteristic, and differential descriptions to design multiple thinking\nprocesses for the LMM. The context descriptions with center coordinate prompt\noptimization help the LMM to locate the target traffic sign in the original\nroad images containing multiple traffic signs and filter irrelevant answers\nthrough the proposed prior traffic sign hypothesis. The characteristic\ndescription is based on few-shot in-context learning of template traffic signs,\nwhich decreases the cross-domain difference and enhances the fine-grained\nrecognition capability of the LMM. The differential descriptions of similar\ntraffic signs optimize the multimodal thinking capability of the LMM. The\nproposed method is independent of training data and requires only simple and\nuniform instructions. We conducted extensive experiments on three benchmark\ndatasets and two real-world datasets from different countries, and the proposed\nmethod achieves state-of-the-art TSR results on all five datasets.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Think Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition\",\"authors\":\"Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama\",\"doi\":\"arxiv-2409.01534\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a new strategy called think twice before recognizing to improve\\nfine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is\\ndifficult due to the complex road conditions, and existing approaches\\nparticularly struggle with cross-country TSR when data is lacking. Our strategy\\nachieves effective fine-grained TSR by stimulating the multiple-thinking\\ncapability of large multimodal models (LMM). We introduce context,\\ncharacteristic, and differential descriptions to design multiple thinking\\nprocesses for the LMM. The context descriptions with center coordinate prompt\\noptimization help the LMM to locate the target traffic sign in the original\\nroad images containing multiple traffic signs and filter irrelevant answers\\nthrough the proposed prior traffic sign hypothesis. The characteristic\\ndescription is based on few-shot in-context learning of template traffic signs,\\nwhich decreases the cross-domain difference and enhances the fine-grained\\nrecognition capability of the LMM. The differential descriptions of similar\\ntraffic signs optimize the multimodal thinking capability of the LMM. The\\nproposed method is independent of training data and requires only simple and\\nuniform instructions. We conducted extensive experiments on three benchmark\\ndatasets and two real-world datasets from different countries, and the proposed\\nmethod achieves state-of-the-art TSR results on all five datasets.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"34 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01534\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们提出了一种名为 "三思而后行 "的新策略,以改进细粒度交通标志识别(TSR)。由于路况复杂,在野外进行细粒度 TSR 十分困难,现有方法尤其难以在缺乏数据的情况下进行跨国 TSR。我们的策略通过激发大型多模态模型(LMM)的多重思维能力来实现有效的细粒度 TSR。我们引入了上下文、特征和差异描述来为 LMM 设计多重思维过程。带有中心坐标提示优化的上下文描述有助于 LMM 在包含多个交通标志的原始道路图像中定位目标交通标志,并通过提出的先验交通标志假设过滤无关答案。特征描述是基于对模板交通标志的少帧上下文学习,从而减小了跨域差异,增强了 LMM 的细粒度识别能力。对相似交通标志的差分描述优化了 LMM 的多模态思维能力。所提出的方法与训练数据无关,只需要简单而统一的指令。我们在三个基准数据集和两个来自不同国家的实际数据集上进行了广泛的实验,所提出的方法在所有五个数据集上都取得了最先进的 TSR 结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Think Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition
We propose a new strategy called think twice before recognizing to improve fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is difficult due to the complex road conditions, and existing approaches particularly struggle with cross-country TSR when data is lacking. Our strategy achieves effective fine-grained TSR by stimulating the multiple-thinking capability of large multimodal models (LMM). We introduce context, characteristic, and differential descriptions to design multiple thinking processes for the LMM. The context descriptions with center coordinate prompt optimization help the LMM to locate the target traffic sign in the original road images containing multiple traffic signs and filter irrelevant answers through the proposed prior traffic sign hypothesis. The characteristic description is based on few-shot in-context learning of template traffic signs, which decreases the cross-domain difference and enhances the fine-grained recognition capability of the LMM. The differential descriptions of similar traffic signs optimize the multimodal thinking capability of the LMM. The proposed method is independent of training data and requires only simple and uniform instructions. We conducted extensive experiments on three benchmark datasets and two real-world datasets from different countries, and the proposed method achieves state-of-the-art TSR results on all five datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信