视觉匹配足以实现场景文本检索

L. Wen, Yingrong Wang, Dongxiang Zhang, Gang Chen
{"title":"视觉匹配足以实现场景文本检索","authors":"L. Wen, Yingrong Wang, Dongxiang Zhang, Gang Chen","doi":"10.1145/3539597.3570428","DOIUrl":null,"url":null,"abstract":"Given a text query, the task of scene text retrieval aims at searching and localizing all the text instances that are contained in an image gallery. The state-of-the-art method learns a cross-modal similarity between the query text and the detected text regions in natural images to facilitate retrieval. However, this cross-modal approach still cannot well bridge the heterogeneity gap between the text and image modalities. In this paper, we propose a new paradigm that converts the task into a single-modality retrieval problem. Unlike previous works that rely on character recognition or embedding, we directly leverage pictorial information by rendering query text into images to learn the glyph feature of each character, which can be utilized to capture the similarity between query and scene text images. With the extracted visual features, we devise a synthetic label image guided feature alignment mechanism that is robust to different scene text styles and layouts. The modules of glyph feature learning, text instance detection, and visual matching are jointly trained in an end-to-end framework. Experimental results show that our proposed paradigm achieves the best performance in multiple benchmark datasets. As a side product, our method can also be easily generalized to support text queries with unseen characters or languages in a zero-shot manner.","PeriodicalId":227804,"journal":{"name":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Visual Matching is Enough for Scene Text Retrieval\",\"authors\":\"L. Wen, Yingrong Wang, Dongxiang Zhang, Gang Chen\",\"doi\":\"10.1145/3539597.3570428\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given a text query, the task of scene text retrieval aims at searching and localizing all the text instances that are contained in an image gallery. The state-of-the-art method learns a cross-modal similarity between the query text and the detected text regions in natural images to facilitate retrieval. However, this cross-modal approach still cannot well bridge the heterogeneity gap between the text and image modalities. In this paper, we propose a new paradigm that converts the task into a single-modality retrieval problem. Unlike previous works that rely on character recognition or embedding, we directly leverage pictorial information by rendering query text into images to learn the glyph feature of each character, which can be utilized to capture the similarity between query and scene text images. With the extracted visual features, we devise a synthetic label image guided feature alignment mechanism that is robust to different scene text styles and layouts. The modules of glyph feature learning, text instance detection, and visual matching are jointly trained in an end-to-end framework. Experimental results show that our proposed paradigm achieves the best performance in multiple benchmark datasets. As a side product, our method can also be easily generalized to support text queries with unseen characters or languages in a zero-shot manner.\",\"PeriodicalId\":227804,\"journal\":{\"name\":\"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3539597.3570428\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539597.3570428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

给定文本查询,场景文本检索任务的目标是搜索和定位图像库中包含的所有文本实例。最先进的方法学习查询文本和自然图像中检测到的文本区域之间的跨模态相似性,以方便检索。然而,这种跨模态方法仍然不能很好地弥合文本和图像模态之间的异质性差距。在本文中,我们提出了一个新的范式,将任务转换为单模态检索问题。与以往依赖字符识别或嵌入的工作不同,我们直接利用图像信息,将查询文本渲染成图像,学习每个字符的字形特征,利用这些特征可以捕获查询和场景文本图像之间的相似度。利用提取的视觉特征,设计了一种对不同场景文本样式和布局具有鲁棒性的合成标签图像引导特征对齐机制。在端到端框架中,对字形特征学习、文本实例检测和视觉匹配模块进行联合训练。实验结果表明,我们提出的范式在多个基准数据集上达到了最佳性能。作为副产品,我们的方法还可以很容易地推广到以零射击方式支持包含未见过的字符或语言的文本查询。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Visual Matching is Enough for Scene Text Retrieval
Given a text query, the task of scene text retrieval aims at searching and localizing all the text instances that are contained in an image gallery. The state-of-the-art method learns a cross-modal similarity between the query text and the detected text regions in natural images to facilitate retrieval. However, this cross-modal approach still cannot well bridge the heterogeneity gap between the text and image modalities. In this paper, we propose a new paradigm that converts the task into a single-modality retrieval problem. Unlike previous works that rely on character recognition or embedding, we directly leverage pictorial information by rendering query text into images to learn the glyph feature of each character, which can be utilized to capture the similarity between query and scene text images. With the extracted visual features, we devise a synthetic label image guided feature alignment mechanism that is robust to different scene text styles and layouts. The modules of glyph feature learning, text instance detection, and visual matching are jointly trained in an end-to-end framework. Experimental results show that our proposed paradigm achieves the best performance in multiple benchmark datasets. As a side product, our method can also be easily generalized to support text queries with unseen characters or languages in a zero-shot manner.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信