视觉匹配足以实现场景文本检索

Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining Pub Date : 2023-02-27 DOI:10.1145/3539597.3570428

L. Wen, Yingrong Wang, Dongxiang Zhang, Gang Chen

{"title":"视觉匹配足以实现场景文本检索","authors":"L. Wen, Yingrong Wang, Dongxiang Zhang, Gang Chen","doi":"10.1145/3539597.3570428","DOIUrl":null,"url":null,"abstract":"Given a text query, the task of scene text retrieval aims at searching and localizing all the text instances that are contained in an image gallery. The state-of-the-art method learns a cross-modal similarity between the query text and the detected text regions in natural images to facilitate retrieval. However, this cross-modal approach still cannot well bridge the heterogeneity gap between the text and image modalities. In this paper, we propose a new paradigm that converts the task into a single-modality retrieval problem. Unlike previous works that rely on character recognition or embedding, we directly leverage pictorial information by rendering query text into images to learn the glyph feature of each character, which can be utilized to capture the similarity between query and scene text images. With the extracted visual features, we devise a synthetic label image guided feature alignment mechanism that is robust to different scene text styles and layouts. The modules of glyph feature learning, text instance detection, and visual matching are jointly trained in an end-to-end framework. Experimental results show that our proposed paradigm achieves the best performance in multiple benchmark datasets. As a side product, our method can also be easily generalized to support text queries with unseen characters or languages in a zero-shot manner.","PeriodicalId":227804,"journal":{"name":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Visual Matching is Enough for Scene Text Retrieval\",\"authors\":\"L. Wen, Yingrong Wang, Dongxiang Zhang, Gang Chen\",\"doi\":\"10.1145/3539597.3570428\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given a text query, the task of scene text retrieval aims at searching and localizing all the text instances that are contained in an image gallery. The state-of-the-art method learns a cross-modal similarity between the query text and the detected text regions in natural images to facilitate retrieval. However, this cross-modal approach still cannot well bridge the heterogeneity gap between the text and image modalities. In this paper, we propose a new paradigm that converts the task into a single-modality retrieval problem. Unlike previous works that rely on character recognition or embedding, we directly leverage pictorial information by rendering query text into images to learn the glyph feature of each character, which can be utilized to capture the similarity between query and scene text images. With the extracted visual features, we devise a synthetic label image guided feature alignment mechanism that is robust to different scene text styles and layouts. The modules of glyph feature learning, text instance detection, and visual matching are jointly trained in an end-to-end framework. Experimental results show that our proposed paradigm achieves the best performance in multiple benchmark datasets. As a side product, our method can also be easily generalized to support text queries with unseen characters or languages in a zero-shot manner.\",\"PeriodicalId\":227804,\"journal\":{\"name\":\"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3539597.3570428\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539597.3570428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

给定文本查询，场景文本检索任务的目标是搜索和定位图像库中包含的所有文本实例。最先进的方法学习查询文本和自然图像中检测到的文本区域之间的跨模态相似性，以方便检索。然而，这种跨模态方法仍然不能很好地弥合文本和图像模态之间的异质性差距。在本文中，我们提出了一个新的范式，将任务转换为单模态检索问题。与以往依赖字符识别或嵌入的工作不同，我们直接利用图像信息，将查询文本渲染成图像，学习每个字符的字形特征，利用这些特征可以捕获查询和场景文本图像之间的相似度。利用提取的视觉特征，设计了一种对不同场景文本样式和布局具有鲁棒性的合成标签图像引导特征对齐机制。在端到端框架中，对字形特征学习、文本实例检测和视觉匹配模块进行联合训练。实验结果表明，我们提出的范式在多个基准数据集上达到了最佳性能。作为副产品，我们的方法还可以很容易地推广到以零射击方式支持包含未见过的字符或语言的文本查询。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Visual Matching is Enough for Scene Text Retrieval

Given a text query, the task of scene text retrieval aims at searching and localizing all the text instances that are contained in an image gallery. The state-of-the-art method learns a cross-modal similarity between the query text and the detected text regions in natural images to facilitate retrieval. However, this cross-modal approach still cannot well bridge the heterogeneity gap between the text and image modalities. In this paper, we propose a new paradigm that converts the task into a single-modality retrieval problem. Unlike previous works that rely on character recognition or embedding, we directly leverage pictorial information by rendering query text into images to learn the glyph feature of each character, which can be utilized to capture the similarity between query and scene text images. With the extracted visual features, we devise a synthetic label image guided feature alignment mechanism that is robust to different scene text styles and layouts. The modules of glyph feature learning, text instance detection, and visual matching are jointly trained in an end-to-end framework. Experimental results show that our proposed paradigm achieves the best performance in multiple benchmark datasets. As a side product, our method can also be easily generalized to support text queries with unseen characters or languages in a zero-shot manner.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining

自引率

0.00%

发文量