Cps-STS：弥合内容和位置之间的差距，为粗点监督场景文本观测者

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-04 DOI:10.1109/TMM.2024.3521756

Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao

{"title":"Cps-STS：弥合内容和位置之间的差距，为粗点监督场景文本观测者","authors":"Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao","doi":"10.1109/TMM.2024.3521756","DOIUrl":null,"url":null,"abstract":"Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1652-1664"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter\",\"authors\":\"Weida Chen;Jie Jiang;Linfei Wang;Huafeng Li;Yibing Zhan;Dapeng Tao\",\"doi\":\"10.1109/TMM.2024.3521756\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"1652-1664\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10949660/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10949660/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

近年来，场景文本识别的弱监督方法因其显著减少数据集标注的潜力而越来越受到研究人员的欢迎。该领域的最新进展是基于单点或多点注释的文本定位器。然而，该方法存在文本识别对精确标注位置的敏感性问题，无法捕捉到字符的相对位置和形状，导致对大量旋转和翻转文本的识别受损。为了解决这些问题，本文开发了一种新的方法，称为粗点监督场景文本识别（Cps-STS）。Cps-STS首先利用几个近似点作为文本位置标签，并引入可学习的位置调制机制，降低了标注的精度要求，增强了模型的鲁棒性。此外，我们还集成了空间兼容性注意（SCA）模块用于文本解码，以有效地利用位置和形状等空间数据。该模块融合了复合查询和全局特征映射，在SCA模块中用作表示文本空间形态的偏向。为了准确定位和解码文本内容，我们将包含空间形态信息和文本内容的特征引入到文本解码器的输入特征中。通过在文本解码器中引入带有空间形态信息的特征作为偏置项，烧烧实验表明，该操作使模型能够有效地识别和利用文本内容与位置之间的关系，从而提高模型的识别性能。Cps-STS的一个显著优势是它能够以低成本仅用几个不精确的粗点实现完全的监督级性能。大量的实验验证了Cps-STS相对于现有方法的有效性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cps-STS: Bridging the Gap Between Content and Position for Coarse-Point-Supervised Scene Text Spotter

Recently, weakly supervised methods for scene text spotter are increasingly popular with researchers due to their potential to significantly reduce dataset annotation efforts. The latest progress in this field is text spotter based on single or multi-point annotations. However, this method struggles with the sensitivity of text recognition to the precise annotation location and fails to capture the relative positions and shapes of characters, leading to impaired recognition of texts with extensive rotations and flips. To address these challenges, this paper develops a novel method named Coarse-point-supervised Scene Text Spotter (Cps-STS). Cps-STS first utilizes a few approximate points as text location labels and introduces a learnable position modulation mechanism, easing the accuracy requirements for annotations and enhancing model robustness. Additionally, we incorporate a Spatial Compatibility Attention (SCA) module for text decoding to effectively utilize spatial data such as position and shape. This module fuses compound queries and global feature maps, serving as a bias in the SCA module to express text spatial morphology. In order to accurately locate and decode text content, we introduce features containing spatial morphology information and text content into the input features of the text decoder. By introducing features with spatial morphology information as bias terms into the text decoder, ablation experiments demonstrate that this operation enables the model to effectively identify and utilize the relationship between text content and position to enhance the recognition performance of our model. One significant advantage of Cps-STS is its ability to achieve full supervision-level performance with just a few imprecise coarse points at a low cost. Extensive experiments validate the effectiveness and superiority of Cps-STS over existing approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.