用于工业容器标识和自然场景文本识别的维数解耦视觉语言转换器

IF 10.4 1区计算机科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Industrial Information Integration Pub Date : 2025-08-20 DOI:10.1016/j.jii.2025.100926

Zhangzhao Liang , Ying Xu , Yikui Zhai , Hufei Zhu , Jiangtao Xi , Pasquale Coscia , Angelo Genovese

{"title":"用于工业容器标识和自然场景文本识别的维数解耦视觉语言转换器","authors":"Zhangzhao Liang , Ying Xu , Yikui Zhai , Hufei Zhu , Jiangtao Xi , Pasquale Coscia , Angelo Genovese","doi":"10.1016/j.jii.2025.100926","DOIUrl":null,"url":null,"abstract":"<div><div>The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: <span><span>https://github.com/yikuizhai/DVLT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55975,"journal":{"name":"Journal of Industrial Information Integration","volume":"48 ","pages":"Article 100926"},"PeriodicalIF":10.4000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting\",\"authors\":\"Zhangzhao Liang , Ying Xu , Yikui Zhai , Hufei Zhu , Jiangtao Xi , Pasquale Coscia , Angelo Genovese\",\"doi\":\"10.1016/j.jii.2025.100926\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: <span><span>https://github.com/yikuizhai/DVLT</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":55975,\"journal\":{\"name\":\"Journal of Industrial Information Integration\",\"volume\":\"48 \",\"pages\":\"Article 100926\"},\"PeriodicalIF\":10.4000,\"publicationDate\":\"2025-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Industrial Information Integration\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2452414X25001499\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Industrial Information Integration","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2452414X25001499","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

深度学习的卓越性能极大地推动了文本识别在各种下游任务中的广泛应用，例如电子文档识别、交通标志识别和账单号识别。一个具有挑战性和关键的应用是集装箱标记文本识别（CMTS），其目的是从集装箱表面快速捕获物流信息，提高物流系统的整体运营效率。与广泛研究的自然场景文本不同，容器标记上的文本通常包含无上下文文本（如“45G1”，“CWFU 1810810”），呈现出独特的识别挑战。此外，部分垂直文本和该领域广泛使用的非端到端模型也限制了容器文本识别的性能。总的来说，由于缺乏对容器文本识别任务的进一步研究，目前模型的性能并不理想。这极大地影响了集装箱行业的智能化和信息化。因此，迫切需要一种高性能且易于部署的方法来提高容器表面文本的识别精度。这不仅可以有效降低集装箱信息的获取成本，还可以提高行业的整体智能化水平。在本文中，我们提出了一种维度解耦视觉语言转换器（DVLT），以实现在CMTS任务中的高性能。为了解决无上下文文本的挑战，我们的方法结合了一个语义增强模块，该模块在推理过程中利用先验知识而不增加计算开销。此外，我们还引入了中心线建议，以增强模型对垂直文本的适应性。最后，DVLT通过一种新颖的维度解耦解码器提高了模型的综合文本识别能力。DVLT是一个完全端到端的文本识别转换器，它在CMTS任务（可公开获得的数据集）上达到了最先进的水平，并且在CTW1500、ICDAR2015和Total-Text等知名基准测试上也展示了具有竞争力的结果。代码和数据集可从https://github.com/yikuizhai/DVLT获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting

The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: https://github.com/yikuizhai/DVLT.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Industrial Information Integration Decision Sciences-Information Systems and Management

CiteScore

22.30

自引率

13.40%

发文量

100

期刊介绍： The Journal of Industrial Information Integration focuses on the industry's transition towards industrial integration and informatization, covering not only hardware and software but also information integration. It serves as a platform for promoting advances in industrial information integration, addressing challenges, issues, and solutions in an interdisciplinary forum for researchers, practitioners, and policy makers. The Journal of Industrial Information Integration welcomes papers on foundational, technical, and practical aspects of industrial information integration, emphasizing the complex and cross-disciplinary topics that arise in industrial integration. Techniques from mathematical science, computer science, computer engineering, electrical and electronic engineering, manufacturing engineering, and engineering management are crucial in this context.