{"title":"用于工业容器标识和自然场景文本识别的维数解耦视觉语言转换器","authors":"Zhangzhao Liang , Ying Xu , Yikui Zhai , Hufei Zhu , Jiangtao Xi , Pasquale Coscia , Angelo Genovese","doi":"10.1016/j.jii.2025.100926","DOIUrl":null,"url":null,"abstract":"<div><div>The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: <span><span>https://github.com/yikuizhai/DVLT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55975,"journal":{"name":"Journal of Industrial Information Integration","volume":"48 ","pages":"Article 100926"},"PeriodicalIF":10.4000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting\",\"authors\":\"Zhangzhao Liang , Ying Xu , Yikui Zhai , Hufei Zhu , Jiangtao Xi , Pasquale Coscia , Angelo Genovese\",\"doi\":\"10.1016/j.jii.2025.100926\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: <span><span>https://github.com/yikuizhai/DVLT</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":55975,\"journal\":{\"name\":\"Journal of Industrial Information Integration\",\"volume\":\"48 \",\"pages\":\"Article 100926\"},\"PeriodicalIF\":10.4000,\"publicationDate\":\"2025-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Industrial Information Integration\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2452414X25001499\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Industrial Information Integration","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2452414X25001499","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
Dimension Decoupling Vision-Language Transformer for industrial container marking and natural scene text spotting
The exceptional performance of Deep Learning has significantly advanced the widespread application of text spotting across various downstream tasks, such as electronic document recognition, traffic sign recognition, and bill number recognition. A challenging and crucial application is Container Marking Text Spotting (CMTS), which aims to swiftly capture logistics information from container surfaces and enhance the overall operational efficiency of logistics systems. Unlike the extensively studied natural scene text, text on container markings typically comprises contextless texts (such as ”45G1”, ”CWFU 1810810”), presenting a unique spotting challenge. In addition, part of the vertical text and the widely used non-end-to-end models in this field also limit the performance of container text spotting. Overall, due to the lack of further research in the task of container text spotting, the performance of the current model is unsatisfactory. This greatly affects the intelligence and informatization of the container industry. Therefore, there is an urgent need for a high-performance and easy to deploy method to improve the spotting accuracy of container surface text. This can not only effectively reduce the cost of obtaining container information, but also improve the overall intelligence level of the industry. In this paper, we propose a Dimension Decoupling Vision-Language Transformer (DVLT) for achieving high-performance in CMTS tasks. To address the challenges of contextless texts, our approach incorporates a Semantic Augmentation Module that leverages prior knowledge without adding computational overhead during inference. Additionally, we introduce center-line proposals to enhance the model’s adaptability to vertical text. Finally, DVLT improves the model’s comprehensive text spotting capabilities through a novel Dimension Decoupling Decoder. DVLT is a completely end-to-end text spotting transformer, which achieved state-of-the-art on the CMTS task (dataset publicly available) and also demonstrated competitive results on well-known benchmarks such as CTW1500, ICDAR2015 and Total-Text. The code and dataset are available at: https://github.com/yikuizhai/DVLT.
期刊介绍:
The Journal of Industrial Information Integration focuses on the industry's transition towards industrial integration and informatization, covering not only hardware and software but also information integration. It serves as a platform for promoting advances in industrial information integration, addressing challenges, issues, and solutions in an interdisciplinary forum for researchers, practitioners, and policy makers.
The Journal of Industrial Information Integration welcomes papers on foundational, technical, and practical aspects of industrial information integration, emphasizing the complex and cross-disciplinary topics that arise in industrial integration. Techniques from mathematical science, computer science, computer engineering, electrical and electronic engineering, manufacturing engineering, and engineering management are crucial in this context.