TransTab：一种基于转换器的方法，用于从扫描文档图像中进行表检测和表数据提取

IF 4.9

Machine learning with applications Pub Date : 2025-05-08 DOI:10.1016/j.mlwa.2025.100665

Yongzhou Wang , Wenliang Lv , Weijie Wu , Guanheng Xie , BiBo Lu , ChunYang Wang , Chao Zhan , Baishun Su

{"title":"TransTab：一种基于转换器的方法，用于从扫描文档图像中进行表检测和表数据提取","authors":"Yongzhou Wang , Wenliang Lv , Weijie Wu , Guanheng Xie , BiBo Lu , ChunYang Wang , Chao Zhan , Baishun Su","doi":"10.1016/j.mlwa.2025.100665","DOIUrl":null,"url":null,"abstract":"<div><div>Table detection and content extraction are crucial tasks in document analysis. Traditional convolutional neural network (CNN) methods often face limitations when dealing with complex tables, such as cross-column, cross-row, and multi-dimensional tables. Although existing methods have shown good performance in recognizing simpler tables, the model’s effectiveness often falls short of meeting practical application needs in the case of complex layouts. The structural intricacy of tables requires more advanced recognition and extraction strategies, particularly in the precise localization and extraction of rows and columns. To address the shortcomings of traditional methods in handling complex table structures, this paper proposes an end-to-end document table detection and content extraction method based on Transformer, named TransTab. TransTab effectively overcomes the limitations of traditional CNN approaches by incorporating Vision Transformer (ViT) into the table recognition task, enabling it to handle complex table structures across columns and rows. The self-attention mechanism of ViT allows the model to capture long-range dependencies within the table, resulting in high accuracy in detecting table boundaries, cell separations, and internal table structures. This paper also introduces separate modules for table detection and column detection, which are responsible for recognizing the overall table structure and accurately positioning columns, respectively. Through this modular design, the model can better adapt to tables with diverse complex layouts, thereby improving its ability to process intricate tables. Finally, EasyOCR technology is employed to extract text from the table. Experimental results demonstrate that TransTab outperforms the state-of-the-art methods across several metrics. This research provides a novel solution for the automatic recognition and processing of document tables, paving the way for future developments in document analysis tasks.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"20 ","pages":"Article 100665"},"PeriodicalIF":4.9000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TransTab: A transformer-based approach for table detection and tabular data extraction from scanned document images\",\"authors\":\"Yongzhou Wang , Wenliang Lv , Weijie Wu , Guanheng Xie , BiBo Lu , ChunYang Wang , Chao Zhan , Baishun Su\",\"doi\":\"10.1016/j.mlwa.2025.100665\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Table detection and content extraction are crucial tasks in document analysis. Traditional convolutional neural network (CNN) methods often face limitations when dealing with complex tables, such as cross-column, cross-row, and multi-dimensional tables. Although existing methods have shown good performance in recognizing simpler tables, the model’s effectiveness often falls short of meeting practical application needs in the case of complex layouts. The structural intricacy of tables requires more advanced recognition and extraction strategies, particularly in the precise localization and extraction of rows and columns. To address the shortcomings of traditional methods in handling complex table structures, this paper proposes an end-to-end document table detection and content extraction method based on Transformer, named TransTab. TransTab effectively overcomes the limitations of traditional CNN approaches by incorporating Vision Transformer (ViT) into the table recognition task, enabling it to handle complex table structures across columns and rows. The self-attention mechanism of ViT allows the model to capture long-range dependencies within the table, resulting in high accuracy in detecting table boundaries, cell separations, and internal table structures. This paper also introduces separate modules for table detection and column detection, which are responsible for recognizing the overall table structure and accurately positioning columns, respectively. Through this modular design, the model can better adapt to tables with diverse complex layouts, thereby improving its ability to process intricate tables. Finally, EasyOCR technology is employed to extract text from the table. Experimental results demonstrate that TransTab outperforms the state-of-the-art methods across several metrics. This research provides a novel solution for the automatic recognition and processing of document tables, paving the way for future developments in document analysis tasks.</div></div>\",\"PeriodicalId\":74093,\"journal\":{\"name\":\"Machine learning with applications\",\"volume\":\"20 \",\"pages\":\"Article 100665\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning with applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666827025000489\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025000489","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

表检测和内容提取是文档分析中的关键任务。传统的卷积神经网络（CNN）方法在处理复杂的表（如交叉列、交叉行和多维表）时往往面临局限性。虽然现有的方法在识别简单的表格方面表现出了良好的性能，但在复杂布局的情况下，该模型的有效性往往不能满足实际应用的需要。表结构的复杂性需要更高级的识别和提取策略，特别是在行和列的精确定位和提取方面。针对传统方法在处理复杂表结构时存在的不足，本文提出了一种基于Transformer的端到端文档表检测和内容提取方法TransTab。TransTab通过将视觉变换（Vision Transformer, ViT）集成到表识别任务中，有效地克服了传统CNN方法的局限性，使其能够处理跨列和跨行的复杂表结构。ViT的自关注机制允许模型捕获表内的远程依赖关系，从而在检测表边界、单元格分离和内部表结构方面具有很高的准确性。本文还分别介绍了表检测模块和列检测模块，分别负责识别表的整体结构和准确定位列。通过这种模块化设计，该模型可以更好地适应各种复杂布局的表格，从而提高其处理复杂表格的能力。最后，采用EasyOCR技术从表中提取文本。实验结果表明，TransTab在几个指标上优于最先进的方法。本研究为文档表的自动识别和处理提供了一种新颖的解决方案，为文档分析任务的未来发展铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TransTab: A transformer-based approach for table detection and tabular data extraction from scanned document images

Table detection and content extraction are crucial tasks in document analysis. Traditional convolutional neural network (CNN) methods often face limitations when dealing with complex tables, such as cross-column, cross-row, and multi-dimensional tables. Although existing methods have shown good performance in recognizing simpler tables, the model’s effectiveness often falls short of meeting practical application needs in the case of complex layouts. The structural intricacy of tables requires more advanced recognition and extraction strategies, particularly in the precise localization and extraction of rows and columns. To address the shortcomings of traditional methods in handling complex table structures, this paper proposes an end-to-end document table detection and content extraction method based on Transformer, named TransTab. TransTab effectively overcomes the limitations of traditional CNN approaches by incorporating Vision Transformer (ViT) into the table recognition task, enabling it to handle complex table structures across columns and rows. The self-attention mechanism of ViT allows the model to capture long-range dependencies within the table, resulting in high accuracy in detecting table boundaries, cell separations, and internal table structures. This paper also introduces separate modules for table detection and column detection, which are responsible for recognizing the overall table structure and accurately positioning columns, respectively. Through this modular design, the model can better adapt to tables with diverse complex layouts, thereby improving its ability to process intricate tables. Finally, EasyOCR technology is employed to extract text from the table. Experimental results demonstrate that TransTab outperforms the state-of-the-art methods across several metrics. This research provides a novel solution for the automatic recognition and processing of document tables, paving the way for future developments in document analysis tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications

自引率

0.00%

发文量

审稿时长

98 days