Yongzhou Wang , Wenliang Lv , Weijie Wu , Guanheng Xie , BiBo Lu , ChunYang Wang , Chao Zhan , Baishun Su
{"title":"TransTab:一种基于转换器的方法,用于从扫描文档图像中进行表检测和表数据提取","authors":"Yongzhou Wang , Wenliang Lv , Weijie Wu , Guanheng Xie , BiBo Lu , ChunYang Wang , Chao Zhan , Baishun Su","doi":"10.1016/j.mlwa.2025.100665","DOIUrl":null,"url":null,"abstract":"<div><div>Table detection and content extraction are crucial tasks in document analysis. Traditional convolutional neural network (CNN) methods often face limitations when dealing with complex tables, such as cross-column, cross-row, and multi-dimensional tables. Although existing methods have shown good performance in recognizing simpler tables, the model’s effectiveness often falls short of meeting practical application needs in the case of complex layouts. The structural intricacy of tables requires more advanced recognition and extraction strategies, particularly in the precise localization and extraction of rows and columns. To address the shortcomings of traditional methods in handling complex table structures, this paper proposes an end-to-end document table detection and content extraction method based on Transformer, named TransTab. TransTab effectively overcomes the limitations of traditional CNN approaches by incorporating Vision Transformer (ViT) into the table recognition task, enabling it to handle complex table structures across columns and rows. The self-attention mechanism of ViT allows the model to capture long-range dependencies within the table, resulting in high accuracy in detecting table boundaries, cell separations, and internal table structures. This paper also introduces separate modules for table detection and column detection, which are responsible for recognizing the overall table structure and accurately positioning columns, respectively. Through this modular design, the model can better adapt to tables with diverse complex layouts, thereby improving its ability to process intricate tables. Finally, EasyOCR technology is employed to extract text from the table. Experimental results demonstrate that TransTab outperforms the state-of-the-art methods across several metrics. This research provides a novel solution for the automatic recognition and processing of document tables, paving the way for future developments in document analysis tasks.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"20 ","pages":"Article 100665"},"PeriodicalIF":4.9000,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TransTab: A transformer-based approach for table detection and tabular data extraction from scanned document images\",\"authors\":\"Yongzhou Wang , Wenliang Lv , Weijie Wu , Guanheng Xie , BiBo Lu , ChunYang Wang , Chao Zhan , Baishun Su\",\"doi\":\"10.1016/j.mlwa.2025.100665\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Table detection and content extraction are crucial tasks in document analysis. Traditional convolutional neural network (CNN) methods often face limitations when dealing with complex tables, such as cross-column, cross-row, and multi-dimensional tables. Although existing methods have shown good performance in recognizing simpler tables, the model’s effectiveness often falls short of meeting practical application needs in the case of complex layouts. The structural intricacy of tables requires more advanced recognition and extraction strategies, particularly in the precise localization and extraction of rows and columns. To address the shortcomings of traditional methods in handling complex table structures, this paper proposes an end-to-end document table detection and content extraction method based on Transformer, named TransTab. TransTab effectively overcomes the limitations of traditional CNN approaches by incorporating Vision Transformer (ViT) into the table recognition task, enabling it to handle complex table structures across columns and rows. The self-attention mechanism of ViT allows the model to capture long-range dependencies within the table, resulting in high accuracy in detecting table boundaries, cell separations, and internal table structures. This paper also introduces separate modules for table detection and column detection, which are responsible for recognizing the overall table structure and accurately positioning columns, respectively. Through this modular design, the model can better adapt to tables with diverse complex layouts, thereby improving its ability to process intricate tables. Finally, EasyOCR technology is employed to extract text from the table. Experimental results demonstrate that TransTab outperforms the state-of-the-art methods across several metrics. This research provides a novel solution for the automatic recognition and processing of document tables, paving the way for future developments in document analysis tasks.</div></div>\",\"PeriodicalId\":74093,\"journal\":{\"name\":\"Machine learning with applications\",\"volume\":\"20 \",\"pages\":\"Article 100665\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning with applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666827025000489\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025000489","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
TransTab: A transformer-based approach for table detection and tabular data extraction from scanned document images
Table detection and content extraction are crucial tasks in document analysis. Traditional convolutional neural network (CNN) methods often face limitations when dealing with complex tables, such as cross-column, cross-row, and multi-dimensional tables. Although existing methods have shown good performance in recognizing simpler tables, the model’s effectiveness often falls short of meeting practical application needs in the case of complex layouts. The structural intricacy of tables requires more advanced recognition and extraction strategies, particularly in the precise localization and extraction of rows and columns. To address the shortcomings of traditional methods in handling complex table structures, this paper proposes an end-to-end document table detection and content extraction method based on Transformer, named TransTab. TransTab effectively overcomes the limitations of traditional CNN approaches by incorporating Vision Transformer (ViT) into the table recognition task, enabling it to handle complex table structures across columns and rows. The self-attention mechanism of ViT allows the model to capture long-range dependencies within the table, resulting in high accuracy in detecting table boundaries, cell separations, and internal table structures. This paper also introduces separate modules for table detection and column detection, which are responsible for recognizing the overall table structure and accurately positioning columns, respectively. Through this modular design, the model can better adapt to tables with diverse complex layouts, thereby improving its ability to process intricate tables. Finally, EasyOCR technology is employed to extract text from the table. Experimental results demonstrate that TransTab outperforms the state-of-the-art methods across several metrics. This research provides a novel solution for the automatic recognition and processing of document tables, paving the way for future developments in document analysis tasks.