{"title":"Efficient extraction of experimental data from line charts using advanced machine learning techniques","authors":"Wenjin Yang, Jie He, Xiaotong Zhang","doi":"10.1016/j.gmod.2025.101259","DOIUrl":null,"url":null,"abstract":"<div><div>Line charts, as a common data visualization tool in scientific research and business analysis, encapsulate rich experimental data. However, existing data extraction tools face challenges such as low automation levels and difficulties in handling complex charts. This paper proposes a novel method for extracting data from line charts, reformulating the extraction problem as an instance segmentation task, and introducing the Mamba-enhanced Transformer mask query method along with a curve mask-guided training approach to address challenges such as long dependencies and intersections in curve detection. Additionally, YOLOv9 is utilized for the detection and classification of chart elements, and a text recognition dataset comprising approximately 100K charts is constructed. An LSTM-based attention mechanism is employed for precise scale value recognition. Lastly, we present a method for automatically converting image data into structured JSON data, significantly enhancing the efficiency and accuracy of data extraction. Experimental results demonstrate that this method exhibits high efficiency and accuracy in handling complex charts, achieving an average extraction accuracy of 93% on public datasets, significantly surpassing the current state-of-the-art methods. This research provides an efficient foundation for large-scale scientific data analysis and machine learning model development, advancing the field of automated data extraction technology.</div></div>","PeriodicalId":55083,"journal":{"name":"Graphical Models","volume":"139 ","pages":"Article 101259"},"PeriodicalIF":2.5000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Graphical Models","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1524070325000062","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Line charts, as a common data visualization tool in scientific research and business analysis, encapsulate rich experimental data. However, existing data extraction tools face challenges such as low automation levels and difficulties in handling complex charts. This paper proposes a novel method for extracting data from line charts, reformulating the extraction problem as an instance segmentation task, and introducing the Mamba-enhanced Transformer mask query method along with a curve mask-guided training approach to address challenges such as long dependencies and intersections in curve detection. Additionally, YOLOv9 is utilized for the detection and classification of chart elements, and a text recognition dataset comprising approximately 100K charts is constructed. An LSTM-based attention mechanism is employed for precise scale value recognition. Lastly, we present a method for automatically converting image data into structured JSON data, significantly enhancing the efficiency and accuracy of data extraction. Experimental results demonstrate that this method exhibits high efficiency and accuracy in handling complex charts, achieving an average extraction accuracy of 93% on public datasets, significantly surpassing the current state-of-the-art methods. This research provides an efficient foundation for large-scale scientific data analysis and machine learning model development, advancing the field of automated data extraction technology.
期刊介绍:
Graphical Models is recognized internationally as a highly rated, top tier journal and is focused on the creation, geometric processing, animation, and visualization of graphical models and on their applications in engineering, science, culture, and entertainment. GMOD provides its readers with thoroughly reviewed and carefully selected papers that disseminate exciting innovations, that teach rigorous theoretical foundations, that propose robust and efficient solutions, or that describe ambitious systems or applications in a variety of topics.
We invite papers in five categories: research (contributions of novel theoretical or practical approaches or solutions), survey (opinionated views of the state-of-the-art and challenges in a specific topic), system (the architecture and implementation details of an innovative architecture for a complete system that supports model/animation design, acquisition, analysis, visualization?), application (description of a novel application of know techniques and evaluation of its impact), or lecture (an elegant and inspiring perspective on previously published results that clarifies them and teaches them in a new way).
GMOD offers its authors an accelerated review, feedback from experts in the field, immediate online publication of accepted papers, no restriction on color and length (when justified by the content) in the online version, and a broad promotion of published papers. A prestigious group of editors selected from among the premier international researchers in their fields oversees the review process.