A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) Pub Date : 2017-11-01 DOI:10.1109/ICDAR.2017.129

Xing Wang, Jyh-Charn S. Liu

{"title":"A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files","authors":"Xing Wang, Jyh-Charn S. Liu","doi":"10.1109/ICDAR.2017.129","DOIUrl":null,"url":null,"abstract":"This paper proposes a Font Setting based Bayesian (FSB) model to extract mathematical expressions (MEs) in the portable document format (PDF) files. The FSB model is a self-adaptive unsupervised algorithm which first uses rules to identify ME and non-ME (NME) and then extracts the remaining ME using the Bayesian inference based on the observation that MEs tend to repeatedly represented in a particular style. PDF files are first processed using a PDF parser and document layout is analyzed using projection profiling cutting based algorithm to detect columns and lines. Heuristic rules derived from the knowledge of math usage and writing practices are employed to reason about the posterior probability of a char being ME vs. NME, conditional upon the font and value information. Based on the char level posterior probability, Bayesian inference is used to infer a none-separable character set (NSCS) being ME or not. Consecutive (fragmented) ME NSCS are merged to produce final results. Experimental results show that our approach achieves 0.006 (0.135) false rate and 0.111/0.093 miss rate for IME (EME) extraction. As for NSCS classification, our approach achieves 93.1% precision, 90.5% recall rate, and F1 score of 0.918. The processing time is markedly shorter than supervised machine learning techniques, and the extracted information and analytics products can be used for high level applications.","PeriodicalId":433676,"journal":{"name":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2017.129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

This paper proposes a Font Setting based Bayesian (FSB) model to extract mathematical expressions (MEs) in the portable document format (PDF) files. The FSB model is a self-adaptive unsupervised algorithm which first uses rules to identify ME and non-ME (NME) and then extracts the remaining ME using the Bayesian inference based on the observation that MEs tend to repeatedly represented in a particular style. PDF files are first processed using a PDF parser and document layout is analyzed using projection profiling cutting based algorithm to detect columns and lines. Heuristic rules derived from the knowledge of math usage and writing practices are employed to reason about the posterior probability of a char being ME vs. NME, conditional upon the font and value information. Based on the char level posterior probability, Bayesian inference is used to infer a none-separable character set (NSCS) being ME or not. Consecutive (fragmented) ME NSCS are merged to produce final results. Experimental results show that our approach achieves 0.006 (0.135) false rate and 0.111/0.093 miss rate for IME (EME) extraction. As for NSCS classification, our approach achieves 93.1% precision, 90.5% recall rate, and F1 score of 0.918. The processing time is markedly shorter than supervised machine learning techniques, and the extracted information and analytics products can be used for high level applications.

查看原文本刊更多论文

基于字体设置的贝叶斯模型提取PDF文件中的数学表达式

提出了一种基于字体设置的贝叶斯(FSB)模型，用于提取PDF文件中的数学表达式(MEs)。FSB模型是一种自适应无监督算法，它首先使用规则来识别ME和非ME (NME)，然后根据观察到ME倾向于以特定风格重复表示，使用贝叶斯推理提取剩余的ME。首先使用PDF解析器处理PDF文件，然后使用基于投影剖面切割的算法对文档布局进行分析，以检测列和线。启发式规则来源于数学用法和写作实践的知识，用于推断一个字符是ME还是NME的后验概率，这取决于字体和值信息。基于字符级后验概率，使用贝叶斯推理来推断不可分字符集是否为不可分字符集。将连续的(碎片化的)ME NSCS合并以产生最终结果。实验结果表明，该方法的误检率为0.006(0.135)，漏检率为0.111/0.093。对于NSCS分类，我们的方法准确率为93.1%，召回率为90.5%，F1得分为0.918。处理时间明显短于监督机器学习技术，提取的信息和分析产品可用于高级应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)

自引率

0.00%

发文量