A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files

Xing Wang, Jyh-Charn S. Liu
{"title":"A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files","authors":"Xing Wang, Jyh-Charn S. Liu","doi":"10.1109/ICDAR.2017.129","DOIUrl":null,"url":null,"abstract":"This paper proposes a Font Setting based Bayesian (FSB) model to extract mathematical expressions (MEs) in the portable document format (PDF) files. The FSB model is a self-adaptive unsupervised algorithm which first uses rules to identify ME and non-ME (NME) and then extracts the remaining ME using the Bayesian inference based on the observation that MEs tend to repeatedly represented in a particular style. PDF files are first processed using a PDF parser and document layout is analyzed using projection profiling cutting based algorithm to detect columns and lines. Heuristic rules derived from the knowledge of math usage and writing practices are employed to reason about the posterior probability of a char being ME vs. NME, conditional upon the font and value information. Based on the char level posterior probability, Bayesian inference is used to infer a none-separable character set (NSCS) being ME or not. Consecutive (fragmented) ME NSCS are merged to produce final results. Experimental results show that our approach achieves 0.006 (0.135) false rate and 0.111/0.093 miss rate for IME (EME) extraction. As for NSCS classification, our approach achieves 93.1% precision, 90.5% recall rate, and F1 score of 0.918. The processing time is markedly shorter than supervised machine learning techniques, and the extracted information and analytics products can be used for high level applications.","PeriodicalId":433676,"journal":{"name":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2017.129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

This paper proposes a Font Setting based Bayesian (FSB) model to extract mathematical expressions (MEs) in the portable document format (PDF) files. The FSB model is a self-adaptive unsupervised algorithm which first uses rules to identify ME and non-ME (NME) and then extracts the remaining ME using the Bayesian inference based on the observation that MEs tend to repeatedly represented in a particular style. PDF files are first processed using a PDF parser and document layout is analyzed using projection profiling cutting based algorithm to detect columns and lines. Heuristic rules derived from the knowledge of math usage and writing practices are employed to reason about the posterior probability of a char being ME vs. NME, conditional upon the font and value information. Based on the char level posterior probability, Bayesian inference is used to infer a none-separable character set (NSCS) being ME or not. Consecutive (fragmented) ME NSCS are merged to produce final results. Experimental results show that our approach achieves 0.006 (0.135) false rate and 0.111/0.093 miss rate for IME (EME) extraction. As for NSCS classification, our approach achieves 93.1% precision, 90.5% recall rate, and F1 score of 0.918. The processing time is markedly shorter than supervised machine learning techniques, and the extracted information and analytics products can be used for high level applications.
基于字体设置的贝叶斯模型提取PDF文件中的数学表达式
提出了一种基于字体设置的贝叶斯(FSB)模型,用于提取PDF文件中的数学表达式(MEs)。FSB模型是一种自适应无监督算法,它首先使用规则来识别ME和非ME (NME),然后根据观察到ME倾向于以特定风格重复表示,使用贝叶斯推理提取剩余的ME。首先使用PDF解析器处理PDF文件,然后使用基于投影剖面切割的算法对文档布局进行分析,以检测列和线。启发式规则来源于数学用法和写作实践的知识,用于推断一个字符是ME还是NME的后验概率,这取决于字体和值信息。基于字符级后验概率,使用贝叶斯推理来推断不可分字符集是否为不可分字符集。将连续的(碎片化的)ME NSCS合并以产生最终结果。实验结果表明,该方法的误检率为0.006(0.135),漏检率为0.111/0.093。对于NSCS分类,我们的方法准确率为93.1%,召回率为90.5%,F1得分为0.918。处理时间明显短于监督机器学习技术,提取的信息和分析产品可用于高级应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信