Kenichi Iwatsuki, T. Sagara, T. Hara, Akiko Aizawa
{"title":"Detecting In-line Mathematical Expressions in Scientific Documents","authors":"Kenichi Iwatsuki, T. Sagara, T. Hara, Akiko Aizawa","doi":"10.1145/3103010.3121041","DOIUrl":null,"url":null,"abstract":"One of the issues in extracting natural language sentences from PDF documents is the identification of non-textual elements in a sentence. In this paper, we report our preliminary results on the identification of in-line mathematical expressions. We first construct a manually annotated corpus and apply conditional random field (CRF) for the math-zone identification using both layout features, such as font types, and linguistic features, such as context n-grams, obtained from PDF documents. Although our method is naive and uses a small amount of annotated training data, our method achieved an 88.95% F-measure compared with 22.81% for existing math OCR software.","PeriodicalId":200469,"journal":{"name":"Proceedings of the 2017 ACM Symposium on Document Engineering","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3103010.3121041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22
Abstract
One of the issues in extracting natural language sentences from PDF documents is the identification of non-textual elements in a sentence. In this paper, we report our preliminary results on the identification of in-line mathematical expressions. We first construct a manually annotated corpus and apply conditional random field (CRF) for the math-zone identification using both layout features, such as font types, and linguistic features, such as context n-grams, obtained from PDF documents. Although our method is naive and uses a small amount of annotated training data, our method achieved an 88.95% F-measure compared with 22.81% for existing math OCR software.