科学文献中内联数学表达式的检测

Proceedings of the 2017 ACM Symposium on Document Engineering Pub Date : 2017-08-31 DOI:10.1145/3103010.3121041

Kenichi Iwatsuki, T. Sagara, T. Hara, Akiko Aizawa

{"title":"科学文献中内联数学表达式的检测","authors":"Kenichi Iwatsuki, T. Sagara, T. Hara, Akiko Aizawa","doi":"10.1145/3103010.3121041","DOIUrl":null,"url":null,"abstract":"One of the issues in extracting natural language sentences from PDF documents is the identification of non-textual elements in a sentence. In this paper, we report our preliminary results on the identification of in-line mathematical expressions. We first construct a manually annotated corpus and apply conditional random field (CRF) for the math-zone identification using both layout features, such as font types, and linguistic features, such as context n-grams, obtained from PDF documents. Although our method is naive and uses a small amount of annotated training data, our method achieved an 88.95% F-measure compared with 22.81% for existing math OCR software.","PeriodicalId":200469,"journal":{"name":"Proceedings of the 2017 ACM Symposium on Document Engineering","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":"{\"title\":\"Detecting In-line Mathematical Expressions in Scientific Documents\",\"authors\":\"Kenichi Iwatsuki, T. Sagara, T. Hara, Akiko Aizawa\",\"doi\":\"10.1145/3103010.3121041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the issues in extracting natural language sentences from PDF documents is the identification of non-textual elements in a sentence. In this paper, we report our preliminary results on the identification of in-line mathematical expressions. We first construct a manually annotated corpus and apply conditional random field (CRF) for the math-zone identification using both layout features, such as font types, and linguistic features, such as context n-grams, obtained from PDF documents. Although our method is naive and uses a small amount of annotated training data, our method achieved an 88.95% F-measure compared with 22.81% for existing math OCR software.\",\"PeriodicalId\":200469,\"journal\":{\"name\":\"Proceedings of the 2017 ACM Symposium on Document Engineering\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM Symposium on Document Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3103010.3121041\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3103010.3121041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

摘要

从PDF文档中提取自然语言句子的问题之一是句子中非文本元素的识别。在本文中，我们报告了我们的初步结果识别的在线数学表达式。我们首先构建一个手动注释的语料库，并使用从PDF文档获得的布局特征(如字体类型)和语言特征(如上下文n-grams)为数学区识别应用条件随机场(CRF)。虽然我们的方法很幼稚，使用了少量带注释的训练数据，但我们的方法达到了88.95%的F-measure，而现有数学OCR软件的F-measure为22.81%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Detecting In-line Mathematical Expressions in Scientific Documents

One of the issues in extracting natural language sentences from PDF documents is the identification of non-textual elements in a sentence. In this paper, we report our preliminary results on the identification of in-line mathematical expressions. We first construct a manually annotated corpus and apply conditional random field (CRF) for the math-zone identification using both layout features, such as font types, and linguistic features, such as context n-grams, obtained from PDF documents. Although our method is naive and uses a small amount of annotated training data, our method achieved an 88.95% F-measure compared with 22.81% for existing math OCR software.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2017 ACM Symposium on Document Engineering

自引率

0.00%

发文量