Detecting In-line Mathematical Expressions in Scientific Documents

Proceedings of the 2017 ACM Symposium on Document Engineering Pub Date : 2017-08-31 DOI:10.1145/3103010.3121041

Kenichi Iwatsuki, T. Sagara, T. Hara, Akiko Aizawa

引用次数: 22

Abstract

One of the issues in extracting natural language sentences from PDF documents is the identification of non-textual elements in a sentence. In this paper, we report our preliminary results on the identification of in-line mathematical expressions. We first construct a manually annotated corpus and apply conditional random field (CRF) for the math-zone identification using both layout features, such as font types, and linguistic features, such as context n-grams, obtained from PDF documents. Although our method is naive and uses a small amount of annotated training data, our method achieved an 88.95% F-measure compared with 22.81% for existing math OCR software.

查看原文本刊更多论文

科学文献中内联数学表达式的检测

从PDF文档中提取自然语言句子的问题之一是句子中非文本元素的识别。在本文中，我们报告了我们的初步结果识别的在线数学表达式。我们首先构建一个手动注释的语料库，并使用从PDF文档获得的布局特征(如字体类型)和语言特征(如上下文n-grams)为数学区识别应用条件随机场(CRF)。虽然我们的方法很幼稚，使用了少量带注释的训练数据，但我们的方法达到了88.95%的F-measure，而现有数学OCR软件的F-measure为22.81%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM Symposium on Document Engineering

自引率

0.00%

发文量