Authors’ names extraction from scanned documents

2007 2nd International Conference on Digital Information Management Pub Date : 2007-10-01 DOI:10.1109/ICDIM.2007.4444202

Manabu Ohta, Shun Yamasaki, Takayuki Yakushi, A. Takasu

引用次数: 2

Abstract

Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. However, extracting such bibliographic data from printed documents requires human intervention; it is therefore not cost-effective, even using various document image-processing techniques such as optical character recognition (OCR). In this paper, we describe an automatic authors' names extraction method for academic articles scanned with OCR mark-up. The proposed method first extracts authors' blocks, which include assumed author/delimiter characters based on layout analysis, and then uses a specifically designed hidden Markov model (HMM) for labeling the unsegmented character strings in the block as those of either an author or a delimiter. We applied the proposed method to Japanese academic articles. Results of these experiments showed that the proposed method correctly extracted more than 99%, of authors' blocks with manual tuning; the proposed HMM correctly labeled more than 95% of the author name strings.

查看原文本刊更多论文

从扫描文档中提取作者姓名

当搜索或浏览存储在数字图书馆的学术文章时，作者的名字是一个关键的书目元素。然而，从印刷文件中提取此类书目数据需要人工干预;因此，即使使用各种文档图像处理技术，如光学字符识别(OCR)，也不具有成本效益。本文描述了一种基于OCR标记扫描的学术论文作者姓名自动提取方法。该方法首先提取作者块，其中包括基于布局分析的假设作者/分隔符字符，然后使用专门设计的隐马尔可夫模型(HMM)将块中未分割的字符串标记为作者字符串或分隔符字符串。我们将该方法应用于日本学术论文。实验结果表明，该方法在手动调优的情况下，正确提取了99%以上的作者块;所提出的HMM正确标记了超过95%的作者姓名字符串。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2007 2nd International Conference on Digital Information Management

自引率

0.00%

发文量