Document Recognition and Retrieval最新文献

Automatic Transcription of Historical Newsprint by Leveraging the Kaldi Speech Recognition Toolkit 利用Kaldi语音识别工具包自动转录历史新闻纸

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-062

Patrick Schone, Alan B. Cannaday, S. Stewart, Rachael Day, J. Schone

引用次数: 5

Cuckoos among Your Data: A Quality Control Method to Retrieve Mislabeled Writer Identities from Handwriting Datasets 数据中的杜鹃:从手写数据集中检索错误标记的作者身份的质量控制方法

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-056

Vlad Atanasiu

引用次数: 5

Improving a deep convolutional neural network architecture for character recognition 改进用于字符识别的深度卷积神经网络架构

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-060

B. Cirstea, Laurence Likforman-Sulem

引用次数: 6

Integrating Text Recognition for Overlapping Text Detection in Maps 集成文本识别的地图重叠文本检测

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-061

N. Nazari, Tianxiang Tan, Yao-Yi Chiang

引用次数: 7

Revisiting Known-Item Retrieval in Degraded Document Collections 退化文献集合中已知项检索的重访

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-065

Jason J. Soo, O. Frieder

{"title":"Revisiting Known-Item Retrieval in Degraded Document Collections","authors":"Jason J. Soo, O. Frieder","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-065","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-065","url":null,"abstract":"Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5’s Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr’s mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively. Introduction Documents that are not electronically readable are increasingly difficult to manage, search, and maintain. Optical character recognition (OCR) is used to digitize these documents, but frequently produces a degraded copy. We develop a search system capable of searching such degraded documents. Our approach sustains a higher search accuracy rate than the prior art as evaluated using the TREC-5 Confusion Track datasets. Additionally, the approach developed is domain and language agnostic; increasing its applicability. In the United States there are two federal initiatives underway focused on the digitization of health records. First, the federal government is incentivizing local and private hospitals to switch from paper to electronic health records to improve the quality of care [3]. Second, the Veteran’s Affairs (VA) has an initiative to eliminate all paper health records by 2015 [2]. Both processes require converting paper records to digital images, and – hopefully – indexing of the digitized images to support searching. These efforts either are leveraging or can leverage OCR to query the newly created records to improve quality of service. These are but a few of the many examples demonstrating the importance of OCR. An OCR process is composed of two main parts. First is the conversion of an imagine to text by identifying characters and words from images [8, 17]. Second, the resulting text is post-processed to identify and correct errors during the first phase. Techniques in this process can range from simple dictionary checks to statistical methods. Our research focuses on the latter phase. Some work in the second phase has attempted to optimize the algorithm’s parameters by training algorithms on portions of the dataset [16]. However, such an approach does not generalize to other OCR collections. Other work focuses on specialized situations: handwritten documents [15]; signs, historical markers/documents [13, 9]. While other works hinge on assumptions: the OCR exposes a confidence level for each processed word [7]; online resources will allow the sys","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126223057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Information Extraction from Resume Documents in PDF Format 从PDF格式的简历文件中提取信息

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-064

Jiaze Chen, Liangcai Gao, Zhi Tang

引用次数: 20

Language Identification in Document Images 文档图像中的语言识别

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-058

Philippine Barlas, David Hebert, Clément Chatelain, Sébastien Adam, T. Paquet

引用次数: 5

Arrowhead detection in biomedical images 生物医学图像中的箭头检测

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-054

K. Santosh, Naved Alam, P. Roy, L. Wendling, Sameer Kiran Antani, G. Thoma

引用次数: 3

Training a calligraphy style classifier on a non-representative training set 在非代表性训练集上训练书法风格分类器

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-052

G. Nagy

{"title":"Training a calligraphy style classifier on a non-representative training set","authors":"G. Nagy","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-052","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-052","url":null,"abstract":"Calligraphy collections are being scanned into document images for preservation and accessibility. The digitization technology is mature and calligraphy character recognition is well underway, but automatic calligraphy style classification is lagging. Special style features are developed to measure style similarity of calligraphy character images of different stroke configurations and GB (or Unicode) labels. Recognizing the five main styles is easiest when a style-labeled sample of the same character (i.e., same GB code) from the same work and scribe is available. Even samples of characters with different GB codes from same work help. Style classification is most difficult when the training data has no comparable characters from the same work. These distinctions are quantified by distance statistics between the underlying feature distributions. Style classification is more accurate when several character samples from the same work are available. In adverse practical scenarios, when labeled versions of unknown works are not available for training the classifier, Borda Count voting and adaptive classification of style-sensitive feature vectors seven-character from the same work raises the ~70% single-sample baseline accuracy to ~90%.","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116714002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Intelligent Pen: A Least Cost Search Approach to Stroke Extraction in Historical Documents 智能笔:一种最低成本的历史文献笔画提取方法

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI: 10.2352/ISSN.2470-1173.2016.17.DRR-057

Kevin L. Bauer, W. Barrett

{"title":"Intelligent Pen: A Least Cost Search Approach to Stroke Extraction in Historical Documents","authors":"Kevin L. Bauer, W. Barrett","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-057","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-057","url":null,"abstract":"Intelligent Pen: A Least Cost Search Approach to Stroke Extraction in Historical Documents Kevin L. Bauer Department of Computer Science, BYU Master of Science Extracting strokes from handwriting in historical documents provides high-level features for the challenging problem of handwriting recognition. Such handwriting often contains noise, faint or incomplete strokes, strokes with gaps, overlapping ascenders and descenders and competing lines when embedded in a table or form, making it unsuitable for local line following algorithms or associated binarization schemes. We introduce Intelligent Pen for piece-wise optimal stroke extraction. Extracted strokes are stitched together to provide a complete trace of the handwriting. Intelligent Pen formulates stroke extraction as a set of piece-wise optimal paths, extracted and assembled in cost order. As such, Intelligent Pen is robust to noise, gaps, faint handwriting and even competing lines and strokes. Intelligent Pen traces compare closely with the shape as well as the order in which the handwriting was written. A quantitative comparison with an ICDAR handwritten stroke data set shows Intelligent Pen traces to be within 0.78 pixels (mean difference) of the manually created strokes.","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132868848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0