Patrick Schone, Alan B. Cannaday, S. Stewart, Rachael Day, J. Schone
{"title":"Automatic Transcription of Historical Newsprint by Leveraging the Kaldi Speech Recognition Toolkit","authors":"Patrick Schone, Alan B. Cannaday, S. Stewart, Rachael Day, J. Schone","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-062","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-062","url":null,"abstract":"","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130504375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cuckoos among Your Data: A Quality Control Method to Retrieve Mislabeled Writer Identities from Handwriting Datasets","authors":"Vlad Atanasiu","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-056","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-056","url":null,"abstract":"","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134159027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving a deep convolutional neural network architecture for character recognition","authors":"B. Cirstea, Laurence Likforman-Sulem","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-060","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-060","url":null,"abstract":"","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125399443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating Text Recognition for Overlapping Text Detection in Maps","authors":"N. Nazari, Tianxiang Tan, Yao-Yi Chiang","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-061","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-061","url":null,"abstract":"Detecting overlapping text from map images is a challenging problem. Previous algorithms generally assume specific cartographic styles (e.g., road shapes and text format) and are difficult to adjust for handling different map types. In this paper, we build on our previous text recognition work, Strabo, to develop an algorithm for detecting overlapping characters from non-text symbols. We call this algorithm Overlapping Text Detection (OTD). OTD uses the recognition results and locations of detected text labels (from Strabo) to detect potential areas that contain overlapping text. Next, OTD classifies these areas as either text or non-text regions based on their shape descriptions (including the ratio of number of foreground pixels to area size, number of connected components, and number of holes). The average precision and recall of OTD in classifying text and non-text regions were 77% and 86%, respectively. We show that OTD improved the precision and recall of text detection in Strabo by 19% and 41%, respectively, and produced higher accuracy compared to a state-ofthe-art text/graphic separation algorithm.","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121270833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Revisiting Known-Item Retrieval in Degraded Document Collections","authors":"Jason J. Soo, O. Frieder","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-065","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-065","url":null,"abstract":"Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publicly available datasets from TREC-5’s Confusion Track containing estimated error rates of 5% and 20% . On the 5% dataset, we demonstrate a statistically significant improvement over the prior art and Solr’s mean reciprocal rank (MRR). On the 20% dataset, we demonstrate a statistically significant improvement over Solr, and have similar performance to the prior art. The described approach achieves an MRR of 0.6627 and 0.4924 on collections with error rates of approximately 5% and 20% respectively. Introduction Documents that are not electronically readable are increasingly difficult to manage, search, and maintain. Optical character recognition (OCR) is used to digitize these documents, but frequently produces a degraded copy. We develop a search system capable of searching such degraded documents. Our approach sustains a higher search accuracy rate than the prior art as evaluated using the TREC-5 Confusion Track datasets. Additionally, the approach developed is domain and language agnostic; increasing its applicability. In the United States there are two federal initiatives underway focused on the digitization of health records. First, the federal government is incentivizing local and private hospitals to switch from paper to electronic health records to improve the quality of care [3]. Second, the Veteran’s Affairs (VA) has an initiative to eliminate all paper health records by 2015 [2]. Both processes require converting paper records to digital images, and – hopefully – indexing of the digitized images to support searching. These efforts either are leveraging or can leverage OCR to query the newly created records to improve quality of service. These are but a few of the many examples demonstrating the importance of OCR. An OCR process is composed of two main parts. First is the conversion of an imagine to text by identifying characters and words from images [8, 17]. Second, the resulting text is post-processed to identify and correct errors during the first phase. Techniques in this process can range from simple dictionary checks to statistical methods. Our research focuses on the latter phase. Some work in the second phase has attempted to optimize the algorithm’s parameters by training algorithms on portions of the dataset [16]. However, such an approach does not generalize to other OCR collections. Other work focuses on specialized situations: handwritten documents [15]; signs, historical markers/documents [13, 9]. While other works hinge on assumptions: the OCR exposes a confidence level for each processed word [7]; online resources will allow the sys","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126223057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Information Extraction from Resume Documents in PDF Format","authors":"Jiaze Chen, Liangcai Gao, Zhi Tang","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-064","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-064","url":null,"abstract":"","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129464349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philippine Barlas, David Hebert, Clément Chatelain, Sébastien Adam, T. Paquet
{"title":"Language Identification in Document Images","authors":"Philippine Barlas, David Hebert, Clément Chatelain, Sébastien Adam, T. Paquet","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-058","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-058","url":null,"abstract":"This paper presents a system dedicated to automatic language identification of text regions in heterogeneous and complex documents. This system is able to process documents with mixed printed and handwritten text and various layouts. To handle such a problem, we propose a system that performs the following sub-tasks: writing type identification (printed/handwritten), script identification and language identification. The methods for the writing type recognition and the script discrimination are based on the analysis of the connected components while the language identification approach relies on a statistical text analysis , which requires a recognition engine. We evaluate the system on a new public dataset and present detailed results on the three tasks. Our system outperforms the Google plug-in evaluated on the ground-truth transcriptions of the same dataset.","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124856857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Santosh, Naved Alam, P. Roy, L. Wendling, Sameer Kiran Antani, G. Thoma
{"title":"Arrowhead detection in biomedical images","authors":"K. Santosh, Naved Alam, P. Roy, L. Wendling, Sameer Kiran Antani, G. Thoma","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-054","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-054","url":null,"abstract":"Medical images in biomedical documents tend to be complex by nature and often contain several regions that are annotated using arrows. Arrowhead detection is a critical precursor to regionof-interest (ROI) labeling and image content analysis. To detect arrowheads, images are first binarized using fuzzy binarization technique to segment a set of candidates based on connected component principle. To select arrow candidates, we use convexity defect-based filtering, which is followed by template matching via dynamic programming. The similarity score via dynamic time warping (DTW) confirms the presence of arrows in the image. Our test on biomedical images from imageCLEF 2010 collection shows the interest of the technique.","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125526254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Training a calligraphy style classifier on a non-representative training set","authors":"G. Nagy","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-052","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-052","url":null,"abstract":"Calligraphy collections are being scanned into document images for preservation and accessibility. The digitization technology is mature and calligraphy character recognition is well underway, but automatic calligraphy style classification is lagging. Special style features are developed to measure style similarity of calligraphy character images of different stroke configurations and GB (or Unicode) labels. Recognizing the five main styles is easiest when a style-labeled sample of the same character (i.e., same GB code) from the same work and scribe is available. Even samples of characters with different GB codes from same work help. Style classification is most difficult when the training data has no comparable characters from the same work. These distinctions are quantified by distance statistics between the underlying feature distributions. Style classification is more accurate when several character samples from the same work are available. In adverse practical scenarios, when labeled versions of unknown works are not available for training the classifier, Borda Count voting and adaptive classification of style-sensitive feature vectors seven-character from the same work raises the ~70% single-sample baseline accuracy to ~90%.","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116714002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intelligent Pen: A Least Cost Search Approach to Stroke Extraction in Historical Documents","authors":"Kevin L. Bauer, W. Barrett","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-057","DOIUrl":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-057","url":null,"abstract":"Intelligent Pen: A Least Cost Search Approach to Stroke Extraction in Historical Documents Kevin L. Bauer Department of Computer Science, BYU Master of Science Extracting strokes from handwriting in historical documents provides high-level features for the challenging problem of handwriting recognition. Such handwriting often contains noise, faint or incomplete strokes, strokes with gaps, overlapping ascenders and descenders and competing lines when embedded in a table or form, making it unsuitable for local line following algorithms or associated binarization schemes. We introduce Intelligent Pen for piece-wise optimal stroke extraction. Extracted strokes are stitched together to provide a complete trace of the handwriting. Intelligent Pen formulates stroke extraction as a set of piece-wise optimal paths, extracted and assembled in cost order. As such, Intelligent Pen is robust to noise, gaps, faint handwriting and even competing lines and strokes. Intelligent Pen traces compare closely with the shape as well as the order in which the handwriting was written. A quantitative comparison with an ICDAR handwritten stroke data set shows Intelligent Pen traces to be within 0.78 pixels (mean difference) of the manually created strokes.","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132868848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}