{"title":"Making Europe's Historical Newspapers Searchable","authors":"Clemens Neudecker, A. Antonacopoulos","doi":"10.1109/DAS.2016.83","DOIUrl":"https://doi.org/10.1109/DAS.2016.83","url":null,"abstract":"This paper provides a rare glimpse into the overall approach for the refinement, i.e. the enrichment of scanned historical newspapers with text and layout recognition, in the Europeana Newspapers project. Within three years, the project processed more than 10 million pages of historical newspapers from 12 national and major libraries to produce the largest open access and fully searchable text collection of digital historical newspapers in Europe. In this, a wide variety of legal, logistical, technical and other challenges were encountered. After introducing the background issues in newspaper digitization in Europe, the paper discusses the technical aspects of refinement in greater detail. It explains what decisions were taken in the design of the large-scale processing workflow to address these challenges, what were the results produced and what were identified as best practices.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121259410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. I. Toledo, A. Fornés, Jordi Cucurull-Juan, J. Lladós
{"title":"Election Tally Sheets Processing System","authors":"J. I. Toledo, A. Fornés, Jordi Cucurull-Juan, J. Lladós","doi":"10.1109/DAS.2016.37","DOIUrl":"https://doi.org/10.1109/DAS.2016.37","url":null,"abstract":"In paper based elections, manual tallies at polling station level produce myriads of documents. These documents share a common form-like structure and a reduced vocabulary worldwide. On the other hand, each tally sheet is filled by a different writer and on different countries, different scripts are used. We present a complete document analysis system for electoral tally sheet processing combining state of the art techniques with a new handwriting recognition subprocess based on unsupervised feature discovery with Variational Autoencoders and sequence classification with BLSTM neural networks. The whole system is designed to be script independent and allows a fast and reliable results consolidation process with reduced operational cost.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114899520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rathin Radhakrishnan Nair, Nishant Sankaran, Ifeoma Nwogu, V. Govindaraju
{"title":"Understanding Line Plots Using Bayesian Network","authors":"Rathin Radhakrishnan Nair, Nishant Sankaran, Ifeoma Nwogu, V. Govindaraju","doi":"10.1109/DAS.2016.73","DOIUrl":"https://doi.org/10.1109/DAS.2016.73","url":null,"abstract":"Information graphics, such as bar charts, graphs, plots etc. in scientific documents primarily facilitate better understanding of information. Graphics are a key component in technical documents as they are simplified representations of complex ideas. When the traditional optical character recognition (OCR) systems is used on digitized documents, we lose the ideas conveyed in these information graphics since OCRs typically work only on text. And although in more recent times, tools have been developed to extract information graphics from pdf files, they still do not intelligently interpret the contents of the extracted graphics. We therefore propose a method for identifying the intended messages of line plots using a Bayesian network. We accomplish this by first extracting a dense set of points in from a line plot and then represent the entire line plot as a sequence of trends. We then implement a Bayesian network for reasoning about the messages conveyed by the line plots and their trends. We validate our approach by performing experiments on a dataset obtained from computer science conference publications and evaluate the performance of the network against the messages generated by human end users. The resulting intended message gives holistic information about the line plot(s) as well as lower level information about the trends that make up the plot.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133432344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SDK Reinvented: Document Image Analysis Methods as RESTful Web Services","authors":"Marcel Würsch, R. Ingold, M. Liwicki","doi":"10.1109/DAS.2016.56","DOIUrl":"https://doi.org/10.1109/DAS.2016.56","url":null,"abstract":"Document Image Analysis (DIA) systems become ever more advanced, but also more complex -- computationally, and logically. This increases the difficulty of integrating existing state-of-the-art approaches into new research or into practical workflows. The current approach to sharing software is publishing source code -- leaving the burden to the integrator -- or creating a Software Development Kit (SDK) which is often restricted to one programming language. We present DIVAServices a framework for sharing and accessing DIA methods within the research community and beyond. Using a RESTful web service architecture we provide access to the methods, leading to only one system on which the binaries of methods need to be maintained. All it takes for a developer to use an algorithm is a simple HTTP request with the image data and parameters for the method and they will receive the computed results in a format that allows for seamless integration into any kind of workflow or for further processing. Furthermore, DIVAServices is open-source, enabling other research groups or libraries to host their own instance in their environment. Using this framework, future DIA systems can be built on the shoulders of well tested algorithms, accessible to everyone.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134367542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multilingual OCR for Indic Scripts","authors":"Minesh Mathew, A. Singh, C. V. Jawahar","doi":"10.1109/DAS.2016.68","DOIUrl":"https://doi.org/10.1109/DAS.2016.68","url":null,"abstract":"In Indian scenario, a document analysis system has to support multiple languages at the same time. With emerging multilingualism in urban India, often bilingual, trilingual or even more languages need to be supported. This demands development of a multilingual OCR system which can work seamlessly across Indic scripts. In our approach the script is identified at word level, prior to the recognition of the word. An end-to-end RNN based architecture which can detect the script and recognize the text in a segmentation-free manner is proposed for this purpose. We demonstrate the approach for 12 Indian languages and English. It is observed that, even with the similar architecture, performance on Indian languages are poorer compared to English. We investigate this further. Our approach is evaluated on a large corpus comprising of thousands of pages. The Hindi OCR is compared with other popular OCRs for the language, as a further testimony for the efficacy of our method.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121120983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MSIO: MultiSpectral Document Image BinarizatIOn","authors":"Markus Diem, Fabian Hollaus, Robert Sablatnig","doi":"10.1109/DAS.2016.39","DOIUrl":"https://doi.org/10.1109/DAS.2016.39","url":null,"abstract":"MultiSpectral (MS) imaging enriches document digitization by increasing the spectral resolution. We present a methodology which detects a target ink in document images by taking into account this additional information. The proposed method performs a rough foreground estimation to localize possible ink regions. Then, the Adaptive Coherence Estimator (ACE), a target detection algorithm, transforms the MS input space into a single gray-scale image where values close to one indicate ink. A spatial segmentation using GrabCut on the target detection's output is computed to create the final binary image. To find a baseline performance, the method is evaluated on the three most recent Document Image Binarization COntests (DIBCO) despite the fact that they only provide RGB images. In addition, an evaluation on three publicly available MS datasets is carried out. The presented methodology achieved the highest performance at the MultiSpectral Text Extraction (MS-TEx) contest 2015.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123746226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emilio Granell, Verónica Romero, C. Martínez-Hinarejos
{"title":"An Interactive Approach with Off-Line and On-Line Handwritten Text Recognition Combination for Transcribing Historical Documents","authors":"Emilio Granell, Verónica Romero, C. Martínez-Hinarejos","doi":"10.1109/DAS.2016.45","DOIUrl":"https://doi.org/10.1109/DAS.2016.45","url":null,"abstract":"Automatic transcription of historical documents is becoming an important research topic, specially because of the increasing number of digitised historical documents that libraries and archives are publishing. However, state-of-the-art handwritten text recognition systems are far from being perfect. Therefore, to have perfect transcriptions, human expert revision is required to really produce a transcription of standard quality. In this context, an interactive assistive scenario, where the automatic system and the human transcriber cooperate to generate the perfect transcription, would allow for a more effective approach. In this paper we present a multimodal interactive transcription system where user feedback is provided by means of touchscreen pen strokes, traditional keyboard and mouse operations. The combination of both the main and the feedback data stream is based on the use of Confusion Networks derived from the output of the on-line and off-line handwritten text recognition systems. The use of the proposed combination help to optimise overall performance and usability.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116985684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OCR Accuracy Prediction Method Based on Blur Estimation","authors":"V. C. Kieu, F. Cloppet, N. Vincent","doi":"10.1109/DAS.2016.50","DOIUrl":"https://doi.org/10.1109/DAS.2016.50","url":null,"abstract":"In this paper, we propose an OCR accuracy prediction method based on a local blur estimation since blur is one of the important factors that mostly damage OCR accuracy. First, we apply the blur estimation on synthetic blurred images by using Gaussian and motion blur in order to investigate the relation between blur effect and character size regarding OCR accuracy. This relation is considered as a blur-character size feature to define a classifier. Finally, the classifier can separate characters of a given document into three classes: readable, intermediate, and non-readable classes. Therefore, the quality score of the document is inferred from the three classes. The proposed method is evaluated on a published database and on an industrial one. The correlation with OCR accuracy is also given to compare with the state-of-the-art methods.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129195403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Natural Scene Character Recognition Using Robust PCA and Sparse Representation","authors":"Zheng Zhang, Yong Xu, Cheng-Lin Liu","doi":"10.1109/DAS.2016.32","DOIUrl":"https://doi.org/10.1109/DAS.2016.32","url":null,"abstract":"Natural scene character recognition is challenging due to the cluttered background, which is hard to separate from text. In this paper, we propose a novel method for robust scene character recognition. Specifically, we first use robust principal component analysis (PCA) to denoise character image by recovering the missing low-rank component and filtering out the sparse noise term, and then use a simple Histogram of oriented Gradient (HOG) to perform image feature extraction, and finally, use a sparse representation based classifier for recognition. In experiments on four public datasets, namely the Char74K dataset, ICADAR 2003 robust reading dataset, Street View Text (SVT) dataset and IIIT5K-word dataset, our method was demonstrated to be competitive with the state-of-the-art methods.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"82 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123237194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frédéric Rayar, T. Mondal, Sabine Barrat, F. Bouali, G. Venturini
{"title":"Visual Analysis System for Features and Distances Qualitative Assessment: Application to Word Image Matching","authors":"Frédéric Rayar, T. Mondal, Sabine Barrat, F. Bouali, G. Venturini","doi":"10.1109/DAS.2016.17","DOIUrl":"https://doi.org/10.1109/DAS.2016.17","url":null,"abstract":"In this paper, a visual analysis system to qualitatively assess the features and distance functions that are used for calculating dissimilarity between two word images is presented. Computation of dissimilarity between two images is the prerequisite for image matching, indexing and retrieval problems. First, the features are extracted from the word images and a distance between each image to others is computed and represented in a matrix form. Then, based on this distance matrix, a proximity graph is built to structure the set of word images and highlight their topology. The proposed visual analysis system is a web based platform that allows visualisation and interactions on the obtained graph. This interactive visualisation tool inherently helps users to quickly analyse and understand the relevance and robustness of selected features and corresponding distance function in a unsupervised way, i.e. without any ground truth. Experiments are performed on a handwritten dataset of segmented words. Three types of features and four distance functions are considered to describe and compare the word images. Theses material are leveraged to evaluate the relevance of the built graph, and the usefulness of the platform.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125559351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}