{"title":"Handwritten and Machine-Printed Text Discrimination Using a Template Matching Approach","authors":"Mehryar Emambakhsh, Yulan He, I. Nabney","doi":"10.1109/DAS.2016.22","DOIUrl":"https://doi.org/10.1109/DAS.2016.22","url":null,"abstract":"We propose a novel template matching approach for the discrimination of handwritten and machine-printed text. We first pre-process the scanned document images by performing denoising, circles/lines exclusion and word-block level segmentation. We then align and match characters in a flexible sized gallery with the segmented regions, using parallelised normalised cross-correlation. The experimental results over the Pattern Recognition & Image Analysis Research Lab-Natural History Museum (PRImA-NHM) dataset show remarkably high robustness of the algorithm in classifying cluttered, occluded and noisy samples, in addition to those with significant high missing data. The algorithm, which gives 84.0% classification rate with false positive rate 0.16 over the dataset, does not require training samples and generates compelling results as opposed to the training-based approaches, which have used the same benchmark.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121766685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"General Pattern Run-Length Transform for Writer Identification","authors":"Sheng He, Lambert Schomaker","doi":"10.1109/DAS.2016.42","DOIUrl":"https://doi.org/10.1109/DAS.2016.42","url":null,"abstract":"In this paper we present a novel textural-based feature for writer identification: the General Pattern Run-Length Transform (GPRLT), which is the histogram of the run-length of any complex patterns. The GPRLT can be computed on the binary images (GPRLT bin) or on the gray scale images (GPRLT gray) without using any binarization or segmentation methods. Experimental results show that the GPRLT gray achieves even higher performance than the GPRLT bin for writer identification. The writer identification performance on the challenging CERUG-EN data set demonstrates that the proposed methods outperform state-of-the-art algorithms. Our source code and data set are available on www.ai.rug.nl/~sheng/dflib.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124729390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Synthesis of Historical Arabic Text for Word-Spotting","authors":"M. Kassis, Jihad El-Sana","doi":"10.1109/DAS.2016.9","DOIUrl":"https://doi.org/10.1109/DAS.2016.9","url":null,"abstract":"We present a novel framework for automatic and efficient synthesis of historical handwritten Arabic text. The main purpose of this framework is to assist word spotting and keyword searching in handwritten historical documents. The proposed framework consists of two main procedures: building a letter connectivity map and synthesizing words. A letter connectivity map includes multiple instances of the various shape of each letter, since a letter in Arabic usually has multiple shapes depends in its position in the word. Each map represents one writer and encodes the specific handwriting style. The letter connectivity map is used to guide the synthesis of any Arabic continuous subword, word, or sentence. The proposed framework automatically generates the letter connectivity map annotation from a several pages historical pages previously annotated. Once the letter connectivity map is available our framework can synthesis the pictorial representation of any Arabic word or sentence from their text representation. The writing style of the synthesized text resembles the writing style of the input pages. The synthesized words can be used in word-spotting and many other historical document processing applications. The proposed approach provides an intuitive and easy-to-use framework to search for a keyword in the rest of the manuscript. Our experimental study shows that our approach enables accurate results in word spotting algorithms.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"38 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120981193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fuzzy Integral for Combining SVM-Based Handwritten Soft-Biometrics Prediction","authors":"Nesrine Bouadjenek, H. Nemmour, Y. Chibani","doi":"10.1109/DAS.2016.27","DOIUrl":"https://doi.org/10.1109/DAS.2016.27","url":null,"abstract":"This work addresses soft-biometrics prediction from handwriting analysis, which aims to predict the writer's gender, age range and handedness. Three SVM predictors associated each to a specific data feature are developed and subsequently combined to aggregate a robust prediction. For the combination step, Sugeno's Fuzzy Integral is proposed. Experiments are conducted on public Arabic and English handwriting datasets. The performance assessment is carried out comparatively to individual systems as well as to max and average rules, using independent and blended corpuses. The results obtained demon-strated the usefulness of the Fuzzy Integral, which provides a gain of more than 4% over individual systems as well as other combination rules. Moreover, with respect to the state of the art methods, the proposed approach seems to be much more relevant.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129586880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combination of Structural and Factual Descriptors for Document Stream Segmentation","authors":"Romain Karpinski, A. Belaïd","doi":"10.1109/DAS.2016.21","DOIUrl":"https://doi.org/10.1109/DAS.2016.21","url":null,"abstract":"This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"208 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121633129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recognition-Based Approach of Numeral Extraction in Handwritten Chemistry Documents Using Contextual Knowledge","authors":"N. Ghanmi, A. Belaïd","doi":"10.1109/DAS.2016.54","DOIUrl":"https://doi.org/10.1109/DAS.2016.54","url":null,"abstract":"This paper presents a complete procedure that uses contextual and syntactic information to identify and recognize amount fields in the table regions of chemistry documents. The proposed method is composed of two main modules. Firstly, a structural analysis based on connected component (CC) dimensions and positions identifies some special symbols and clusters other CCs into three groups: fragment of characters, isolated characters or connected characters. Then, a specific processing is performed on each group of CCs. The fragment of characters are merged with the nearest character or string using geometric relationship based rules. The characters are sent to a recognition module to identify the numeral components. For the connected characters, the final decision on the string nature (numeric or non-numeric) is made based on a global score computed on the full string using the height regularity property and the recognition probabilities of its segmented fragments. Finally, a simple syntactic verification at table row level is conducted in order to correct eventual errors. The experimental tests are carried out on real-world chemistry documents provided by our industrial partner eNovalys. The obtained results show the effectiveness of the proposed system in extracting amount fields.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131323771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters","authors":"A. Ul-Hasan, S. S. Bukhari, A. Dengel","doi":"10.1109/DAS.2016.51","DOIUrl":"https://doi.org/10.1109/DAS.2016.51","url":null,"abstract":"Digitizing historical documents is crucial in preserving the literary heritage. With the availability of low cost capturing devices, libraries and institutes all over the world have old literature preserved in the form of scanned documents. However, searching through these scanned images is still a tedious job as one is unable to search through them. Contemporary machine learning approaches have been applied successfully to recognize text in both printed and handwriting form, however, these approaches require a lot of transcribed training data in order to obtain satisfactory performance. Transcribing the documents manually is a laborious and costly task, requiring many man-hours and language-specific expertise. This paper presents a generic iterative training framework to address this issue. The proposed framework is not only applicable to historical documents, but for present-day documents as well, where manually transcribed training data is unavailable. Starting with the minimal information available, the proposed approach iteratively corrects the training and generalization errors. Specifically, we have used a segmentation-based OCR method to train on individual symbols and then use the semi-corrected recognized text lines as the ground-truth data for segmentation-free sequence learning, which learns to correct the errors in the ground-truth by incorporating context-aware processing. The proposed approach is applied to a collection of 15th century Latin documents. The iterative procedure using segmentation-free OCR was able to reduce the initial character error of about 23% (obtained from segmentation-based OCR) to less than 7% in few iterations.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115239773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries","authors":"Felix Stahlberg, S. Vogel","doi":"10.1109/DAS.2016.81","DOIUrl":"https://doi.org/10.1109/DAS.2016.81","url":null,"abstract":"Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116409671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joan Pastor-Pellicer, Muhammad Zeshan Afzal, M. Liwicki, María José Castro Bleda
{"title":"Complete System for Text Line Extraction Using Convolutional Neural Networks and Watershed Transform","authors":"Joan Pastor-Pellicer, Muhammad Zeshan Afzal, M. Liwicki, María José Castro Bleda","doi":"10.1109/DAS.2016.58","DOIUrl":"https://doi.org/10.1109/DAS.2016.58","url":null,"abstract":"We present a novel Convolutional Neural Network based method for the extraction of text lines, which consists of an initial Layout Analysis followed by the estimation of the Main Body Area (i.e., the text area between the baseline and the corpus line) for each text line. Finally, a region-based method using watershed transform is performed on the map of the Main Body Area for extracting the resulting lines. We have evaluated the new system on the IAM-HisDB, a publicly available dataset containing historical documents, outperforming existing learning-based text line extraction methods, which consider the problem as pixel labelling problem into text and non-text regions.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129829755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nathalie Girard, Roger Trullo, Sabine Barrat, N. Ragot, Jean-Yves Ramel
{"title":"Interactive Definition and Tuning of One-Class Classifiers for Document Image Classification","authors":"Nathalie Girard, Roger Trullo, Sabine Barrat, N. Ragot, Jean-Yves Ramel","doi":"10.1109/DAS.2016.46","DOIUrl":"https://doi.org/10.1109/DAS.2016.46","url":null,"abstract":"With mass of data, document image classification systems have to face new trends like being able to process heterogeneous data streams efficiently. Generally, when processing data streams, few knowledge is available about the content of the possible streams. Furthermore, as getting labelled data is costly, the classification model has to be learned from few available labelled examples. To handle such specific context, we think that combining one-class classifiers could be a very interesting alternative to quickly define and tune classification systems dedicated to different document streams. The main interest of one-class classifiers is that no interdependence occurs between each classifier model allowing easy removal, addition or modification of classes of documents. Such reconfiguration will not have any impact on the other classifiers. It is also noticeable that each classifier can use a different set of features compared to the other to handle the same class or even different classes. In return, as only one class is well-specified during the learning step, one-class classifiers have to be defined carefully to obtain good performances. It is more difficult to select the representative training examples and the discriminative features with only positive examples. To overcome these difficulties, we have defined a complete framework offering different methods that can help a system designer to define and tune one-class classifier models. The aims are to make easier the selection of good training examples and of suitable features depending on the class to recognize into the document stream. For that purpose, the proposed methods compute different measures to evaluate the relevance of the available features and training examples. Moreover, a visualization of the decision space according to selected examples and features is proposed to help such a choice and, an automatic tuning is proposed for the parameters of the models according to the class to recognize when a validation stream is available. The pertinence of the proposed framework is illustrated on two different use cases (a real data stream and a public data set).","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127101200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}