Anna Scius-Bertrand, Simon Gabay, Juliette Janes, L. Petkovic, Caroline Corbieres, Thibault Clérice
{"title":"The BIR database – Identifying typographic emphasis in list-like historical documents","authors":"Anna Scius-Bertrand, Simon Gabay, Juliette Janes, L. Petkovic, Caroline Corbieres, Thibault Clérice","doi":"10.1145/3476887.3476913","DOIUrl":"https://doi.org/10.1145/3476887.3476913","url":null,"abstract":"Layout analysis and optical character recognition have become traditional tasks for processing historical prints, but are now insufficient. Additional information is found in typographic emphasis, such as bold and italic letters. They carry semantic meaning (titles, emphasis...) and also outline the structure of the page (entries, sub-parts...). Retrieving such data is therefore crucial for information extraction and automatic document structuring. In this paper, we introduce the Bold-Italic-Regular (BIR) database, which contains 285 pages of scanned, list-like historical prints that have been annotated at word level with bold and italic emphasis. Baseline results are provided for word detection and style classification using state-of-the-art deep neural network models, highlighting promising possibilities, such as near-human performance for isolated word classification, but also demonstrating limitations for the task at hand.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121264397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clemens Neudecker, Konstantin Baierer, Mike Gerber, C. Clausner, A. Antonacopoulos, S. Pletschacher
{"title":"A survey of OCR evaluation tools and metrics","authors":"Clemens Neudecker, Konstantin Baierer, Mike Gerber, C. Clausner, A. Antonacopoulos, S. Pletschacher","doi":"10.1145/3476887.3476888","DOIUrl":"https://doi.org/10.1145/3476887.3476888","url":null,"abstract":"The millions of pages of historical documents that are digitized in libraries are increasingly used in contexts that have more specific requirements for OCR quality than keyword search. How to comprehensively, efficiently and reliably assess the quality of OCR results against the background of mass digitization, when ground truth can only ever be produced for very small numbers? Due to gaps in specifications, results from OCR evaluation tools can return different results, and due to differences in implementation, even commonly used error rates are often not directly comparable. OCR evaluation metrics and sampling methods are also not sufficient where they do not take into account the accuracy of layout analysis, since for advanced use cases like Natural Language Processing or the Digital Humanities, accurate layout analysis and detection of the reading order are crucial. We provide an overview of OCR evaluation metrics and tools, describe two advanced use cases for OCR results, and perform an OCR evaluation experiment with multiple evaluation tools and different metrics for two distinct datasets. We analyze the differences and commonalities in light of the presented use cases and suggest areas for future work.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129010618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visual Analysis of Chapbooks Printed in Scotland","authors":"Abhishek Dutta, G. Bergel, Andrew Zisserman","doi":"10.1145/3476887.3476893","DOIUrl":"https://doi.org/10.1145/3476887.3476893","url":null,"abstract":"Chapbooks were short, cheap printed booklets produced in large quantities in Scotland, England, Ireland, North America and much of Europe between roughly the seventeenth and nineteenth centuries. A form of popular literature containing songs, stories, poems, games, riddles, religious writings and other content designed to appeal to a wide readership, they were frequently illustrated, particularly on their title-pages. This paper describes the visual analysis of such chapbook illustrations. We automatically extract all the illustrations contained in the National Library of Scotland Chapbooks Printed in Scotland dataset, and create a visual search engine to search this dataset using full or part-illustrations as queries. We also cluster these illustrations based on their visual content, and provide keyword-based search of the metadata associated with each publication. The visual search; clustering of illustrations based on visual content; and metadata search features enable researchers to forensically analyse the chapbooks dataset and to discover unnoticed relationships between its elements. We release all annotations and software tools described in this paper to enable reproduction of the results presented and to allow extension of the methodology described to datasets of a similar nature.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123958825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalized Template Matching for Semi-structured Text","authors":"G. Nagy","doi":"10.1145/3476887.3476895","DOIUrl":"https://doi.org/10.1145/3476887.3476895","url":null,"abstract":"Conventional template matching for named entity recognition on book-length text strings is generalized by allowing search phrases to capture distant tokens. Combined with word-type tagging and format variants (alternative name/date formats), a few initial templates (class—search-phrase—extract-phrase triples) can label most of the significant tokens. The program then uses its book-length statistics of tag-label associations to suggest candidate text for further template construction. The method serves as a preprocessor for error-free extraction of semantic relations from text obeying explicit semi-structure constraints. On three sample books of genealogical records, an F-measure of over 0.99 was achieved with less than 3 hours’ user time on each book.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126435482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Text Detection and Recognition by using CNNs in the Austro-Hungarian Historical Military Mapping Survey","authors":"Y. Can, M. E. Kabadayı","doi":"10.1145/3476887.3476904","DOIUrl":"https://doi.org/10.1145/3476887.3476904","url":null,"abstract":"Historical maps include precious data about historical, geographical and economic perspectives of a period. However, several unique challenges and opportunities accompany historical maps compared to modern maps, such as low-quality images, degraded manuscripts and the huge quantity of non-annotated digital map collections. In the recent decade, Convolutional Neural Networks (CNNs) are applied to solve various image processing problems, but they need enormous annotated data to have accurate results. In this work, we annotated text regions of the Third Military Mapping Survey of Austria-Hungary historical map series conducted between 1884 and 1918 manually and made them accessible for researchers. Then, we detected the pixel-wise positions of text regions by employing the deep neural network architecture and recognized them with encouraging error rates.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117094733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mélodie Boillet, Martin Maarand, T. Paquet, Christopher Kermorvant
{"title":"Including Keyword Position in Image-based Models for Act Segmentation of Historical Registers","authors":"Mélodie Boillet, Martin Maarand, T. Paquet, Christopher Kermorvant","doi":"10.1145/3476887.3476905","DOIUrl":"https://doi.org/10.1145/3476887.3476905","url":null,"abstract":"The segmentation of complex images into semantic regions has seen a growing interest these last years with the advent of Deep Learning. Until recently, most existing methods for Historical Document Analysis focused on the visual appearance of documents, ignoring the rich information that textual content can offer. However, the segmentation of complex documents into semantic regions is sometimes impossible relying only on visual features and recent models embed both visual and textual information. In this paper, we focus on the use of both visual and textual information for segmenting historical registers into structured and meaningful units such as acts. An act is a text recording containing valuable knowledge such as demographic information (baptism, marriage or death) or royal decisions (donation or pardon). We propose a simple pipeline to enrich document images with the position of text lines containing key-phrases and show that running a standard image-based layout analysis system on these images can lead to significant gains. Our experiments show that the detection of acts increases from 38 % of mAP to 74 % when adding textual information, in real use-case conditions where text lines positions and content are extracted with an automatic recognition system.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132275206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GloSAT Historical Measurement Table Dataset: Enhanced Table Structure Recognition Annotation for Downstream Historical Data Rescue","authors":"Juliusz Ziomek, S. Middleton","doi":"10.1145/3476887.3476890","DOIUrl":"https://doi.org/10.1145/3476887.3476890","url":null,"abstract":"Understanding and extracting tables from documents is a research problem that has been studied for decades. Table structure recognition is the labelling of components within a detected table, which can be detected automatically or manually provided. This paper presents the GloSAT historical measurement table dataset designed to train table structure recognition models for use in downstream historical data rescue applications. The dataset contains 500 scanned and manually annotated images of pages from meteorological measurement logbooks. We enhance standard full table and individual cell annotations by adding additional annotations for headings, headers, and table bodies. We also provide annotations for coarse segmentation cells consisting of multiple data cells logically grouped by ruling lines of ink or whitespace in the table, which often represent data cells that are semantically grouped. Our dataset annotations are provided in VOC2007 and ICDAR-2019 Competition on Table Detection and Recognition (cTDaR-19) XML formats, and our dataset can easily be aggregated with the cTDaR-19 dataset. We report results running a series of benchmark algorithms on our new dataset, concluding that post-processing is very important for performance, and that page style is not as significant a feature as table type on model performance.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134487453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli
{"title":"BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset","authors":"D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli","doi":"10.1145/3476887.3476896","DOIUrl":"https://doi.org/10.1145/3476887.3476896","url":null,"abstract":"The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124469887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}