Christian Bartz, Laurenz Seidel, Duy-Hung Nguyen, Joseph Bethge, Haojin Yang, C. Meinel
{"title":"Synthetic Data for the Analysis of Archival Documents: Handwriting Determination","authors":"Christian Bartz, Laurenz Seidel, Duy-Hung Nguyen, Joseph Bethge, Haojin Yang, C. Meinel","doi":"10.1109/DICTA51227.2020.9363410","DOIUrl":null,"url":null,"abstract":"Archives contain a wealth of information and are invaluable for historical research. Thanks to digitization, many archives are preserved in a digital format, making it easier to share and access documents from an archive. Handwriting and handwritten notes are commonly found in archives and contain a lot of information that can not be extracted by analyzing documents with Optical Character Recognition (OCR) for printed text. In this paper, we present an approach for determining whether a scan of a document contains handwriting. As a preprocessing step, this approach can help to identify documents that need further analysis with a full recognition pipeline. Our method consists of a deep neural network that classifies whether a document contains handwriting. Our method is designed in such a way that we overcome the most significant challenge when working with archival data, which is the scarcity of annotated training data. To overcome this problem, we introduce a data generation method to successfully train our proposed deep neural network. Our experiments show that our model, trained on synthetic data, can achieve promising results on a real-world dataset from an art-historical archive.","PeriodicalId":348164,"journal":{"name":"2020 Digital Image Computing: Techniques and Applications (DICTA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Digital Image Computing: Techniques and Applications (DICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DICTA51227.2020.9363410","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Archives contain a wealth of information and are invaluable for historical research. Thanks to digitization, many archives are preserved in a digital format, making it easier to share and access documents from an archive. Handwriting and handwritten notes are commonly found in archives and contain a lot of information that can not be extracted by analyzing documents with Optical Character Recognition (OCR) for printed text. In this paper, we present an approach for determining whether a scan of a document contains handwriting. As a preprocessing step, this approach can help to identify documents that need further analysis with a full recognition pipeline. Our method consists of a deep neural network that classifies whether a document contains handwriting. Our method is designed in such a way that we overcome the most significant challenge when working with archival data, which is the scarcity of annotated training data. To overcome this problem, we introduce a data generation method to successfully train our proposed deep neural network. Our experiments show that our model, trained on synthetic data, can achieve promising results on a real-world dataset from an art-historical archive.