D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli
{"title":"BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset","authors":"D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli","doi":"10.1145/3476887.3476896","DOIUrl":null,"url":null,"abstract":"The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 6th International Workshop on Historical Document Imaging and Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3476887.3476896","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.