D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli
{"title":"中世纪希伯来文手稿和开放注释数据集的通用模型","authors":"D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli","doi":"10.1145/3476887.3476896","DOIUrl":null,"url":null,"abstract":"The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset\",\"authors\":\"D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli\",\"doi\":\"10.1145/3476887.3476896\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.\",\"PeriodicalId\":166776,\"journal\":{\"name\":\"The 6th International Workshop on Historical Document Imaging and Processing\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 6th International Workshop on Historical Document Imaging and Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3476887.3476896\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 6th International Workshop on Historical Document Imaging and Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3476887.3476896","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset
The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.