中世纪希伯来文手稿和开放注释数据集的通用模型

The 6th International Workshop on Historical Document Imaging and Processing Pub Date : 2021-09-05 DOI:10.1145/3476887.3476896

D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli

{"title":"中世纪希伯来文手稿和开放注释数据集的通用模型","authors":"D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli","doi":"10.1145/3476887.3476896","DOIUrl":null,"url":null,"abstract":"The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.","PeriodicalId":166776,"journal":{"name":"The 6th International Workshop on Historical Document Imaging and Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset\",\"authors\":\"D. Ezra, Bronson Brown-deVost, P. Jablonski, Hayim Lapin, Benjamin Kiessling, Elena Lolli\",\"doi\":\"10.1145/3476887.3476896\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.\",\"PeriodicalId\":166776,\"journal\":{\"name\":\"The 6th International Workshop on Historical Document Imaging and Processing\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 6th International Workshop on Historical Document Imaging and Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3476887.3476896\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 6th International Workshop on Historical Document Imaging and Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3476887.3476896","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

本文提出了用于识别和页面分割的开源广义模型，旨在用于eScriptorium平台或kraken OCR引擎，用于中世纪希伯来文手稿的正方形脚本，在验证集和数据集上达到了超过97%的字符精度，该数据集由来自近100种不同文学手稿的202页组成，带有布局注释(区域和线条)以及转录。手稿页的来源材料在不同的脚本类型，地理和时间的起源。此外，我们描述了引导过程，使我们能够通过文本-图像对齐自动创建大部分数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset

The paper presents Open Source generalized models for recognition and page segmentation, intended for use on the eScriptorium platform or kraken OCR engine, of Medieval Hebrew manuscripts in square script that arrive at a character accuracy of more than 97% on the validation set and a dataset consisting of 202 pages from almost 100 different literary manuscripts with layout annotation (regions and lines) as well as transcription. The manuscript pages are sourced from material in different script types, geographical, and chronological origins. In addition we describe the bootstrapping procedure that enabled us to create most of the dataset automatically through text-image alignment.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The 6th International Workshop on Historical Document Imaging and Processing

自引率

0.00%

发文量