Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage最新文献_第3页

Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model 利用文本模型提高中文珍本图书光学字符识别精度

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322922

Hsiang-An Wang, Pin-Ting Liu

{"title":"Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model","authors":"Hsiang-An Wang, Pin-Ting Liu","doi":"10.1145/3322905.3322922","DOIUrl":"https://doi.org/10.1145/3322905.3322922","url":null,"abstract":"The legibility of the text of rare books is often subject to precarious conditions: natural decay or erosion, and ink bleed caused by flawed printing methods centuries ago often make such texts difficult to recognize. This difficulty hardens the challenge for optical character recognition (OCR), whose task is to convert images of printed text into machine-encoded text when the rare book has been digitized. To reduce the error of the OCR for rare books, this research applies N-gram, long short-term memory (LSTM), and backward and forward N-gram (BF N-gram) statistics text models through substantial training data of texts to develop a more accurate OCR model. We build N-gram, LSTM, and BF N-gram statistics models at varying character lengths and experiment on different quantities of text to locate the best performance of character recognition through observing how these models carry out the tasks. Once the text model capable of optimized performance is identified, we use further experiments to track down the most appropriate time and method to correct OCR errors with the aid of the text model. Our experiments suggest that the correction implemented by the text model yields more accurate OCR results than does falling back on OCR models only.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134398472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis 阿拉伯语OCR的新策略:Archigraphemes，字母块，脚本语法和形状合成

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322928

Thomas Milo, A. Martínez

{"title":"A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis","authors":"Thomas Milo, A. Martínez","doi":"10.1145/3322905.3322928","DOIUrl":"https://doi.org/10.1145/3322905.3322928","url":null,"abstract":"Current OCR has limited capability for Arabic because of script models lacking scientific basis. We propose a new OCR strategy for Arabic, based on 1. Islamic script grammar including extended shaping and 2. treating Arabic script as a multi-layered writing system. We analyse Arabic script as an allographic rendering of graphemic abstractions. Grapheme is a term adapted from phonology; it is analogous to the term phoneme. In phonology, the smallest functional unit of sound is the phoneme. This is not heard, but perceived. What one hears are contextually conditioned allophones. In Arabic orthography, the smallest functional unit of spelling is the grapheme. This is not seen, but perceived. What one sees are contextually conditioned allographs. In our analysis, the letter block is the minimum unit of Arabic script formation and therefore of script grammar. A letter block is a single allograph or of a group of fused allographs surrounded by graphic space. The analogy with phonology can be pushed further: the archiphoneme is a bundle of shared features between two or more phonemes, minus their distinctive features. The archigrapheme is the bundle of shared features between two or more graphemes, minus their distinctive features. An archigraphemic letter block consists of one or more reduced allographs between spaces. The letter block follows the base line. There can be ligatures between letter blocks. In our strategy the archigraphemic letter block also forms the minimum unit of OCR. We have (1) implemented an algorithm that reduces any Unicode text in Arabic script to archigraphemes and we used it to create a list in Unicode format of all attested unique archigraphemic letter blocks on the internet. (2) With this list, and applying extended Islamic script grammar, we can synthesize realistic images of all possible archigraphemic fusions in a given style. These two developments make it possible to create an OCR system for recognizing synthetic Arabic under controlled conditions for both basic and extended shaping in a given style. These two steps result in competence, after which the OCR system should be trained to apply tolerance for the variation of performance in real documents. To interpret the identified letter blocks linguistically, a technique for the parsing of archigraphemes must be developed. For example, the single sequence of the three archigraphemic letter blocks EBD A LLH can be interpreted as several different surface texts such as abda-n li llaahi, abdu l-laahi and inda l-laahi. To facilitate the linguistic phase of the process, the same list of unique archigraphemic letter blocks is designed to identify the language of the text under scrutiny. In this phase we can present • Islamic script synthesis • Unicode conversion from plene orthography to archigraphemic transliteration • the archigraphemic search algorithm • the list of unique archigraphemic letter blocks • samples of authentic shape generation These are the first steps towards","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131314889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts 基于深度学习的历史文本注释形态学标注器和词法分析器

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322915

Helmut Schmid

引用次数: 19

Optical Character Recognition for Coptic fonts: A multi-source approach for scholarly editions 科普特字体的光学字符识别:学术版的多源方法

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322931

E. Lincke, Kirill Bulert, Marco Büchler

引用次数: 0

Improving OCR of historical newspapers and journals published in Finland 改进芬兰历史报纸和期刊的OCR

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322914

Senka Drobac, Pekka Kauppinen, Krister Lindén

引用次数: 3

OCR-D: An end-to-end open source OCR framework for historical printed documents OCR- d:用于历史打印文档的端到端开源OCR框架

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322917

Clemens Neudecker, Konstantin Baierer, M. Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, Elisa Herrmann

{"title":"OCR-D: An end-to-end open source OCR framework for historical printed documents","authors":"Clemens Neudecker, Konstantin Baierer, M. Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, Elisa Herrmann","doi":"10.1145/3322905.3322917","DOIUrl":"https://doi.org/10.1145/3322905.3322917","url":null,"abstract":"Various research projects were concerned with the development and adaptation of methods for OCR specifically for historical printed documents (cf. METAe [20], IMPACT [1], eMOP [9]). However, these initiatives have ended before the wide adoption of deep neural networks and, despite the various project's achievements, there remains a lack of OCR software that is a) comprehensive with regard to the challenges presented by the wide variety of historical documents and b) available as ready-to-use Free Software. The OCR-D project aims to rectify that. In this paper we introduce the background of OCR-D, the main challenges and shortcomings in the availability of open tools and resources for OCR of historical printed documents and discuss the various software modules and related components (repositories, workflows) that are being made available through OCR-D. Finally we provide an outlook to a number of remaining challenges that are not addressed by OCR-D and point out several examples for the positive community aspects arisen through the creation and sharing of open resources for historical German OCR.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128717922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

A-I-PoCoTo: Combining Automated and Interactive OCR Postcorrection A-I-PoCoTo:结合自动和交互式OCR后校正

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2019-05-08 DOI: 10.1145/3322905.3322908

Tobias Englmeier, F. Fink, K. Schulz

{"title":"A-I-PoCoTo: Combining Automated and Interactive OCR Postcorrection","authors":"Tobias Englmeier, F. Fink, K. Schulz","doi":"10.1145/3322905.3322908","DOIUrl":"https://doi.org/10.1145/3322905.3322908","url":null,"abstract":"PoCoTo is known as a web-based interactive tool for the postcorrection of OCR-results on historical texts. In this paper we first introduce A-PoCoTo, a fully automated extension of PoCoTo designed for the use in large-scale digitization projects. Among other features, A-PoCoTo takes into account the recognition results of several OCR-engines on the given input text, and sentence context is used for refining rankings and decisions. Preliminary evaluation results are given. In view of the very high level of accuracy needed for many scholarly applications it is questionable if a fully automated process is always able to fully meet the standards expected by researchers in Digital Humanities. We describe the architecture of A-I-PoCoTo, a postcorrection system (under development) combining automated postcorrection as a first step and interactive postcorrection as an optional second step. In A-I-PoCoTo decisions and correction steps of the automated component are stored in a special protocol. Views offered by the graphical user interface help to efficiently confirm, reject, or improve these decisions as a first step of the manual postcorrection.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127857898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage 第三届文本文化遗产数字化获取国际会议论文集

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage Pub Date : 2014-05-19 DOI: 10.1145/3322905

A. Antonacopoulos, K. Schulz

引用次数: 1