Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage最新文献

筛选
英文 中文
Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model 利用文本模型提高中文珍本图书光学字符识别精度
Hsiang-An Wang, Pin-Ting Liu
{"title":"Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model","authors":"Hsiang-An Wang, Pin-Ting Liu","doi":"10.1145/3322905.3322922","DOIUrl":"https://doi.org/10.1145/3322905.3322922","url":null,"abstract":"The legibility of the text of rare books is often subject to precarious conditions: natural decay or erosion, and ink bleed caused by flawed printing methods centuries ago often make such texts difficult to recognize. This difficulty hardens the challenge for optical character recognition (OCR), whose task is to convert images of printed text into machine-encoded text when the rare book has been digitized. To reduce the error of the OCR for rare books, this research applies N-gram, long short-term memory (LSTM), and backward and forward N-gram (BF N-gram) statistics text models through substantial training data of texts to develop a more accurate OCR model. We build N-gram, LSTM, and BF N-gram statistics models at varying character lengths and experiment on different quantities of text to locate the best performance of character recognition through observing how these models carry out the tasks. Once the text model capable of optimized performance is identified, we use further experiments to track down the most appropriate time and method to correct OCR errors with the aid of the text model. Our experiments suggest that the correction implemented by the text model yields more accurate OCR results than does falling back on OCR models only.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134398472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis 阿拉伯语OCR的新策略:Archigraphemes,字母块,脚本语法和形状合成
Thomas Milo, A. Martínez
{"title":"A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis","authors":"Thomas Milo, A. Martínez","doi":"10.1145/3322905.3322928","DOIUrl":"https://doi.org/10.1145/3322905.3322928","url":null,"abstract":"Current OCR has limited capability for Arabic because of script models lacking scientific basis. We propose a new OCR strategy for Arabic, based on 1. Islamic script grammar including extended shaping and 2. treating Arabic script as a multi-layered writing system. We analyse Arabic script as an allographic rendering of graphemic abstractions. Grapheme is a term adapted from phonology; it is analogous to the term phoneme. In phonology, the smallest functional unit of sound is the phoneme. This is not heard, but perceived. What one hears are contextually conditioned allophones. In Arabic orthography, the smallest functional unit of spelling is the grapheme. This is not seen, but perceived. What one sees are contextually conditioned allographs. In our analysis, the letter block is the minimum unit of Arabic script formation and therefore of script grammar. A letter block is a single allograph or of a group of fused allographs surrounded by graphic space. The analogy with phonology can be pushed further: the archiphoneme is a bundle of shared features between two or more phonemes, minus their distinctive features. The archigrapheme is the bundle of shared features between two or more graphemes, minus their distinctive features. An archigraphemic letter block consists of one or more reduced allographs between spaces. The letter block follows the base line. There can be ligatures between letter blocks. In our strategy the archigraphemic letter block also forms the minimum unit of OCR. We have (1) implemented an algorithm that reduces any Unicode text in Arabic script to archigraphemes and we used it to create a list in Unicode format of all attested unique archigraphemic letter blocks on the internet. (2) With this list, and applying extended Islamic script grammar, we can synthesize realistic images of all possible archigraphemic fusions in a given style. These two developments make it possible to create an OCR system for recognizing synthetic Arabic under controlled conditions for both basic and extended shaping in a given style. These two steps result in competence, after which the OCR system should be trained to apply tolerance for the variation of performance in real documents. To interpret the identified letter blocks linguistically, a technique for the parsing of archigraphemes must be developed. For example, the single sequence of the three archigraphemic letter blocks EBD A LLH can be interpreted as several different surface texts such as abda-n li llaahi, abdu l-laahi and inda l-laahi. To facilitate the linguistic phase of the process, the same list of unique archigraphemic letter blocks is designed to identify the language of the text under scrutiny. In this phase we can present • Islamic script synthesis • Unicode conversion from plene orthography to archigraphemic transliteration • the archigraphemic search algorithm • the list of unique archigraphemic letter blocks • samples of authentic shape generation These are the first steps towards","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131314889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts 基于深度学习的历史文本注释形态学标注器和词法分析器
Helmut Schmid
{"title":"Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts","authors":"Helmut Schmid","doi":"10.1145/3322905.3322915","DOIUrl":"https://doi.org/10.1145/3322905.3322915","url":null,"abstract":"Part-of-speech tagging, morphological tagging, and lemmatization of historical texts pose special challenges due to the high spelling variability and the lack of large, high-quality training corpora. Researchers therefore often first map the words to their modern spelling and then annotate with tools trained on modern corpora. We show in this paper that high quality part-of-speech tagging and lemmatization of historical texts is possible while operating directly on the historical spelling. We use a part-of-speech tagger based on bidirectional long short-term memory networks (LSTMs) [11] with character-based word representations and lemmatize using an encoder-decoder system with attention. We achieve state-of-the-art results for modern German morphological tagging on the Tiger corpus and also on two historical corpora which have been used in previous work.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114853905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Optical Character Recognition for Coptic fonts: A multi-source approach for scholarly editions 科普特字体的光学字符识别:学术版的多源方法
E. Lincke, Kirill Bulert, Marco Büchler
{"title":"Optical Character Recognition for Coptic fonts: A multi-source approach for scholarly editions","authors":"E. Lincke, Kirill Bulert, Marco Büchler","doi":"10.1145/3322905.3322931","DOIUrl":"https://doi.org/10.1145/3322905.3322931","url":null,"abstract":"In this paper, we show that the OCR engine Ocropy can be trained for fonts used in rather complex and varied Coptic typeset. For each of the three fonts presented in this paper, we used a number of texts from scholarly editions with different philological and editorial standards and texts from two different dialects of Coptic (Bohairic and Sahidic). Despite the complexity of the training data, we observed accuracy rates of 97.5%, for one font even up to 99%.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117148576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving OCR of historical newspapers and journals published in Finland 改进芬兰历史报纸和期刊的OCR
Senka Drobac, Pekka Kauppinen, Krister Lindén
{"title":"Improving OCR of historical newspapers and journals published in Finland","authors":"Senka Drobac, Pekka Kauppinen, Krister Lindén","doi":"10.1145/3322905.3322914","DOIUrl":"https://doi.org/10.1145/3322905.3322914","url":null,"abstract":"This paper presents experiments on Optical character recognition (OCR) of historical newspapers and journals published in Finland. The corpus has two main languages: Finnish and Swedish and is written in both Blackletter and Antiqua fonts. Here we experiment with how much training data is enough to train high accuracy models, and try to train a joint model for both languages and all fonts. So far we have not been successful in getting one best model for all, but it is promising that with the mixed model we get the best results on the Finnish test set with 95 % CAR, which clearly surpasses previous results on this data set.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127434191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
OCR-D: An end-to-end open source OCR framework for historical printed documents OCR- d:用于历史打印文档的端到端开源OCR框架
Clemens Neudecker, Konstantin Baierer, M. Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, Elisa Herrmann
{"title":"OCR-D: An end-to-end open source OCR framework for historical printed documents","authors":"Clemens Neudecker, Konstantin Baierer, M. Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, Elisa Herrmann","doi":"10.1145/3322905.3322917","DOIUrl":"https://doi.org/10.1145/3322905.3322917","url":null,"abstract":"Various research projects were concerned with the development and adaptation of methods for OCR specifically for historical printed documents (cf. METAe [20], IMPACT [1], eMOP [9]). However, these initiatives have ended before the wide adoption of deep neural networks and, despite the various project's achievements, there remains a lack of OCR software that is a) comprehensive with regard to the challenges presented by the wide variety of historical documents and b) available as ready-to-use Free Software. The OCR-D project aims to rectify that. In this paper we introduce the background of OCR-D, the main challenges and shortcomings in the availability of open tools and resources for OCR of historical printed documents and discuss the various software modules and related components (repositories, workflows) that are being made available through OCR-D. Finally we provide an outlook to a number of remaining challenges that are not addressed by OCR-D and point out several examples for the positive community aspects arisen through the creation and sharing of open resources for historical German OCR.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128717922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
A-I-PoCoTo: Combining Automated and Interactive OCR Postcorrection A-I-PoCoTo:结合自动和交互式OCR后校正
Tobias Englmeier, F. Fink, K. Schulz
{"title":"A-I-PoCoTo: Combining Automated and Interactive OCR Postcorrection","authors":"Tobias Englmeier, F. Fink, K. Schulz","doi":"10.1145/3322905.3322908","DOIUrl":"https://doi.org/10.1145/3322905.3322908","url":null,"abstract":"PoCoTo is known as a web-based interactive tool for the postcorrection of OCR-results on historical texts. In this paper we first introduce A-PoCoTo, a fully automated extension of PoCoTo designed for the use in large-scale digitization projects. Among other features, A-PoCoTo takes into account the recognition results of several OCR-engines on the given input text, and sentence context is used for refining rankings and decisions. Preliminary evaluation results are given. In view of the very high level of accuracy needed for many scholarly applications it is questionable if a fully automated process is always able to fully meet the standards expected by researchers in Digital Humanities. We describe the architecture of A-I-PoCoTo, a postcorrection system (under development) combining automated postcorrection as a first step and interactive postcorrection as an optional second step. In A-I-PoCoTo decisions and correction steps of the automated component are stored in a special protocol. Views offered by the graphical user interface help to efficiently confirm, reject, or improve these decisions as a first step of the manual postcorrection.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127857898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage 第三届文本文化遗产数字化获取国际会议论文集
A. Antonacopoulos, K. Schulz
{"title":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","authors":"A. Antonacopoulos, K. Schulz","doi":"10.1145/3322905","DOIUrl":"https://doi.org/10.1145/3322905","url":null,"abstract":"We are delighted to present the program of the first international conference on Digital Access to Textual Cultural Heritage (DATeCH 2014). The aim of establishing this conference is to bring together researchers in the complementary fields of Document Image Analysis and Recognition, Computational Linguistics and Digital Humanities as well as content holders and practitioners working on the creation, transformation and exploitation of historical documents in digital form. We strongly believe that there are very significant benefits in gathering such a multi-disciplinary group of experts, combining experiences and discussing ways forward for tackling the significant challenges and opportunities presented by historical documents.","PeriodicalId":418911,"journal":{"name":"Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126741599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信