Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

G. Franzini, M. Kestemont, Gabriela Rotari, Melina Jander, Jeremi K. Ochab, E. Franzini, Joanna Byszuk, Jan Rybicki
{"title":"Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm","authors":"G. Franzini, M. Kestemont, Gabriela Rotari, Melina Jander, Jeremi K. Ochab, E. Franzini, Joanna Byszuk, Jan Rybicki","doi":"10.3389/fdigh.2018.00004","DOIUrl":null,"url":null,"abstract":"This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regard to HTR, this research demonstrates that even though automated transcription significantly increases risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution.","PeriodicalId":227954,"journal":{"name":"Frontiers Digit. Humanit.","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers Digit. Humanit.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdigh.2018.00004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

Abstract

This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regard to HTR, this research demonstrates that even though automated transcription significantly increases risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution.
格林夫妇嘈杂的数字化通信中的作者归属
本文介绍了一个多学科项目的结果,旨在更好地理解不同数字化策略对计算文本分析的影响。更具体地说,它描述了通过HTR(手写文本识别)和OCR(光学字符识别)处理的未校正通信体中自动识别雅各布和威廉格林的作者身份的努力,报告了这种噪声对计算识别两兄弟不同写作风格所需的分析的影响。总之,我们的研究结果表明,OCR数字化可以作为更艰苦的手工数字化过程的可靠代理,至少在作者归属方面是这样。我们的结果表明,即使使用来自不同数字化管道的训练和测试集,归因也是可行的。关于HTR,本研究表明,尽管与OCR相比,自动转录显著增加了文本误分类的风险,但清洁度高于≈20%已经足以实现正确二元归因的高于机会的概率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信