{"title":"Page Layout Analysis of Text-heavy Historical Documents: a Comparison of Textual and Visual Approaches","authors":"Sven Najem-Meyer, Matteo Romanello","doi":"10.48550/arXiv.2212.13924","DOIUrl":"https://doi.org/10.48550/arXiv.2212.13924","url":null,"abstract":"Page layout analysis is a fundamental step in document processing which enables to segment a page into regions of interest. With highly complex layouts and mixed scripts, scholarly commentaries are text-heavy documents which remain challenging for state-of-the-art models. Their layout considerably varies across editions and their most important regions are mainly defined by semantic rather than graphical characteristics such as position or appearance. This setting calls for a comparison between textual, visual and hybrid approaches. We therefore assess the performances of two transformers (LayoutLMv3 and RoBERTa) and an objection-detection network (YOLOv5). If results show a clear advantage in favor of the latter, we also list several caveats to this finding. In addition to our experiments, we release a dataset of ca. 300 annotated pages sampled from 19th century commentaries.","PeriodicalId":191971,"journal":{"name":"Workshop on Computational Humanities Research","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133199820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting Word Frequencies in Authorship Attribution","authors":"Maciej Eder","doi":"10.48550/arXiv.2211.01289","DOIUrl":"https://doi.org/10.48550/arXiv.2211.01289","url":null,"abstract":"In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings.","PeriodicalId":191971,"journal":{"name":"Workshop on Computational Humanities Research","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125944291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Lassen, Yuri Bizzoni, Telam Peura, M. Thomsen, K. Nielbo
{"title":"Reviewer Preferences and Gender Disparities in Aesthetic Judgments","authors":"I. Lassen, Yuri Bizzoni, Telam Peura, M. Thomsen, K. Nielbo","doi":"10.48550/arXiv.2206.08697","DOIUrl":"https://doi.org/10.48550/arXiv.2206.08697","url":null,"abstract":"Aesthetic preferences are considered highly subjective resulting in inherently noisy judgements of aesthetic objects, yet certain aspects of aesthetic judgement display convergent trends over time. This paper present a study that uses literary reviews as a proxy for aesthetic judgement in order to identify systematic components that can be attributed to bias. Specifically we find that judgement of literary quality in newspapers displays a gender bias in preference of male writers. Male reviewers have a same gender preference while female reviewer show an opposite gender preference. While alternative accounts exist of this apparent gender disparity, we argue that it reflects a cultural gender antagonism.","PeriodicalId":191971,"journal":{"name":"Workshop on Computational Humanities Research","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115079091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring the Acceleration of the Social Construction of Time using the BOE (Boletin Oficial del Estado)","authors":"E. Fernández, Mirco Schönfeld, J. Pfeffer","doi":"10.5281/ZENODO.4357663","DOIUrl":"https://doi.org/10.5281/ZENODO.4357663","url":null,"abstract":"The Practice of Conceptual History, by Reinhart Koselleck, explores the idea that there is a direct relationship between technological advancements and an acceleration in the social construction of time. This paper will quantify this theory by measuring information density and information variety of narratives in a BOE (Boletín Oficial del Estado) dataset of thirty years (1988-2018). Using Quantitative Narrative Analysis, we will define a narrative unit as a triplet of Subject, Verb, Object (SVO), and we will define information density (ID) as the ratio of narrative units per words per year. Afterwards, we will quantify the different contexts of narratives to measure information variety (IV) by constructing a network of semantic closeness from trained word embeddings. This paper will present an increased IV and ID over the observation time, indicating more and more facts being reported. The results will show evidence of an acceleration of the social construction of time.","PeriodicalId":191971,"journal":{"name":"Workshop on Computational Humanities Research","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122827680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cultural Accumulation and Improvement in Online Fan Fiction","authors":"Federico Pianzola, Alberto Acerbi, S. Rebora","doi":"10.31219/osf.io/4wjnm","DOIUrl":"https://doi.org/10.31219/osf.io/4wjnm","url":null,"abstract":"We analyse stories in Harry Potter fan fiction published on Archive of Our Own (AO3), using concepts from cultural evolution. In particular, we focus on cumulative cultural evolution, that is, the idea that cultural systems improve with time, drawing on previous innovations. In this study we examine two features of cumulative culture: accumulation and improvement. First, we show that stories in Harry Potter’s fan fiction accumulate cultural traits—unique tags, in our analysis—through time, both globally and at the level of single stories. Second, more recent stories are also liked more by readers than earlier stories. Our research illustrates the potential of the combination of cultural evolution theory and digital literary studies, and it paves the way for the study of the effects of online digital media on cultural cumulation.","PeriodicalId":191971,"journal":{"name":"Workshop on Computational Humanities Research","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124489937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward a Thermodynamics of Meaning","authors":"Jonathan Scott Enderle","doi":"10.5281/ZENODO.4302259","DOIUrl":"https://doi.org/10.5281/ZENODO.4302259","url":null,"abstract":"As language models such as GPT-3 become increasingly successful at generating realistic text, questions about what purely text-based modeling can learn about the world have become more urgent. Is text purely syntactic, as skeptics argue? Or does it in fact contain some semantic information that a sufficiently sophisticated language model could use to learn about the world without any additional inputs? This paper describes a new model that suggests some qualified answers to those questions. By theorizing the relationship between text and the world it describes as an equilibrium relationship between a thermodynamic system and a much larger reservoir, this paper argues that even very simple language models do learn structural facts about the world, while also proposing relatively precise limits on the nature and extent of those facts. This perspective promises not only to answer questions about what language models actually learn, but also to explain the consistent and surprising success of cooccurrence prediction as a meaning-making strategy in AI.","PeriodicalId":191971,"journal":{"name":"Workshop on Computational Humanities Research","volume":"13 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129235321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}