Boosting Word Frequencies in Authorship Attribution

Workshop on Computational Humanities Research Pub Date : 2022-11-03 DOI:10.48550/arXiv.2211.01289

Maciej Eder

引用次数: 2

Abstract

In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings.

查看原文本刊更多论文

在作者归属中提高词频

在本文中，我介绍了一种计算作者归属和类似文体测量任务的相对词频的简单方法。与其用给定单词的出现次数除以文本中标记的总数来计算相对频率，我认为更有效的归一化因子是只使用相关标记的总数。相关词的概念包括同义词，通常还包括其他几十个在某种意义上与有问题的词相似的词。为了确定这样的语义背景，可以使用一种词嵌入模型。所提出的方法大大优于经典的最频繁词方法，通常根据输入设置高出几个百分点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Workshop on Computational Humanities Research

自引率

0.00%

发文量