在作者归属中提高词频

Workshop on Computational Humanities Research Pub Date : 2022-11-03 DOI:10.48550/arXiv.2211.01289

Maciej Eder

{"title":"在作者归属中提高词频","authors":"Maciej Eder","doi":"10.48550/arXiv.2211.01289","DOIUrl":null,"url":null,"abstract":"In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings.","PeriodicalId":191971,"journal":{"name":"Workshop on Computational Humanities Research","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Boosting Word Frequencies in Authorship Attribution\",\"authors\":\"Maciej Eder\",\"doi\":\"10.48550/arXiv.2211.01289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings.\",\"PeriodicalId\":191971,\"journal\":{\"name\":\"Workshop on Computational Humanities Research\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Computational Humanities Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2211.01289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Computational Humanities Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2211.01289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

在本文中，我介绍了一种计算作者归属和类似文体测量任务的相对词频的简单方法。与其用给定单词的出现次数除以文本中标记的总数来计算相对频率，我认为更有效的归一化因子是只使用相关标记的总数。相关词的概念包括同义词，通常还包括其他几十个在某种意义上与有问题的词相似的词。为了确定这样的语义背景，可以使用一种词嵌入模型。所提出的方法大大优于经典的最频繁词方法，通常根据输入设置高出几个百分点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Boosting Word Frequencies in Authorship Attribution

In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Computational Humanities Research

自引率

0.00%

发文量