Document dissimilarity within and across languages: A benchmarking study

R. Forsyth, S. Sharoff
{"title":"Document dissimilarity within and across languages: A benchmarking study","authors":"R. Forsyth, S. Sharoff","doi":"10.1093/LLC/FQT002","DOIUrl":null,"url":null,"abstract":"Quantifying the similarity or dissimilarity between documents is an important task in authorship attribution, information retrieval, plagiarism detection, text mining, and many other areas of linguistic computing. Numerous similarity indices have been devised and used, but relatively little attention has been paid to calibrating such indices against externally imposed standards, mainly because of the difficulty of establishing agreed reference levels of inter-text similarity. The present article introduces a multi-register corpus gathered for this purpose, in which each text has been located in a similarity space based on ratings by human readers. This provides a resource for testing similarity measures derived from computational text-processing against reference levels derived from human judgement, i.e. external to the texts themselves. We describe the results of a benchmarking study in five different languages in which some widely used meas- ures perform comparatively poorly. In particular, several alternative correlational measures (Pearson r, Spearman rho, tetrachoric correlation) consistently outper- form cosine similarity on our data. A method of using what we call 'anchor texts' to extend this method from monolingual inter-text similarity-scoring to inter-text similarity-scoring across languages is also proposed and tested.","PeriodicalId":235034,"journal":{"name":"Lit. Linguistic Comput.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lit. Linguistic Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/LLC/FQT002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 35

Abstract

Quantifying the similarity or dissimilarity between documents is an important task in authorship attribution, information retrieval, plagiarism detection, text mining, and many other areas of linguistic computing. Numerous similarity indices have been devised and used, but relatively little attention has been paid to calibrating such indices against externally imposed standards, mainly because of the difficulty of establishing agreed reference levels of inter-text similarity. The present article introduces a multi-register corpus gathered for this purpose, in which each text has been located in a similarity space based on ratings by human readers. This provides a resource for testing similarity measures derived from computational text-processing against reference levels derived from human judgement, i.e. external to the texts themselves. We describe the results of a benchmarking study in five different languages in which some widely used meas- ures perform comparatively poorly. In particular, several alternative correlational measures (Pearson r, Spearman rho, tetrachoric correlation) consistently outper- form cosine similarity on our data. A method of using what we call 'anchor texts' to extend this method from monolingual inter-text similarity-scoring to inter-text similarity-scoring across languages is also proposed and tested.
语言内部和跨语言的文档差异:基准研究
在作者归属、信息检索、剽窃检测、文本挖掘和许多其他语言计算领域,量化文档之间的相似度或不相似度是一项重要任务。已经设计和使用了许多相似度指数,但是很少注意根据外部强加的标准来校准这些指数,主要是因为难以确定商定的文本间相似度参考水平。本文介绍了一个为此目的而收集的多寄存器语料库,其中每个文本都位于基于人类读者评级的相似空间中。这提供了一种资源,用于测试从计算文本处理中获得的相似性度量,而不是从人类判断中获得的参考水平,即文本本身外部。我们在五种不同的语言中描述了一项基准研究的结果,其中一些广泛使用的方法表现相对较差。特别是,在我们的数据中,几个替代的相关度量(Pearson r, Spearman rho,四分频相关)始终优于余弦相似度。我们还提出并测试了一种使用我们称之为“锚文本”的方法,将这种方法从单语言文本间相似性评分扩展到跨语言文本间相似性评分。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信