{"title":"Attributing authorship via the perplexity of authorial language models.","authors":"Weihang Huang, Akira Murakami, Jack Grieve","doi":"10.1371/journal.pone.0327081","DOIUrl":null,"url":null,"abstract":"<p><p>Authorship attribution is the task of identifying the most likely author of a questioned document from a set of candidate authors, where each candidate is represented by a writing sample. A wide range of quantitative methods for inferring authorship have been developed in stylometry, but the rise of Large Language Models (LLMs) offers new opportunities in this field. In this paper, we introduce a technique for authorship attribution based on fine-tuned LLMs. Our approach involves first further pretraining LLMs for each candidate author based on their known writings and then assigning the questioned document to the author whose Authorial Language Model (ALM) finds the questioned document most predictable, measured as the perplexity of the questioned document. We find that our approach meets or exceeds the current state-of-the-art on several standard benchmarking datasets. In addition, we show how our approach can be used to measure the predictability of each word in a questioned document for a given candidate ALM, allowing the linguistic patterns that drive our attributions to be inspected directly. Finally, we analyze what types of words generally drive successful attributions, finding that content words classes are characterized by a higher density of authorship information than function word classes, challenging a long-standing assumption of stylometry.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 7","pages":"e0327081"},"PeriodicalIF":2.6000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12225838/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0327081","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Authorship attribution is the task of identifying the most likely author of a questioned document from a set of candidate authors, where each candidate is represented by a writing sample. A wide range of quantitative methods for inferring authorship have been developed in stylometry, but the rise of Large Language Models (LLMs) offers new opportunities in this field. In this paper, we introduce a technique for authorship attribution based on fine-tuned LLMs. Our approach involves first further pretraining LLMs for each candidate author based on their known writings and then assigning the questioned document to the author whose Authorial Language Model (ALM) finds the questioned document most predictable, measured as the perplexity of the questioned document. We find that our approach meets or exceeds the current state-of-the-art on several standard benchmarking datasets. In addition, we show how our approach can be used to measure the predictability of each word in a questioned document for a given candidate ALM, allowing the linguistic patterns that drive our attributions to be inspected directly. Finally, we analyze what types of words generally drive successful attributions, finding that content words classes are characterized by a higher density of authorship information than function word classes, challenging a long-standing assumption of stylometry.
期刊介绍:
PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides:
* Open-access—freely accessible online, authors retain copyright
* Fast publication times
* Peer review by expert, practicing researchers
* Post-publication tools to indicate quality and impact
* Community-based dialogue on articles
* Worldwide media coverage