{"title":"Forensic authorship profiling using geolocated social media data: A corpus linguistic and cartographic approach","authors":"Dana Roemling","doi":"10.1016/j.acorp.2025.100146","DOIUrl":null,"url":null,"abstract":"<div><div>This paper explores the use of corpus-based methods for regional authorship profiling in forensic linguistics. Traditional approaches depend on linguistic expertise to identify regional markers, but this has limitations: it relies on an analyst’s intuition and potentially outdated dialect resources. Furthermore, traditional dialectology typically does not support word frequency analysis.</div><div>This study argues for the use of large, geolocated datasets to modernise regional authorship profiling. Unlike traditional dialect atlases, corpora provide access to contemporary, naturally occurring data, allowing for nuanced frequency analyses. Spatial statistics, such as Moran’s <em>I</em>, and tools like R allow for the rapid visualisation of regional linguistic patterns, enhancing both analysis and communication in legal contexts.</div><div>Using a case study based on a corpus of 15 million social media posts, this paper demonstrates the advantages of corpus-based methods in regional authorship profiling. It finds that for the 10,000 most frequent words in the dataset, Moran’s <em>I</em> values ranged from 0.071 to 0.768 (mean = 0.329), with strongly regional terms such as <em>etz</em> (“now”; <em>I</em> = 0.739) and <em>guad</em> (“good”; <em>I</em> = 0.511) showing clear spatial clustering. This data-driven, spatial statistical approach enables the extraction of regional markers without relying on expert intuition. Consequently, the approach provides a more objective and scalable method for identifying regional language patterns, enhancing forensic casework while also reducing the reliance on potentially outdated dialect resources.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"5 3","pages":"Article 100146"},"PeriodicalIF":2.1000,"publicationDate":"2025-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Corpus Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666799125000292","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper explores the use of corpus-based methods for regional authorship profiling in forensic linguistics. Traditional approaches depend on linguistic expertise to identify regional markers, but this has limitations: it relies on an analyst’s intuition and potentially outdated dialect resources. Furthermore, traditional dialectology typically does not support word frequency analysis.
This study argues for the use of large, geolocated datasets to modernise regional authorship profiling. Unlike traditional dialect atlases, corpora provide access to contemporary, naturally occurring data, allowing for nuanced frequency analyses. Spatial statistics, such as Moran’s I, and tools like R allow for the rapid visualisation of regional linguistic patterns, enhancing both analysis and communication in legal contexts.
Using a case study based on a corpus of 15 million social media posts, this paper demonstrates the advantages of corpus-based methods in regional authorship profiling. It finds that for the 10,000 most frequent words in the dataset, Moran’s I values ranged from 0.071 to 0.768 (mean = 0.329), with strongly regional terms such as etz (“now”; I = 0.739) and guad (“good”; I = 0.511) showing clear spatial clustering. This data-driven, spatial statistical approach enables the extraction of regional markers without relying on expert intuition. Consequently, the approach provides a more objective and scalable method for identifying regional language patterns, enhancing forensic casework while also reducing the reliance on potentially outdated dialect resources.