{"title":"使用地理定位的社会媒体数据的法医作者分析:语料库语言和制图方法","authors":"Dana Roemling","doi":"10.1016/j.acorp.2025.100146","DOIUrl":null,"url":null,"abstract":"<div><div>This paper explores the use of corpus-based methods for regional authorship profiling in forensic linguistics. Traditional approaches depend on linguistic expertise to identify regional markers, but this has limitations: it relies on an analyst’s intuition and potentially outdated dialect resources. Furthermore, traditional dialectology typically does not support word frequency analysis.</div><div>This study argues for the use of large, geolocated datasets to modernise regional authorship profiling. Unlike traditional dialect atlases, corpora provide access to contemporary, naturally occurring data, allowing for nuanced frequency analyses. Spatial statistics, such as Moran’s <em>I</em>, and tools like R allow for the rapid visualisation of regional linguistic patterns, enhancing both analysis and communication in legal contexts.</div><div>Using a case study based on a corpus of 15 million social media posts, this paper demonstrates the advantages of corpus-based methods in regional authorship profiling. It finds that for the 10,000 most frequent words in the dataset, Moran’s <em>I</em> values ranged from 0.071 to 0.768 (mean = 0.329), with strongly regional terms such as <em>etz</em> (“now”; <em>I</em> = 0.739) and <em>guad</em> (“good”; <em>I</em> = 0.511) showing clear spatial clustering. This data-driven, spatial statistical approach enables the extraction of regional markers without relying on expert intuition. Consequently, the approach provides a more objective and scalable method for identifying regional language patterns, enhancing forensic casework while also reducing the reliance on potentially outdated dialect resources.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"5 3","pages":"Article 100146"},"PeriodicalIF":2.1000,"publicationDate":"2025-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Forensic authorship profiling using geolocated social media data: A corpus linguistic and cartographic approach\",\"authors\":\"Dana Roemling\",\"doi\":\"10.1016/j.acorp.2025.100146\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This paper explores the use of corpus-based methods for regional authorship profiling in forensic linguistics. Traditional approaches depend on linguistic expertise to identify regional markers, but this has limitations: it relies on an analyst’s intuition and potentially outdated dialect resources. Furthermore, traditional dialectology typically does not support word frequency analysis.</div><div>This study argues for the use of large, geolocated datasets to modernise regional authorship profiling. Unlike traditional dialect atlases, corpora provide access to contemporary, naturally occurring data, allowing for nuanced frequency analyses. Spatial statistics, such as Moran’s <em>I</em>, and tools like R allow for the rapid visualisation of regional linguistic patterns, enhancing both analysis and communication in legal contexts.</div><div>Using a case study based on a corpus of 15 million social media posts, this paper demonstrates the advantages of corpus-based methods in regional authorship profiling. It finds that for the 10,000 most frequent words in the dataset, Moran’s <em>I</em> values ranged from 0.071 to 0.768 (mean = 0.329), with strongly regional terms such as <em>etz</em> (“now”; <em>I</em> = 0.739) and <em>guad</em> (“good”; <em>I</em> = 0.511) showing clear spatial clustering. This data-driven, spatial statistical approach enables the extraction of regional markers without relying on expert intuition. Consequently, the approach provides a more objective and scalable method for identifying regional language patterns, enhancing forensic casework while also reducing the reliance on potentially outdated dialect resources.</div></div>\",\"PeriodicalId\":72254,\"journal\":{\"name\":\"Applied Corpus Linguistics\",\"volume\":\"5 3\",\"pages\":\"Article 100146\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Corpus Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666799125000292\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Corpus Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666799125000292","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Forensic authorship profiling using geolocated social media data: A corpus linguistic and cartographic approach
This paper explores the use of corpus-based methods for regional authorship profiling in forensic linguistics. Traditional approaches depend on linguistic expertise to identify regional markers, but this has limitations: it relies on an analyst’s intuition and potentially outdated dialect resources. Furthermore, traditional dialectology typically does not support word frequency analysis.
This study argues for the use of large, geolocated datasets to modernise regional authorship profiling. Unlike traditional dialect atlases, corpora provide access to contemporary, naturally occurring data, allowing for nuanced frequency analyses. Spatial statistics, such as Moran’s I, and tools like R allow for the rapid visualisation of regional linguistic patterns, enhancing both analysis and communication in legal contexts.
Using a case study based on a corpus of 15 million social media posts, this paper demonstrates the advantages of corpus-based methods in regional authorship profiling. It finds that for the 10,000 most frequent words in the dataset, Moran’s I values ranged from 0.071 to 0.768 (mean = 0.329), with strongly regional terms such as etz (“now”; I = 0.739) and guad (“good”; I = 0.511) showing clear spatial clustering. This data-driven, spatial statistical approach enables the extraction of regional markers without relying on expert intuition. Consequently, the approach provides a more objective and scalable method for identifying regional language patterns, enhancing forensic casework while also reducing the reliance on potentially outdated dialect resources.