Forensic authorship profiling using geolocated social media data: A corpus linguistic and cartographic approach

IF 2.1

Applied Corpus Linguistics Pub Date : 2025-08-24 DOI:10.1016/j.acorp.2025.100146

Dana Roemling

{"title":"Forensic authorship profiling using geolocated social media data: A corpus linguistic and cartographic approach","authors":"Dana Roemling","doi":"10.1016/j.acorp.2025.100146","DOIUrl":null,"url":null,"abstract":"<div><div>This paper explores the use of corpus-based methods for regional authorship profiling in forensic linguistics. Traditional approaches depend on linguistic expertise to identify regional markers, but this has limitations: it relies on an analyst’s intuition and potentially outdated dialect resources. Furthermore, traditional dialectology typically does not support word frequency analysis.</div><div>This study argues for the use of large, geolocated datasets to modernise regional authorship profiling. Unlike traditional dialect atlases, corpora provide access to contemporary, naturally occurring data, allowing for nuanced frequency analyses. Spatial statistics, such as Moran’s <em>I</em>, and tools like R allow for the rapid visualisation of regional linguistic patterns, enhancing both analysis and communication in legal contexts.</div><div>Using a case study based on a corpus of 15 million social media posts, this paper demonstrates the advantages of corpus-based methods in regional authorship profiling. It finds that for the 10,000 most frequent words in the dataset, Moran’s <em>I</em> values ranged from 0.071 to 0.768 (mean = 0.329), with strongly regional terms such as <em>etz</em> (“now”; <em>I</em> = 0.739) and <em>guad</em> (“good”; <em>I</em> = 0.511) showing clear spatial clustering. This data-driven, spatial statistical approach enables the extraction of regional markers without relying on expert intuition. Consequently, the approach provides a more objective and scalable method for identifying regional language patterns, enhancing forensic casework while also reducing the reliance on potentially outdated dialect resources.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"5 3","pages":"Article 100146"},"PeriodicalIF":2.1000,"publicationDate":"2025-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Corpus Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666799125000292","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper explores the use of corpus-based methods for regional authorship profiling in forensic linguistics. Traditional approaches depend on linguistic expertise to identify regional markers, but this has limitations: it relies on an analyst’s intuition and potentially outdated dialect resources. Furthermore, traditional dialectology typically does not support word frequency analysis.

This study argues for the use of large, geolocated datasets to modernise regional authorship profiling. Unlike traditional dialect atlases, corpora provide access to contemporary, naturally occurring data, allowing for nuanced frequency analyses. Spatial statistics, such as Moran’s I, and tools like R allow for the rapid visualisation of regional linguistic patterns, enhancing both analysis and communication in legal contexts.

Using a case study based on a corpus of 15 million social media posts, this paper demonstrates the advantages of corpus-based methods in regional authorship profiling. It finds that for the 10,000 most frequent words in the dataset, Moran’s I values ranged from 0.071 to 0.768 (mean = 0.329), with strongly regional terms such as etz (“now”; I = 0.739) and guad (“good”; I = 0.511) showing clear spatial clustering. This data-driven, spatial statistical approach enables the extraction of regional markers without relying on expert intuition. Consequently, the approach provides a more objective and scalable method for identifying regional language patterns, enhancing forensic casework while also reducing the reliance on potentially outdated dialect resources.

查看原文本刊更多论文

使用地理定位的社会媒体数据的法医作者分析：语料库语言和制图方法

本文探讨了在法律语言学中使用基于语料库的方法进行区域作者身份分析。传统的方法依赖于语言专业知识来识别区域标记，但这有局限性：它依赖于分析师的直觉和潜在过时的方言资源。此外，传统的方言学通常不支持词频分析。该研究主张使用大型地理定位数据集来实现区域作者身份分析的现代化。与传统的方言地图集不同，语料库提供了对当代自然发生的数据的访问，允许进行细致入微的频率分析。空间统计，如莫兰的I，和像R这样的工具允许区域语言模式的快速可视化，加强在法律背景下的分析和交流。本文以1500万篇社交媒体帖子的语料库为例，展示了基于语料库的方法在区域作者身份分析中的优势。研究发现，对于数据集中出现频率最高的10000个单词，Moran的I值范围在0.071到0.768之间（平均值= 0.329），etz（“现在”；I = 0.739）和guad（“好”；I = 0.511）等具有很强的地域性的术语显示出明显的空间聚类。这种数据驱动的空间统计方法可以在不依赖专家直觉的情况下提取区域标记。因此，该方法为识别区域语言模式提供了一种更加客观和可扩展的方法，增强了法医案件工作，同时也减少了对潜在过时方言资源的依赖。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊