Forensic authorship profiling using geolocated social media data: A corpus linguistic and cartographic approach

IF 2.1
Dana Roemling
{"title":"Forensic authorship profiling using geolocated social media data: A corpus linguistic and cartographic approach","authors":"Dana Roemling","doi":"10.1016/j.acorp.2025.100146","DOIUrl":null,"url":null,"abstract":"<div><div>This paper explores the use of corpus-based methods for regional authorship profiling in forensic linguistics. Traditional approaches depend on linguistic expertise to identify regional markers, but this has limitations: it relies on an analyst’s intuition and potentially outdated dialect resources. Furthermore, traditional dialectology typically does not support word frequency analysis.</div><div>This study argues for the use of large, geolocated datasets to modernise regional authorship profiling. Unlike traditional dialect atlases, corpora provide access to contemporary, naturally occurring data, allowing for nuanced frequency analyses. Spatial statistics, such as Moran’s <em>I</em>, and tools like R allow for the rapid visualisation of regional linguistic patterns, enhancing both analysis and communication in legal contexts.</div><div>Using a case study based on a corpus of 15 million social media posts, this paper demonstrates the advantages of corpus-based methods in regional authorship profiling. It finds that for the 10,000 most frequent words in the dataset, Moran’s <em>I</em> values ranged from 0.071 to 0.768 (mean = 0.329), with strongly regional terms such as <em>etz</em> (“now”; <em>I</em> = 0.739) and <em>guad</em> (“good”; <em>I</em> = 0.511) showing clear spatial clustering. This data-driven, spatial statistical approach enables the extraction of regional markers without relying on expert intuition. Consequently, the approach provides a more objective and scalable method for identifying regional language patterns, enhancing forensic casework while also reducing the reliance on potentially outdated dialect resources.</div></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"5 3","pages":"Article 100146"},"PeriodicalIF":2.1000,"publicationDate":"2025-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Corpus Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666799125000292","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper explores the use of corpus-based methods for regional authorship profiling in forensic linguistics. Traditional approaches depend on linguistic expertise to identify regional markers, but this has limitations: it relies on an analyst’s intuition and potentially outdated dialect resources. Furthermore, traditional dialectology typically does not support word frequency analysis.
This study argues for the use of large, geolocated datasets to modernise regional authorship profiling. Unlike traditional dialect atlases, corpora provide access to contemporary, naturally occurring data, allowing for nuanced frequency analyses. Spatial statistics, such as Moran’s I, and tools like R allow for the rapid visualisation of regional linguistic patterns, enhancing both analysis and communication in legal contexts.
Using a case study based on a corpus of 15 million social media posts, this paper demonstrates the advantages of corpus-based methods in regional authorship profiling. It finds that for the 10,000 most frequent words in the dataset, Moran’s I values ranged from 0.071 to 0.768 (mean = 0.329), with strongly regional terms such as etz (“now”; I = 0.739) and guad (“good”; I = 0.511) showing clear spatial clustering. This data-driven, spatial statistical approach enables the extraction of regional markers without relying on expert intuition. Consequently, the approach provides a more objective and scalable method for identifying regional language patterns, enhancing forensic casework while also reducing the reliance on potentially outdated dialect resources.
使用地理定位的社会媒体数据的法医作者分析:语料库语言和制图方法
本文探讨了在法律语言学中使用基于语料库的方法进行区域作者身份分析。传统的方法依赖于语言专业知识来识别区域标记,但这有局限性:它依赖于分析师的直觉和潜在过时的方言资源。此外,传统的方言学通常不支持词频分析。该研究主张使用大型地理定位数据集来实现区域作者身份分析的现代化。与传统的方言地图集不同,语料库提供了对当代自然发生的数据的访问,允许进行细致入微的频率分析。空间统计,如莫兰的I,和像R这样的工具允许区域语言模式的快速可视化,加强在法律背景下的分析和交流。本文以1500万篇社交媒体帖子的语料库为例,展示了基于语料库的方法在区域作者身份分析中的优势。研究发现,对于数据集中出现频率最高的10000个单词,Moran的I值范围在0.071到0.768之间(平均值= 0.329),etz(“现在”;I = 0.739)和guad(“好”;I = 0.511)等具有很强的地域性的术语显示出明显的空间聚类。这种数据驱动的空间统计方法可以在不依赖专家直觉的情况下提取区域标记。因此,该方法为识别区域语言模式提供了一种更加客观和可扩展的方法,增强了法医案件工作,同时也减少了对潜在过时方言资源的依赖。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Applied Corpus Linguistics
Applied Corpus Linguistics Linguistics and Language
CiteScore
1.30
自引率
0.00%
发文量
0
审稿时长
70 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信