{"title":"探索大型语言模型在数字文本取证中用于作者特征描述任务的潜力","authors":"","doi":"10.1016/j.fsidi.2024.301814","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid advancement of large language models (LLMs) has opened up new possibilities for various natural language processing tasks. This study explores the potential of LLMs for author profiling in digital text forensics, which involves identifying characteristics such as age and gender from writing style—a crucial task in forensic investigations of anonymous or pseudonymous communications. Experiments were conducted using state-of-the-art LLMs, including Polyglot, EEVE, and Bllossom, to evaluate their performance in author profiling. Different fine-tuning strategies, such as full fine-tuning, Low-Rank Adaptation (LoRA), and Quantized LoRA (QLoRA), were compared to determine the most effective methods for adapting LLMs to the specific needs of this task. The results show that fine-tuned LLMs can effectively predict authors’ age and gender based on their writing styles, with Polyglot-based models generally outperforming EEVE and Bllossom models. Additionally, LoRA and QLoRA strategies significantly reduce computational costs and memory requirements while maintaining performance comparable to full fine-tuning. However, error analysis reveals limitations in the current LLM-based approach, including difficulty in capturing subtle linguistic variations across age groups and potential biases from pre-training data. These challenges are discussed and future research directions to address them are proposed. This study underscores the potential of LLMs in author profiling for digital text forensics, suggesting promising avenues for further exploration and refinement.</div></div>","PeriodicalId":48481,"journal":{"name":"Forensic Science International-Digital Investigation","volume":null,"pages":null},"PeriodicalIF":2.0000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring the potential of large language models for author profiling tasks in digital text forensics\",\"authors\":\"\",\"doi\":\"10.1016/j.fsidi.2024.301814\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The rapid advancement of large language models (LLMs) has opened up new possibilities for various natural language processing tasks. This study explores the potential of LLMs for author profiling in digital text forensics, which involves identifying characteristics such as age and gender from writing style—a crucial task in forensic investigations of anonymous or pseudonymous communications. Experiments were conducted using state-of-the-art LLMs, including Polyglot, EEVE, and Bllossom, to evaluate their performance in author profiling. Different fine-tuning strategies, such as full fine-tuning, Low-Rank Adaptation (LoRA), and Quantized LoRA (QLoRA), were compared to determine the most effective methods for adapting LLMs to the specific needs of this task. The results show that fine-tuned LLMs can effectively predict authors’ age and gender based on their writing styles, with Polyglot-based models generally outperforming EEVE and Bllossom models. Additionally, LoRA and QLoRA strategies significantly reduce computational costs and memory requirements while maintaining performance comparable to full fine-tuning. However, error analysis reveals limitations in the current LLM-based approach, including difficulty in capturing subtle linguistic variations across age groups and potential biases from pre-training data. These challenges are discussed and future research directions to address them are proposed. This study underscores the potential of LLMs in author profiling for digital text forensics, suggesting promising avenues for further exploration and refinement.</div></div>\",\"PeriodicalId\":48481,\"journal\":{\"name\":\"Forensic Science International-Digital Investigation\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Forensic Science International-Digital Investigation\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666281724001380\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Digital Investigation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666281724001380","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Exploring the potential of large language models for author profiling tasks in digital text forensics
The rapid advancement of large language models (LLMs) has opened up new possibilities for various natural language processing tasks. This study explores the potential of LLMs for author profiling in digital text forensics, which involves identifying characteristics such as age and gender from writing style—a crucial task in forensic investigations of anonymous or pseudonymous communications. Experiments were conducted using state-of-the-art LLMs, including Polyglot, EEVE, and Bllossom, to evaluate their performance in author profiling. Different fine-tuning strategies, such as full fine-tuning, Low-Rank Adaptation (LoRA), and Quantized LoRA (QLoRA), were compared to determine the most effective methods for adapting LLMs to the specific needs of this task. The results show that fine-tuned LLMs can effectively predict authors’ age and gender based on their writing styles, with Polyglot-based models generally outperforming EEVE and Bllossom models. Additionally, LoRA and QLoRA strategies significantly reduce computational costs and memory requirements while maintaining performance comparable to full fine-tuning. However, error analysis reveals limitations in the current LLM-based approach, including difficulty in capturing subtle linguistic variations across age groups and potential biases from pre-training data. These challenges are discussed and future research directions to address them are proposed. This study underscores the potential of LLMs in author profiling for digital text forensics, suggesting promising avenues for further exploration and refinement.