作者网:利用基于注意力的早期融合转换器实现低资源作者归属

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Md. Rajib Hossain , Mohammed Moshiul Hoque , M. Ali Akber Dewan , Enamul Hoque , Nazmul Siddique
{"title":"作者网:利用基于注意力的早期融合转换器实现低资源作者归属","authors":"Md. Rajib Hossain ,&nbsp;Mohammed Moshiul Hoque ,&nbsp;M. Ali Akber Dewan ,&nbsp;Enamul Hoque ,&nbsp;Nazmul Siddique","doi":"10.1016/j.eswa.2024.125643","DOIUrl":null,"url":null,"abstract":"<div><div>Authorship Attribution (AA) is crucial for identifying the author of a given text from a pool of suspects, especially with the widespread use of the internet and electronic devices. However, most AA research has primarily focused on high-resource languages like English, leaving low-resource languages such as Bengali relatively unexplored. Challenges faced in this domain include the absence of benchmark corpora, a lack of context-aware feature extractors, limited availability of tuned hyperparameters, and OOV issues. To address these challenges, this study introduces AuthorNet for authorship attribution using attention-based early fusion of transformer-based language models, i.e., concatenation of an embeddings output of two existing models that were fine-tuned. AuthorNet consists of three key modules: Feature extraction, Fine-tuning and selection of best-performing models, and Attention-based early fusion. To evaluate the performance of AuthorNet, a number of experiments using four benchmark corpora have been conducted. The results demonstrated exceptional accuracy: 98.86 ± 0.01%, 99.49 ± 0.01%, 97.91 ± 0.01%, and 99.87 ± 0.01% for four corpora. Notably, AuthorNet outperformed all foundation models, achieving accuracy improvements ranging from 0.24% to 2.92% across the four corpora.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"262 ","pages":"Article 125643"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AuthorNet: Leveraging attention-based early fusion of transformers for low-resource authorship attribution\",\"authors\":\"Md. Rajib Hossain ,&nbsp;Mohammed Moshiul Hoque ,&nbsp;M. Ali Akber Dewan ,&nbsp;Enamul Hoque ,&nbsp;Nazmul Siddique\",\"doi\":\"10.1016/j.eswa.2024.125643\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Authorship Attribution (AA) is crucial for identifying the author of a given text from a pool of suspects, especially with the widespread use of the internet and electronic devices. However, most AA research has primarily focused on high-resource languages like English, leaving low-resource languages such as Bengali relatively unexplored. Challenges faced in this domain include the absence of benchmark corpora, a lack of context-aware feature extractors, limited availability of tuned hyperparameters, and OOV issues. To address these challenges, this study introduces AuthorNet for authorship attribution using attention-based early fusion of transformer-based language models, i.e., concatenation of an embeddings output of two existing models that were fine-tuned. AuthorNet consists of three key modules: Feature extraction, Fine-tuning and selection of best-performing models, and Attention-based early fusion. To evaluate the performance of AuthorNet, a number of experiments using four benchmark corpora have been conducted. The results demonstrated exceptional accuracy: 98.86 ± 0.01%, 99.49 ± 0.01%, 97.91 ± 0.01%, and 99.87 ± 0.01% for four corpora. Notably, AuthorNet outperformed all foundation models, achieving accuracy improvements ranging from 0.24% to 2.92% across the four corpora.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"262 \",\"pages\":\"Article 125643\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417424025107\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417424025107","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

作者归属(AA)对于从众多嫌疑人中识别特定文本的作者至关重要,尤其是在互联网和电子设备广泛使用的情况下。然而,大多数作者署名研究主要集中在英语等高资源语言上,孟加拉语等低资源语言则相对较少。该领域面临的挑战包括缺乏基准语料库、缺乏上下文感知特征提取器、可调谐超参数的可用性有限以及 OOV 问题。为了应对这些挑战,本研究引入了 AuthorNet,利用基于注意力的早期融合转换器语言模型(即连接两个经过微调的现有模型的嵌入输出)来实现作者归属。AuthorNet 由三个关键模块组成:特征提取、微调和选择性能最佳的模型,以及基于注意力的早期融合。为了评估 AuthorNet 的性能,我们使用四个基准语料库进行了大量实验。结果表明,AuthorNet 的准确率非常高:四个语料库的准确率分别为 98.86 ± 0.01%、99.49 ± 0.01%、97.91 ± 0.01% 和 99.87 ± 0.01%。值得注意的是,AuthorNet 的表现优于所有基础模型,在四个语料库中的准确率提高了 0.24% 到 2.92%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
AuthorNet: Leveraging attention-based early fusion of transformers for low-resource authorship attribution
Authorship Attribution (AA) is crucial for identifying the author of a given text from a pool of suspects, especially with the widespread use of the internet and electronic devices. However, most AA research has primarily focused on high-resource languages like English, leaving low-resource languages such as Bengali relatively unexplored. Challenges faced in this domain include the absence of benchmark corpora, a lack of context-aware feature extractors, limited availability of tuned hyperparameters, and OOV issues. To address these challenges, this study introduces AuthorNet for authorship attribution using attention-based early fusion of transformer-based language models, i.e., concatenation of an embeddings output of two existing models that were fine-tuned. AuthorNet consists of three key modules: Feature extraction, Fine-tuning and selection of best-performing models, and Attention-based early fusion. To evaluate the performance of AuthorNet, a number of experiments using four benchmark corpora have been conducted. The results demonstrated exceptional accuracy: 98.86 ± 0.01%, 99.49 ± 0.01%, 97.91 ± 0.01%, and 99.87 ± 0.01% for four corpora. Notably, AuthorNet outperformed all foundation models, achieving accuracy improvements ranging from 0.24% to 2.92% across the four corpora.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信