使用经过微调的GPT-2模型从文本中识别母语。

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

PeerJ Computer Science Pub Date : 2025-05-28 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2909

Yuzhe Nie

{"title":"使用经过微调的GPT-2模型从文本中识别母语。","authors":"Yuzhe Nie","doi":"10.7717/peerj-cs.2909","DOIUrl":null,"url":null,"abstract":"Native language identification (NLI) is a critical task in computational linguistics, supporting applications such as personalized language learning, forensic analysis, and machine translation. This study investigates the use of a fine-tuned GPT-2 model to enhance NLI accuracy. Using the NLI-PT dataset, we preprocess and fine-tune GPT-2 to classify the native language of learners based on their Portuguese-written texts. Our approach leverages deep learning techniques, including tokenization, embedding extraction, and multi-layer transformer-based classification. Experimental results show that our fine-tuned GPT-2 model significantly outperforms traditional machine learning methods (e.g., SVM, Random Forest) and other pre-trained language models (e.g., BERT, RoBERTa, BioBERT), achieving a weighted F1 score of 0.9419 and an accuracy of 94.65%. These results show that large transformer models work well for native language identification and can help guide future research in personalized language tools and artificial intelligence (AI)-based education.","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2909"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12192634/pdf/","citationCount":"0","resultStr":"{\"title\":\"Native language identification from text using a fine-tuned GPT-2 model.\",\"authors\":\"Yuzhe Nie\",\"doi\":\"10.7717/peerj-cs.2909\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Native language identification (NLI) is a critical task in computational linguistics, supporting applications such as personalized language learning, forensic analysis, and machine translation. This study investigates the use of a fine-tuned GPT-2 model to enhance NLI accuracy. Using the NLI-PT dataset, we preprocess and fine-tune GPT-2 to classify the native language of learners based on their Portuguese-written texts. Our approach leverages deep learning techniques, including tokenization, embedding extraction, and multi-layer transformer-based classification. Experimental results show that our fine-tuned GPT-2 model significantly outperforms traditional machine learning methods (e.g., SVM, Random Forest) and other pre-trained language models (e.g., BERT, RoBERTa, BioBERT), achieving a weighted F1 score of 0.9419 and an accuracy of 94.65%. These results show that large transformer models work well for native language identification and can help guide future research in personalized language tools and artificial intelligence (AI)-based education.\",\"PeriodicalId\":54224,\"journal\":{\"name\":\"PeerJ Computer Science\",\"volume\":\"11 \",\"pages\":\"e2909\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12192634/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PeerJ Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.7717/peerj-cs.2909\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2909","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

母语识别（NLI）是计算语言学中的一项关键任务，支持个性化语言学习、法医分析和机器翻译等应用。本研究探讨了使用微调GPT-2模型来提高NLI准确性。使用NLI-PT数据集，我们对GPT-2进行预处理和微调，以根据学习者的葡萄牙语书面文本对其母语进行分类。我们的方法利用了深度学习技术，包括标记化、嵌入提取和基于多层变压器的分类。实验结果表明，我们的微调GPT-2模型显著优于传统机器学习方法（如SVM、Random Forest）和其他预训练语言模型（如BERT、RoBERTa、BioBERT），加权F1得分为0.9419，准确率为94.65%。这些结果表明，大型变压器模型可以很好地用于母语识别，并有助于指导未来个性化语言工具和基于人工智能（AI）的教育的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Native language identification from text using a fine-tuned GPT-2 model.

Native language identification (NLI) is a critical task in computational linguistics, supporting applications such as personalized language learning, forensic analysis, and machine translation. This study investigates the use of a fine-tuned GPT-2 model to enhance NLI accuracy. Using the NLI-PT dataset, we preprocess and fine-tune GPT-2 to classify the native language of learners based on their Portuguese-written texts. Our approach leverages deep learning techniques, including tokenization, embedding extraction, and multi-layer transformer-based classification. Experimental results show that our fine-tuned GPT-2 model significantly outperforms traditional machine learning methods (e.g., SVM, Random Forest) and other pre-trained language models (e.g., BERT, RoBERTa, BioBERT), achieving a weighted F1 score of 0.9419 and an accuracy of 94.65%. These results show that large transformer models work well for native language identification and can help guide future research in personalized language tools and artificial intelligence (AI)-based education.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.