{"title":"利用大型语言模型对土耳其语进行拼写校正。","authors":"Ceren Guzel Turhan","doi":"10.7717/peerj-cs.2889","DOIUrl":null,"url":null,"abstract":"<p><p>The field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings in human-written texts adversely affect language understanding for LLMs for various NLP tasks as well as misspelling applications such as auto-proofreading and chatbots. Therefore, this study focuses on the task of spelling correction in the agglutinative language Turkish, where its nature makes spell correction significantly more challenging. To address this, the research introduces a novel dataset, referred to as NoisyWikiTr, to explore encoder-only models based on bidirectional encoder representations from transformers (BERT) and existing auto-correction tools. For the first time in this study, as far as is known, encoder-only models based on BERT are presented as subword prediction models, and encoder-decoder models based on text-cleaning (Text-to-Text Transfer Transformer) architecture are fine-tuned for this task in Turkish. A comprehensive comparison of these models highlights the advantages of context-based approaches over traditional, context-free auto-correction tools. The findings also reveal that among LLMs, a language-specific sequence-to-sequence model outperforms both cross-lingual sequence-to-sequence models and encoder-only models in handling realistic misspellings.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2889"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12192738/pdf/","citationCount":"0","resultStr":"{\"title\":\"Leveraging large language models for spelling correction in Turkish.\",\"authors\":\"Ceren Guzel Turhan\",\"doi\":\"10.7717/peerj-cs.2889\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings in human-written texts adversely affect language understanding for LLMs for various NLP tasks as well as misspelling applications such as auto-proofreading and chatbots. Therefore, this study focuses on the task of spelling correction in the agglutinative language Turkish, where its nature makes spell correction significantly more challenging. To address this, the research introduces a novel dataset, referred to as NoisyWikiTr, to explore encoder-only models based on bidirectional encoder representations from transformers (BERT) and existing auto-correction tools. For the first time in this study, as far as is known, encoder-only models based on BERT are presented as subword prediction models, and encoder-decoder models based on text-cleaning (Text-to-Text Transfer Transformer) architecture are fine-tuned for this task in Turkish. A comprehensive comparison of these models highlights the advantages of context-based approaches over traditional, context-free auto-correction tools. The findings also reveal that among LLMs, a language-specific sequence-to-sequence model outperforms both cross-lingual sequence-to-sequence models and encoder-only models in handling realistic misspellings.</p>\",\"PeriodicalId\":54224,\"journal\":{\"name\":\"PeerJ Computer Science\",\"volume\":\"11 \",\"pages\":\"e2889\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-06-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12192738/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PeerJ Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.7717/peerj-cs.2889\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2889","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
自然语言处理(NLP)领域发展迅速,特别是随着大型语言模型(llm)的兴起,它以跨语言的方式增强了我们对复杂NLP任务中语言内在结构的理解。然而,在人类书写的文本中经常遇到的拼写错误会对法学硕士的各种NLP任务以及自动校对和聊天机器人等拼写错误应用程序的语言理解产生不利影响。因此,本研究的重点是黏着语土耳其语的拼写纠正任务,它的性质使得拼写纠正更具挑战性。为了解决这个问题,该研究引入了一个名为NoisyWikiTr的新数据集,用于探索基于变压器(BERT)和现有自动校正工具的双向编码器表示的纯编码器模型。就目前所知,本研究首次将基于BERT的纯编码器模型作为子词预测模型提出,并对基于文本清洗(Text-to-Text Transfer Transformer)架构的编码器-解码器模型在土耳其语中进行了微调。对这些模型的全面比较突出了基于上下文的方法优于传统的、无上下文的自动校正工具。研究结果还表明,在llm中,特定语言的序列到序列模型在处理实际拼写错误方面优于跨语言序列到序列模型和仅编码模型。
Leveraging large language models for spelling correction in Turkish.
The field of natural language processing (NLP) has rapidly progressed, particularly with the rise of large language models (LLMs), which enhance our understanding of the intrinsic structures of languages in a cross-linguistic manner for complex NLP tasks. However, commonly encountered misspellings in human-written texts adversely affect language understanding for LLMs for various NLP tasks as well as misspelling applications such as auto-proofreading and chatbots. Therefore, this study focuses on the task of spelling correction in the agglutinative language Turkish, where its nature makes spell correction significantly more challenging. To address this, the research introduces a novel dataset, referred to as NoisyWikiTr, to explore encoder-only models based on bidirectional encoder representations from transformers (BERT) and existing auto-correction tools. For the first time in this study, as far as is known, encoder-only models based on BERT are presented as subword prediction models, and encoder-decoder models based on text-cleaning (Text-to-Text Transfer Transformer) architecture are fine-tuned for this task in Turkish. A comprehensive comparison of these models highlights the advantages of context-based approaches over traditional, context-free auto-correction tools. The findings also reveal that among LLMs, a language-specific sequence-to-sequence model outperforms both cross-lingual sequence-to-sequence models and encoder-only models in handling realistic misspellings.
期刊介绍:
PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.