{"title":"Slovak morphological tokenizer using the Byte-Pair Encoding algorithm.","authors":"Dávid Držík, Frantisek Forgac","doi":"10.7717/peerj-cs.2465","DOIUrl":null,"url":null,"abstract":"<p><p>This study introduces a new approach to text tokenization, SlovaK Morphological Tokenizer (SKMT), which integrates the morphology of the Slovak language into the training process using the Byte-Pair Encoding (BPE) algorithm. Unlike conventional tokenizers, SKMT focuses on preserving the integrity of word roots in individual tokens, crucial for maintaining lexical meaning. The methodology involves segmenting and extracting word roots from morphological dictionaries and databases, followed by <i>corpus</i> preprocessing and training SKMT alongside a traditional BPE tokenizer. Comparative evaluation against existing tokenizers demonstrates SKMT's outstanding ability to maintain root integrity, achieving 99.7% root integrity compared to SlovakBERT (90.5%) and a pureBPE tokenizer (93.1%). Further validation involved fine-tuning models on a sentiment classification NLP task, where models trained with SKMT achieved an F1-score improvement of 3.5% over those trained with conventional BPE tokenization, followed by a focus on the Semantic Textual Similarity (STS) task. These findings suggest that training language models on the SKMT tokenizer significantly enhances model performance and quality.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"10 ","pages":"e2465"},"PeriodicalIF":3.5000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11622830/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2465","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
This study introduces a new approach to text tokenization, SlovaK Morphological Tokenizer (SKMT), which integrates the morphology of the Slovak language into the training process using the Byte-Pair Encoding (BPE) algorithm. Unlike conventional tokenizers, SKMT focuses on preserving the integrity of word roots in individual tokens, crucial for maintaining lexical meaning. The methodology involves segmenting and extracting word roots from morphological dictionaries and databases, followed by corpus preprocessing and training SKMT alongside a traditional BPE tokenizer. Comparative evaluation against existing tokenizers demonstrates SKMT's outstanding ability to maintain root integrity, achieving 99.7% root integrity compared to SlovakBERT (90.5%) and a pureBPE tokenizer (93.1%). Further validation involved fine-tuning models on a sentiment classification NLP task, where models trained with SKMT achieved an F1-score improvement of 3.5% over those trained with conventional BPE tokenization, followed by a focus on the Semantic Textual Similarity (STS) task. These findings suggest that training language models on the SKMT tokenizer significantly enhances model performance and quality.
期刊介绍:
PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.