{"title":"Afrikaans Literary Genre Recognition using Embeddings and Pre-Trained Multilingual Language Models","authors":"E. Kotzé, Burgert Senekal","doi":"10.1109/ACDSA59508.2024.10467838","DOIUrl":null,"url":null,"abstract":"Automatic literary genre recognition is pivotal in various domains, including digital libraries, literary studies, and computational linguistics. Efficiently categorizing texts into genres, such as poetry or prose, facilitates the organization and retrieval of literary works, enhancing accessibility for readers, researchers, and academics. Recognizing genre-specific patterns, themes, and stylistic elements enables in-depth analysis and comprehension of literary texts. This study focuses on fine-tuning several state-of-the-art multilingual pre-trained language models, including mBERT, DistilmBERT, and XLM-RoBERTa, to distinguish between Afrikaans poetry and prose. A baseline Support Vector Machine (SVM) classifier and a self-attention transformer model were also trained for comparison.Results demonstrated that the SVM model with text-embeddings-ada-002 embeddings achieved the highest test F1-score of 0.936. The XLM-RoBERTa model exhibited the best performance during validation with an F1-score of 0.924, while the DistilmBERT model surpassed all others, including the SVM during testing, achieving the highest F1-score of 0.942. Notably, the self-attention model demonstrated comparable F1-scores for training (0.923) and testing (0.929), establishing itself as the second-best performing genre classifier.This study contributes to advancing automatic literary genre recognition in Afrikaans by exploring the capabilities of state-of-the-art multilingual language models and traditional classifiers, providing insights into their comparative performance and potential applications in real-world scenarios.","PeriodicalId":518964,"journal":{"name":"2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)","volume":"44 ","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACDSA59508.2024.10467838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic literary genre recognition is pivotal in various domains, including digital libraries, literary studies, and computational linguistics. Efficiently categorizing texts into genres, such as poetry or prose, facilitates the organization and retrieval of literary works, enhancing accessibility for readers, researchers, and academics. Recognizing genre-specific patterns, themes, and stylistic elements enables in-depth analysis and comprehension of literary texts. This study focuses on fine-tuning several state-of-the-art multilingual pre-trained language models, including mBERT, DistilmBERT, and XLM-RoBERTa, to distinguish between Afrikaans poetry and prose. A baseline Support Vector Machine (SVM) classifier and a self-attention transformer model were also trained for comparison.Results demonstrated that the SVM model with text-embeddings-ada-002 embeddings achieved the highest test F1-score of 0.936. The XLM-RoBERTa model exhibited the best performance during validation with an F1-score of 0.924, while the DistilmBERT model surpassed all others, including the SVM during testing, achieving the highest F1-score of 0.942. Notably, the self-attention model demonstrated comparable F1-scores for training (0.923) and testing (0.929), establishing itself as the second-best performing genre classifier.This study contributes to advancing automatic literary genre recognition in Afrikaans by exploring the capabilities of state-of-the-art multilingual language models and traditional classifiers, providing insights into their comparative performance and potential applications in real-world scenarios.