使用嵌入和预训练多语言语言模型识别南非荷兰语文学体裁

2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA) Pub Date : 2024-02-01 DOI:10.1109/ACDSA59508.2024.10467838

E. Kotzé, Burgert Senekal

{"title":"使用嵌入和预训练多语言语言模型识别南非荷兰语文学体裁","authors":"E. Kotzé, Burgert Senekal","doi":"10.1109/ACDSA59508.2024.10467838","DOIUrl":null,"url":null,"abstract":"Automatic literary genre recognition is pivotal in various domains, including digital libraries, literary studies, and computational linguistics. Efficiently categorizing texts into genres, such as poetry or prose, facilitates the organization and retrieval of literary works, enhancing accessibility for readers, researchers, and academics. Recognizing genre-specific patterns, themes, and stylistic elements enables in-depth analysis and comprehension of literary texts. This study focuses on fine-tuning several state-of-the-art multilingual pre-trained language models, including mBERT, DistilmBERT, and XLM-RoBERTa, to distinguish between Afrikaans poetry and prose. A baseline Support Vector Machine (SVM) classifier and a self-attention transformer model were also trained for comparison.Results demonstrated that the SVM model with text-embeddings-ada-002 embeddings achieved the highest test F1-score of 0.936. The XLM-RoBERTa model exhibited the best performance during validation with an F1-score of 0.924, while the DistilmBERT model surpassed all others, including the SVM during testing, achieving the highest F1-score of 0.942. Notably, the self-attention model demonstrated comparable F1-scores for training (0.923) and testing (0.929), establishing itself as the second-best performing genre classifier.This study contributes to advancing automatic literary genre recognition in Afrikaans by exploring the capabilities of state-of-the-art multilingual language models and traditional classifiers, providing insights into their comparative performance and potential applications in real-world scenarios.","PeriodicalId":518964,"journal":{"name":"2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)","volume":"44 ","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Afrikaans Literary Genre Recognition using Embeddings and Pre-Trained Multilingual Language Models\",\"authors\":\"E. Kotzé, Burgert Senekal\",\"doi\":\"10.1109/ACDSA59508.2024.10467838\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic literary genre recognition is pivotal in various domains, including digital libraries, literary studies, and computational linguistics. Efficiently categorizing texts into genres, such as poetry or prose, facilitates the organization and retrieval of literary works, enhancing accessibility for readers, researchers, and academics. Recognizing genre-specific patterns, themes, and stylistic elements enables in-depth analysis and comprehension of literary texts. This study focuses on fine-tuning several state-of-the-art multilingual pre-trained language models, including mBERT, DistilmBERT, and XLM-RoBERTa, to distinguish between Afrikaans poetry and prose. A baseline Support Vector Machine (SVM) classifier and a self-attention transformer model were also trained for comparison.Results demonstrated that the SVM model with text-embeddings-ada-002 embeddings achieved the highest test F1-score of 0.936. The XLM-RoBERTa model exhibited the best performance during validation with an F1-score of 0.924, while the DistilmBERT model surpassed all others, including the SVM during testing, achieving the highest F1-score of 0.942. Notably, the self-attention model demonstrated comparable F1-scores for training (0.923) and testing (0.929), establishing itself as the second-best performing genre classifier.This study contributes to advancing automatic literary genre recognition in Afrikaans by exploring the capabilities of state-of-the-art multilingual language models and traditional classifiers, providing insights into their comparative performance and potential applications in real-world scenarios.\",\"PeriodicalId\":518964,\"journal\":{\"name\":\"2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)\",\"volume\":\"44 \",\"pages\":\"1-6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACDSA59508.2024.10467838\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACDSA59508.2024.10467838","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

自动文学体裁识别在数字图书馆、文学研究和计算语言学等多个领域都至关重要。将文本有效地归类为诗歌或散文等体裁，有利于文学作品的组织和检索，提高读者、研究人员和学者的可访问性。识别特定体裁的模式、主题和文体元素有助于深入分析和理解文学文本。本研究的重点是微调几种最先进的多语言预训练语言模型，包括 mBERT、DistilmBERT 和 XLM-RoBERTa，以区分南非荷兰语诗歌和散文。结果表明，使用文本嵌入-ada-002 嵌入的 SVM 模型取得了最高的测试 F1 分数 0.936。XLM-RoBERTa 模型在验证过程中表现最佳，F1 分数为 0.924，而 DistilmBERT 模型在测试过程中超越了包括 SVM 在内的所有其他模型，取得了最高的 F1 分数 0.942。值得注意的是，自我关注模型在训练（0.923）和测试（0.929）中表现出了相当的 F1 分数，成为表现第二好的体裁分类器。这项研究通过探索最先进的多语言语言模型和传统分类器的能力，深入了解了它们在现实世界中的比较性能和潜在应用，为推进南非荷兰语文学体裁的自动识别做出了贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Afrikaans Literary Genre Recognition using Embeddings and Pre-Trained Multilingual Language Models

Automatic literary genre recognition is pivotal in various domains, including digital libraries, literary studies, and computational linguistics. Efficiently categorizing texts into genres, such as poetry or prose, facilitates the organization and retrieval of literary works, enhancing accessibility for readers, researchers, and academics. Recognizing genre-specific patterns, themes, and stylistic elements enables in-depth analysis and comprehension of literary texts. This study focuses on fine-tuning several state-of-the-art multilingual pre-trained language models, including mBERT, DistilmBERT, and XLM-RoBERTa, to distinguish between Afrikaans poetry and prose. A baseline Support Vector Machine (SVM) classifier and a self-attention transformer model were also trained for comparison.Results demonstrated that the SVM model with text-embeddings-ada-002 embeddings achieved the highest test F1-score of 0.936. The XLM-RoBERTa model exhibited the best performance during validation with an F1-score of 0.924, while the DistilmBERT model surpassed all others, including the SVM during testing, achieving the highest F1-score of 0.942. Notably, the self-attention model demonstrated comparable F1-scores for training (0.923) and testing (0.929), establishing itself as the second-best performing genre classifier.This study contributes to advancing automatic literary genre recognition in Afrikaans by exploring the capabilities of state-of-the-art multilingual language models and traditional classifiers, providing insights into their comparative performance and potential applications in real-world scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)

自引率

0.00%

发文量