{"title":"Transformer-based Pouranic topic classification in Indian mythology","authors":"Apurba Paul, Srijan Seal, Dipankar Das","doi":"10.1007/s12046-024-02598-6","DOIUrl":null,"url":null,"abstract":"<p>Topic classification is a challenging task in order to comprehend the subject matter or theme of the Indian mythology. It will enhance the performance of NLP-based systems, such as recommendation and semantic search engines, when dealing with texts containing mythology. This research focuses on developing transformer based models for automated topic classification of Indian mythological documents, which addresses the challenges of organizing and analyzing this rich and diverse corpus. We introduce <b><span>PouranicTopic</span></b>, a new annotated dataset containing over 200k verses from 7 major Hindu texts with canto, topic, and sentence labels. Additional datasets <b><span>Similarity-based</span></b> and <b><span>Log-likelihood-based</span></b> are created using sentence clustering techniques. The BERT, RoBERTa, and DistilBERT models are evaluated for canto and topic classification on these datasets. Clustering greatly improves the results on the Similarity-based dataset, but Log-likelihood-based dataset remains challenging.</p>","PeriodicalId":21498,"journal":{"name":"Sādhanā","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sādhanā","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12046-024-02598-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Topic classification is a challenging task in order to comprehend the subject matter or theme of the Indian mythology. It will enhance the performance of NLP-based systems, such as recommendation and semantic search engines, when dealing with texts containing mythology. This research focuses on developing transformer based models for automated topic classification of Indian mythological documents, which addresses the challenges of organizing and analyzing this rich and diverse corpus. We introduce PouranicTopic, a new annotated dataset containing over 200k verses from 7 major Hindu texts with canto, topic, and sentence labels. Additional datasets Similarity-based and Log-likelihood-based are created using sentence clustering techniques. The BERT, RoBERTa, and DistilBERT models are evaluated for canto and topic classification on these datasets. Clustering greatly improves the results on the Similarity-based dataset, but Log-likelihood-based dataset remains challenging.