Transformer-based Pouranic topic classification in Indian mythology

Apurba Paul, Srijan Seal, Dipankar Das
{"title":"Transformer-based Pouranic topic classification in Indian mythology","authors":"Apurba Paul, Srijan Seal, Dipankar Das","doi":"10.1007/s12046-024-02598-6","DOIUrl":null,"url":null,"abstract":"<p>Topic classification is a challenging task in order to comprehend the subject matter or theme of the Indian mythology. It will enhance the performance of NLP-based systems, such as recommendation and semantic search engines, when dealing with texts containing mythology. This research focuses on developing transformer based models for automated topic classification of Indian mythological documents, which addresses the challenges of organizing and analyzing this rich and diverse corpus. We introduce <b><span>PouranicTopic</span></b>, a new annotated dataset containing over 200k verses from 7 major Hindu texts with canto, topic, and sentence labels. Additional datasets <b><span>Similarity-based</span></b> and <b><span>Log-likelihood-based</span></b> are created using sentence clustering techniques. The BERT, RoBERTa, and DistilBERT models are evaluated for canto and topic classification on these datasets. Clustering greatly improves the results on the Similarity-based dataset, but Log-likelihood-based dataset remains challenging.</p>","PeriodicalId":21498,"journal":{"name":"Sādhanā","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sādhanā","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12046-024-02598-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Topic classification is a challenging task in order to comprehend the subject matter or theme of the Indian mythology. It will enhance the performance of NLP-based systems, such as recommendation and semantic search engines, when dealing with texts containing mythology. This research focuses on developing transformer based models for automated topic classification of Indian mythological documents, which addresses the challenges of organizing and analyzing this rich and diverse corpus. We introduce PouranicTopic, a new annotated dataset containing over 200k verses from 7 major Hindu texts with canto, topic, and sentence labels. Additional datasets Similarity-based and Log-likelihood-based are created using sentence clustering techniques. The BERT, RoBERTa, and DistilBERT models are evaluated for canto and topic classification on these datasets. Clustering greatly improves the results on the Similarity-based dataset, but Log-likelihood-based dataset remains challenging.

Abstract Image

基于变压器的印度神话 Pouranic 主题分类
要理解印度神话的主题或题材,主题分类是一项具有挑战性的任务。在处理包含神话的文本时,它将提高基于 NLP 的系统(如推荐和语义搜索引擎)的性能。本研究的重点是为印度神话文档的自动主题分类开发基于转换器的模型,以解决组织和分析这一丰富多样的语料库所面临的挑战。我们介绍了一个新的注释数据集 PouranicTopic,该数据集包含来自 7 个主要印度教文本的 20 多万节诗文,并带有章节、主题和句子标签。我们还利用句子聚类技术创建了基于相似度和基于对数概率的其他数据集。在这些数据集上对 BERT、RoBERTa 和 DistilBERT 模型进行了音调和主题分类评估。聚类技术大大提高了基于相似度的数据集的结果,但基于对数似然的数据集仍然具有挑战性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信