基于多语言主题词词汇的文档分层多主题分类概率方法

Nikolaos Makris;Stamatina K. Koutsileou;Nikolaos Mitrou
{"title":"基于多语言主题词词汇的文档分层多主题分类概率方法","authors":"Nikolaos Makris;Stamatina K. Koutsileou;Nikolaos Mitrou","doi":"10.1109/OJCS.2025.3592254","DOIUrl":null,"url":null,"abstract":"Hierarchical Multilabel Classification (HMC) is a challenging task in information retrieval, especially within scientific textbooks, where the objective is to allocate multiple labels adhering to a hierarchical taxonomy. This research presents a new language neutral methodology for HMC to assess documents as normalised weighted distributions of well-defined subjects across hierarchical levels, based on a hierarchical subject term vocabulary. The proposed approach utilizes Bayesian formulas, in contrast to typical methods that depend on machine learning models, thereby obviating the necessity for resource-intensive training processes at various hierarchical levels. The method integrates refined pre-processing techniques, such as natural language processing (NLP) and filtering of non-distinctive terms, to enhance classification accuracy. It employs Bayesian inference along with real time and cached computations across all hierarchical levels, yielding an effective, time-efficient and interpretable classification method while ensuring scalability for large datasets. Experimental results demonstrate the potency of the algorithm to classify scientific textbooks across hierarchical subject tiers with significant precision and recall and retrieve semantically related scientific textbooks, thereby verifying its efficacy in tasks requiring hierarchical subject classification. This study presents a streamlined, interpretable alternative to model-dependent HMC approaches, rendering it particularly appropriate for real-world applications in educational and scientific fields. Furthermore, in the context of the present study, two public Web User Interfaces were published, the first is founded on Skosmos to illustrate the hierarchical structure of the subject term vocabulary, while the second one employs the HMC method to present in real-time the classification between subjects in English and Greek textual data.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"1294-1305"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11095338","citationCount":"0","resultStr":"{\"title\":\"A Probabilistic Method for Hierarchical Multisubject Classification of Documents Based on Multilingual Subject Term Vocabularies\",\"authors\":\"Nikolaos Makris;Stamatina K. Koutsileou;Nikolaos Mitrou\",\"doi\":\"10.1109/OJCS.2025.3592254\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hierarchical Multilabel Classification (HMC) is a challenging task in information retrieval, especially within scientific textbooks, where the objective is to allocate multiple labels adhering to a hierarchical taxonomy. This research presents a new language neutral methodology for HMC to assess documents as normalised weighted distributions of well-defined subjects across hierarchical levels, based on a hierarchical subject term vocabulary. The proposed approach utilizes Bayesian formulas, in contrast to typical methods that depend on machine learning models, thereby obviating the necessity for resource-intensive training processes at various hierarchical levels. The method integrates refined pre-processing techniques, such as natural language processing (NLP) and filtering of non-distinctive terms, to enhance classification accuracy. It employs Bayesian inference along with real time and cached computations across all hierarchical levels, yielding an effective, time-efficient and interpretable classification method while ensuring scalability for large datasets. Experimental results demonstrate the potency of the algorithm to classify scientific textbooks across hierarchical subject tiers with significant precision and recall and retrieve semantically related scientific textbooks, thereby verifying its efficacy in tasks requiring hierarchical subject classification. This study presents a streamlined, interpretable alternative to model-dependent HMC approaches, rendering it particularly appropriate for real-world applications in educational and scientific fields. Furthermore, in the context of the present study, two public Web User Interfaces were published, the first is founded on Skosmos to illustrate the hierarchical structure of the subject term vocabulary, while the second one employs the HMC method to present in real-time the classification between subjects in English and Greek textual data.\",\"PeriodicalId\":13205,\"journal\":{\"name\":\"IEEE Open Journal of the Computer Society\",\"volume\":\"6 \",\"pages\":\"1294-1305\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11095338\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Open Journal of the Computer Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11095338/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11095338/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

分层多标签分类(HMC)在信息检索中是一项具有挑战性的任务,特别是在科学教科书中,其目标是分配遵循分层分类法的多个标签。本研究提出了一种新的语言中立方法,用于HMC评估文档,将其作为基于分层主题术语词汇的分层层次中定义良好的主题的标准化加权分布。与依赖机器学习模型的典型方法相比,所提出的方法利用贝叶斯公式,从而避免了在不同层次上进行资源密集型训练过程的必要性。该方法结合了自然语言处理(NLP)和非显著词过滤等精细预处理技术,提高了分类精度。它采用贝叶斯推理以及所有层次的实时和缓存计算,产生有效,时间高效和可解释的分类方法,同时确保大型数据集的可扩展性。实验结果表明,该算法可以跨层次学科层对科学教科书进行分类,具有显著的准确率,并且可以检索和召回语义相关的科学教科书,从而验证了该算法在需要层次学科分类的任务中的有效性。本研究提出了一种精简的、可解释的替代模型依赖的HMC方法,使其特别适合于教育和科学领域的实际应用。此外,在本研究的背景下,我们发布了两个公开的Web用户界面,第一个是基于Skosmos来说明主题术语词汇的层次结构,第二个是采用HMC方法来实时呈现英语和希腊语文本数据中的主题分类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Probabilistic Method for Hierarchical Multisubject Classification of Documents Based on Multilingual Subject Term Vocabularies
Hierarchical Multilabel Classification (HMC) is a challenging task in information retrieval, especially within scientific textbooks, where the objective is to allocate multiple labels adhering to a hierarchical taxonomy. This research presents a new language neutral methodology for HMC to assess documents as normalised weighted distributions of well-defined subjects across hierarchical levels, based on a hierarchical subject term vocabulary. The proposed approach utilizes Bayesian formulas, in contrast to typical methods that depend on machine learning models, thereby obviating the necessity for resource-intensive training processes at various hierarchical levels. The method integrates refined pre-processing techniques, such as natural language processing (NLP) and filtering of non-distinctive terms, to enhance classification accuracy. It employs Bayesian inference along with real time and cached computations across all hierarchical levels, yielding an effective, time-efficient and interpretable classification method while ensuring scalability for large datasets. Experimental results demonstrate the potency of the algorithm to classify scientific textbooks across hierarchical subject tiers with significant precision and recall and retrieve semantically related scientific textbooks, thereby verifying its efficacy in tasks requiring hierarchical subject classification. This study presents a streamlined, interpretable alternative to model-dependent HMC approaches, rendering it particularly appropriate for real-world applications in educational and scientific fields. Furthermore, in the context of the present study, two public Web User Interfaces were published, the first is founded on Skosmos to illustrate the hierarchical structure of the subject term vocabulary, while the second one employs the HMC method to present in real-time the classification between subjects in English and Greek textual data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
12.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信