Vector Space Model based Topic Retrieval from Bengali Documents

Topu Dash Roy, Shamima Khatun, Rubina Begum, Al Mehdi Saadat Chowdhury
{"title":"Vector Space Model based Topic Retrieval from Bengali Documents","authors":"Topu Dash Roy, Shamima Khatun, Rubina Begum, Al Mehdi Saadat Chowdhury","doi":"10.1109/ICISET.2018.8745587","DOIUrl":null,"url":null,"abstract":"This work attempts to find the topic of a Bengali text document based on a traditional similarity based retrieval model named Vector Space Model. This fascinating model has traditionally obtained much fame in the research community, but to the best of our knowledge, was never tried for Bengali topic retrieval. In this work, therefore, we have used four different settings of the vector space model which are TF-IDF weighting scheme with Euclidean distance, TF-IDF weighting scheme with Manhattan distance, TF-IDF weighting scheme with Cosine similarity and Improved document scoring scheme. The K-nearest neighbor algorithm is then used to retrieve the topic of a query document. For training and testing purpose, we have also created a large corpus of Bengali text documents. On this corpus, our result shows the best retrieval accuracy of 93.33%.","PeriodicalId":6608,"journal":{"name":"2018 International Conference on Innovations in Science, Engineering and Technology (ICISET)","volume":"1 1","pages":"60-63"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Innovations in Science, Engineering and Technology (ICISET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICISET.2018.8745587","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

This work attempts to find the topic of a Bengali text document based on a traditional similarity based retrieval model named Vector Space Model. This fascinating model has traditionally obtained much fame in the research community, but to the best of our knowledge, was never tried for Bengali topic retrieval. In this work, therefore, we have used four different settings of the vector space model which are TF-IDF weighting scheme with Euclidean distance, TF-IDF weighting scheme with Manhattan distance, TF-IDF weighting scheme with Cosine similarity and Improved document scoring scheme. The K-nearest neighbor algorithm is then used to retrieve the topic of a query document. For training and testing purpose, we have also created a large corpus of Bengali text documents. On this corpus, our result shows the best retrieval accuracy of 93.33%.
基于向量空间模型的孟加拉语文档主题检索
本工作试图基于传统的基于相似度的检索模型(向量空间模型)来查找孟加拉文文本文档的主题。这个迷人的模型传统上在研究界获得了很大的声誉,但据我们所知,从未尝试过孟加拉语主题检索。因此,在本工作中,我们使用了四种不同的向量空间模型设置,分别是具有欧几里得距离的TF-IDF加权方案、具有曼哈顿距离的TF-IDF加权方案、具有余弦相似度的TF-IDF加权方案和改进的文档评分方案。然后使用k近邻算法检索查询文档的主题。为了培训和测试的目的,我们还创建了一个大型的孟加拉语文本文档语料库。在该语料库上,我们的结果显示出最好的检索准确率为93.33%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信