Vector Space Model based Topic Retrieval from Bengali Documents

2018 International Conference on Innovations in Science, Engineering and Technology (ICISET) Pub Date : 2018-10-01 DOI:10.1109/ICISET.2018.8745587

Topu Dash Roy, Shamima Khatun, Rubina Begum, Al Mehdi Saadat Chowdhury

引用次数: 3

Abstract

This work attempts to find the topic of a Bengali text document based on a traditional similarity based retrieval model named Vector Space Model. This fascinating model has traditionally obtained much fame in the research community, but to the best of our knowledge, was never tried for Bengali topic retrieval. In this work, therefore, we have used four different settings of the vector space model which are TF-IDF weighting scheme with Euclidean distance, TF-IDF weighting scheme with Manhattan distance, TF-IDF weighting scheme with Cosine similarity and Improved document scoring scheme. The K-nearest neighbor algorithm is then used to retrieve the topic of a query document. For training and testing purpose, we have also created a large corpus of Bengali text documents. On this corpus, our result shows the best retrieval accuracy of 93.33%.

查看原文本刊更多论文

基于向量空间模型的孟加拉语文档主题检索

本工作试图基于传统的基于相似度的检索模型(向量空间模型)来查找孟加拉文文本文档的主题。这个迷人的模型传统上在研究界获得了很大的声誉，但据我们所知，从未尝试过孟加拉语主题检索。因此，在本工作中，我们使用了四种不同的向量空间模型设置，分别是具有欧几里得距离的TF-IDF加权方案、具有曼哈顿距离的TF-IDF加权方案、具有余弦相似度的TF-IDF加权方案和改进的文档评分方案。然后使用k近邻算法检索查询文档的主题。为了培训和测试的目的，我们还创建了一个大型的孟加拉语文本文档语料库。在该语料库上，我们的结果显示出最好的检索准确率为93.33%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 International Conference on Innovations in Science, Engineering and Technology (ICISET)

自引率

0.00%

发文量