Topu Dash Roy, Shamima Khatun, Rubina Begum, Al Mehdi Saadat Chowdhury
{"title":"Vector Space Model based Topic Retrieval from Bengali Documents","authors":"Topu Dash Roy, Shamima Khatun, Rubina Begum, Al Mehdi Saadat Chowdhury","doi":"10.1109/ICISET.2018.8745587","DOIUrl":null,"url":null,"abstract":"This work attempts to find the topic of a Bengali text document based on a traditional similarity based retrieval model named Vector Space Model. This fascinating model has traditionally obtained much fame in the research community, but to the best of our knowledge, was never tried for Bengali topic retrieval. In this work, therefore, we have used four different settings of the vector space model which are TF-IDF weighting scheme with Euclidean distance, TF-IDF weighting scheme with Manhattan distance, TF-IDF weighting scheme with Cosine similarity and Improved document scoring scheme. The K-nearest neighbor algorithm is then used to retrieve the topic of a query document. For training and testing purpose, we have also created a large corpus of Bengali text documents. On this corpus, our result shows the best retrieval accuracy of 93.33%.","PeriodicalId":6608,"journal":{"name":"2018 International Conference on Innovations in Science, Engineering and Technology (ICISET)","volume":"1 1","pages":"60-63"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Innovations in Science, Engineering and Technology (ICISET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICISET.2018.8745587","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
This work attempts to find the topic of a Bengali text document based on a traditional similarity based retrieval model named Vector Space Model. This fascinating model has traditionally obtained much fame in the research community, but to the best of our knowledge, was never tried for Bengali topic retrieval. In this work, therefore, we have used four different settings of the vector space model which are TF-IDF weighting scheme with Euclidean distance, TF-IDF weighting scheme with Manhattan distance, TF-IDF weighting scheme with Cosine similarity and Improved document scoring scheme. The K-nearest neighbor algorithm is then used to retrieve the topic of a query document. For training and testing purpose, we have also created a large corpus of Bengali text documents. On this corpus, our result shows the best retrieval accuracy of 93.33%.