{"title":"Term Clustering and Confidence Measurement in Document Clustering","authors":"K. Csorba, I. Vajk","doi":"10.1109/ICCCYB.2006.305694","DOIUrl":null,"url":null,"abstract":"A novel topic based document clustering technique is presented in the paper for situations, where there is no need to assign all the documents to the clusters. Under such conditions the clustering system can provide a much cleaner result by rejecting the classification of documents with ambiguous topic. This is achieved by applying a confidence measurement for every classification result and by discarding documents with a confidence value less than a predefined lower limit. This means that our system returns the classification for a document only if it feels sure about it If not, the document is marked as \"unsure\". Beside this ability the confidence measurement allows the use of a much stronger term filtering, performed by a novel, supervised term cluster creation and term filtering algorithm, which is presented in this paper as well.","PeriodicalId":160588,"journal":{"name":"2006 IEEE International Conference on Computational Cybernetics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE International Conference on Computational Cybernetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCYB.2006.305694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
A novel topic based document clustering technique is presented in the paper for situations, where there is no need to assign all the documents to the clusters. Under such conditions the clustering system can provide a much cleaner result by rejecting the classification of documents with ambiguous topic. This is achieved by applying a confidence measurement for every classification result and by discarding documents with a confidence value less than a predefined lower limit. This means that our system returns the classification for a document only if it feels sure about it If not, the document is marked as "unsure". Beside this ability the confidence measurement allows the use of a much stronger term filtering, performed by a novel, supervised term cluster creation and term filtering algorithm, which is presented in this paper as well.