{"title":"Similarity model and term association for document categorization","authors":"Huaizhong Kou, G. Gardarin","doi":"10.1109/DEXA.2002.1045908","DOIUrl":null,"url":null,"abstract":"Both Euclidean distance- and cosine-based similarity models are widely used for measures of document similarity in information retrieval and document categorization. These two similarity models are based on the assumption that term vectors are orthogonal. But this assumption is not true. Term associations are ignored in such similarity models. In the document categorization context, we analyze the properties of term-document space, term-category space and category-document space. Then, without the assumption of term independence, we propose a new mathematical model to estimate the association between terms and define an /spl epsiv/-similarity model of documents. Here we make best use of existing category membership represented by the corpus as much as possible, and the objective is to improve categorization performance. Experiments have been done with a k-NN classifier over the Reuters-5178 corpus. The empirical results show that utilization of term association can improve the effectiveness of the categorization system and the /spl epsiv/-similarity model outperforms those without term association.","PeriodicalId":254550,"journal":{"name":"Proceedings. 13th International Workshop on Database and Expert Systems Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 13th International Workshop on Database and Expert Systems Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2002.1045908","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
Both Euclidean distance- and cosine-based similarity models are widely used for measures of document similarity in information retrieval and document categorization. These two similarity models are based on the assumption that term vectors are orthogonal. But this assumption is not true. Term associations are ignored in such similarity models. In the document categorization context, we analyze the properties of term-document space, term-category space and category-document space. Then, without the assumption of term independence, we propose a new mathematical model to estimate the association between terms and define an /spl epsiv/-similarity model of documents. Here we make best use of existing category membership represented by the corpus as much as possible, and the objective is to improve categorization performance. Experiments have been done with a k-NN classifier over the Reuters-5178 corpus. The empirical results show that utilization of term association can improve the effectiveness of the categorization system and the /spl epsiv/-similarity model outperforms those without term association.