{"title":"基于可扩展相似性的聚类层次聚类框架的聚类假设检验","authors":"Xinyu Wang, Julien Ah-Pine, J. Darmont","doi":"10.24348/coria.2017.RJCRI_15","DOIUrl":null,"url":null,"abstract":"The Cluster Hypothesis is the fundamental assumption of using clustering in Information Retrieval. It states that similar documents tend to be relevant to the same query. Past research works extensively test this hypothesis using agglomerative hierarchical clustering (AHC) methods. However, their conclusions are not consistent concerning retrieval effectiveness for a given clustering method. The main limit of these works is the scalability issue of AHC. In this paper, we extend our previous work to a new test of the cluster hypothesis by applying a scalable similarity-based AHC framework. Principally, the input pairwise cosine similarity matrix is sparsified by given threshold values to reduce memory usage and running time. Our experiments show that even when the similarity matrix is largely sparsified, retrieval effectiveness is retained for all tested methods. Moreover, for two clustering methods, complete link and average link, they do not always dominate the other methods as reported in past works.","PeriodicalId":390974,"journal":{"name":"Conférence en Recherche d'Infomations et Applications","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A New Test of Cluster Hypothesis Using a Scalable Similarity-Based Agglomerative Hierarchical Clustering Framework\",\"authors\":\"Xinyu Wang, Julien Ah-Pine, J. Darmont\",\"doi\":\"10.24348/coria.2017.RJCRI_15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Cluster Hypothesis is the fundamental assumption of using clustering in Information Retrieval. It states that similar documents tend to be relevant to the same query. Past research works extensively test this hypothesis using agglomerative hierarchical clustering (AHC) methods. However, their conclusions are not consistent concerning retrieval effectiveness for a given clustering method. The main limit of these works is the scalability issue of AHC. In this paper, we extend our previous work to a new test of the cluster hypothesis by applying a scalable similarity-based AHC framework. Principally, the input pairwise cosine similarity matrix is sparsified by given threshold values to reduce memory usage and running time. Our experiments show that even when the similarity matrix is largely sparsified, retrieval effectiveness is retained for all tested methods. Moreover, for two clustering methods, complete link and average link, they do not always dominate the other methods as reported in past works.\",\"PeriodicalId\":390974,\"journal\":{\"name\":\"Conférence en Recherche d'Infomations et Applications\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Conférence en Recherche d'Infomations et Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.24348/coria.2017.RJCRI_15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conférence en Recherche d'Infomations et Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24348/coria.2017.RJCRI_15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A New Test of Cluster Hypothesis Using a Scalable Similarity-Based Agglomerative Hierarchical Clustering Framework
The Cluster Hypothesis is the fundamental assumption of using clustering in Information Retrieval. It states that similar documents tend to be relevant to the same query. Past research works extensively test this hypothesis using agglomerative hierarchical clustering (AHC) methods. However, their conclusions are not consistent concerning retrieval effectiveness for a given clustering method. The main limit of these works is the scalability issue of AHC. In this paper, we extend our previous work to a new test of the cluster hypothesis by applying a scalable similarity-based AHC framework. Principally, the input pairwise cosine similarity matrix is sparsified by given threshold values to reduce memory usage and running time. Our experiments show that even when the similarity matrix is largely sparsified, retrieval effectiveness is retained for all tested methods. Moreover, for two clustering methods, complete link and average link, they do not always dominate the other methods as reported in past works.