{"title":"基于半监督图聚类的种子选择","authors":"C. Le, V. Vu, L. K. Oanh, Nguyen Thi Hai Yen","doi":"10.15625/1813-9663/35/4/14123","DOIUrl":null,"url":null,"abstract":"Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.","PeriodicalId":15444,"journal":{"name":"Journal of Computer Science and Cybernetics","volume":"94 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CHOOSING SEEDS FOR SEMI-SUPERVISED GRAPH BASED CLUSTERING\",\"authors\":\"C. Le, V. Vu, L. K. Oanh, Nguyen Thi Hai Yen\",\"doi\":\"10.15625/1813-9663/35/4/14123\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.\",\"PeriodicalId\":15444,\"journal\":{\"name\":\"Journal of Computer Science and Cybernetics\",\"volume\":\"94 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Science and Cybernetics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15625/1813-9663/35/4/14123\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science and Cybernetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15625/1813-9663/35/4/14123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CHOOSING SEEDS FOR SEMI-SUPERVISED GRAPH BASED CLUSTERING
Though clustering algorithms have long history, nowadays clustering topic still attracts a lot of attention because of the need of efficient data analysis tools in many applications such as social network, electronic commerce, GIS, etc. Recently, semi-supervised clustering, for example, semi-supervised K-Means, semi-supervised DBSCAN, semi-supervised graph-based clustering (SSGC) etc., which uses side information, has received a great deal of attention. Generally, there are two forms of side information: seed form (labeled data) and constraint form (must-link, cannot-link). By integrating information provided by the user or domain expert, the semi-supervised clustering can produce expected results. In fact, clustering results usually depend on side information provided, so different side information will produce different results of clustering. In some cases, the performance of clustering may decrease if the side information is not carefully chosen. This paper addresses the problem of efficient collection of seeds for semi-supervised clustering, especially for graph based clustering by seeding (SSGC). The properly collected seeds can boost the quality of clustering and minimize the number of queries solicited from the user. For this purpose, we have developed an active learning algorithm (called SKMMM) for the seeds collection task, which identifies candidates to solicit users by using the K-Means and min-max algorithms. Experiments conducted on real data sets from UCI and a real collected document data set show the effectiveness of our approach compared with other methods.