{"title":"Validation Measures for Clustering Algorithms Incorporating Biological Information","authors":"S. Datta, S. Datta","doi":"10.1109/IMSCCS.2006.139","DOIUrl":null,"url":null,"abstract":"A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. A closely related problem is that of selecting a clustering algorithm that is optimal in some way from a rather impressive list of clustering algorithms that currently exist. In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional consistency, so that a good clustering algorithm should have a small value for these measures. We illustrate our methods using two sets of expression profiles obtained from a breast cancer data set. Six well known clustering algorithms UPGMA, k-means, Diana, Fanny, model-based and SOM were evaluated. Whereas the exact ordering depends on the particular data set (expression profiles) used and the validation measure employed, overall UPGMA appears to be the optimal for this cancer data set that we considered","PeriodicalId":202629,"journal":{"name":"First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IMSCCS.2006.139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. A closely related problem is that of selecting a clustering algorithm that is optimal in some way from a rather impressive list of clustering algorithms that currently exist. In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional consistency, so that a good clustering algorithm should have a small value for these measures. We illustrate our methods using two sets of expression profiles obtained from a breast cancer data set. Six well known clustering algorithms UPGMA, k-means, Diana, Fanny, model-based and SOM were evaluated. Whereas the exact ordering depends on the particular data set (expression profiles) used and the validation measure employed, overall UPGMA appears to be the optimal for this cancer data set that we considered