{"title":"Quality-based distance measures and applications to clustering","authors":"D. Taverna, M. Brun, E. Dougherty, Yidong Chen","doi":"10.1109/LSSA.2006.250390","DOIUrl":null,"url":null,"abstract":"When analyzing biological data sets, a common approach is to partition the data into clusters. Examples of this include finding a subset of genes with co-regulated expression among experiments, grouping similar disease phenotypes, or implicating regions of genetic variation in disease. The ability to separate the data into subsets depends upon the structure of the distribution of points and the choice of clustering algorithm. Furthermore, the biological relevance of the clustering results is biased by the variation among the data points themselves. We introduce a mathematical quality-based distance metric which will allow all data, regardless of its error, to be included in analysis without the need to introduce a cutoff. This removes the need to exclude points or to change the dimensionality. The advantage of this approach is shown by clustering simulated data with added noise","PeriodicalId":360097,"journal":{"name":"2006 IEEE/NLM Life Science Systems and Applications Workshop","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE/NLM Life Science Systems and Applications Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/LSSA.2006.250390","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
When analyzing biological data sets, a common approach is to partition the data into clusters. Examples of this include finding a subset of genes with co-regulated expression among experiments, grouping similar disease phenotypes, or implicating regions of genetic variation in disease. The ability to separate the data into subsets depends upon the structure of the distribution of points and the choice of clustering algorithm. Furthermore, the biological relevance of the clustering results is biased by the variation among the data points themselves. We introduce a mathematical quality-based distance metric which will allow all data, regardless of its error, to be included in analysis without the need to introduce a cutoff. This removes the need to exclude points or to change the dimensionality. The advantage of this approach is shown by clustering simulated data with added noise