{"title":"CM-test: An Innovative Divergence Measurement and Its Application in Diabetes Gene Expression Data Analysis","authors":"L. Liang, Shiyong Lu, Yi Lu, P. Dhawan, D. Kumar","doi":"10.1109/GRC.2006.1635794","DOIUrl":null,"url":null,"abstract":"One important problem in data analysis is to effec- tively measure the divergence of two sets of values of a feature, each from a group of samples with a particular condition. Such a measurement is the foundation for identifying critical features that contribute to the difference between the two conditions. The two traditional methods t-test and Wilcoxon rank sum test measure this divergence indirectly, using the difference of the means of the two groups and the sum of the ranks from one of the groups, respectively. In this paper, we propose an innovative approach based on fuzzy set theory, the Cluster Misclassification test (CM-test), to quantify the divergence directly and robustly. To validate our approach, we conducted experiments on both synthetic and real diabetes gene expression datasets. On the synthetic datasets, we observed that CM-test effectively quantifies the divergence of two sets. On the real diabetes dataset, we observed that in the top ten genes identified by CM-test, eight of them have been confirmed to be associated with diabetes in the literature. We suggest the remaining two genes, M95610 and M88461, as two potential diabetic genes for further biological investigation. Therefore, we recommend that CM-test be another effective method for measuring the divergence of two sets, complementing t-test and Wilcoxon rank sum test in practice.","PeriodicalId":400997,"journal":{"name":"2006 IEEE International Conference on Granular Computing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE International Conference on Granular Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRC.2006.1635794","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
One important problem in data analysis is to effec- tively measure the divergence of two sets of values of a feature, each from a group of samples with a particular condition. Such a measurement is the foundation for identifying critical features that contribute to the difference between the two conditions. The two traditional methods t-test and Wilcoxon rank sum test measure this divergence indirectly, using the difference of the means of the two groups and the sum of the ranks from one of the groups, respectively. In this paper, we propose an innovative approach based on fuzzy set theory, the Cluster Misclassification test (CM-test), to quantify the divergence directly and robustly. To validate our approach, we conducted experiments on both synthetic and real diabetes gene expression datasets. On the synthetic datasets, we observed that CM-test effectively quantifies the divergence of two sets. On the real diabetes dataset, we observed that in the top ten genes identified by CM-test, eight of them have been confirmed to be associated with diabetes in the literature. We suggest the remaining two genes, M95610 and M88461, as two potential diabetic genes for further biological investigation. Therefore, we recommend that CM-test be another effective method for measuring the divergence of two sets, complementing t-test and Wilcoxon rank sum test in practice.