CM-test: An Innovative Divergence Measurement and Its Application in Diabetes Gene Expression Data Analysis

2006 IEEE International Conference on Granular Computing Pub Date : 2006-05-10 DOI:10.1109/GRC.2006.1635794

L. Liang, Shiyong Lu, Yi Lu, P. Dhawan, D. Kumar

{"title":"CM-test: An Innovative Divergence Measurement and Its Application in Diabetes Gene Expression Data Analysis","authors":"L. Liang, Shiyong Lu, Yi Lu, P. Dhawan, D. Kumar","doi":"10.1109/GRC.2006.1635794","DOIUrl":null,"url":null,"abstract":"One important problem in data analysis is to effec- tively measure the divergence of two sets of values of a feature, each from a group of samples with a particular condition. Such a measurement is the foundation for identifying critical features that contribute to the difference between the two conditions. The two traditional methods t-test and Wilcoxon rank sum test measure this divergence indirectly, using the difference of the means of the two groups and the sum of the ranks from one of the groups, respectively. In this paper, we propose an innovative approach based on fuzzy set theory, the Cluster Misclassification test (CM-test), to quantify the divergence directly and robustly. To validate our approach, we conducted experiments on both synthetic and real diabetes gene expression datasets. On the synthetic datasets, we observed that CM-test effectively quantifies the divergence of two sets. On the real diabetes dataset, we observed that in the top ten genes identified by CM-test, eight of them have been confirmed to be associated with diabetes in the literature. We suggest the remaining two genes, M95610 and M88461, as two potential diabetic genes for further biological investigation. Therefore, we recommend that CM-test be another effective method for measuring the divergence of two sets, complementing t-test and Wilcoxon rank sum test in practice.","PeriodicalId":400997,"journal":{"name":"2006 IEEE International Conference on Granular Computing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE International Conference on Granular Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRC.2006.1635794","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

One important problem in data analysis is to effec- tively measure the divergence of two sets of values of a feature, each from a group of samples with a particular condition. Such a measurement is the foundation for identifying critical features that contribute to the difference between the two conditions. The two traditional methods t-test and Wilcoxon rank sum test measure this divergence indirectly, using the difference of the means of the two groups and the sum of the ranks from one of the groups, respectively. In this paper, we propose an innovative approach based on fuzzy set theory, the Cluster Misclassification test (CM-test), to quantify the divergence directly and robustly. To validate our approach, we conducted experiments on both synthetic and real diabetes gene expression datasets. On the synthetic datasets, we observed that CM-test effectively quantifies the divergence of two sets. On the real diabetes dataset, we observed that in the top ten genes identified by CM-test, eight of them have been confirmed to be associated with diabetes in the literature. We suggest the remaining two genes, M95610 and M88461, as two potential diabetic genes for further biological investigation. Therefore, we recommend that CM-test be another effective method for measuring the divergence of two sets, complementing t-test and Wilcoxon rank sum test in practice.

查看原文本刊更多论文

CM-test:一种创新的散度测量方法及其在糖尿病基因表达数据分析中的应用

数据分析中的一个重要问题是有效地度量一个特征的两组值的散度，每组值来自一组具有特定条件的样本。这种测量是识别导致两种条件之间差异的关键特征的基础。两种传统方法t检验和Wilcoxon秩和检验间接测量这种差异，分别使用两组均值之差和其中一组的秩和。在本文中，我们提出了一种基于模糊集理论的创新方法——聚类错误分类测试(CM-test)来直接和稳健地量化散度。为了验证我们的方法，我们在合成和真实的糖尿病基因表达数据集上进行了实验。在合成数据集上，我们观察到CM-test有效地量化了两个集的散度。在真实的糖尿病数据集上，我们观察到CM-test鉴定出的前10个基因中，有8个基因已被文献证实与糖尿病相关。我们建议其余两个基因M95610和M88461作为两个潜在的糖尿病基因进行进一步的生物学研究。因此，我们建议cm检验作为实践中t检验和Wilcoxon秩和检验的补充，是另一种有效的度量两集散度的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2006 IEEE International Conference on Granular Computing

自引率

0.00%

发文量