最小熵聚类及其在基因表达分析中的应用。

Proceedings. IEEE Computational Systems Bioinformatics Conference Pub Date : 2004-01-01 DOI:10.1109/csb.2004.1332427

Haifeng Li, Keshu Zhang, Tao Jiang

{"title":"最小熵聚类及其在基因表达分析中的应用。","authors":"Haifeng Li, Keshu Zhang, Tao Jiang","doi":"10.1109/csb.2004.1332427","DOIUrl":null,"url":null,"abstract":"Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"142-51"},"PeriodicalIF":0.0000,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332427","citationCount":"0","resultStr":"{\"title\":\"Minimum entropy clustering and applications to gene expression analysis.\",\"authors\":\"Haifeng Li, Keshu Zhang, Tao Jiang\",\"doi\":\"10.1109/csb.2004.1332427\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.\",\"PeriodicalId\":87417,\"journal\":{\"name\":\"Proceedings. IEEE Computational Systems Bioinformatics Conference\",\"volume\":\" \",\"pages\":\"142-51\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/csb.2004.1332427\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. IEEE Computational Systems Bioinformatics Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/csb.2004.1332427\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE Computational Systems Bioinformatics Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/csb.2004.1332427","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

聚类是分析基因表达数据的常用方法。本文从信息论的角度提出了一种新的聚类算法。首先，我们提出了最小熵(在后验概率上测量)标准，这是给定观测值的聚类的条件熵。Fano不等式表明它可能是一个很好的聚类准则。我们用Havrda-Charvat的结构熵代替Shannon的熵来推广该准则。有趣的是，当α = 2时，基于结构α -熵的最小熵准则等于最近邻方法的概率误差。这是另一个证据，表明所提出的标准是良好的聚类。利用非参数方法估计后验概率，建立了一种有效的迭代算法来最小化熵。实验结果表明，在调整后的Rand指数方面，聚类算法的性能明显优于k-means/median、分层聚类、SOM和EM。特别是，我们的算法即使在正确的簇数未知的情况下也表现得非常好。此外，大多数聚类算法在存在异常点的情况下会产生较差的分区，而我们的方法可以正确地揭示数据的结构，同时有效地识别异常点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Minimum entropy clustering and applications to gene expression analysis.

Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. IEEE Computational Systems Bioinformatics Conference

自引率

0.00%

发文量