聚类混合数值和低质量分类数据:酵母示例上的显著性度量

Information Quality in Information Systems Pub Date : 2005-06-17 DOI:10.1145/1077501.1077517

Bill Andreopoulos, Aijun An, Xiaogang Wang

{"title":"聚类混合数值和低质量分类数据:酵母示例上的显著性度量","authors":"Bill Andreopoulos, Aijun An, Xiaogang Wang","doi":"10.1145/1077501.1077517","DOIUrl":null,"url":null,"abstract":"We present the M-BILCOM algorithm for clustering mixed numerical and categorical data sets, in which the categorical attribute values (CAs) are not certain to be correct and have associated confidence values (CVs) from 0.0 to 1.0 to represent their certainty of correctness. M-BILCOM performs bi-level clustering of mixed data sets resembling a Bayesian process. We have applied M-BILCOM to yeast data sets in which the CAs were perturbed randomly and CVs were assigned indicating the confidence of correctness of the CAs. On such mixed data sets M-BILCOM outperforms other clustering algorithms, such as AutoClass. We have applied M-BILCOM to real numerical data sets from gene expression studies on yeast, incorporating CAs representing Gene Ontology annotations on the genes and CVs representing Gene Ontology Evidence Codes on the CAs. We apply novel significance metrics to the CAs in resulting clusters, to extract the most significant CAs based on their frequencies and their CVs in the cluster. For genomic data sets, we use the most significant CAs in a cluster to predict gene function.","PeriodicalId":306187,"journal":{"name":"Information Quality in Information Systems","volume":"243 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Clustering mixed numerical and low quality categorical data: significance metrics on a yeast example\",\"authors\":\"Bill Andreopoulos, Aijun An, Xiaogang Wang\",\"doi\":\"10.1145/1077501.1077517\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present the M-BILCOM algorithm for clustering mixed numerical and categorical data sets, in which the categorical attribute values (CAs) are not certain to be correct and have associated confidence values (CVs) from 0.0 to 1.0 to represent their certainty of correctness. M-BILCOM performs bi-level clustering of mixed data sets resembling a Bayesian process. We have applied M-BILCOM to yeast data sets in which the CAs were perturbed randomly and CVs were assigned indicating the confidence of correctness of the CAs. On such mixed data sets M-BILCOM outperforms other clustering algorithms, such as AutoClass. We have applied M-BILCOM to real numerical data sets from gene expression studies on yeast, incorporating CAs representing Gene Ontology annotations on the genes and CVs representing Gene Ontology Evidence Codes on the CAs. We apply novel significance metrics to the CAs in resulting clusters, to extract the most significant CAs based on their frequencies and their CVs in the cluster. For genomic data sets, we use the most significant CAs in a cluster to predict gene function.\",\"PeriodicalId\":306187,\"journal\":{\"name\":\"Information Quality in Information Systems\",\"volume\":\"243 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-06-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Quality in Information Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1077501.1077517\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Quality in Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1077501.1077517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

本文提出了一种用于数值和分类混合数据集聚类的M-BILCOM算法，其中分类属性值(CAs)不确定是否正确，并且具有从0.0到1.0的关联置信值(cv)来表示其正确性的确定性。M-BILCOM执行类似贝叶斯过程的混合数据集的双层聚类。我们将M-BILCOM应用于酵母数据集，其中ca被随机扰动，cv被分配，表明ca的正确性置信度。在这样的混合数据集上，M-BILCOM优于其他聚类算法，如AutoClass。我们将M-BILCOM应用于酵母基因表达研究的真实数值数据集，在基因上结合了代表基因本体注释的CAs和代表基因本体证据代码的CVs。我们将新的显著性指标应用于结果聚类中的ca，根据它们的频率和它们在聚类中的cv提取最显著的ca。对于基因组数据集，我们使用集群中最显著的ca来预测基因功能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Clustering mixed numerical and low quality categorical data: significance metrics on a yeast example

We present the M-BILCOM algorithm for clustering mixed numerical and categorical data sets, in which the categorical attribute values (CAs) are not certain to be correct and have associated confidence values (CVs) from 0.0 to 1.0 to represent their certainty of correctness. M-BILCOM performs bi-level clustering of mixed data sets resembling a Bayesian process. We have applied M-BILCOM to yeast data sets in which the CAs were perturbed randomly and CVs were assigned indicating the confidence of correctness of the CAs. On such mixed data sets M-BILCOM outperforms other clustering algorithms, such as AutoClass. We have applied M-BILCOM to real numerical data sets from gene expression studies on yeast, incorporating CAs representing Gene Ontology annotations on the genes and CVs representing Gene Ontology Evidence Codes on the CAs. We apply novel significance metrics to the CAs in resulting clusters, to extract the most significant CAs based on their frequencies and their CVs in the cluster. For genomic data sets, we use the most significant CAs in a cluster to predict gene function.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Quality in Information Systems

自引率

0.00%

发文量