Clustering binary fingerprint vectors with missing values for DNA array data analysis.

Andres Figueroa, James Borneman, Tao Jiang
{"title":"Clustering binary fingerprint vectors with missing values for DNA array data analysis.","authors":"Andres Figueroa,&nbsp;James Borneman,&nbsp;Tao Jiang","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Oligonucleotide fingerprinting is a powerful DNA array based method to characterize cDNA and ribosomal RNA gene (rDNA) libraries and has many applications including gene expression profiling and DNA clone classification. We are especially interested in the latter application. A key step in the method is the cluster analysis of fingerprint data obtained from DNA array hybridization experiments. Most of the existing approaches to clustering use (normalized) real intensity values and thus do not treat positive and negative hybridization signals equally (positive signals are much more emphasized). In this paper, we consider a discrete approach. Fingerprint data are first normalized and binarized using control DNA clones. Because there may exist unresolved (or missing) values in this binarization process, we formulate the clustering of (binary) oligonucleotide fingerprints as a combinatorial optimization problem that attempts to identify clusters and resolve the missing values in the fingerprints simultaneously. We study the computational complexity of this clustering problem and a natural parameterized version, and present an efficient greedy algorithm based on MINIMUM CLIQUE PARTITION on graphs. The algorithm takes advantage of some unique properties of the graphs considered here, which allow us to efficiently find the maximum cliques as well as some special maximal cliques. Our experimental results on simulated and real data demonstrate that the algorithm runs faster and performs better than some popular hierarchical and graph-based clustering methods. The results on real data from DNA clone classification also suggest that this discrete approach is more accurate than clustering methods based on real intensity values, in terms of separating clones that have different characteristics with respect to the given oligonucleotide probes.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"38-47"},"PeriodicalIF":0.0000,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE Computer Society Bioinformatics Conference","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Oligonucleotide fingerprinting is a powerful DNA array based method to characterize cDNA and ribosomal RNA gene (rDNA) libraries and has many applications including gene expression profiling and DNA clone classification. We are especially interested in the latter application. A key step in the method is the cluster analysis of fingerprint data obtained from DNA array hybridization experiments. Most of the existing approaches to clustering use (normalized) real intensity values and thus do not treat positive and negative hybridization signals equally (positive signals are much more emphasized). In this paper, we consider a discrete approach. Fingerprint data are first normalized and binarized using control DNA clones. Because there may exist unresolved (or missing) values in this binarization process, we formulate the clustering of (binary) oligonucleotide fingerprints as a combinatorial optimization problem that attempts to identify clusters and resolve the missing values in the fingerprints simultaneously. We study the computational complexity of this clustering problem and a natural parameterized version, and present an efficient greedy algorithm based on MINIMUM CLIQUE PARTITION on graphs. The algorithm takes advantage of some unique properties of the graphs considered here, which allow us to efficiently find the maximum cliques as well as some special maximal cliques. Our experimental results on simulated and real data demonstrate that the algorithm runs faster and performs better than some popular hierarchical and graph-based clustering methods. The results on real data from DNA clone classification also suggest that this discrete approach is more accurate than clustering methods based on real intensity values, in terms of separating clones that have different characteristics with respect to the given oligonucleotide probes.

基于缺失值的二值指纹向量聚类分析。
寡核苷酸指纹图谱是一种基于DNA阵列的功能强大的cDNA和核糖体RNA基因(rDNA)文库表征方法,在基因表达谱和DNA克隆分类等方面有着广泛的应用。我们对后一种应用特别感兴趣。该方法的关键步骤是对DNA阵列杂交实验获得的指纹数据进行聚类分析。大多数现有的聚类方法使用(归一化的)真实强度值,因此不能平等地对待正杂化信号和负杂化信号(更强调正杂化信号)。在本文中,我们考虑一个离散的方法。首先使用对照DNA克隆对指纹数据进行归一化和二值化。由于在二值化过程中可能存在未解析值(或缺失值),我们将(二值)寡核苷酸指纹聚类作为一个组合优化问题,试图同时识别聚类并解决指纹中的缺失值。研究了该聚类问题的计算复杂度和自然参数化版本,提出了一种高效的基于最小CLIQUE划分的贪心算法。该算法利用图的一些独特性质,使我们能够有效地找到最大团和一些特殊的最大团。在模拟和真实数据上的实验结果表明,该算法比一些流行的分层聚类和基于图的聚类方法运行速度更快,性能更好。从DNA克隆分类的真实数据的结果也表明,这种离散方法比基于真实强度值的聚类方法更准确,在分离克隆具有不同的特征相对于给定的寡核苷酸探针。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信