Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, J. Zobel, Karin M. Verspoor
{"title":"Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases","authors":"Qingyu Chen, Yu Wan, Xiuzhen Zhang, Yang Lei, J. Zobel, Karin M. Verspoor","doi":"10.1145/3131611","DOIUrl":null,"url":null,"abstract":"The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"5 1","pages":"1 - 27"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3131611","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.
生物数据库重复数据删除序列聚类方法比较分析
生物序列数据库的海量数据为大规模生物学研究提供了重要的资源。然而,这些资源的底层数据质量是一个关键问题。一个特别的挑战是重复,其中多个记录具有相似的序列,从而产生高度冗余,影响数据库存储、管理和搜索。生物数据库重复数据删除有两个直接的应用:用于数据库管理,删除检测到的重复序列以提高管理效率;用于数据库搜索,检测到的重复序列可能会被标记,但仍可用于支持分析。聚类方法已广泛应用于生物序列的数据库重复数据删除。由于对序列进行详尽的两两比较不能适用于大量数据,因此采用了启发式方法,例如使用简单的相似性阈值。在本文中,我们比较了CD-HIT和UCLUST这两种最著名的序列数据库重复数据删除聚类工具。我们的贡献包括对重复数据删除后剩余冗余的详细评估,应用标准聚类评估指标来量化每种方法生成的聚类的内聚和分离,以及评估聚类内功能注释一致性的生物学案例研究,以展示这些因素对序列聚类方法实际应用的影响。我们的结果表明,当使用低阈值和集群规模较大时,效率和准确性之间的权衡变得尖锐。这种评估为用户提供了实用的建议,以便更有效地使用用于重复数据删除的序列聚类工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信