Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction.

Computational systems bioinformatics. Computational Systems Bioinformatics Conference Pub Date : 2008-01-01 DOI:10.1142/9781848162648_0014

Mark Moll, L. Kavraki

{"title":"Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction.","authors":"Mark Moll, L. Kavraki","doi":"10.1142/9781848162648_0014","DOIUrl":null,"url":null,"abstract":"There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"157-68"},"PeriodicalIF":0.0000,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/9781848162648_0014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.

查看原文本刊更多论文

基于残基标签哈希和几何滤波的蛋白质功能预测结构基序匹配。

结构已知但功能未知的蛋白质越来越多。确定它们的功能将对理解疾病和设计新的治疗方法产生重大影响。然而，实验蛋白功能测定既昂贵又耗时。计算方法可以通过识别具有高结构和化学相似性的蛋白质来促进功能确定。我们的重点是确定结合位点相似性的方法。尽管存在几种这样的方法，但如何在大数据集中快速、高特异性地找到结构基序的所有功能相关匹配仍然是一个具有挑战性的问题。在这种情况下，结构基序是一组用表征分子功能的物理化学信息注释的3D点。我们提出了一种名为LabelHash的新方法，它为一组目标创建n元残基哈希表。使用这些哈希表，我们可以快速查找与主题的部分匹配，并将这些匹配扩展为完全匹配。我们表明，通过仅应用非常轻微的几何约束，我们可以在非常大的数据集和非常普遍的结构图案中找到具有极高特异性的统计显著匹配。我们证明，当采用简单的几何滤波器时，我们的方法需要合理的存储量，并进一步提高了我们以前工作的特异性，同时保持了非常高的灵敏度。我们的算法在20个同源类和蛋白质数据库的非冗余版本作为我们的背景数据集上进行了评估。我们使用聚类分析来分析为什么某些类别的同系物比其他类别的同系物更难分类。LabelHash算法在web服务器http://kavrakilab.org/labelhash/上实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

自引率

0.00%

发文量