GenDiS3 database: census on the prevalence of protein domain superfamilies of known structure in the entire sequence database.

IF 3.6 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation Pub Date : 2025-05-09 DOI:10.1093/database/baaf035

Sarthak Joshi, Shailendu Mohapatra, Dhwani Kumar, Adwait Joshi, Meenakshi Iyer, Ramanathan Sowdhamini

{"title":"GenDiS3 database: census on the prevalence of protein domain superfamilies of known structure in the entire sequence database.","authors":"Sarthak Joshi, Shailendu Mohapatra, Dhwani Kumar, Adwait Joshi, Meenakshi Iyer, Ramanathan Sowdhamini","doi":"10.1093/database/baaf035","DOIUrl":null,"url":null,"abstract":"<p><p>Despite the vast amount of sequence data available, a significant disparity exists between the number of protein sequences identified and the relatively few structures that have been resolved. This disparity highlights the challenge in structural biology to bridge the gap between sequence information and 3D structural data, and the necessity for robust databases capable of linking distant homologs to known structures. Studies have indicated that there are a limited number of structural folds, despite the vast diversity of proteins. Hence, computational tools can enhance our ability to classify protein sequences, much before their structures are determined or their functions are characterized, thereby bridging the gap between sequence and structural data. GenDiS (Genomic Distribution of Superfamilies) is a repository with information on the genomic distribution of protein domain superfamilies, involving a one-time computational exercise to search for trusted homologs of protein domains of known structures against the vast sequence database. We have updated this database employing advanced bioinformatics tools, including DELTA-BLAST (domain enhanced lookup time accelerated BLAST) for initial detection of hits and HMMSCAN for validation, significantly improving the accuracy of domain identification. Using these tools, over 151 million sequence homologs for 2060 superfamilies [SCOPe (Structural Classification of Proteins extended)] were identified and 116 million out of them were validated as true positives. Through a case study on glycolysis-related enzymes, variations in domain architectures of these enzymes are explored, revealing evolutionary changes and functional diversity among these essential proteins. We present another case, LOG gene, where one can tune in and find significant mutations across the evolutionary lineage. The GenDiS database, GenDiS3, and the associated tools made available at https://caps.ncbs.res.in/gendis3/ offer a powerful resource for researchers in functional annotation and evolutionary studies. Database URL: https://caps.ncbs.res.in/gendis3/.</p>","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2025 ","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12063530/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baaf035","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Despite the vast amount of sequence data available, a significant disparity exists between the number of protein sequences identified and the relatively few structures that have been resolved. This disparity highlights the challenge in structural biology to bridge the gap between sequence information and 3D structural data, and the necessity for robust databases capable of linking distant homologs to known structures. Studies have indicated that there are a limited number of structural folds, despite the vast diversity of proteins. Hence, computational tools can enhance our ability to classify protein sequences, much before their structures are determined or their functions are characterized, thereby bridging the gap between sequence and structural data. GenDiS (Genomic Distribution of Superfamilies) is a repository with information on the genomic distribution of protein domain superfamilies, involving a one-time computational exercise to search for trusted homologs of protein domains of known structures against the vast sequence database. We have updated this database employing advanced bioinformatics tools, including DELTA-BLAST (domain enhanced lookup time accelerated BLAST) for initial detection of hits and HMMSCAN for validation, significantly improving the accuracy of domain identification. Using these tools, over 151 million sequence homologs for 2060 superfamilies [SCOPe (Structural Classification of Proteins extended)] were identified and 116 million out of them were validated as true positives. Through a case study on glycolysis-related enzymes, variations in domain architectures of these enzymes are explored, revealing evolutionary changes and functional diversity among these essential proteins. We present another case, LOG gene, where one can tune in and find significant mutations across the evolutionary lineage. The GenDiS database, GenDiS3, and the associated tools made available at https://caps.ncbs.res.in/gendis3/ offer a powerful resource for researchers in functional annotation and evolutionary studies. Database URL: https://caps.ncbs.res.in/gendis3/.

查看原文本刊更多论文

GenDiS3数据库：对整个序列数据库中已知结构的蛋白质结构域超家族的流行情况进行普查。

尽管有大量可用的序列数据，但已确定的蛋白质序列数量与已解决的相对较少的结构之间存在显着差异。这种差异凸显了结构生物学在序列信息和3D结构数据之间建立桥梁的挑战，以及建立能够将遥远同源物与已知结构联系起来的强大数据库的必要性。研究表明，尽管蛋白质种类繁多，但结构褶皱的数量有限。因此，计算工具可以提高我们对蛋白质序列进行分类的能力，在它们的结构被确定或功能被表征之前，从而弥合了序列和结构数据之间的差距。GenDiS（基因组分布超家族）是蛋白质结构域超家族基因组分布信息的存储库，涉及一次性计算练习，以搜索已知结构的蛋白质结构域的可靠同源物，而不是庞大的序列数据库。我们使用先进的生物信息学工具更新了该数据库，包括用于初始检测命中的DELTA-BLAST（域增强查找时间加速BLAST）和用于验证的HMMSCAN，显着提高了域识别的准确性。使用这些工具，鉴定了2060个超家族[SCOPe (Structural Classification of Proteins extended)]的1.51亿个序列同源物，其中1.16亿个被验证为真阳性。通过对糖酵解相关酶的案例研究，探讨了这些酶结构域结构的变化，揭示了这些必需蛋白质的进化变化和功能多样性。我们提出了另一种情况，LOG基因，在这种情况下，人们可以调谐并发现进化谱系中的重大突变。GenDiS数据库，GenDiS3和相关的工具在https://caps.ncbs.res.in/gendis3/上提供了功能注释和进化研究的研究人员一个强大的资源。数据库地址：https://caps.ncbs.res.in/gendis3/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

9.00

自引率

3.40%

发文量

100

审稿时长

>12 weeks

期刊介绍： Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.