FP-Zernike: an open-source structural database construction toolkit for fast structure retrieval

Junhai Qi, Chenjie Feng, Yulin Shi, Jianyi Yang, Fa Zhang, Guojun Li, Renmin Han
{"title":"FP-Zernike: an open-source structural database construction toolkit for fast structure retrieval","authors":"Junhai Qi, Chenjie Feng, Yulin Shi, Jianyi Yang, Fa Zhang, Guojun Li, Renmin Han","doi":"10.1093/gpbjnl/qzae007","DOIUrl":null,"url":null,"abstract":"\n The release of AlphaFold2 has sparked a rapid expansion in protein model databases. Efficient protein structure retrieval is crucial for the analysis of structure models, while measuring the similarity between structures is the key challenge in structural retrieval. Although existing structure alignment algorithms can address this challenge, they are often time-consuming. Currently, the state-of-the-art approach involves converting protein structures into three-dimensional (3D) Zernike descriptors and assessing similarity using Euclidean distance. However, the methods for computing 3D Zernike descriptors mainly rely on structural surfaces and are predominantly web-based, thus limiting their application in studying custom datasets. To overcome this limitation, we developed FP-Zernike, a user-friendly toolkit for computing different types of Zernike descriptors based on feature points. Users simply need to enter a single line of command to calculate the Zernike descriptors of all structures in customized datasets. FP-Zernike outperforms the leading method in terms of retrieval accuracy and binary classification accuracy across diverse benchmark datasets. In addition, we showed the application of FP-Zernike in the construction of the descriptor database and the protocol used for the Protein Data Bank (PDB) dataset to facilitate the local deployment of this tool for interested readers. Our demonstration contained 590,685 structures, and at this scale, our system required only 4–9 seconds to complete a retrieval. The experiments confirmed that it achieved the state-of-the-art accuracy level. FP-Zernike is an open-source toolkit, with the source code and related data accessible at https://ngdc.cncb.ac.cn/biocode/tools/BT007365/releases/0.1, as well as through a webserver at http://www.structbioinfo.cn/.","PeriodicalId":170516,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"16 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics, Proteomics & Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/gpbjnl/qzae007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The release of AlphaFold2 has sparked a rapid expansion in protein model databases. Efficient protein structure retrieval is crucial for the analysis of structure models, while measuring the similarity between structures is the key challenge in structural retrieval. Although existing structure alignment algorithms can address this challenge, they are often time-consuming. Currently, the state-of-the-art approach involves converting protein structures into three-dimensional (3D) Zernike descriptors and assessing similarity using Euclidean distance. However, the methods for computing 3D Zernike descriptors mainly rely on structural surfaces and are predominantly web-based, thus limiting their application in studying custom datasets. To overcome this limitation, we developed FP-Zernike, a user-friendly toolkit for computing different types of Zernike descriptors based on feature points. Users simply need to enter a single line of command to calculate the Zernike descriptors of all structures in customized datasets. FP-Zernike outperforms the leading method in terms of retrieval accuracy and binary classification accuracy across diverse benchmark datasets. In addition, we showed the application of FP-Zernike in the construction of the descriptor database and the protocol used for the Protein Data Bank (PDB) dataset to facilitate the local deployment of this tool for interested readers. Our demonstration contained 590,685 structures, and at this scale, our system required only 4–9 seconds to complete a retrieval. The experiments confirmed that it achieved the state-of-the-art accuracy level. FP-Zernike is an open-source toolkit, with the source code and related data accessible at https://ngdc.cncb.ac.cn/biocode/tools/BT007365/releases/0.1, as well as through a webserver at http://www.structbioinfo.cn/.
FP-Zernike:用于快速结构检索的开源结构数据库构建工具包
AlphaFold2 的发布引发了蛋白质模型数据库的迅速扩展。高效的蛋白质结构检索对结构模型分析至关重要,而测量结构之间的相似性则是结构检索的关键挑战。虽然现有的结构配准算法可以解决这一难题,但它们往往耗费大量时间。目前,最先进的方法是将蛋白质结构转换为三维(3D)Zernike 描述符,并使用欧氏距离评估相似性。然而,计算三维泽尔尼克描述符的方法主要依赖于结构表面,而且主要基于网络,因此限制了它们在研究定制数据集方面的应用。为了克服这一限制,我们开发了 FP-Zernike,这是一个基于特征点计算不同类型 Zernike 描述符的用户友好型工具包。用户只需输入一行命令,就能计算定制数据集中所有结构的 Zernike 描述符。在各种基准数据集上,FP-Zernike 在检索准确率和二元分类准确率方面都优于领先的方法。此外,我们还展示了 FP-Zernike 在构建描述符数据库中的应用,以及用于蛋白质数据库(PDB)数据集的协议,以方便感兴趣的读者在本地部署这一工具。我们的演示包含 590,685 个结构,在这种规模下,我们的系统只需要 4-9 秒就能完成一次检索。实验证实,它达到了最先进的准确度水平。FP-Zernike 是一个开源工具包,源代码和相关数据可通过 https://ngdc.cncb.ac.cn/biocode/tools/BT007365/releases/0.1 以及 http://www.structbioinfo.cn/ 的网络服务器访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信