ProLanGO2

Kyle Hippe, Sola Gbenro, Renzhi Cao
{"title":"ProLanGO2","authors":"Kyle Hippe, Sola Gbenro, Renzhi Cao","doi":"10.1145/3388440.3414701","DOIUrl":null,"url":null,"abstract":"Predicting protein function from protein sequence is a main challenge in the computational biology field. Traditional methods that search protein sequences against existing databases may not work well in practice, particularly when little or no homology exists in the database. We introduce the ProLanGO2 method which utilizes the natural language processing and machine learning techniques to tackle the protein function prediction problem with protein sequence as input. Our method has been benchmarked blindly in the latest Critical Assessment of protein Function Annotation algorithms (CAFA 4) experiment. There are a few changes compared to the old version of ProLanGO. First of all, the latest version of the UniProt database is used. Second, the Uniprot database is filtered by the newly created fragment sequence database FSD to prepare for the protein sequence language. Third, the Encoder-Decoder network, a model consisting of two RNNs (encoder and decoder), is used to train models on the dataset. Fourth, if no k-mers of a protein sequence exist in the FSD, we select the top ten GO terms with the highest probability in all sequences from the Uniprot database that didn't contain any k-mers in FSD, and use those ten GO terms as back up for the prediction of new protein sequence. Finally, we selected the 100 best performing models and explored all combinations of those models to select the best performance ensemble model. We benchmark those different combinations of models on CAFA 3 dataset and select three top performance ensemble models for prediction in the latest CAFA 4 experiment as CaoLab. We have also evaluated the performance of our ProLanGO2 method on 253 unseen sequences taken from the UniProt database and compared with several other protein function prediction methods, the results show that our method achieves great performance among sequence-based protein function prediction methods. Our method is available in GitHub: https://github.com/caorenzhi/ProLanGO2.git.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3414701","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Predicting protein function from protein sequence is a main challenge in the computational biology field. Traditional methods that search protein sequences against existing databases may not work well in practice, particularly when little or no homology exists in the database. We introduce the ProLanGO2 method which utilizes the natural language processing and machine learning techniques to tackle the protein function prediction problem with protein sequence as input. Our method has been benchmarked blindly in the latest Critical Assessment of protein Function Annotation algorithms (CAFA 4) experiment. There are a few changes compared to the old version of ProLanGO. First of all, the latest version of the UniProt database is used. Second, the Uniprot database is filtered by the newly created fragment sequence database FSD to prepare for the protein sequence language. Third, the Encoder-Decoder network, a model consisting of two RNNs (encoder and decoder), is used to train models on the dataset. Fourth, if no k-mers of a protein sequence exist in the FSD, we select the top ten GO terms with the highest probability in all sequences from the Uniprot database that didn't contain any k-mers in FSD, and use those ten GO terms as back up for the prediction of new protein sequence. Finally, we selected the 100 best performing models and explored all combinations of those models to select the best performance ensemble model. We benchmark those different combinations of models on CAFA 3 dataset and select three top performance ensemble models for prediction in the latest CAFA 4 experiment as CaoLab. We have also evaluated the performance of our ProLanGO2 method on 253 unseen sequences taken from the UniProt database and compared with several other protein function prediction methods, the results show that our method achieves great performance among sequence-based protein function prediction methods. Our method is available in GitHub: https://github.com/caorenzhi/ProLanGO2.git.
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信