BERTDom: Protein Domain Boundary Prediction Using BERT

IF 0.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Ahmad Haseeb, Maryam Bashir, Aamir Wali
{"title":"BERTDom: Protein Domain Boundary Prediction Using BERT","authors":"Ahmad Haseeb, Maryam Bashir, Aamir Wali","doi":"10.31577/cai_2023_3_667","DOIUrl":null,"url":null,"abstract":". The domains of a protein provide an insight on the functions that the protein can perform. Delineation of proteins using high-throughput experimental methods is difficult and a time-consuming task. Template-free and sequence-based computational methods that mainly rely on machine learning techniques can be used. However, some of the drawbacks of computational methods are low accuracy and their limitation in predicting different types of multi-domain proteins. Biological language modeling and deep learning techniques can be useful in such situations. In this study, we propose BERTDom for segmenting protein sequences. BERTDOM uses BERT for feature representation and stacked bi-directional long short term memory for classification. We pre-train BERT from scratch on a corpus of protein sequences obtained from UniProt knowledge base with reference clusters. For comparison, we also used two other deep learning architectures: LSTM and feed-forward neural networks. We also experimented with protein-to-vector (Pro2Vec) feature representation that uses word2vec to encode protein bio-words. For testing, three other bench-marked datasets were used. The experimental re-sults on benchmarks datasets show that BERTDom produces the best F-score as compared to other template-based and template-free protein domain boundary prediction methods. Employing deep learning architectures can significantly improve domain boundary prediction. Furthermore, BERT used extensively in NLP for feature representation, has shown promising results when used for encoding bio-words. The code is available at https://github.com/maryam988/BERTDom-Code .","PeriodicalId":55215,"journal":{"name":"Computing and Informatics","volume":"42 1","pages":"667-689"},"PeriodicalIF":0.7000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computing and Informatics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.31577/cai_2023_3_667","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

. The domains of a protein provide an insight on the functions that the protein can perform. Delineation of proteins using high-throughput experimental methods is difficult and a time-consuming task. Template-free and sequence-based computational methods that mainly rely on machine learning techniques can be used. However, some of the drawbacks of computational methods are low accuracy and their limitation in predicting different types of multi-domain proteins. Biological language modeling and deep learning techniques can be useful in such situations. In this study, we propose BERTDom for segmenting protein sequences. BERTDOM uses BERT for feature representation and stacked bi-directional long short term memory for classification. We pre-train BERT from scratch on a corpus of protein sequences obtained from UniProt knowledge base with reference clusters. For comparison, we also used two other deep learning architectures: LSTM and feed-forward neural networks. We also experimented with protein-to-vector (Pro2Vec) feature representation that uses word2vec to encode protein bio-words. For testing, three other bench-marked datasets were used. The experimental re-sults on benchmarks datasets show that BERTDom produces the best F-score as compared to other template-based and template-free protein domain boundary prediction methods. Employing deep learning architectures can significantly improve domain boundary prediction. Furthermore, BERT used extensively in NLP for feature representation, has shown promising results when used for encoding bio-words. The code is available at https://github.com/maryam988/BERTDom-Code .
BERT:基于BERT的蛋白质结构域边界预测
. 蛋白质的结构域提供了对蛋白质可以执行的功能的洞察。使用高通量实验方法描述蛋白质是一项困难且耗时的任务。可以使用主要依赖于机器学习技术的无模板和基于序列的计算方法。然而,计算方法的一些缺点是精度低,并且在预测不同类型的多结构域蛋白质方面存在局限性。生物语言建模和深度学习技术在这种情况下很有用。在这项研究中,我们提出了BERTDom来分割蛋白质序列。BERTDOM使用BERT进行特征表示,使用堆叠双向长短期记忆进行分类。我们在从UniProt知识库获得的蛋白质序列语料库上使用参考聚类从头开始预训练BERT。为了比较,我们还使用了另外两种深度学习架构:LSTM和前馈神经网络。我们还实验了蛋白质-载体(Pro2Vec)特征表示,使用word2vec编码蛋白质生物词。为了进行测试,使用了三个其他基准数据集。在基准数据集上的实验结果表明,与其他基于模板和无模板的蛋白质结构域边界预测方法相比,BERTDom产生了最好的f值。采用深度学习架构可以显著改善领域边界预测。此外,BERT在NLP中广泛用于特征表示,在用于编码生物词时显示出有希望的结果。代码可在https://github.com/maryam988/BERTDom-Code上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computing and Informatics
Computing and Informatics 工程技术-计算机:人工智能
CiteScore
1.60
自引率
14.30%
发文量
19
审稿时长
9 months
期刊介绍: Main Journal Topics: COMPUTER ARCHITECTURES AND NETWORKING PARALLEL AND DISTRIBUTED COMPUTING THEORETICAL FOUNDATIONS SOFTWARE ENGINEERING KNOWLEDGE AND INFORMATION ENGINEERING Apart from the main topics given above, the Editorial Board welcomes papers from other areas of computing and informatics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信