利用氨基酸的单热编码和化学编码进行基于单序列的蛋白质二级结构预测

Hoa Trinh, Satish Kumar Thittamaranahalli
{"title":"利用氨基酸的单热编码和化学编码进行基于单序列的蛋白质二级结构预测","authors":"Hoa Trinh, Satish Kumar Thittamaranahalli","doi":"arxiv-2407.05173","DOIUrl":null,"url":null,"abstract":"In protein secondary structure prediction, each amino acid in sequence is\ntypically treated as a distinct category and represented by a one-hot vector.\nIn this study, we developed two novel chemical representations for amino acids\nutilizing molecular fingerprints and the dimensionality reduction algorithm\nFastMap. We demonstrate that the two new chemical encodings can provide\nadditional information about the interactions of amino acids in sequences that\nan LSTM-based model cannot capture with one-hot encoding alone. Compared to the\nlatest LSTM-based model used in the single-sequence-based method\nSPOT-1D-Single, our ensemble model utilizing one-hot and chemical encodings\nachieves better accuracy across most test sets while requiring approximately\nnine times fewer trainable parameters for each encoding model. Our\nsingle-sequence-based method is valuable for its simplicity, lower resource\nrequirements, and independence from external sequence data. It is beneficial\nwhen quick or preliminary predictions are needed or when data on homologous\nsequences is scarce.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Single-Sequence-Based Protein Secondary Structure Prediction using One-Hot and Chemical Encodings of Amino Acids\",\"authors\":\"Hoa Trinh, Satish Kumar Thittamaranahalli\",\"doi\":\"arxiv-2407.05173\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In protein secondary structure prediction, each amino acid in sequence is\\ntypically treated as a distinct category and represented by a one-hot vector.\\nIn this study, we developed two novel chemical representations for amino acids\\nutilizing molecular fingerprints and the dimensionality reduction algorithm\\nFastMap. We demonstrate that the two new chemical encodings can provide\\nadditional information about the interactions of amino acids in sequences that\\nan LSTM-based model cannot capture with one-hot encoding alone. Compared to the\\nlatest LSTM-based model used in the single-sequence-based method\\nSPOT-1D-Single, our ensemble model utilizing one-hot and chemical encodings\\nachieves better accuracy across most test sets while requiring approximately\\nnine times fewer trainable parameters for each encoding model. Our\\nsingle-sequence-based method is valuable for its simplicity, lower resource\\nrequirements, and independence from external sequence data. It is beneficial\\nwhen quick or preliminary predictions are needed or when data on homologous\\nsequences is scarce.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.05173\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.05173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在蛋白质二级结构预测中,序列中的每个氨基酸通常被视为一个不同的类别,并用一个单击向量来表示。在这项研究中,我们利用分子指纹和降维算法FastMap为氨基酸开发了两种新的化学表示方法。我们证明了这两种新的化学编码可以提供序列中氨基酸相互作用的额外信息,而这些信息是基于 LSTM 的模型无法单独用单次编码捕捉到的。与基于单序列的方法SPOT-1D-Single 中使用的基于 LSTM 的最新模型相比,我们利用单次编码和化学编码的集合模型在大多数测试集中获得了更高的准确率,同时每个编码模型所需的可训练参数大约减少了九倍。我们基于单序列的方法因其简单性、较低的资源需求和独立于外部序列数据而非常有价值。当需要快速或初步预测或同源序列数据稀缺时,这种方法非常有用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Single-Sequence-Based Protein Secondary Structure Prediction using One-Hot and Chemical Encodings of Amino Acids
In protein secondary structure prediction, each amino acid in sequence is typically treated as a distinct category and represented by a one-hot vector. In this study, we developed two novel chemical representations for amino acids utilizing molecular fingerprints and the dimensionality reduction algorithm FastMap. We demonstrate that the two new chemical encodings can provide additional information about the interactions of amino acids in sequences that an LSTM-based model cannot capture with one-hot encoding alone. Compared to the latest LSTM-based model used in the single-sequence-based method SPOT-1D-Single, our ensemble model utilizing one-hot and chemical encodings achieves better accuracy across most test sets while requiring approximately nine times fewer trainable parameters for each encoding model. Our single-sequence-based method is valuable for its simplicity, lower resource requirements, and independence from external sequence data. It is beneficial when quick or preliminary predictions are needed or when data on homologous sequences is scarce.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信