{"title":"Single-Sequence-Based Protein Secondary Structure Prediction using One-Hot and Chemical Encodings of Amino Acids","authors":"Hoa Trinh, Satish Kumar Thittamaranahalli","doi":"arxiv-2407.05173","DOIUrl":null,"url":null,"abstract":"In protein secondary structure prediction, each amino acid in sequence is\ntypically treated as a distinct category and represented by a one-hot vector.\nIn this study, we developed two novel chemical representations for amino acids\nutilizing molecular fingerprints and the dimensionality reduction algorithm\nFastMap. We demonstrate that the two new chemical encodings can provide\nadditional information about the interactions of amino acids in sequences that\nan LSTM-based model cannot capture with one-hot encoding alone. Compared to the\nlatest LSTM-based model used in the single-sequence-based method\nSPOT-1D-Single, our ensemble model utilizing one-hot and chemical encodings\nachieves better accuracy across most test sets while requiring approximately\nnine times fewer trainable parameters for each encoding model. Our\nsingle-sequence-based method is valuable for its simplicity, lower resource\nrequirements, and independence from external sequence data. It is beneficial\nwhen quick or preliminary predictions are needed or when data on homologous\nsequences is scarce.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.05173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In protein secondary structure prediction, each amino acid in sequence is
typically treated as a distinct category and represented by a one-hot vector.
In this study, we developed two novel chemical representations for amino acids
utilizing molecular fingerprints and the dimensionality reduction algorithm
FastMap. We demonstrate that the two new chemical encodings can provide
additional information about the interactions of amino acids in sequences that
an LSTM-based model cannot capture with one-hot encoding alone. Compared to the
latest LSTM-based model used in the single-sequence-based method
SPOT-1D-Single, our ensemble model utilizing one-hot and chemical encodings
achieves better accuracy across most test sets while requiring approximately
nine times fewer trainable parameters for each encoding model. Our
single-sequence-based method is valuable for its simplicity, lower resource
requirements, and independence from external sequence data. It is beneficial
when quick or preliminary predictions are needed or when data on homologous
sequences is scarce.