Huma PerveenSchool of Mathematical and Physical Sciences, University of Sussex, Brighton, UK, Julie WeedsSchool of Engineering and Informatics, University of Sussex, Brighton, UK
{"title":"Protein sequence classification using natural language processing techniques","authors":"Huma PerveenSchool of Mathematical and Physical Sciences, University of Sussex, Brighton, UK, Julie WeedsSchool of Engineering and Informatics, University of Sussex, Brighton, UK","doi":"arxiv-2409.04491","DOIUrl":null,"url":null,"abstract":"Proteins are essential to numerous biological functions, with their sequences\ndetermining their roles within organisms. Traditional methods for determining\nprotein function are time-consuming and labor-intensive. This study addresses\nthe increasing demand for precise, effective, and automated protein sequence\nclassification methods by employing natural language processing (NLP)\ntechniques on a dataset comprising 75 target protein classes. We explored\nvarious machine learning and deep learning models, including K-Nearest\nNeighbors (KNN), Multinomial Na\\\"ive Bayes, Logistic Regression, Multi-Layer\nPerceptron (MLP), Decision Tree, Random Forest, XGBoost, Voting and Stacking\nclassifiers, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM),\nand transformer models (BertForSequenceClassification, DistilBERT, and\nProtBert). Experiments were conducted using amino acid ranges of 1-4 grams for\nmachine learning models and different sequence lengths for CNN and LSTM models.\nThe KNN algorithm performed best on tri-gram data with 70.0% accuracy and a\nmacro F1 score of 63.0%. The Voting classifier achieved best performance with\n74.0% accuracy and an F1 score of 65.0%, while the Stacking classifier reached\n75.0% accuracy and an F1 score of 64.0%. ProtBert demonstrated the highest\nperformance among transformer models, with a accuracy 76.0% and F1 score 61.0%\nwhich is same for all three transformer models. Advanced NLP techniques,\nparticularly ensemble methods and transformer models, show great potential in\nprotein classification. Our results demonstrate that ensemble methods,\nparticularly Voting Soft classifiers, achieved superior results, highlighting\nthe importance of sufficient training data and addressing sequence similarity\nacross different classes.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Proteins are essential to numerous biological functions, with their sequences
determining their roles within organisms. Traditional methods for determining
protein function are time-consuming and labor-intensive. This study addresses
the increasing demand for precise, effective, and automated protein sequence
classification methods by employing natural language processing (NLP)
techniques on a dataset comprising 75 target protein classes. We explored
various machine learning and deep learning models, including K-Nearest
Neighbors (KNN), Multinomial Na\"ive Bayes, Logistic Regression, Multi-Layer
Perceptron (MLP), Decision Tree, Random Forest, XGBoost, Voting and Stacking
classifiers, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM),
and transformer models (BertForSequenceClassification, DistilBERT, and
ProtBert). Experiments were conducted using amino acid ranges of 1-4 grams for
machine learning models and different sequence lengths for CNN and LSTM models.
The KNN algorithm performed best on tri-gram data with 70.0% accuracy and a
macro F1 score of 63.0%. The Voting classifier achieved best performance with
74.0% accuracy and an F1 score of 65.0%, while the Stacking classifier reached
75.0% accuracy and an F1 score of 64.0%. ProtBert demonstrated the highest
performance among transformer models, with a accuracy 76.0% and F1 score 61.0%
which is same for all three transformer models. Advanced NLP techniques,
particularly ensemble methods and transformer models, show great potential in
protein classification. Our results demonstrate that ensemble methods,
particularly Voting Soft classifiers, achieved superior results, highlighting
the importance of sufficient training data and addressing sequence similarity
across different classes.