Protein sequence classification using natural language processing techniques

arXiv - QuanBio - Quantitative Methods Pub Date : 2024-09-06 DOI:arxiv-2409.04491

Huma PerveenSchool of Mathematical and Physical Sciences, University of Sussex, Brighton, UK, Julie WeedsSchool of Engineering and Informatics, University of Sussex, Brighton, UK

{"title":"Protein sequence classification using natural language processing techniques","authors":"Huma PerveenSchool of Mathematical and Physical Sciences, University of Sussex, Brighton, UK, Julie WeedsSchool of Engineering and Informatics, University of Sussex, Brighton, UK","doi":"arxiv-2409.04491","DOIUrl":null,"url":null,"abstract":"Proteins are essential to numerous biological functions, with their sequences\ndetermining their roles within organisms. Traditional methods for determining\nprotein function are time-consuming and labor-intensive. This study addresses\nthe increasing demand for precise, effective, and automated protein sequence\nclassification methods by employing natural language processing (NLP)\ntechniques on a dataset comprising 75 target protein classes. We explored\nvarious machine learning and deep learning models, including K-Nearest\nNeighbors (KNN), Multinomial Na\\\"ive Bayes, Logistic Regression, Multi-Layer\nPerceptron (MLP), Decision Tree, Random Forest, XGBoost, Voting and Stacking\nclassifiers, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM),\nand transformer models (BertForSequenceClassification, DistilBERT, and\nProtBert). Experiments were conducted using amino acid ranges of 1-4 grams for\nmachine learning models and different sequence lengths for CNN and LSTM models.\nThe KNN algorithm performed best on tri-gram data with 70.0% accuracy and a\nmacro F1 score of 63.0%. The Voting classifier achieved best performance with\n74.0% accuracy and an F1 score of 65.0%, while the Stacking classifier reached\n75.0% accuracy and an F1 score of 64.0%. ProtBert demonstrated the highest\nperformance among transformer models, with a accuracy 76.0% and F1 score 61.0%\nwhich is same for all three transformer models. Advanced NLP techniques,\nparticularly ensemble methods and transformer models, show great potential in\nprotein classification. Our results demonstrate that ensemble methods,\nparticularly Voting Soft classifiers, achieved superior results, highlighting\nthe importance of sufficient training data and addressing sequence similarity\nacross different classes.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"408 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Proteins are essential to numerous biological functions, with their sequences determining their roles within organisms. Traditional methods for determining protein function are time-consuming and labor-intensive. This study addresses the increasing demand for precise, effective, and automated protein sequence classification methods by employing natural language processing (NLP) techniques on a dataset comprising 75 target protein classes. We explored various machine learning and deep learning models, including K-Nearest Neighbors (KNN), Multinomial Na\"ive Bayes, Logistic Regression, Multi-Layer Perceptron (MLP), Decision Tree, Random Forest, XGBoost, Voting and Stacking classifiers, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and transformer models (BertForSequenceClassification, DistilBERT, and ProtBert). Experiments were conducted using amino acid ranges of 1-4 grams for machine learning models and different sequence lengths for CNN and LSTM models. The KNN algorithm performed best on tri-gram data with 70.0% accuracy and a macro F1 score of 63.0%. The Voting classifier achieved best performance with 74.0% accuracy and an F1 score of 65.0%, while the Stacking classifier reached 75.0% accuracy and an F1 score of 64.0%. ProtBert demonstrated the highest performance among transformer models, with a accuracy 76.0% and F1 score 61.0% which is same for all three transformer models. Advanced NLP techniques, particularly ensemble methods and transformer models, show great potential in protein classification. Our results demonstrate that ensemble methods, particularly Voting Soft classifiers, achieved superior results, highlighting the importance of sufficient training data and addressing sequence similarity across different classes.

查看原文本刊更多论文

利用自然语言处理技术进行蛋白质序列分类

蛋白质对许多生物功能至关重要，其序列决定了它们在生物体内的作用。确定蛋白质功能的传统方法耗时耗力。本研究通过在包含 75 个目标蛋白质类别的数据集上采用自然语言处理（NLP）技术，满足了对精确、有效和自动化蛋白质序列分类方法日益增长的需求。我们探索了各种机器学习和深度学习模型，包括 K-NearestNeighbors (KNN)、Multinomial Na "ive Bayes、Logistic Regression、Multi-LayerPerceptron (MLP)、Decision Tree、Random Forest、XGBoost、Voting and Stackingclassifiers、Convolutional Neural Network (CNN)、Long Short-Term Memory (LSTM) 和转换器模型（BertForSequenceClassification、DistilBERT 和ProtBert）。实验中，机器学习模型使用了 1-4 克的氨基酸范围，CNN 和 LSTM 模型使用了不同的序列长度。投票分类器的准确率为 74.0%，F1 得分为 65.0%，而堆叠分类器的准确率为 75.0%，F1 得分为 64.0%。在所有三个变压器模型中，ProtBert 的准确率为 76.0%，F1 得分为 61.0%，是变压器模型中准确率和 F1 得分最高的。先进的 NLP 技术，尤其是集合方法和转换器模型，在蛋白质分类中显示出巨大的潜力。我们的研究结果表明，集合方法，尤其是 Voting Soft 分类器取得了优异的结果，这突出了充足的训练数据和解决不同类别序列相似性问题的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Quantitative Methods

自引率

0.00%

发文量