{"title":"根据系统发育基因序列的 K-Mer 编码预测单链 RNA 病毒的病毒科和宿主","authors":"Bahar Çi̇ftçi̇ , Ramazan Teki̇n","doi":"10.1016/j.compbiolchem.2024.108114","DOIUrl":null,"url":null,"abstract":"<div><p>There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.</p></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences\",\"authors\":\"Bahar Çi̇ftçi̇ , Ramazan Teki̇n\",\"doi\":\"10.1016/j.compbiolchem.2024.108114\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.</p></div>\",\"PeriodicalId\":10616,\"journal\":{\"name\":\"Computational Biology and Chemistry\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Biology and Chemistry\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1476927124001026\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927124001026","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}
Prediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences
There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.
期刊介绍:
Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered.
Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered.
Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.