深入研究用血凝素序列预测流感病毒宿主的机器学习算法

IF 1.2 Q3 Computer Science
Yanhua Xu, D. Wojtczak
{"title":"深入研究用血凝素序列预测流感病毒宿主的机器学习算法","authors":"Yanhua Xu, D. Wojtczak","doi":"10.48550/arXiv.2207.13842","DOIUrl":null,"url":null,"abstract":"Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups. Throughout history, influenza A viruses have caused pandemics between different species. It is important to identify the origin of a virus in order to prevent the spread of an outbreak. Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences. In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels. As hemagglutinin is the major protein in the immune response, only hemagglutinin sequences were used and represented by position-specific scoring matrix and word embedding. The results suggest that the 5-grams-transformer neural network is the most effective algorithm for predicting viral sequence origins, with approximately 99.54% AUCPR, 98.01% F1 score and 96.60% MCC at a higher classification level, and approximately 94.74% AUCPR, 87.41% F1 score and 80.79% MCC at a lower classification level.","PeriodicalId":42620,"journal":{"name":"Bio-Algorithms and Med-Systems","volume":null,"pages":null},"PeriodicalIF":1.2000,"publicationDate":"2022-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences\",\"authors\":\"Yanhua Xu, D. Wojtczak\",\"doi\":\"10.48550/arXiv.2207.13842\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups. Throughout history, influenza A viruses have caused pandemics between different species. It is important to identify the origin of a virus in order to prevent the spread of an outbreak. Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences. In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels. As hemagglutinin is the major protein in the immune response, only hemagglutinin sequences were used and represented by position-specific scoring matrix and word embedding. The results suggest that the 5-grams-transformer neural network is the most effective algorithm for predicting viral sequence origins, with approximately 99.54% AUCPR, 98.01% F1 score and 96.60% MCC at a higher classification level, and approximately 94.74% AUCPR, 87.41% F1 score and 80.79% MCC at a lower classification level.\",\"PeriodicalId\":42620,\"journal\":{\"name\":\"Bio-Algorithms and Med-Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2022-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bio-Algorithms and Med-Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2207.13842\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bio-Algorithms and Med-Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.13842","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 7

摘要

流感病毒变异迅速,可对公众健康构成威胁,特别是对弱势群体。纵观历史,甲型流感病毒曾在不同物种之间造成大流行。为了防止疫情的蔓延,确定病毒的来源是很重要的。最近,人们对使用机器学习算法为病毒序列提供快速准确的预测越来越感兴趣。在本研究中,使用真实测试数据集和各种评估指标来评估不同分类水平的机器学习算法。由于血凝素是免疫反应的主要蛋白,因此仅使用血凝素序列,并采用位置特异性评分矩阵和词嵌入表示。结果表明,5-g -transformer神经网络是预测病毒序列起源最有效的算法,在高分类水平上,AUCPR为99.54%,F1得分为98.01%,MCC为96.60%;在低分类水平上,AUCPR为94.74%,F1得分为87.41%,MCC为80.79%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences
Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups. Throughout history, influenza A viruses have caused pandemics between different species. It is important to identify the origin of a virus in order to prevent the spread of an outbreak. Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences. In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels. As hemagglutinin is the major protein in the immune response, only hemagglutinin sequences were used and represented by position-specific scoring matrix and word embedding. The results suggest that the 5-grams-transformer neural network is the most effective algorithm for predicting viral sequence origins, with approximately 99.54% AUCPR, 98.01% F1 score and 96.60% MCC at a higher classification level, and approximately 94.74% AUCPR, 87.41% F1 score and 80.79% MCC at a lower classification level.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Bio-Algorithms and Med-Systems
Bio-Algorithms and Med-Systems MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
3.80
自引率
0.00%
发文量
3
期刊介绍: The journal Bio-Algorithms and Med-Systems (BAMS), edited by the Jagiellonian University Medical College, provides a forum for the exchange of information in the interdisciplinary fields of computational methods applied in medicine, presenting new algorithms and databases that allows the progress in collaborations between medicine, informatics, physics, and biochemistry. Projects linking specialists representing these disciplines are welcome to be published in this Journal. Articles in BAMS are published in English. Topics Bioinformatics Systems biology Telemedicine E-Learning in Medicine Patient''s electronic record Image processing Medical databases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信