VirusPredictor:基于 XGBoost 的软件,用于预测人类数据中的病毒相关序列。

IF 4.4 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS
Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li
{"title":"VirusPredictor:基于 XGBoost 的软件,用于预测人类数据中的病毒相关序列。","authors":"Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li","doi":"10.1093/bioinformatics/btae192","DOIUrl":null,"url":null,"abstract":"MOTIVATION\nDiscovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data.\n\n\nRESULTS\nWe developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions.\n\n\nAVAILABILITY\nwww.dllab.org/software/VirusPredictor.html.\n\n\nSUPPLEMENTARY INFORMATION\nSupplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.\",\"authors\":\"Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li\",\"doi\":\"10.1093/bioinformatics/btae192\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MOTIVATION\\nDiscovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data.\\n\\n\\nRESULTS\\nWe developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions.\\n\\n\\nAVAILABILITY\\nwww.dllab.org/software/VirusPredictor.html.\\n\\n\\nSUPPLEMENTARY INFORMATION\\nSupplementary data are available at Bioinformatics online.\",\"PeriodicalId\":8903,\"journal\":{\"name\":\"Bioinformatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-04-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btae192\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae192","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

摘要

动机发现致病病原体,尤其是没有参考基因组的病毒,是一项技术挑战,因为通过序列比对往往无法识别这些病原体。对无法与人类和病原体基因组比对的病人高通量序列进行机器学习预测,可能会发现源自未定性病毒的序列。目前,还缺乏专门用于准确预测人类数据中此类病毒序列的软件。结果我们利用内部病毒基因组数据库开发了一种快速 XGBoost 方法和软件 VirusPredictor。我们的两步 XGBoost 模型首先将每个查询序列分为三类:传染性病毒、内源性逆转录病毒 (ERV) 或非ERV 人类。序列越长,预测准确率越高,150-350(Illumina 短读数)、850-950(Sanger 测序数据)和 2,000-5,000 bp 序列的预测准确率分别为 0.76、0.93 和 0.98。当查询序列从 150-350 bp 增加到大于 850 bp 时,准确度从 0.92 增加到大于 0.98。结果表明,Illumina 短读数应尽可能在预测前从头组装成等体(例如,1000 bp 或更长)。我们将 VirusPredictor 应用于多个真实的基因组和元基因组数据集,并获得了很高的准确率。VirusPredictor 是一款用户友好的开源 Python 软件,可用于预测患者不可应用序列的来源。这项研究首次在传染性病毒序列预测中对 ERV 进行了分类。这也是第一项结合病毒亚群预测的研究。AVAILABILITYwww.dllab.org/software/VirusPredictor.html.SUPPLEMENTARY INFORMATIONS补充数据可在生物信息学网上获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.
MOTIVATION Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY www.dllab.org/software/VirusPredictor.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Bioinformatics
Bioinformatics 生物-生化研究方法
CiteScore
11.20
自引率
5.20%
发文量
753
审稿时长
2.1 months
期刊介绍: The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信