{"title":"VirusPredictor:基于 XGBoost 的软件,用于预测人类数据中的病毒相关序列。","authors":"Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li","doi":"10.1093/bioinformatics/btae192","DOIUrl":null,"url":null,"abstract":"MOTIVATION\nDiscovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data.\n\n\nRESULTS\nWe developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions.\n\n\nAVAILABILITY\nwww.dllab.org/software/VirusPredictor.html.\n\n\nSUPPLEMENTARY INFORMATION\nSupplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.\",\"authors\":\"Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li\",\"doi\":\"10.1093/bioinformatics/btae192\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MOTIVATION\\nDiscovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data.\\n\\n\\nRESULTS\\nWe developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions.\\n\\n\\nAVAILABILITY\\nwww.dllab.org/software/VirusPredictor.html.\\n\\n\\nSUPPLEMENTARY INFORMATION\\nSupplementary data are available at Bioinformatics online.\",\"PeriodicalId\":8903,\"journal\":{\"name\":\"Bioinformatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-04-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btae192\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae192","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
摘要
动机发现致病病原体,尤其是没有参考基因组的病毒,是一项技术挑战,因为通过序列比对往往无法识别这些病原体。对无法与人类和病原体基因组比对的病人高通量序列进行机器学习预测,可能会发现源自未定性病毒的序列。目前,还缺乏专门用于准确预测人类数据中此类病毒序列的软件。结果我们利用内部病毒基因组数据库开发了一种快速 XGBoost 方法和软件 VirusPredictor。我们的两步 XGBoost 模型首先将每个查询序列分为三类:传染性病毒、内源性逆转录病毒 (ERV) 或非ERV 人类。序列越长,预测准确率越高,150-350(Illumina 短读数)、850-950(Sanger 测序数据)和 2,000-5,000 bp 序列的预测准确率分别为 0.76、0.93 和 0.98。当查询序列从 150-350 bp 增加到大于 850 bp 时,准确度从 0.92 增加到大于 0.98。结果表明,Illumina 短读数应尽可能在预测前从头组装成等体(例如,1000 bp 或更长)。我们将 VirusPredictor 应用于多个真实的基因组和元基因组数据集,并获得了很高的准确率。VirusPredictor 是一款用户友好的开源 Python 软件,可用于预测患者不可应用序列的来源。这项研究首次在传染性病毒序列预测中对 ERV 进行了分类。这也是第一项结合病毒亚群预测的研究。AVAILABILITYwww.dllab.org/software/VirusPredictor.html.SUPPLEMENTARY INFORMATIONS补充数据可在生物信息学网上获取。
VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.
MOTIVATION
Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data.
RESULTS
We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions.
AVAILABILITY
www.dllab.org/software/VirusPredictor.html.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
期刊介绍:
The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.