Abdullatif Al-Najim, Sven Hauns, Van Dinh Tran, Rolf Backofen, Omer S Alkhnbashi
{"title":"HVSeeker:一种基于深度学习的宿主和病毒DNA序列识别方法。","authors":"Abdullatif Al-Najim, Sven Hauns, Van Dinh Tran, Rolf Backofen, Omer S Alkhnbashi","doi":"10.1093/gigascience/giaf037","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Bacteriophages are among the most abundant organisms on Earth, significantly impacting ecosystems and human society. The identification of viral sequences, especially novel ones, from mixed metagenomes is a critical first step in analyzing the viral components of host samples. This plays a key role in many downstream tasks. However, this is a challenging task due to their rapid evolution rate. The identification process typically involves two steps: distinguishing viral sequences from the host and identifying if they come from novel viral genomes. Traditional metagenomic techniques that rely on sequence similarity with known entities often fall short, especially when dealing with short or novel genomes. Meanwhile, deep learning has demonstrated its efficacy across various domains, including the bioinformatics field.</p><p><strong>Results: </strong>We have developed HVSeeker-a host/virus seeker method-based on deep learning to distinguish between bacterial and phage sequences. HVSeeker consists of two separate models: one analyzing DNA sequences and the other focusing on proteins. In addition to the robust architecture of HVSeeker, three distinct preprocessing methods were introduced to enhance the learning process: padding, contigs assembly, and sliding window. This method has shown promising results on sequences with various lengths, ranging from 200 to 1,500 base pairs. Tested on both NCBI and IMGVR databases, HVSeeker outperformed several methods from the literature such as Seeker, Rnn-VirSeeker, DeepVirFinder, and PPR-Meta. Moreover, when compared with other methods on benchmark datasets, HVSeeker has shown better performance, establishing its effectiveness in identifying unknown phage genomes.</p><p><strong>Conclusions: </strong>These results demonstrate the exceptional structure of HVSeeker, which encompasses both the preprocessing methods and the model design. The advancements provided by HVSeeker are significant for identifying viral genomes and developing new therapeutic approaches, such as phage therapy. Therefore, HVSeeker serves as an essential tool in prokaryotic and phage taxonomy, offering a crucial first step toward analyzing the host-viral component of samples by identifying the host and viral sequences in mixed metagenomes.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12080225/pdf/","citationCount":"0","resultStr":"{\"title\":\"HVSeeker: a deep-learning-based method for identification of host and viral DNA sequences.\",\"authors\":\"Abdullatif Al-Najim, Sven Hauns, Van Dinh Tran, Rolf Backofen, Omer S Alkhnbashi\",\"doi\":\"10.1093/gigascience/giaf037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Bacteriophages are among the most abundant organisms on Earth, significantly impacting ecosystems and human society. The identification of viral sequences, especially novel ones, from mixed metagenomes is a critical first step in analyzing the viral components of host samples. This plays a key role in many downstream tasks. However, this is a challenging task due to their rapid evolution rate. The identification process typically involves two steps: distinguishing viral sequences from the host and identifying if they come from novel viral genomes. Traditional metagenomic techniques that rely on sequence similarity with known entities often fall short, especially when dealing with short or novel genomes. Meanwhile, deep learning has demonstrated its efficacy across various domains, including the bioinformatics field.</p><p><strong>Results: </strong>We have developed HVSeeker-a host/virus seeker method-based on deep learning to distinguish between bacterial and phage sequences. HVSeeker consists of two separate models: one analyzing DNA sequences and the other focusing on proteins. In addition to the robust architecture of HVSeeker, three distinct preprocessing methods were introduced to enhance the learning process: padding, contigs assembly, and sliding window. This method has shown promising results on sequences with various lengths, ranging from 200 to 1,500 base pairs. Tested on both NCBI and IMGVR databases, HVSeeker outperformed several methods from the literature such as Seeker, Rnn-VirSeeker, DeepVirFinder, and PPR-Meta. Moreover, when compared with other methods on benchmark datasets, HVSeeker has shown better performance, establishing its effectiveness in identifying unknown phage genomes.</p><p><strong>Conclusions: </strong>These results demonstrate the exceptional structure of HVSeeker, which encompasses both the preprocessing methods and the model design. The advancements provided by HVSeeker are significant for identifying viral genomes and developing new therapeutic approaches, such as phage therapy. Therefore, HVSeeker serves as an essential tool in prokaryotic and phage taxonomy, offering a crucial first step toward analyzing the host-viral component of samples by identifying the host and viral sequences in mixed metagenomes.</p>\",\"PeriodicalId\":12581,\"journal\":{\"name\":\"GigaScience\",\"volume\":\"14 \",\"pages\":\"\"},\"PeriodicalIF\":11.8000,\"publicationDate\":\"2025-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12080225/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GigaScience\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/gigascience/giaf037\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf037","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
HVSeeker: a deep-learning-based method for identification of host and viral DNA sequences.
Background: Bacteriophages are among the most abundant organisms on Earth, significantly impacting ecosystems and human society. The identification of viral sequences, especially novel ones, from mixed metagenomes is a critical first step in analyzing the viral components of host samples. This plays a key role in many downstream tasks. However, this is a challenging task due to their rapid evolution rate. The identification process typically involves two steps: distinguishing viral sequences from the host and identifying if they come from novel viral genomes. Traditional metagenomic techniques that rely on sequence similarity with known entities often fall short, especially when dealing with short or novel genomes. Meanwhile, deep learning has demonstrated its efficacy across various domains, including the bioinformatics field.
Results: We have developed HVSeeker-a host/virus seeker method-based on deep learning to distinguish between bacterial and phage sequences. HVSeeker consists of two separate models: one analyzing DNA sequences and the other focusing on proteins. In addition to the robust architecture of HVSeeker, three distinct preprocessing methods were introduced to enhance the learning process: padding, contigs assembly, and sliding window. This method has shown promising results on sequences with various lengths, ranging from 200 to 1,500 base pairs. Tested on both NCBI and IMGVR databases, HVSeeker outperformed several methods from the literature such as Seeker, Rnn-VirSeeker, DeepVirFinder, and PPR-Meta. Moreover, when compared with other methods on benchmark datasets, HVSeeker has shown better performance, establishing its effectiveness in identifying unknown phage genomes.
Conclusions: These results demonstrate the exceptional structure of HVSeeker, which encompasses both the preprocessing methods and the model design. The advancements provided by HVSeeker are significant for identifying viral genomes and developing new therapeutic approaches, such as phage therapy. Therefore, HVSeeker serves as an essential tool in prokaryotic and phage taxonomy, offering a crucial first step toward analyzing the host-viral component of samples by identifying the host and viral sequences in mixed metagenomes.
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.