{"title":"从高通量测序数据中预测冠状病毒序列的基于自我关注的深度学习模型","authors":"ZhenNan Wang, ChaoMei Liu","doi":"10.1101/2024.08.07.24311618","DOIUrl":null,"url":null,"abstract":"Transformer models have achieved excellent results in various tasks, primarily due to the self-attention mechanism. We explore using self-attention for detecting coronavirus sequences in high-throughput sequencing data, offering a novel approach for accurately identifying emerging and highly variable coronavirus strains. Coronavirus and human genome data were obtained from the Genomic Data Commons (GDC) and the National Genomics Data Center (NGDC) databases. After preprocessing, a simulated high-throughput sequencing dataset of coronavirus-infected samples was constructed. This dataset was divided into training, validation, and test datasets. The self-attention-based model was trained on the training datasets, tested on the validation and test datasets, and SARS-CoV-2 genome data were collected as an independent test datasets. The results showed that the self-attention-based model outperformed traditional bioinformatics methods in terms of performance on both the test and the independent test datasets, with a significant improvement in computation speed. The self-attention-based model can sensitively and rapidly detect coronavirus sequences from high-throughput sequencing data while exhibiting excellent generalization ability. It can accurately detect emerging and highly variable coronavirus strains, providing a new approach for identifying such viruses.","PeriodicalId":501509,"journal":{"name":"medRxiv - Infectious Diseases","volume":"191 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-attention based deep learning model for predicting the coronavirus sequences from high-throughput sequencing data\",\"authors\":\"ZhenNan Wang, ChaoMei Liu\",\"doi\":\"10.1101/2024.08.07.24311618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer models have achieved excellent results in various tasks, primarily due to the self-attention mechanism. We explore using self-attention for detecting coronavirus sequences in high-throughput sequencing data, offering a novel approach for accurately identifying emerging and highly variable coronavirus strains. Coronavirus and human genome data were obtained from the Genomic Data Commons (GDC) and the National Genomics Data Center (NGDC) databases. After preprocessing, a simulated high-throughput sequencing dataset of coronavirus-infected samples was constructed. This dataset was divided into training, validation, and test datasets. The self-attention-based model was trained on the training datasets, tested on the validation and test datasets, and SARS-CoV-2 genome data were collected as an independent test datasets. The results showed that the self-attention-based model outperformed traditional bioinformatics methods in terms of performance on both the test and the independent test datasets, with a significant improvement in computation speed. The self-attention-based model can sensitively and rapidly detect coronavirus sequences from high-throughput sequencing data while exhibiting excellent generalization ability. It can accurately detect emerging and highly variable coronavirus strains, providing a new approach for identifying such viruses.\",\"PeriodicalId\":501509,\"journal\":{\"name\":\"medRxiv - Infectious Diseases\",\"volume\":\"191 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv - Infectious Diseases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.07.24311618\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Infectious Diseases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.07.24311618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Self-attention based deep learning model for predicting the coronavirus sequences from high-throughput sequencing data
Transformer models have achieved excellent results in various tasks, primarily due to the self-attention mechanism. We explore using self-attention for detecting coronavirus sequences in high-throughput sequencing data, offering a novel approach for accurately identifying emerging and highly variable coronavirus strains. Coronavirus and human genome data were obtained from the Genomic Data Commons (GDC) and the National Genomics Data Center (NGDC) databases. After preprocessing, a simulated high-throughput sequencing dataset of coronavirus-infected samples was constructed. This dataset was divided into training, validation, and test datasets. The self-attention-based model was trained on the training datasets, tested on the validation and test datasets, and SARS-CoV-2 genome data were collected as an independent test datasets. The results showed that the self-attention-based model outperformed traditional bioinformatics methods in terms of performance on both the test and the independent test datasets, with a significant improvement in computation speed. The self-attention-based model can sensitively and rapidly detect coronavirus sequences from high-throughput sequencing data while exhibiting excellent generalization ability. It can accurately detect emerging and highly variable coronavirus strains, providing a new approach for identifying such viruses.