X. Y. Lim, Jia Yee Lim, Weng Howe Chan, Hui Wen Nies
{"title":"Detection of Potential Viral Sequence from Next Generation Sequencing Data Using Convolutional Neural Network","authors":"X. Y. Lim, Jia Yee Lim, Weng Howe Chan, Hui Wen Nies","doi":"10.11113/ijic.v13n1.382","DOIUrl":null,"url":null,"abstract":"Next Generation Sequencing (NGS) is a modern sequencing technology that can determine the sequences of RNA and DNA faster and at lower cost. The availability of NGS data has sparked numerous efforts in bioinformatics, especially in the study of genetic variation and viral sequence detection. Viral sequence detection has been one of the important processes in studying virus-induced diseases. Common methods in detecting viral sequences involve alignment of the sequence with existing databases, which remains limited as these databases might be incomplete and difficult to detect highly divergent viruses. Thus, machine learning and deep learning have been used in this regard, to unveil the patterns that distinguish viral sequences through learning from the NGS data. This study focuses on viral sequence detection using convolutional neural network (CNN). This study intended to investigate how CNN model can be used for analysis of NGS data and develop a CNN model for detecting potential viral sequences from NGS data. The CNN architecture used for this study is based on an existing design that divided into two branches namely pattern and frequency branch that cater for extracting different aspects of information from the data and lastly combined into a full model. This study further implemented slightly modified architecture that includes additional convolution layer and pooling layer. Then, parameter tuning is implemented to identify near optimal parameters for the CNN to elucidate the performance impact. The evaluation of the optimized CNN model is done using a dataset with 18,445 DNA sequences. The results show that the CNN model in this study achieved a better performance compared with existing in terms of area under receiver operating characteristics curve (AUROC) for full model (+0.1434).","PeriodicalId":50314,"journal":{"name":"International Journal of Innovative Computing Information and Control","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Innovative Computing Information and Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11113/ijic.v13n1.382","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Next Generation Sequencing (NGS) is a modern sequencing technology that can determine the sequences of RNA and DNA faster and at lower cost. The availability of NGS data has sparked numerous efforts in bioinformatics, especially in the study of genetic variation and viral sequence detection. Viral sequence detection has been one of the important processes in studying virus-induced diseases. Common methods in detecting viral sequences involve alignment of the sequence with existing databases, which remains limited as these databases might be incomplete and difficult to detect highly divergent viruses. Thus, machine learning and deep learning have been used in this regard, to unveil the patterns that distinguish viral sequences through learning from the NGS data. This study focuses on viral sequence detection using convolutional neural network (CNN). This study intended to investigate how CNN model can be used for analysis of NGS data and develop a CNN model for detecting potential viral sequences from NGS data. The CNN architecture used for this study is based on an existing design that divided into two branches namely pattern and frequency branch that cater for extracting different aspects of information from the data and lastly combined into a full model. This study further implemented slightly modified architecture that includes additional convolution layer and pooling layer. Then, parameter tuning is implemented to identify near optimal parameters for the CNN to elucidate the performance impact. The evaluation of the optimized CNN model is done using a dataset with 18,445 DNA sequences. The results show that the CNN model in this study achieved a better performance compared with existing in terms of area under receiver operating characteristics curve (AUROC) for full model (+0.1434).
期刊介绍:
The primary aim of the International Journal of Innovative Computing, Information and Control (IJICIC) is to publish high-quality papers of new developments and trends, novel techniques and approaches, innovative methodologies and technologies on the theory and applications of intelligent systems, information and control. The IJICIC is a peer-reviewed English language journal and is published bimonthly