利用卷积神经网络从下一代测序数据中检测潜在病毒序列

IF 1.3 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
X. Y. Lim, Jia Yee Lim, Weng Howe Chan, Hui Wen Nies
{"title":"利用卷积神经网络从下一代测序数据中检测潜在病毒序列","authors":"X. Y. Lim, Jia Yee Lim, Weng Howe Chan, Hui Wen Nies","doi":"10.11113/ijic.v13n1.382","DOIUrl":null,"url":null,"abstract":"Next Generation Sequencing (NGS) is a modern sequencing technology that can determine the sequences of RNA and DNA faster and at lower cost. The availability of NGS data has sparked numerous efforts in bioinformatics, especially in the study of genetic variation and viral sequence detection. Viral sequence detection has been one of the important processes in studying virus-induced diseases. Common methods in detecting viral sequences involve alignment of the sequence with existing databases, which remains limited as these databases might be incomplete and difficult to detect highly divergent viruses. Thus, machine learning and deep learning have been used in this regard, to unveil the patterns that distinguish viral sequences through learning from the NGS data. This study focuses on viral sequence detection using convolutional neural network (CNN). This study intended to investigate how CNN model can be used for analysis of NGS data and develop a CNN model for detecting potential viral sequences from NGS data. The CNN architecture used for this study is based on an existing design that divided into two branches namely pattern and frequency branch that cater for extracting different aspects of information from the data and lastly combined into a full model. This study further implemented slightly modified architecture that includes additional convolution layer and pooling layer. Then, parameter tuning is implemented to identify near optimal parameters for the CNN to elucidate the performance impact. The evaluation of the optimized CNN model is done using a dataset with 18,445 DNA sequences. The results show that the CNN model in this study achieved a better performance compared with existing in terms of area under receiver operating characteristics curve (AUROC) for full model (+0.1434).","PeriodicalId":50314,"journal":{"name":"International Journal of Innovative Computing Information and Control","volume":"50 1 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Detection of Potential Viral Sequence from Next Generation Sequencing Data Using Convolutional Neural Network\",\"authors\":\"X. Y. Lim, Jia Yee Lim, Weng Howe Chan, Hui Wen Nies\",\"doi\":\"10.11113/ijic.v13n1.382\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Next Generation Sequencing (NGS) is a modern sequencing technology that can determine the sequences of RNA and DNA faster and at lower cost. The availability of NGS data has sparked numerous efforts in bioinformatics, especially in the study of genetic variation and viral sequence detection. Viral sequence detection has been one of the important processes in studying virus-induced diseases. Common methods in detecting viral sequences involve alignment of the sequence with existing databases, which remains limited as these databases might be incomplete and difficult to detect highly divergent viruses. Thus, machine learning and deep learning have been used in this regard, to unveil the patterns that distinguish viral sequences through learning from the NGS data. This study focuses on viral sequence detection using convolutional neural network (CNN). This study intended to investigate how CNN model can be used for analysis of NGS data and develop a CNN model for detecting potential viral sequences from NGS data. The CNN architecture used for this study is based on an existing design that divided into two branches namely pattern and frequency branch that cater for extracting different aspects of information from the data and lastly combined into a full model. This study further implemented slightly modified architecture that includes additional convolution layer and pooling layer. Then, parameter tuning is implemented to identify near optimal parameters for the CNN to elucidate the performance impact. The evaluation of the optimized CNN model is done using a dataset with 18,445 DNA sequences. The results show that the CNN model in this study achieved a better performance compared with existing in terms of area under receiver operating characteristics curve (AUROC) for full model (+0.1434).\",\"PeriodicalId\":50314,\"journal\":{\"name\":\"International Journal of Innovative Computing Information and Control\",\"volume\":\"50 1 1\",\"pages\":\"\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2023-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Innovative Computing Information and Control\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.11113/ijic.v13n1.382\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Innovative Computing Information and Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11113/ijic.v13n1.382","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

下一代测序(NGS)是一种能够以更快的速度和更低的成本确定RNA和DNA序列的现代测序技术。NGS数据的可用性引发了生物信息学领域的许多努力,特别是在遗传变异和病毒序列检测研究方面。病毒序列检测是研究病毒诱导疾病的重要手段之一。检测病毒序列的常用方法包括将序列与现有数据库比对,由于这些数据库可能不完整且难以检测高度分化的病毒,因此这些方法仍然有限。因此,机器学习和深度学习已被用于这方面,通过从NGS数据中学习来揭示区分病毒序列的模式。本研究的重点是利用卷积神经网络(CNN)进行病毒序列检测。本研究旨在探讨如何将CNN模型用于NGS数据的分析,并开发一个CNN模型用于从NGS数据中检测潜在的病毒序列。本研究使用的CNN架构是基于现有的设计,分为模式和频率两个分支,分别用于从数据中提取不同方面的信息,最后组合成一个完整的模型。本研究进一步实现了稍微修改的架构,包括额外的卷积层和池化层。然后,实现参数调优,为CNN识别接近最优的参数,以阐明性能影响。对优化后的CNN模型的评估使用了包含18445个DNA序列的数据集。结果表明,在全模型下,本研究的CNN模型在receiver operating characteristic curve (AUROC)下的面积(+0.1434)优于现有模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Detection of Potential Viral Sequence from Next Generation Sequencing Data Using Convolutional Neural Network
Next Generation Sequencing (NGS) is a modern sequencing technology that can determine the sequences of RNA and DNA faster and at lower cost. The availability of NGS data has sparked numerous efforts in bioinformatics, especially in the study of genetic variation and viral sequence detection. Viral sequence detection has been one of the important processes in studying virus-induced diseases. Common methods in detecting viral sequences involve alignment of the sequence with existing databases, which remains limited as these databases might be incomplete and difficult to detect highly divergent viruses. Thus, machine learning and deep learning have been used in this regard, to unveil the patterns that distinguish viral sequences through learning from the NGS data. This study focuses on viral sequence detection using convolutional neural network (CNN). This study intended to investigate how CNN model can be used for analysis of NGS data and develop a CNN model for detecting potential viral sequences from NGS data. The CNN architecture used for this study is based on an existing design that divided into two branches namely pattern and frequency branch that cater for extracting different aspects of information from the data and lastly combined into a full model. This study further implemented slightly modified architecture that includes additional convolution layer and pooling layer. Then, parameter tuning is implemented to identify near optimal parameters for the CNN to elucidate the performance impact. The evaluation of the optimized CNN model is done using a dataset with 18,445 DNA sequences. The results show that the CNN model in this study achieved a better performance compared with existing in terms of area under receiver operating characteristics curve (AUROC) for full model (+0.1434).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.20
自引率
20.00%
发文量
0
审稿时长
4.3 months
期刊介绍: The primary aim of the International Journal of Innovative Computing, Information and Control (IJICIC) is to publish high-quality papers of new developments and trends, novel techniques and approaches, innovative methodologies and technologies on the theory and applications of intelligent systems, information and control. The IJICIC is a peer-reviewed English language journal and is published bimonthly
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信