DNA Sequence Recognition using Image Representation

C. LuisA.Santamaría, H. SarahíZuñiga, I. H. P. Torres, M. J. S. García, Mario Rossainz López
{"title":"DNA Sequence Recognition using Image Representation","authors":"C. LuisA.Santamaría, H. SarahíZuñiga, I. H. P. Torres, M. J. S. García, Mario Rossainz López","doi":"10.13053/rcs-148-3-9","DOIUrl":null,"url":null,"abstract":"In recent years, the field of machine learning has progressed enormously in addressing difficult classification problems. The problem raised in this article is to recognize DNA sequences, recognize the boundaries between exons and introns using a graphic representation of DNA sequences and recent methods of deep learning. The objective of this work is to classify DNA sequences using a convolutional neuronal network (CNN). The set of DNA sequences used for the recognition were 1847 sequences from a database with 4 types of hepatitis C virus (type 1, 2, 3 and 6) taken from the repository available on the ViPR page. The other set of sequences used to recognize limits between exons and introns were sequences from the Molecular database (Splice-junction Gene Sequences) Data Set that has 3190 sequences, available on the ICU page, with three classes of sequences: limit exon-intron, limit intron-exon and none. For the processing of the DNA sequences, a representation method was designed where each nitrogenous base is represented in gray scale to form an image. The generated images were used to train the convolutional neuronal network. The results obtained from the CNN trained with the Hepatitis C virus database suggest that the CNNs are suitable for the classification of the images generated from the DNA sequences. This result led us to perform the experiments for the recognition of exons and introns with the UCI database for the recognition of limits between exons and introns. The results obtained were a training precision of 82%, a validation accuracy of 75% and an evaluation accuracy of 80.8%. It is concluded that it is possible to classify the images of DNA sequences of the databases used.","PeriodicalId":220522,"journal":{"name":"Res. Comput. Sci.","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Res. Comput. Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13053/rcs-148-3-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

In recent years, the field of machine learning has progressed enormously in addressing difficult classification problems. The problem raised in this article is to recognize DNA sequences, recognize the boundaries between exons and introns using a graphic representation of DNA sequences and recent methods of deep learning. The objective of this work is to classify DNA sequences using a convolutional neuronal network (CNN). The set of DNA sequences used for the recognition were 1847 sequences from a database with 4 types of hepatitis C virus (type 1, 2, 3 and 6) taken from the repository available on the ViPR page. The other set of sequences used to recognize limits between exons and introns were sequences from the Molecular database (Splice-junction Gene Sequences) Data Set that has 3190 sequences, available on the ICU page, with three classes of sequences: limit exon-intron, limit intron-exon and none. For the processing of the DNA sequences, a representation method was designed where each nitrogenous base is represented in gray scale to form an image. The generated images were used to train the convolutional neuronal network. The results obtained from the CNN trained with the Hepatitis C virus database suggest that the CNNs are suitable for the classification of the images generated from the DNA sequences. This result led us to perform the experiments for the recognition of exons and introns with the UCI database for the recognition of limits between exons and introns. The results obtained were a training precision of 82%, a validation accuracy of 75% and an evaluation accuracy of 80.8%. It is concluded that it is possible to classify the images of DNA sequences of the databases used.
基于图像表示的DNA序列识别
近年来,机器学习领域在解决困难的分类问题方面取得了巨大进展。本文提出的问题是识别DNA序列,使用DNA序列的图形表示和最新的深度学习方法识别外显子和内含子之间的边界。这项工作的目的是使用卷积神经网络(CNN)对DNA序列进行分类。用于识别的一组DNA序列来自一个数据库中的1847个序列,该数据库中有4种丙型肝炎病毒(1、2、3和6型),取自ViPR页面上提供的存储库。另一组用于识别外显子和内含子之间界限的序列来自Molecular database (Splice-junction Gene sequences)数据集,该数据集有3190个序列,可在ICU页面上找到,序列分为三种:限制外显子-内含子、限制内含子-外显子和无。对于DNA序列的处理,设计了一种表示方法,将每个含氮碱基用灰度表示形成图像。生成的图像用于训练卷积神经网络。用丙型肝炎病毒数据库训练的CNN得到的结果表明,CNN适合对DNA序列生成的图像进行分类。这一结果促使我们利用UCI数据库进行外显子和内含子之间界限的识别实验。得到的训练精度为82%,验证精度为75%,评价精度为80.8%。结果表明,利用数据库对DNA序列图像进行分类是可行的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信