丝状真菌DNA序列编码区域的机器学习识别

Gustavo Henrique Ferreira Cruz, Vinícius Menossi, Josiane Melchiori Pinheiro, Antônio Roberto dos Santos, Gustavo Luiz Furuhata Ferreira, Sarah Anduca de Oliveira
{"title":"丝状真菌DNA序列编码区域的机器学习识别","authors":"Gustavo Henrique Ferreira Cruz, Vinícius Menossi, Josiane Melchiori Pinheiro, Antônio Roberto dos Santos, Gustavo Luiz Furuhata Ferreira, Sarah Anduca de Oliveira","doi":"10.14210/cotb.v13.p236-242","DOIUrl":null,"url":null,"abstract":"The task of identifying intron and exon regions in genes is a verycomplex task, and it is necessary to identify certain nucleotidepatterns in the gene sequence. This task can be done manually orthrough software that most often uses genetic alignment techniques, which is not a very effective way for this purpose. In this oppor-tunity for collaboration between biology and computer science using machine learning techniques, the objective was to predictthe intron and exon regions in filamentous fungi genes as well totranslate the identified regions intro proteic codons. In this paper,the problem was modeled as a supervised learning problem, basedon training a set of genes obtained from GenBank that alreadyhave the intron and exon regions identified. The machine learningmodel used in this work was the Condicional Random Fields (CRF).Through the values resulting from the metrics applied to the model,it can be seen that it is possible to achieve a good precision in thetask of identifying the intron and exon regions as well the proteiccodons. Thus, although there is a need for a greater diversity ofdatabase characteristics to support the effectiveness of identifyingthe splicing sites, this paper gives evidence that it is possible topredict these splicing sites with a good accuracy.","PeriodicalId":375380,"journal":{"name":"Anais do XIII Computer on the Beach - COTB'22","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Aprendizagem de Máquina na identificação de regiões codantes em sequências de DNA de fungos filamentosos\",\"authors\":\"Gustavo Henrique Ferreira Cruz, Vinícius Menossi, Josiane Melchiori Pinheiro, Antônio Roberto dos Santos, Gustavo Luiz Furuhata Ferreira, Sarah Anduca de Oliveira\",\"doi\":\"10.14210/cotb.v13.p236-242\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The task of identifying intron and exon regions in genes is a verycomplex task, and it is necessary to identify certain nucleotidepatterns in the gene sequence. This task can be done manually orthrough software that most often uses genetic alignment techniques, which is not a very effective way for this purpose. In this oppor-tunity for collaboration between biology and computer science using machine learning techniques, the objective was to predictthe intron and exon regions in filamentous fungi genes as well totranslate the identified regions intro proteic codons. In this paper,the problem was modeled as a supervised learning problem, basedon training a set of genes obtained from GenBank that alreadyhave the intron and exon regions identified. The machine learningmodel used in this work was the Condicional Random Fields (CRF).Through the values resulting from the metrics applied to the model,it can be seen that it is possible to achieve a good precision in thetask of identifying the intron and exon regions as well the proteiccodons. Thus, although there is a need for a greater diversity ofdatabase characteristics to support the effectiveness of identifyingthe splicing sites, this paper gives evidence that it is possible topredict these splicing sites with a good accuracy.\",\"PeriodicalId\":375380,\"journal\":{\"name\":\"Anais do XIII Computer on the Beach - COTB'22\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Anais do XIII Computer on the Beach - COTB'22\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14210/cotb.v13.p236-242\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do XIII Computer on the Beach - COTB'22","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14210/cotb.v13.p236-242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

基因中内含子和外显子区域的识别是一项非常复杂的任务,有必要确定基因序列中的某些核苷酸模式。这项任务可以手动完成,也可以通过软件完成,软件通常使用基因校准技术,这不是一种非常有效的方法。在这个利用机器学习技术进行生物学和计算机科学合作的机会中,目标是预测丝状真菌基因中的内含子和外显子区域,并将鉴定的区域翻译为蛋白质密码子。在本文中,该问题被建模为一个监督学习问题,基于训练从GenBank中获得的一组已经识别出内含子和外显子区域的基因。在这项工作中使用的机器学习模型是条件随机场(CRF)。通过应用于该模型的度量所产生的值,可以看出,在识别内含子和外显子区域以及蛋白质密码子的任务中,可以达到很好的精度。因此,尽管需要更多样化的数据库特征来支持识别剪接位点的有效性,但本文提供的证据表明,以良好的准确性预测这些剪接位点是可能的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Aprendizagem de Máquina na identificação de regiões codantes em sequências de DNA de fungos filamentosos
The task of identifying intron and exon regions in genes is a verycomplex task, and it is necessary to identify certain nucleotidepatterns in the gene sequence. This task can be done manually orthrough software that most often uses genetic alignment techniques, which is not a very effective way for this purpose. In this oppor-tunity for collaboration between biology and computer science using machine learning techniques, the objective was to predictthe intron and exon regions in filamentous fungi genes as well totranslate the identified regions intro proteic codons. In this paper,the problem was modeled as a supervised learning problem, basedon training a set of genes obtained from GenBank that alreadyhave the intron and exon regions identified. The machine learningmodel used in this work was the Condicional Random Fields (CRF).Through the values resulting from the metrics applied to the model,it can be seen that it is possible to achieve a good precision in thetask of identifying the intron and exon regions as well the proteiccodons. Thus, although there is a need for a greater diversity ofdatabase characteristics to support the effectiveness of identifyingthe splicing sites, this paper gives evidence that it is possible topredict these splicing sites with a good accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信