丝状真菌DNA序列编码区域的机器学习识别

Anais do XIII Computer on the Beach - COTB'22 Pub Date : 2022-07-13 DOI:10.14210/cotb.v13.p236-242

Gustavo Henrique Ferreira Cruz, Vinícius Menossi, Josiane Melchiori Pinheiro, Antônio Roberto dos Santos, Gustavo Luiz Furuhata Ferreira, Sarah Anduca de Oliveira

{"title":"丝状真菌DNA序列编码区域的机器学习识别","authors":"Gustavo Henrique Ferreira Cruz, Vinícius Menossi, Josiane Melchiori Pinheiro, Antônio Roberto dos Santos, Gustavo Luiz Furuhata Ferreira, Sarah Anduca de Oliveira","doi":"10.14210/cotb.v13.p236-242","DOIUrl":null,"url":null,"abstract":"The task of identifying intron and exon regions in genes is a verycomplex task, and it is necessary to identify certain nucleotidepatterns in the gene sequence. This task can be done manually orthrough software that most often uses genetic alignment techniques, which is not a very effective way for this purpose. In this oppor-tunity for collaboration between biology and computer science using machine learning techniques, the objective was to predictthe intron and exon regions in filamentous fungi genes as well totranslate the identified regions intro proteic codons. In this paper,the problem was modeled as a supervised learning problem, basedon training a set of genes obtained from GenBank that alreadyhave the intron and exon regions identified. The machine learningmodel used in this work was the Condicional Random Fields (CRF).Through the values resulting from the metrics applied to the model,it can be seen that it is possible to achieve a good precision in thetask of identifying the intron and exon regions as well the proteiccodons. Thus, although there is a need for a greater diversity ofdatabase characteristics to support the effectiveness of identifyingthe splicing sites, this paper gives evidence that it is possible topredict these splicing sites with a good accuracy.","PeriodicalId":375380,"journal":{"name":"Anais do XIII Computer on the Beach - COTB'22","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Aprendizagem de Máquina na identificação de regiões codantes em sequências de DNA de fungos filamentosos\",\"authors\":\"Gustavo Henrique Ferreira Cruz, Vinícius Menossi, Josiane Melchiori Pinheiro, Antônio Roberto dos Santos, Gustavo Luiz Furuhata Ferreira, Sarah Anduca de Oliveira\",\"doi\":\"10.14210/cotb.v13.p236-242\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The task of identifying intron and exon regions in genes is a verycomplex task, and it is necessary to identify certain nucleotidepatterns in the gene sequence. This task can be done manually orthrough software that most often uses genetic alignment techniques, which is not a very effective way for this purpose. In this oppor-tunity for collaboration between biology and computer science using machine learning techniques, the objective was to predictthe intron and exon regions in filamentous fungi genes as well totranslate the identified regions intro proteic codons. In this paper,the problem was modeled as a supervised learning problem, basedon training a set of genes obtained from GenBank that alreadyhave the intron and exon regions identified. The machine learningmodel used in this work was the Condicional Random Fields (CRF).Through the values resulting from the metrics applied to the model,it can be seen that it is possible to achieve a good precision in thetask of identifying the intron and exon regions as well the proteiccodons. Thus, although there is a need for a greater diversity ofdatabase characteristics to support the effectiveness of identifyingthe splicing sites, this paper gives evidence that it is possible topredict these splicing sites with a good accuracy.\",\"PeriodicalId\":375380,\"journal\":{\"name\":\"Anais do XIII Computer on the Beach - COTB'22\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Anais do XIII Computer on the Beach - COTB'22\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14210/cotb.v13.p236-242\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais do XIII Computer on the Beach - COTB'22","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14210/cotb.v13.p236-242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

基因中内含子和外显子区域的识别是一项非常复杂的任务，有必要确定基因序列中的某些核苷酸模式。这项任务可以手动完成，也可以通过软件完成，软件通常使用基因校准技术，这不是一种非常有效的方法。在这个利用机器学习技术进行生物学和计算机科学合作的机会中，目标是预测丝状真菌基因中的内含子和外显子区域，并将鉴定的区域翻译为蛋白质密码子。在本文中，该问题被建模为一个监督学习问题，基于训练从GenBank中获得的一组已经识别出内含子和外显子区域的基因。在这项工作中使用的机器学习模型是条件随机场(CRF)。通过应用于该模型的度量所产生的值，可以看出，在识别内含子和外显子区域以及蛋白质密码子的任务中，可以达到很好的精度。因此，尽管需要更多样化的数据库特征来支持识别剪接位点的有效性，但本文提供的证据表明，以良好的准确性预测这些剪接位点是可能的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Aprendizagem de Máquina na identificação de regiões codantes em sequências de DNA de fungos filamentosos

The task of identifying intron and exon regions in genes is a verycomplex task, and it is necessary to identify certain nucleotidepatterns in the gene sequence. This task can be done manually orthrough software that most often uses genetic alignment techniques, which is not a very effective way for this purpose. In this oppor-tunity for collaboration between biology and computer science using machine learning techniques, the objective was to predictthe intron and exon regions in filamentous fungi genes as well totranslate the identified regions intro proteic codons. In this paper,the problem was modeled as a supervised learning problem, basedon training a set of genes obtained from GenBank that alreadyhave the intron and exon regions identified. The machine learningmodel used in this work was the Condicional Random Fields (CRF).Through the values resulting from the metrics applied to the model,it can be seen that it is possible to achieve a good precision in thetask of identifying the intron and exon regions as well the proteiccodons. Thus, although there is a need for a greater diversity ofdatabase characteristics to support the effectiveness of identifyingthe splicing sites, this paper gives evidence that it is possible topredict these splicing sites with a good accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Anais do XIII Computer on the Beach - COTB'22

自引率

0.00%

发文量