DeepAnnotator:基因组注释与深度学习

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI:10.1145/3233547.3233577

M. R. Amin, Alisa Yurovsky, Yingtao Tian, S. Skiena

{"title":"DeepAnnotator:基因组注释与深度学习","authors":"M. R. Amin, Alisa Yurovsky, Yingtao Tian, S. Skiena","doi":"10.1145/3233547.3233577","DOIUrl":null,"url":null,"abstract":"Genome annotation is the process of labeling DNA sequences of an organism with its biological features, and is one of the fundamental problems in Bioinformatics. Public annotation pipelines such as NCBI integrate a variety of algorithms and homology searches on public and private databases. However, they build on the information of varying consistency and quality, produced over the last two decades. We identified 12,415 errors in NCBI RNA gene annotations, demonstrating the need for improved annotation programs. We use Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) to demonstrate the potential of deep learning networks to annotate genome sequences, and evaluate different approaches on prokaryotic sequences from NCBI database. Particularly, we evaluate DNA $K-$mer embeddings and the application of RNNs for genome annotation. We show how to improve the performance of our deep networks by incorporating intermediate objectives and downstream algorithms to achieve better accuracy. Our method, called DeepAnnotator, achieves an F-score of ~94%, and establishes a generalized computational approach for genome annotation using deep learning. Our results are very encouraging as our method eliminates the requirement of hand crafted features and motivates further research in application of deep learning to full genome annotation. DeepAnnotator algorithms and models can be accessed in Github: \\urlhttps://github.com/ruhulsbu/DeepAnnotator.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"DeepAnnotator: Genome Annotation with Deep Learning\",\"authors\":\"M. R. Amin, Alisa Yurovsky, Yingtao Tian, S. Skiena\",\"doi\":\"10.1145/3233547.3233577\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Genome annotation is the process of labeling DNA sequences of an organism with its biological features, and is one of the fundamental problems in Bioinformatics. Public annotation pipelines such as NCBI integrate a variety of algorithms and homology searches on public and private databases. However, they build on the information of varying consistency and quality, produced over the last two decades. We identified 12,415 errors in NCBI RNA gene annotations, demonstrating the need for improved annotation programs. We use Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) to demonstrate the potential of deep learning networks to annotate genome sequences, and evaluate different approaches on prokaryotic sequences from NCBI database. Particularly, we evaluate DNA $K-$mer embeddings and the application of RNNs for genome annotation. We show how to improve the performance of our deep networks by incorporating intermediate objectives and downstream algorithms to achieve better accuracy. Our method, called DeepAnnotator, achieves an F-score of ~94%, and establishes a generalized computational approach for genome annotation using deep learning. Our results are very encouraging as our method eliminates the requirement of hand crafted features and motivates further research in application of deep learning to full genome annotation. DeepAnnotator algorithms and models can be accessed in Github: \\\\urlhttps://github.com/ruhulsbu/DeepAnnotator.\",\"PeriodicalId\":131906,\"journal\":{\"name\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"volume\":\"138 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3233547.3233577\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233577","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

基因组注释是将生物体的DNA序列标记为其生物学特征的过程，是生物信息学的基本问题之一。像NCBI这样的公共注释管道在公共和私有数据库上集成了各种算法和同源性搜索。然而，它们建立在过去二十年中产生的一致性和质量各不相同的信息基础上。我们在NCBI RNA基因注释中发现了12,415个错误，表明需要改进注释程序。本文利用长短期记忆递归神经网络(RNN)证明了深度学习网络在基因组序列注释方面的潜力，并对NCBI数据库中原核生物序列的不同方法进行了评价。特别地，我们评估了DNA $K-$mer嵌入和rnn在基因组注释中的应用。我们展示了如何通过结合中间目标和下游算法来提高深度网络的性能，以达到更好的准确性。我们的方法，称为DeepAnnotator，达到了~94%的f分，并建立了一种使用深度学习的基因组注释的通用计算方法。我们的结果非常令人鼓舞，因为我们的方法消除了手工制作特征的要求，并激发了深度学习在全基因组注释中的应用的进一步研究。DeepAnnotator算法和模型可以在Github中访问:\urlhttps://github.com/ruhulsbu/DeepAnnotator。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DeepAnnotator: Genome Annotation with Deep Learning

Genome annotation is the process of labeling DNA sequences of an organism with its biological features, and is one of the fundamental problems in Bioinformatics. Public annotation pipelines such as NCBI integrate a variety of algorithms and homology searches on public and private databases. However, they build on the information of varying consistency and quality, produced over the last two decades. We identified 12,415 errors in NCBI RNA gene annotations, demonstrating the need for improved annotation programs. We use Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) to demonstrate the potential of deep learning networks to annotate genome sequences, and evaluate different approaches on prokaryotic sequences from NCBI database. Particularly, we evaluate DNA $K-$mer embeddings and the application of RNNs for genome annotation. We show how to improve the performance of our deep networks by incorporating intermediate objectives and downstream algorithms to achieve better accuracy. Our method, called DeepAnnotator, achieves an F-score of ~94%, and establishes a generalized computational approach for genome annotation using deep learning. Our results are very encouraging as our method eliminates the requirement of hand crafted features and motivates further research in application of deep learning to full genome annotation. DeepAnnotator algorithms and models can be accessed in Github: \urlhttps://github.com/ruhulsbu/DeepAnnotator.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

自引率

0.00%

发文量