A Novel Approach for Increasing Taxonomic Resolution in Protein-Based Alignments

Cooper J. Park, Keir J. Macartney, Junfu Shen, Kunpeng Xie, Xin Zhang, R. Bergeron, W. Thomas, Cheryl P. Andam, A. Westbrook
{"title":"A Novel Approach for Increasing Taxonomic Resolution in Protein-Based Alignments","authors":"Cooper J. Park, Keir J. Macartney, Junfu Shen, Kunpeng Xie, Xin Zhang, R. Bergeron, W. Thomas, Cheryl P. Andam, A. Westbrook","doi":"10.1145/3233547.3233646","DOIUrl":null,"url":null,"abstract":"Most of today's genome sequencing technology requires that genomes be sequenced in fragments. Typically, these fragments are then aligned using a variety of different alignment programs. All alignment tools query against a reference database to determine the most accurate reassembly of the original DNA strand's nucleotide sequence. Although these programs can align in both nucleotide and protein space, each method comes with its own disadvantages. Protein aligners such as PALADIN consistently align a greater percent of reads faster and provide greater insight into the functional capabilities of the aligned sequence. On the other hand, this method reduces the sensitivity of taxonomic classification due to the degeneracy of the genetic codes. Our program, Renuc, is a PALADIN plugin that addresses this issue by taking protein alignment results using the UniProt database and identifying the most likely taxonomic origin for each nucleotide sequence associated with each detected protein. We have validated our approach and its implementation in Renuc by successfully retrieving the nucleotide sequence and corresponding taxonomic IDs for all of the aligned proteins in our test dataset consisting of a whole Escherichia coli genome. Our program aligns over 99 percent of the nucleotide reads with 97 percent of them remaining in the same protein cluster as the original protein alignment. However, this dataset is incredibly well studied and documented in UniProt. Future work should be considered with a dataset containing less annotations in the database. Renuc quickly identifies and visualizes the alignment's taxonomic data in a user friendly way. The integration of SQLite into the program significantly reduces the time required to retrieve information from the UniProt database. Currently, we seek to improve the retrieval of nucleotide sequences by creating a local cache of the NCBI RefSeq database, and visualizing taxonomy with greater resolution using RaxML.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Most of today's genome sequencing technology requires that genomes be sequenced in fragments. Typically, these fragments are then aligned using a variety of different alignment programs. All alignment tools query against a reference database to determine the most accurate reassembly of the original DNA strand's nucleotide sequence. Although these programs can align in both nucleotide and protein space, each method comes with its own disadvantages. Protein aligners such as PALADIN consistently align a greater percent of reads faster and provide greater insight into the functional capabilities of the aligned sequence. On the other hand, this method reduces the sensitivity of taxonomic classification due to the degeneracy of the genetic codes. Our program, Renuc, is a PALADIN plugin that addresses this issue by taking protein alignment results using the UniProt database and identifying the most likely taxonomic origin for each nucleotide sequence associated with each detected protein. We have validated our approach and its implementation in Renuc by successfully retrieving the nucleotide sequence and corresponding taxonomic IDs for all of the aligned proteins in our test dataset consisting of a whole Escherichia coli genome. Our program aligns over 99 percent of the nucleotide reads with 97 percent of them remaining in the same protein cluster as the original protein alignment. However, this dataset is incredibly well studied and documented in UniProt. Future work should be considered with a dataset containing less annotations in the database. Renuc quickly identifies and visualizes the alignment's taxonomic data in a user friendly way. The integration of SQLite into the program significantly reduces the time required to retrieve information from the UniProt database. Currently, we seek to improve the retrieval of nucleotide sequences by creating a local cache of the NCBI RefSeq database, and visualizing taxonomy with greater resolution using RaxML.
一种提高基于蛋白质比对的分类分辨率的新方法
今天的大多数基因组测序技术都要求对基因组进行片段测序。通常,这些片段然后使用各种不同的对齐程序进行对齐。所有比对工具查询参考数据库,以确定最准确的重组原始DNA链的核苷酸序列。尽管这些程序可以在核苷酸和蛋白质空间中对齐,但每种方法都有其自身的缺点。像PALADIN这样的蛋白质比对器可以更快地比对更多的读数,并提供对比对序列功能的更深入的了解。另一方面,由于遗传密码的退化,该方法降低了分类分类的敏感性。我们的程序Renuc是一个PALADIN插件,它通过使用UniProt数据库获取蛋白质比对结果并识别与每个检测到的蛋白质相关的每个核苷酸序列最可能的分类起源来解决这个问题。通过成功检索由整个大肠杆菌基因组组成的测试数据集中所有对齐蛋白的核苷酸序列和相应的分类id,我们验证了我们的方法及其在Renuc中的实现。我们的程序比对了超过99%的核苷酸序列,其中97%的序列与原始的序列保持在相同的蛋白质簇中。然而,这个数据集在UniProt中得到了非常好的研究和记录。未来的工作应该考虑在数据库中包含更少注释的数据集。Renuc以用户友好的方式快速识别和可视化对齐的分类数据。将SQLite集成到程序中大大减少了从UniProt数据库检索信息所需的时间。目前,我们试图通过创建NCBI RefSeq数据库的本地缓存来改进核苷酸序列的检索,并使用RaxML以更高的分辨率可视化分类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信