A Novel Approach for Increasing Taxonomic Resolution in Protein-Based Alignments

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI:10.1145/3233547.3233646

Cooper J. Park, Keir J. Macartney, Junfu Shen, Kunpeng Xie, Xin Zhang, R. Bergeron, W. Thomas, Cheryl P. Andam, A. Westbrook

{"title":"A Novel Approach for Increasing Taxonomic Resolution in Protein-Based Alignments","authors":"Cooper J. Park, Keir J. Macartney, Junfu Shen, Kunpeng Xie, Xin Zhang, R. Bergeron, W. Thomas, Cheryl P. Andam, A. Westbrook","doi":"10.1145/3233547.3233646","DOIUrl":null,"url":null,"abstract":"Most of today's genome sequencing technology requires that genomes be sequenced in fragments. Typically, these fragments are then aligned using a variety of different alignment programs. All alignment tools query against a reference database to determine the most accurate reassembly of the original DNA strand's nucleotide sequence. Although these programs can align in both nucleotide and protein space, each method comes with its own disadvantages. Protein aligners such as PALADIN consistently align a greater percent of reads faster and provide greater insight into the functional capabilities of the aligned sequence. On the other hand, this method reduces the sensitivity of taxonomic classification due to the degeneracy of the genetic codes. Our program, Renuc, is a PALADIN plugin that addresses this issue by taking protein alignment results using the UniProt database and identifying the most likely taxonomic origin for each nucleotide sequence associated with each detected protein. We have validated our approach and its implementation in Renuc by successfully retrieving the nucleotide sequence and corresponding taxonomic IDs for all of the aligned proteins in our test dataset consisting of a whole Escherichia coli genome. Our program aligns over 99 percent of the nucleotide reads with 97 percent of them remaining in the same protein cluster as the original protein alignment. However, this dataset is incredibly well studied and documented in UniProt. Future work should be considered with a dataset containing less annotations in the database. Renuc quickly identifies and visualizes the alignment's taxonomic data in a user friendly way. The integration of SQLite into the program significantly reduces the time required to retrieve information from the UniProt database. Currently, we seek to improve the retrieval of nucleotide sequences by creating a local cache of the NCBI RefSeq database, and visualizing taxonomy with greater resolution using RaxML.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Most of today's genome sequencing technology requires that genomes be sequenced in fragments. Typically, these fragments are then aligned using a variety of different alignment programs. All alignment tools query against a reference database to determine the most accurate reassembly of the original DNA strand's nucleotide sequence. Although these programs can align in both nucleotide and protein space, each method comes with its own disadvantages. Protein aligners such as PALADIN consistently align a greater percent of reads faster and provide greater insight into the functional capabilities of the aligned sequence. On the other hand, this method reduces the sensitivity of taxonomic classification due to the degeneracy of the genetic codes. Our program, Renuc, is a PALADIN plugin that addresses this issue by taking protein alignment results using the UniProt database and identifying the most likely taxonomic origin for each nucleotide sequence associated with each detected protein. We have validated our approach and its implementation in Renuc by successfully retrieving the nucleotide sequence and corresponding taxonomic IDs for all of the aligned proteins in our test dataset consisting of a whole Escherichia coli genome. Our program aligns over 99 percent of the nucleotide reads with 97 percent of them remaining in the same protein cluster as the original protein alignment. However, this dataset is incredibly well studied and documented in UniProt. Future work should be considered with a dataset containing less annotations in the database. Renuc quickly identifies and visualizes the alignment's taxonomic data in a user friendly way. The integration of SQLite into the program significantly reduces the time required to retrieve information from the UniProt database. Currently, we seek to improve the retrieval of nucleotide sequences by creating a local cache of the NCBI RefSeq database, and visualizing taxonomy with greater resolution using RaxML.

查看原文本刊更多论文

一种提高基于蛋白质比对的分类分辨率的新方法

今天的大多数基因组测序技术都要求对基因组进行片段测序。通常，这些片段然后使用各种不同的对齐程序进行对齐。所有比对工具查询参考数据库，以确定最准确的重组原始DNA链的核苷酸序列。尽管这些程序可以在核苷酸和蛋白质空间中对齐，但每种方法都有其自身的缺点。像PALADIN这样的蛋白质比对器可以更快地比对更多的读数，并提供对比对序列功能的更深入的了解。另一方面，由于遗传密码的退化，该方法降低了分类分类的敏感性。我们的程序Renuc是一个PALADIN插件，它通过使用UniProt数据库获取蛋白质比对结果并识别与每个检测到的蛋白质相关的每个核苷酸序列最可能的分类起源来解决这个问题。通过成功检索由整个大肠杆菌基因组组成的测试数据集中所有对齐蛋白的核苷酸序列和相应的分类id，我们验证了我们的方法及其在Renuc中的实现。我们的程序比对了超过99%的核苷酸序列，其中97%的序列与原始的序列保持在相同的蛋白质簇中。然而，这个数据集在UniProt中得到了非常好的研究和记录。未来的工作应该考虑在数据库中包含更少注释的数据集。Renuc以用户友好的方式快速识别和可视化对齐的分类数据。将SQLite集成到程序中大大减少了从UniProt数据库检索信息所需的时间。目前，我们试图通过创建NCBI RefSeq数据库的本地缓存来改进核苷酸序列的检索，并使用RaxML以更高的分辨率可视化分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

自引率

0.00%

发文量