Next generation sequence assembler mis-assembly of phage genomes with terminal redundancy

Julia D. Warnke-Sommer, I. Thapa, H. Ali
{"title":"Next generation sequence assembler mis-assembly of phage genomes with terminal redundancy","authors":"Julia D. Warnke-Sommer, I. Thapa, H. Ali","doi":"10.1109/BIBM.2015.7359836","DOIUrl":null,"url":null,"abstract":"Next generation sequencing (NGS) has become the platform of numerous biomedical applications. The study of viral genomes using NGS technologies has led to the characterization of viral species in numerous environments including the human gut microbiome and plant hosts. Many viral genomes are circular or have terminally redundant ends. Circular or linear viral genomes with indeterminate starting and ending points pose a challenge for NGS assemblers, which may erroneously duplicate sections of these genomes. The length of an assembly, often characterized by the N50 length, is frequently used as an indication of an assembly's completeness and even quality. In this paper, we show that the longest contig produced by various assemblers is not always the best assembly for circular or terminally redundant phage genomes and may represent erroneously repeated genomic regions. Results demonstrate that assembly tools may even produce assembled genomes of different lengths for the same species, depending on content inaccurately repeated, leading to results that might be confusing to or inaccurately used by a researcher. To overcome this problem, we introduce strategies for using coverage depth to identify inaccurately repeated content in circular or terminally redundant phage genomes. We conclude the paper by providing the results of assembling two bacteriophage genomes and a bacteriophage metagenomics dataset, highlighting the impact of using the proposed strategies.","PeriodicalId":186217,"journal":{"name":"2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2015.7359836","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Next generation sequencing (NGS) has become the platform of numerous biomedical applications. The study of viral genomes using NGS technologies has led to the characterization of viral species in numerous environments including the human gut microbiome and plant hosts. Many viral genomes are circular or have terminally redundant ends. Circular or linear viral genomes with indeterminate starting and ending points pose a challenge for NGS assemblers, which may erroneously duplicate sections of these genomes. The length of an assembly, often characterized by the N50 length, is frequently used as an indication of an assembly's completeness and even quality. In this paper, we show that the longest contig produced by various assemblers is not always the best assembly for circular or terminally redundant phage genomes and may represent erroneously repeated genomic regions. Results demonstrate that assembly tools may even produce assembled genomes of different lengths for the same species, depending on content inaccurately repeated, leading to results that might be confusing to or inaccurately used by a researcher. To overcome this problem, we introduce strategies for using coverage depth to identify inaccurately repeated content in circular or terminally redundant phage genomes. We conclude the paper by providing the results of assembling two bacteriophage genomes and a bacteriophage metagenomics dataset, highlighting the impact of using the proposed strategies.
具有末端冗余的噬菌体基因组错误组装的下一代序列组装器
下一代测序(NGS)已成为众多生物医学应用的平台。利用NGS技术对病毒基因组的研究已经导致了许多环境中病毒物种的表征,包括人类肠道微生物组和植物宿主。许多病毒基因组是圆形的,或者末端有冗余。起始点和终点不确定的圆形或线性病毒基因组对NGS组装者构成挑战,它们可能错误地复制这些基因组的部分。组件的长度,通常以N50长度为特征,经常被用作组件完整性甚至质量的指示。在本文中,我们证明了由各种组装器产生的最长的组装并不总是圆形或末端冗余噬菌体基因组的最佳组装,并且可能代表错误重复的基因组区域。结果表明,组装工具甚至可能为同一物种产生不同长度的组装基因组,这取决于不准确重复的内容,导致结果可能使研究人员感到困惑或不准确地使用。为了克服这个问题,我们引入了使用覆盖深度来识别环状或末端冗余噬菌体基因组中不准确重复内容的策略。我们通过提供组装两个噬菌体基因组和噬菌体宏基因组数据集的结果来总结本文,并强调了使用所提出策略的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信