A Probabilistic Analysis of Shotgun Sequencing for Metagenomics

Marlee Herring
{"title":"A Probabilistic Analysis of Shotgun Sequencing for Metagenomics","authors":"Marlee Herring","doi":"10.1137/22s1472437","DOIUrl":null,"url":null,"abstract":"Genome sequencing is the basis for many modern biological and medicinal studies. With recent technological advances, metagenomics has become a problem of interest. This problem entails the analysis and reconstruction of multiple DNA sequences from different sources. Shotgun genome sequencing works by breaking up long DNA sequences into shorter segments called reads. Given this collection of reads, one would like to reconstruct the original collection of DNA sequences. For experimental design in metagenomics, it is important to understand how the minimal read length necessary for reliable reconstruction depends on the number and characteristics of the genomes involved. Utilizing simple probabilistic models for each DNA sequence, we analyze the identifiability of collections of M genomes of length N in an asymptotic regime in which N tends to infinity and M may grow with N. Our first main result provides a threshold in terms of M and N so that if the read length exceeds the threshold, then a simple greedy algorithm successfully reconstructs the full collection of genomes with probability tending to one. Our second main result establishes a lower threshold in terms of M and N such that if the read length is shorter than the threshold, then reconstruction of the full collection of genomes is impossible with probability tending to one.","PeriodicalId":93373,"journal":{"name":"SIAM undergraduate research online","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM undergraduate research online","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/22s1472437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Genome sequencing is the basis for many modern biological and medicinal studies. With recent technological advances, metagenomics has become a problem of interest. This problem entails the analysis and reconstruction of multiple DNA sequences from different sources. Shotgun genome sequencing works by breaking up long DNA sequences into shorter segments called reads. Given this collection of reads, one would like to reconstruct the original collection of DNA sequences. For experimental design in metagenomics, it is important to understand how the minimal read length necessary for reliable reconstruction depends on the number and characteristics of the genomes involved. Utilizing simple probabilistic models for each DNA sequence, we analyze the identifiability of collections of M genomes of length N in an asymptotic regime in which N tends to infinity and M may grow with N. Our first main result provides a threshold in terms of M and N so that if the read length exceeds the threshold, then a simple greedy algorithm successfully reconstructs the full collection of genomes with probability tending to one. Our second main result establishes a lower threshold in terms of M and N such that if the read length is shorter than the threshold, then reconstruction of the full collection of genomes is impossible with probability tending to one.
宏基因组学Shotgun测序的概率分析
基因组测序是许多现代生物学和医学研究的基础。随着近年来技术的进步,宏基因组学已经成为一个令人感兴趣的问题。这个问题需要分析和重建来自不同来源的多个DNA序列。霰弹枪基因组测序的工作原理是将长DNA序列分解成称为reads的较短片段。有了这些读数,人们想要重建原始的DNA序列。对于宏基因组学的实验设计,重要的是要了解可靠重建所需的最小读取长度如何取决于所涉及基因组的数量和特征。利用每个DNA序列的简单概率模型,我们分析了长度为N的M个基因组集合在N趋近于无穷大且M随N增长的渐近状态下的可识别性。我们的第一个主要结果提供了M和N的阈值,如果读取长度超过阈值,那么一个简单的贪婪算法成功地重建了概率趋近于1的完整基因组集合。我们的第二个主要结果用M和N建立了一个较低的阈值,这样,如果读取长度短于阈值,那么整个基因组集合的重建是不可能的,概率趋于1。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信