On the causes, consequences, and avoidance of PCR duplicates: Towards a theory of library complexity

IF 5.5 1区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY
Nicolas C. Rochette, Angel G. Rivera-Colón, Jessica Walsh, Thomas J. Sanger, Shane C. Campbell-Staton, Julian M. Catchen
{"title":"On the causes, consequences, and avoidance of PCR duplicates: Towards a theory of library complexity","authors":"Nicolas C. Rochette,&nbsp;Angel G. Rivera-Colón,&nbsp;Jessica Walsh,&nbsp;Thomas J. Sanger,&nbsp;Shane C. Campbell-Staton,&nbsp;Julian M. Catchen","doi":"10.1111/1755-0998.13800","DOIUrl":null,"url":null,"abstract":"<p>Library preparation protocols for most sequencing technologies involve PCR amplification of the template DNA, which open the possibility that a given template DNA molecule is sequenced multiple times. Reads arising from this phenomenon, known as PCR duplicates, inflate the cost of sequencing and can jeopardize the reliability of affected experiments. Despite the pervasiveness of this artefact, our understanding of its causes and of its impact on downstream statistical analyses remains essentially empirical. Here, we develop a general quantitative model of amplification distortions in sequencing data sets, which we leverage to investigate the factors controlling the occurrence of PCR duplicates. We show that the PCR duplicate rate is determined primarily by the ratio between library complexity and sequencing depth, and that amplification noise (including in its dependence on the number of PCR cycles) only plays a secondary role for this artefact. We confirm our predictions using new and published RAD-seq libraries and provide a method to estimate library complexity and amplification noise in any data set containing PCR duplicates. We discuss how amplification-related artefacts impact downstream analyses, and in particular genotyping accuracy. The proposed framework unites the numerous observations made on PCR duplicates and will be useful to experimenters of all sequencing technologies where DNA availability is a concern.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":"23 6","pages":"1299-1318"},"PeriodicalIF":5.5000,"publicationDate":"2023-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.13800","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Ecology Resources","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13800","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 6

Abstract

Library preparation protocols for most sequencing technologies involve PCR amplification of the template DNA, which open the possibility that a given template DNA molecule is sequenced multiple times. Reads arising from this phenomenon, known as PCR duplicates, inflate the cost of sequencing and can jeopardize the reliability of affected experiments. Despite the pervasiveness of this artefact, our understanding of its causes and of its impact on downstream statistical analyses remains essentially empirical. Here, we develop a general quantitative model of amplification distortions in sequencing data sets, which we leverage to investigate the factors controlling the occurrence of PCR duplicates. We show that the PCR duplicate rate is determined primarily by the ratio between library complexity and sequencing depth, and that amplification noise (including in its dependence on the number of PCR cycles) only plays a secondary role for this artefact. We confirm our predictions using new and published RAD-seq libraries and provide a method to estimate library complexity and amplification noise in any data set containing PCR duplicates. We discuss how amplification-related artefacts impact downstream analyses, and in particular genotyping accuracy. The proposed framework unites the numerous observations made on PCR duplicates and will be useful to experimenters of all sequencing technologies where DNA availability is a concern.

PCR重复的原因、后果与避免:图书馆复杂性理论的探讨
大多数测序技术的文库制备方案都涉及模板DNA的PCR扩增,这使得对给定的模板DNA分子进行多次测序成为可能。由这种现象引起的读取,被称为PCR重复,增加了测序的成本,并可能危及受影响实验的可靠性。尽管这一人工产物无处不在,但我们对其原因及其对下游统计分析的影响的理解基本上仍然是经验主义的。在这里,我们开发了一个通用的定量模型,在测序数据集的扩增扭曲,我们利用它来研究控制PCR重复发生的因素。我们发现,PCR重复率主要由文库复杂性和测序深度之间的比率决定,而扩增噪声(包括其对PCR循环数的依赖)仅在该伪产物中起次要作用。我们使用新的和已发表的RAD-seq文库证实了我们的预测,并提供了一种在任何包含PCR重复的数据集中估计文库复杂性和扩增噪声的方法。我们讨论了扩增相关的伪影如何影响下游分析,特别是基因分型的准确性。提出的框架统一了对PCR重复进行的大量观察,并将对所有测序技术的实验者有用,其中DNA可用性是一个问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Molecular Ecology Resources
Molecular Ecology Resources 生物-进化生物学
CiteScore
15.60
自引率
5.20%
发文量
170
审稿时长
3 months
期刊介绍: Molecular Ecology Resources promotes the creation of comprehensive resources for the scientific community, encompassing computer programs, statistical and molecular advancements, and a diverse array of molecular tools. Serving as a conduit for disseminating these resources, the journal targets a broad audience of researchers in the fields of evolution, ecology, and conservation. Articles in Molecular Ecology Resources are crafted to support investigations tackling significant questions within these disciplines. In addition to original resource articles, Molecular Ecology Resources features Reviews, Opinions, and Comments relevant to the field. The journal also periodically releases Special Issues focusing on resource development within specific areas.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信