DNBSEQ数据中软件读取交叉污染分析。

IF 3.6 3区 生物学 Q1 BIOLOGY
Dmitry N Konanov, Vera Y Tereshchuk, Ignat V Sonets, Elena V Korneenko, Aleksandra V Lukina-Gronskaya, Anna S Speranskaya, Elena N Ilina
{"title":"DNBSEQ数据中软件读取交叉污染分析。","authors":"Dmitry N Konanov, Vera Y Tereshchuk, Ignat V Sonets, Elena V Korneenko, Aleksandra V Lukina-Gronskaya, Anna S Speranskaya, Elena N Ilina","doi":"10.3390/biology14060670","DOIUrl":null,"url":null,"abstract":"<p><p>DNA nanoball sequencing (DNBSEQ) is one of the most rapidly developing sequencing technologies and is widely applied in genomic and transcriptomic investigations. Recently, a new PE300 sequencing option primarily recommended for amplicon analysis was released for DNBSEQ-G99 and G400 devices. Given their unprecedentedly high data yield per flow cell, the new PE300 kits could be a great choice for various sequencing tasks, but we found that combining different types of DNA libraries in a single run could lead to undesired artifacts in the data. In this study, we investigate the occasional read cross-contamination that we first observed in our DNBSEQ PE300 run. The phenomenon, which we refer to as \"software contamination\", is not actual contamination but primarily manifests as improper forward/reverse read pairing, improper demultiplexing, or as \"digital chimeric\" reads. Although rare, these artifacts were found in all runs we have analyzed, including several MGI demo datasets (both PE100 and PE150). In this study, we demonstrate that these artifacts arise primarily from the incorrect resolution of sequencing signals produced by neighboring DNA nanoballs, leading to mixing out forward and reverse reads or improper demultiplexing. The artifacts occur most frequently with read pairs where the length of insert sequence is shorter than the read length. Based on a few external NA12878 human exome sequencing data, we conclude that the total improper pairing rate in DNBSEQ data is comparable to Illumina ones. Overall, the problem only affects the analysis results when simultaneously sequenced libraries have markedly different insert size distribution or flow cell loading. Additionally, we demonstrate here that raw DNBSEQ data might contain ~2% optical duplicates, resulting from the same effect of close neighboring of DNB-sites in the flow cell.</p>","PeriodicalId":48624,"journal":{"name":"Biology-Basel","volume":"14 6","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12189395/pdf/","citationCount":"0","resultStr":"{\"title\":\"Analysis of Software Read Cross-Contamination in DNBSEQ Data.\",\"authors\":\"Dmitry N Konanov, Vera Y Tereshchuk, Ignat V Sonets, Elena V Korneenko, Aleksandra V Lukina-Gronskaya, Anna S Speranskaya, Elena N Ilina\",\"doi\":\"10.3390/biology14060670\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>DNA nanoball sequencing (DNBSEQ) is one of the most rapidly developing sequencing technologies and is widely applied in genomic and transcriptomic investigations. Recently, a new PE300 sequencing option primarily recommended for amplicon analysis was released for DNBSEQ-G99 and G400 devices. Given their unprecedentedly high data yield per flow cell, the new PE300 kits could be a great choice for various sequencing tasks, but we found that combining different types of DNA libraries in a single run could lead to undesired artifacts in the data. In this study, we investigate the occasional read cross-contamination that we first observed in our DNBSEQ PE300 run. The phenomenon, which we refer to as \\\"software contamination\\\", is not actual contamination but primarily manifests as improper forward/reverse read pairing, improper demultiplexing, or as \\\"digital chimeric\\\" reads. Although rare, these artifacts were found in all runs we have analyzed, including several MGI demo datasets (both PE100 and PE150). In this study, we demonstrate that these artifacts arise primarily from the incorrect resolution of sequencing signals produced by neighboring DNA nanoballs, leading to mixing out forward and reverse reads or improper demultiplexing. The artifacts occur most frequently with read pairs where the length of insert sequence is shorter than the read length. Based on a few external NA12878 human exome sequencing data, we conclude that the total improper pairing rate in DNBSEQ data is comparable to Illumina ones. Overall, the problem only affects the analysis results when simultaneously sequenced libraries have markedly different insert size distribution or flow cell loading. Additionally, we demonstrate here that raw DNBSEQ data might contain ~2% optical duplicates, resulting from the same effect of close neighboring of DNB-sites in the flow cell.</p>\",\"PeriodicalId\":48624,\"journal\":{\"name\":\"Biology-Basel\",\"volume\":\"14 6\",\"pages\":\"\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12189395/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biology-Basel\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.3390/biology14060670\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology-Basel","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.3390/biology14060670","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

DNA纳米球测序(DNBSEQ)是目前发展最为迅速的测序技术之一,广泛应用于基因组学和转录组学研究。最近,DNBSEQ-G99和G400设备发布了一种新的PE300测序选项,主要推荐用于扩增子分析。鉴于其每流式细胞前所未有的高数据产量,新的PE300试剂盒可能是各种测序任务的绝佳选择,但我们发现在一次运行中组合不同类型的DNA文库可能会导致数据中出现不希望的伪影。在本研究中,我们调查了我们在DNBSEQ PE300运行中首次观察到的偶尔的读取交叉污染。这种现象,我们称之为“软件污染”,并不是实际的污染,而是主要表现为不正确的正向/反向读取配对,不正确的解复用,或“数字嵌合”读取。虽然很少,但是这些工件在我们分析的所有运行中都可以找到,包括几个MGI演示数据集(PE100和PE150)。在这项研究中,我们证明了这些伪影主要是由于邻近DNA纳米球产生的测序信号分辨率不正确,导致正向和反向读取混合或不正确的解复用。在插入序列的长度小于读取长度的读对中,伪影最常发生。基于少数外部NA12878人类外显子组测序数据,我们得出DNBSEQ数据中的总不正确配对率与Illumina数据相当。总的来说,只有当同时测序的文库具有明显不同的插入大小分布或流动池负载时,这个问题才会影响分析结果。此外,我们在这里证明了原始DNBSEQ数据可能包含约2%的光学重复,这是由于流细胞中dnb位点的邻近相同的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Analysis of Software Read Cross-Contamination in DNBSEQ Data.

DNA nanoball sequencing (DNBSEQ) is one of the most rapidly developing sequencing technologies and is widely applied in genomic and transcriptomic investigations. Recently, a new PE300 sequencing option primarily recommended for amplicon analysis was released for DNBSEQ-G99 and G400 devices. Given their unprecedentedly high data yield per flow cell, the new PE300 kits could be a great choice for various sequencing tasks, but we found that combining different types of DNA libraries in a single run could lead to undesired artifacts in the data. In this study, we investigate the occasional read cross-contamination that we first observed in our DNBSEQ PE300 run. The phenomenon, which we refer to as "software contamination", is not actual contamination but primarily manifests as improper forward/reverse read pairing, improper demultiplexing, or as "digital chimeric" reads. Although rare, these artifacts were found in all runs we have analyzed, including several MGI demo datasets (both PE100 and PE150). In this study, we demonstrate that these artifacts arise primarily from the incorrect resolution of sequencing signals produced by neighboring DNA nanoballs, leading to mixing out forward and reverse reads or improper demultiplexing. The artifacts occur most frequently with read pairs where the length of insert sequence is shorter than the read length. Based on a few external NA12878 human exome sequencing data, we conclude that the total improper pairing rate in DNBSEQ data is comparable to Illumina ones. Overall, the problem only affects the analysis results when simultaneously sequenced libraries have markedly different insert size distribution or flow cell loading. Additionally, we demonstrate here that raw DNBSEQ data might contain ~2% optical duplicates, resulting from the same effect of close neighboring of DNB-sites in the flow cell.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Biology-Basel
Biology-Basel Biological Science-Biological Science
CiteScore
5.70
自引率
4.80%
发文量
1618
审稿时长
11 weeks
期刊介绍: Biology (ISSN 2079-7737) is an international, peer-reviewed, quick-refereeing open access journal of Biological Science published by MDPI online. It publishes reviews, research papers and communications in all areas of biology and at the interface of related disciplines. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced. Electronic files regarding the full details of the experimental procedure, if unable to be published in a normal way, can be deposited as supplementary material.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信