DNBSEQ数据中软件读取交叉污染分析。

IF 3.6 3区生物学 Q1 BIOLOGY

Biology-Basel Pub Date : 2025-06-09 DOI:10.3390/biology14060670

Dmitry N Konanov, Vera Y Tereshchuk, Ignat V Sonets, Elena V Korneenko, Aleksandra V Lukina-Gronskaya, Anna S Speranskaya, Elena N Ilina

{"title":"DNBSEQ数据中软件读取交叉污染分析。","authors":"Dmitry N Konanov, Vera Y Tereshchuk, Ignat V Sonets, Elena V Korneenko, Aleksandra V Lukina-Gronskaya, Anna S Speranskaya, Elena N Ilina","doi":"10.3390/biology14060670","DOIUrl":null,"url":null,"abstract":"DNA nanoball sequencing (DNBSEQ) is one of the most rapidly developing sequencing technologies and is widely applied in genomic and transcriptomic investigations. Recently, a new PE300 sequencing option primarily recommended for amplicon analysis was released for DNBSEQ-G99 and G400 devices. Given their unprecedentedly high data yield per flow cell, the new PE300 kits could be a great choice for various sequencing tasks, but we found that combining different types of DNA libraries in a single run could lead to undesired artifacts in the data. In this study, we investigate the occasional read cross-contamination that we first observed in our DNBSEQ PE300 run. The phenomenon, which we refer to as \"software contamination\", is not actual contamination but primarily manifests as improper forward/reverse read pairing, improper demultiplexing, or as \"digital chimeric\" reads. Although rare, these artifacts were found in all runs we have analyzed, including several MGI demo datasets (both PE100 and PE150). In this study, we demonstrate that these artifacts arise primarily from the incorrect resolution of sequencing signals produced by neighboring DNA nanoballs, leading to mixing out forward and reverse reads or improper demultiplexing. The artifacts occur most frequently with read pairs where the length of insert sequence is shorter than the read length. Based on a few external NA12878 human exome sequencing data, we conclude that the total improper pairing rate in DNBSEQ data is comparable to Illumina ones. Overall, the problem only affects the analysis results when simultaneously sequenced libraries have markedly different insert size distribution or flow cell loading. Additionally, we demonstrate here that raw DNBSEQ data might contain ~2% optical duplicates, resulting from the same effect of close neighboring of DNB-sites in the flow cell.","PeriodicalId":48624,"journal":{"name":"Biology-Basel","volume":"14 6","pages":""},"PeriodicalIF":3.6000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12189395/pdf/","citationCount":"0","resultStr":"{\"title\":\"Analysis of Software Read Cross-Contamination in DNBSEQ Data.\",\"authors\":\"Dmitry N Konanov, Vera Y Tereshchuk, Ignat V Sonets, Elena V Korneenko, Aleksandra V Lukina-Gronskaya, Anna S Speranskaya, Elena N Ilina\",\"doi\":\"10.3390/biology14060670\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"DNA nanoball sequencing (DNBSEQ) is one of the most rapidly developing sequencing technologies and is widely applied in genomic and transcriptomic investigations. Recently, a new PE300 sequencing option primarily recommended for amplicon analysis was released for DNBSEQ-G99 and G400 devices. Given their unprecedentedly high data yield per flow cell, the new PE300 kits could be a great choice for various sequencing tasks, but we found that combining different types of DNA libraries in a single run could lead to undesired artifacts in the data. In this study, we investigate the occasional read cross-contamination that we first observed in our DNBSEQ PE300 run. The phenomenon, which we refer to as \\\"software contamination\\\", is not actual contamination but primarily manifests as improper forward/reverse read pairing, improper demultiplexing, or as \\\"digital chimeric\\\" reads. Although rare, these artifacts were found in all runs we have analyzed, including several MGI demo datasets (both PE100 and PE150). In this study, we demonstrate that these artifacts arise primarily from the incorrect resolution of sequencing signals produced by neighboring DNA nanoballs, leading to mixing out forward and reverse reads or improper demultiplexing. The artifacts occur most frequently with read pairs where the length of insert sequence is shorter than the read length. Based on a few external NA12878 human exome sequencing data, we conclude that the total improper pairing rate in DNBSEQ data is comparable to Illumina ones. Overall, the problem only affects the analysis results when simultaneously sequenced libraries have markedly different insert size distribution or flow cell loading. Additionally, we demonstrate here that raw DNBSEQ data might contain ~2% optical duplicates, resulting from the same effect of close neighboring of DNB-sites in the flow cell.\",\"PeriodicalId\":48624,\"journal\":{\"name\":\"Biology-Basel\",\"volume\":\"14 6\",\"pages\":\"\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12189395/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biology-Basel\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.3390/biology14060670\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biology-Basel","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.3390/biology14060670","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

DNA纳米球测序（DNBSEQ）是目前发展最为迅速的测序技术之一，广泛应用于基因组学和转录组学研究。最近，DNBSEQ-G99和G400设备发布了一种新的PE300测序选项，主要推荐用于扩增子分析。鉴于其每流式细胞前所未有的高数据产量，新的PE300试剂盒可能是各种测序任务的绝佳选择，但我们发现在一次运行中组合不同类型的DNA文库可能会导致数据中出现不希望的伪影。在本研究中，我们调查了我们在DNBSEQ PE300运行中首次观察到的偶尔的读取交叉污染。这种现象，我们称之为“软件污染”，并不是实际的污染，而是主要表现为不正确的正向/反向读取配对，不正确的解复用，或“数字嵌合”读取。虽然很少，但是这些工件在我们分析的所有运行中都可以找到，包括几个MGI演示数据集（PE100和PE150）。在这项研究中，我们证明了这些伪影主要是由于邻近DNA纳米球产生的测序信号分辨率不正确，导致正向和反向读取混合或不正确的解复用。在插入序列的长度小于读取长度的读对中，伪影最常发生。基于少数外部NA12878人类外显子组测序数据，我们得出DNBSEQ数据中的总不正确配对率与Illumina数据相当。总的来说，只有当同时测序的文库具有明显不同的插入大小分布或流动池负载时，这个问题才会影响分析结果。此外，我们在这里证明了原始DNBSEQ数据可能包含约2%的光学重复，这是由于流细胞中dnb位点的邻近相同的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysis of Software Read Cross-Contamination in DNBSEQ Data.

DNA nanoball sequencing (DNBSEQ) is one of the most rapidly developing sequencing technologies and is widely applied in genomic and transcriptomic investigations. Recently, a new PE300 sequencing option primarily recommended for amplicon analysis was released for DNBSEQ-G99 and G400 devices. Given their unprecedentedly high data yield per flow cell, the new PE300 kits could be a great choice for various sequencing tasks, but we found that combining different types of DNA libraries in a single run could lead to undesired artifacts in the data. In this study, we investigate the occasional read cross-contamination that we first observed in our DNBSEQ PE300 run. The phenomenon, which we refer to as "software contamination", is not actual contamination but primarily manifests as improper forward/reverse read pairing, improper demultiplexing, or as "digital chimeric" reads. Although rare, these artifacts were found in all runs we have analyzed, including several MGI demo datasets (both PE100 and PE150). In this study, we demonstrate that these artifacts arise primarily from the incorrect resolution of sequencing signals produced by neighboring DNA nanoballs, leading to mixing out forward and reverse reads or improper demultiplexing. The artifacts occur most frequently with read pairs where the length of insert sequence is shorter than the read length. Based on a few external NA12878 human exome sequencing data, we conclude that the total improper pairing rate in DNBSEQ data is comparable to Illumina ones. Overall, the problem only affects the analysis results when simultaneously sequenced libraries have markedly different insert size distribution or flow cell loading. Additionally, we demonstrate here that raw DNBSEQ data might contain ~2% optical duplicates, resulting from the same effect of close neighboring of DNB-sites in the flow cell.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biology-Basel Biological Science-Biological Science

CiteScore

5.70

自引率

4.80%

发文量

1618

审稿时长

11 weeks

期刊介绍： Biology (ISSN 2079-7737) is an international, peer-reviewed, quick-refereeing open access journal of Biological Science published by MDPI online. It publishes reviews, research papers and communications in all areas of biology and at the interface of related disciplines. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced. Electronic files regarding the full details of the experimental procedure, if unable to be published in a normal way, can be deposited as supplementary material.