超越黑名单：排除集生成策略和替代方法的关键评估。

bioRxiv : the preprint server for biology Pub Date : 2025-09-30 DOI:10.1101/2025.02.06.636968

Brydon P G Wall, Jonathan D Ogata, My Nguyen, Amy L Olex, Konstantinos V Floros, Anthony C Faber, Joseph L McClay, Chuck Harrell, Mikhail G Dozmorov

{"title":"超越黑名单：排除集生成策略和替代方法的关键评估。","authors":"Brydon P G Wall, Jonathan D Ogata, My Nguyen, Amy L Olex, Konstantinos V Floros, Anthony C Faber, Joseph L McClay, Chuck Harrell, Mikhail G Dozmorov","doi":"10.1101/2025.02.06.636968","DOIUrl":null,"url":null,"abstract":"Short-read sequencing data can be affected by alignment artifacts in certain genomic regions. Removing reads overlapping these exclusion regions, previously known as Blacklists, help to potentially improve biological signal. Tools like the widely used Blacklist software facilitate this process, but their algorithmic details and parameter choices are not always clearly documented, affecting reproducibility and biological relevance. We examined the Blacklist software and found that pre-generated exclusion sets were difficult to reproduce due to variability in input data, aligner choice, and read length. We also identified and addressed a coding issue that led to over-annotation of high-signal regions. We further explored the use of \"sponge\" sequences - unassembled genomic regions such as satellite DNA, ribosomal DNA, and mitochondrial DNA - as an alternative approach. Aligning reads to a genome that includes sponge sequences reduced signal correlation in ChIP-seq data comparably to Blacklist-derived exclusion sets while preserving biological signal. Sponge-based alignment also had minimal impact on RNA-seq gene counts, suggesting broader applicability beyond chromatin profiling. These results highlight the limitations of fixed exclusion sets and suggest that sponge sequences offer a flexible, alignment-guided strategy for reducing artifacts and improving functional genomics analyses.","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839099/pdf/","citationCount":"0","resultStr":"{\"title\":\"Beyond Blacklists: A Critical Assessment of Exclusion Set Generation Strategies and Alternative Approaches.\",\"authors\":\"Brydon P G Wall, Jonathan D Ogata, My Nguyen, Amy L Olex, Konstantinos V Floros, Anthony C Faber, Joseph L McClay, Chuck Harrell, Mikhail G Dozmorov\",\"doi\":\"10.1101/2025.02.06.636968\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Short-read sequencing data can be affected by alignment artifacts in certain genomic regions. Removing reads overlapping these exclusion regions, previously known as Blacklists, help to potentially improve biological signal. Tools like the widely used Blacklist software facilitate this process, but their algorithmic details and parameter choices are not always clearly documented, affecting reproducibility and biological relevance. We examined the Blacklist software and found that pre-generated exclusion sets were difficult to reproduce due to variability in input data, aligner choice, and read length. We also identified and addressed a coding issue that led to over-annotation of high-signal regions. We further explored the use of \\\"sponge\\\" sequences - unassembled genomic regions such as satellite DNA, ribosomal DNA, and mitochondrial DNA - as an alternative approach. Aligning reads to a genome that includes sponge sequences reduced signal correlation in ChIP-seq data comparably to Blacklist-derived exclusion sets while preserving biological signal. Sponge-based alignment also had minimal impact on RNA-seq gene counts, suggesting broader applicability beyond chromatin profiling. These results highlight the limitations of fixed exclusion sets and suggest that sponge sequences offer a flexible, alignment-guided strategy for reducing artifacts and improving functional genomics analyses.\",\"PeriodicalId\":519960,\"journal\":{\"name\":\"bioRxiv : the preprint server for biology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11839099/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv : the preprint server for biology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2025.02.06.636968\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.02.06.636968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

短读测序数据可能受到某些基因组区域比对伪影的影响。去除与这些排除区域重叠的读取，以前被称为黑名单，有助于潜在地改善生物信号。广泛使用的黑名单软件等工具促进了这一过程，但它们的算法细节和参数选择并不总是清晰地记录下来，影响了可重复性和生物学相关性。我们检查了Blacklist软件，发现由于输入数据、对齐器选择和读取长度的变化，预生成的排除集很难重现。我们还发现并解决了导致高信号区域过度注释的编码问题。我们进一步探索了使用“海绵”序列——未组装的基因组区域，如卫星DNA、核糖体DNA和线粒体DNA——作为一种替代方法。与黑名单衍生的排除集相比，将读数与包含海绵序列的基因组对齐可以降低ChIP-seq数据中的信号相关性，同时保留生物信号。基于海绵的比对对RNA-seq基因计数的影响也很小，这表明比染色质谱更广泛的适用性。这些结果突出了固定排除集的局限性，并表明海绵序列为减少伪象和改进功能基因组学分析提供了一种灵活的、定向的策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Beyond Blacklists: A Critical Assessment of Exclusion Set Generation Strategies and Alternative Approaches.

Short-read sequencing data can be affected by alignment artifacts in certain genomic regions. Removing reads overlapping these exclusion regions, previously known as Blacklists, help to potentially improve biological signal. Tools like the widely used Blacklist software facilitate this process, but their algorithmic details and parameter choices are not always clearly documented, affecting reproducibility and biological relevance. We examined the Blacklist software and found that pre-generated exclusion sets were difficult to reproduce due to variability in input data, aligner choice, and read length. We also identified and addressed a coding issue that led to over-annotation of high-signal regions. We further explored the use of "sponge" sequences - unassembled genomic regions such as satellite DNA, ribosomal DNA, and mitochondrial DNA - as an alternative approach. Aligning reads to a genome that includes sponge sequences reduced signal correlation in ChIP-seq data comparably to Blacklist-derived exclusion sets while preserving biological signal. Sponge-based alignment also had minimal impact on RNA-seq gene counts, suggesting broader applicability beyond chromatin profiling. These results highlight the limitations of fixed exclusion sets and suggest that sponge sequences offer a flexible, alignment-guided strategy for reducing artifacts and improving functional genomics analyses.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量