Martin Papenberg, Cheng Wang, Maïgane Diop, Syed Hassan Bukhari, Boris Oskotsky, Brittany R Davidson, Kim Chi Vo, Binya Liu, Juan C Irwin, Alexis J Combes, Brice Gaudilliere, Jingjing Li, David K Stevenson, Gunnar W Klau, Linda C Giudice, Marina Sirota, Tomiko T Oskotsky
{"title":"反聚类的样本分配,以尽量减少批量影响。","authors":"Martin Papenberg, Cheng Wang, Maïgane Diop, Syed Hassan Bukhari, Boris Oskotsky, Brittany R Davidson, Kim Chi Vo, Binya Liu, Juan C Irwin, Alexis J Combes, Brice Gaudilliere, Jingjing Li, David K Stevenson, Gunnar W Klau, Linda C Giudice, Marina Sirota, Tomiko T Oskotsky","doi":"10.1016/j.crmeth.2025.101137","DOIUrl":null,"url":null,"abstract":"<p><p>High-throughput sequencing enables efficient processing of DNA and RNA samples in batches, but batch effects can obscure true biological signal. We propose using anticlustering as an automated method to assign samples to balanced batches, minimizing covariate imbalance and supporting user-defined constraints such as batch size, number of batches, and \"must-link\" assignments. In simulations, anticlustering outperforms existing methods in assigning balanced batches. We illustrate its utility using a real-life example from the University of California, San Francisco (UCSF)-Stanford Endometriosis Center for Discovery, Innovation, Training and Community Engagement (ENACT) Center, where multiple samples per individual required processing within the same batch to avoid confounding. The Two-Phase Must-Link (2PML) anticlustering algorithm realized the must-link restrictions while balancing disease stage, menstrual cycle phase, case vs. control, and clinical site. All methods are accessible via the free, open-source R package anticlust, with a companion RShiny web app for visualization and interactive batch assignment.</p>","PeriodicalId":29773,"journal":{"name":"Cell Reports Methods","volume":"5 8","pages":"101137"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12461633/pdf/","citationCount":"0","resultStr":"{\"title\":\"Anticlustering for sample allocation to minimize batch effects.\",\"authors\":\"Martin Papenberg, Cheng Wang, Maïgane Diop, Syed Hassan Bukhari, Boris Oskotsky, Brittany R Davidson, Kim Chi Vo, Binya Liu, Juan C Irwin, Alexis J Combes, Brice Gaudilliere, Jingjing Li, David K Stevenson, Gunnar W Klau, Linda C Giudice, Marina Sirota, Tomiko T Oskotsky\",\"doi\":\"10.1016/j.crmeth.2025.101137\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>High-throughput sequencing enables efficient processing of DNA and RNA samples in batches, but batch effects can obscure true biological signal. We propose using anticlustering as an automated method to assign samples to balanced batches, minimizing covariate imbalance and supporting user-defined constraints such as batch size, number of batches, and \\\"must-link\\\" assignments. In simulations, anticlustering outperforms existing methods in assigning balanced batches. We illustrate its utility using a real-life example from the University of California, San Francisco (UCSF)-Stanford Endometriosis Center for Discovery, Innovation, Training and Community Engagement (ENACT) Center, where multiple samples per individual required processing within the same batch to avoid confounding. The Two-Phase Must-Link (2PML) anticlustering algorithm realized the must-link restrictions while balancing disease stage, menstrual cycle phase, case vs. control, and clinical site. All methods are accessible via the free, open-source R package anticlust, with a companion RShiny web app for visualization and interactive batch assignment.</p>\",\"PeriodicalId\":29773,\"journal\":{\"name\":\"Cell Reports Methods\",\"volume\":\"5 8\",\"pages\":\"101137\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12461633/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cell Reports Methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.crmeth.2025.101137\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cell Reports Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.crmeth.2025.101137","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
Anticlustering for sample allocation to minimize batch effects.
High-throughput sequencing enables efficient processing of DNA and RNA samples in batches, but batch effects can obscure true biological signal. We propose using anticlustering as an automated method to assign samples to balanced batches, minimizing covariate imbalance and supporting user-defined constraints such as batch size, number of batches, and "must-link" assignments. In simulations, anticlustering outperforms existing methods in assigning balanced batches. We illustrate its utility using a real-life example from the University of California, San Francisco (UCSF)-Stanford Endometriosis Center for Discovery, Innovation, Training and Community Engagement (ENACT) Center, where multiple samples per individual required processing within the same batch to avoid confounding. The Two-Phase Must-Link (2PML) anticlustering algorithm realized the must-link restrictions while balancing disease stage, menstrual cycle phase, case vs. control, and clinical site. All methods are accessible via the free, open-source R package anticlust, with a companion RShiny web app for visualization and interactive batch assignment.