{"title":"SoMaCX: a complex generative genome modeling framework.","authors":"Timothy James Becker","doi":"10.1186/s12864-025-12023-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Somatic structural variations (SVs) are commonly observed in cancer tissue, but remain challenging to discover with short and long read sequencing due to tumor heterogeneity and other technical sequencing factors. Only SVs with a sufficient fraction of reads spanning the event will be detectable, while issues like chromothripsis increase the complexity and resulting interpretation significantly. Because structural variation is difficult to measure and reproduce in vivo, it is logical to make use of simulation frameworks to determine realistic system limitations. Our generative modeling approach called soMaCX uses distributions from data to empower simulations that approach real data.</p><p><strong>Results: </strong>Our generative framework includes mechanisms for biological conservation in the germline as well as tissue composition in the somatic along with regional distribution controls and complex SV generation that is not available in other systems. The output of this system is FASTA format which can then be used as input to any downstream read simulator making Illumina, PacBio, 10X genomics, Oxford-Nanopore and Bionano FASTQ data files which are further processed to become standard BAM files for SV calling.</p><p><strong>Conclusions: </strong>The soMaCX framework provides superior generative modeling-based performance when compared to other simulation frameworks with respect to real data. Our open-source method introduces an important conceptual element to simulation by utilizing biological relevant regions (genes and regulatory elements) as the distribution controls along with the biological modulation of known pathways (end-joining) leading to more detailed and realistic simulated genomes. By designing a generative method to explore the most difficult genomic conditions, we provide a means to measure germline variation calling performance and to calibrate the results for rare variants needed in the clinical setting. We provide a python 3 implementation at: https://github.com/timothyjamesbecker/somacx .</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"26 1","pages":"853"},"PeriodicalIF":3.7000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482561/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-025-12023-9","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Somatic structural variations (SVs) are commonly observed in cancer tissue, but remain challenging to discover with short and long read sequencing due to tumor heterogeneity and other technical sequencing factors. Only SVs with a sufficient fraction of reads spanning the event will be detectable, while issues like chromothripsis increase the complexity and resulting interpretation significantly. Because structural variation is difficult to measure and reproduce in vivo, it is logical to make use of simulation frameworks to determine realistic system limitations. Our generative modeling approach called soMaCX uses distributions from data to empower simulations that approach real data.
Results: Our generative framework includes mechanisms for biological conservation in the germline as well as tissue composition in the somatic along with regional distribution controls and complex SV generation that is not available in other systems. The output of this system is FASTA format which can then be used as input to any downstream read simulator making Illumina, PacBio, 10X genomics, Oxford-Nanopore and Bionano FASTQ data files which are further processed to become standard BAM files for SV calling.
Conclusions: The soMaCX framework provides superior generative modeling-based performance when compared to other simulation frameworks with respect to real data. Our open-source method introduces an important conceptual element to simulation by utilizing biological relevant regions (genes and regulatory elements) as the distribution controls along with the biological modulation of known pathways (end-joining) leading to more detailed and realistic simulated genomes. By designing a generative method to explore the most difficult genomic conditions, we provide a means to measure germline variation calling performance and to calibrate the results for rare variants needed in the clinical setting. We provide a python 3 implementation at: https://github.com/timothyjamesbecker/somacx .
期刊介绍:
BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics.
BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.