SoMaCX: a complex generative genome modeling framework.

IF 3.7 2区 生物学 Q2 BIOTECHNOLOGY & APPLIED MICROBIOLOGY
Timothy James Becker
{"title":"SoMaCX: a complex generative genome modeling framework.","authors":"Timothy James Becker","doi":"10.1186/s12864-025-12023-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Somatic structural variations (SVs) are commonly observed in cancer tissue, but remain challenging to discover with short and long read sequencing due to tumor heterogeneity and other technical sequencing factors. Only SVs with a sufficient fraction of reads spanning the event will be detectable, while issues like chromothripsis increase the complexity and resulting interpretation significantly. Because structural variation is difficult to measure and reproduce in vivo, it is logical to make use of simulation frameworks to determine realistic system limitations. Our generative modeling approach called soMaCX uses distributions from data to empower simulations that approach real data.</p><p><strong>Results: </strong>Our generative framework includes mechanisms for biological conservation in the germline as well as tissue composition in the somatic along with regional distribution controls and complex SV generation that is not available in other systems. The output of this system is FASTA format which can then be used as input to any downstream read simulator making Illumina, PacBio, 10X genomics, Oxford-Nanopore and Bionano FASTQ data files which are further processed to become standard BAM files for SV calling.</p><p><strong>Conclusions: </strong>The soMaCX framework provides superior generative modeling-based performance when compared to other simulation frameworks with respect to real data. Our open-source method introduces an important conceptual element to simulation by utilizing biological relevant regions (genes and regulatory elements) as the distribution controls along with the biological modulation of known pathways (end-joining) leading to more detailed and realistic simulated genomes. By designing a generative method to explore the most difficult genomic conditions, we provide a means to measure germline variation calling performance and to calibrate the results for rare variants needed in the clinical setting. We provide a python 3 implementation at: https://github.com/timothyjamesbecker/somacx .</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"26 1","pages":"853"},"PeriodicalIF":3.7000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482561/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-025-12023-9","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Somatic structural variations (SVs) are commonly observed in cancer tissue, but remain challenging to discover with short and long read sequencing due to tumor heterogeneity and other technical sequencing factors. Only SVs with a sufficient fraction of reads spanning the event will be detectable, while issues like chromothripsis increase the complexity and resulting interpretation significantly. Because structural variation is difficult to measure and reproduce in vivo, it is logical to make use of simulation frameworks to determine realistic system limitations. Our generative modeling approach called soMaCX uses distributions from data to empower simulations that approach real data.

Results: Our generative framework includes mechanisms for biological conservation in the germline as well as tissue composition in the somatic along with regional distribution controls and complex SV generation that is not available in other systems. The output of this system is FASTA format which can then be used as input to any downstream read simulator making Illumina, PacBio, 10X genomics, Oxford-Nanopore and Bionano FASTQ data files which are further processed to become standard BAM files for SV calling.

Conclusions: The soMaCX framework provides superior generative modeling-based performance when compared to other simulation frameworks with respect to real data. Our open-source method introduces an important conceptual element to simulation by utilizing biological relevant regions (genes and regulatory elements) as the distribution controls along with the biological modulation of known pathways (end-joining) leading to more detailed and realistic simulated genomes. By designing a generative method to explore the most difficult genomic conditions, we provide a means to measure germline variation calling performance and to calibrate the results for rare variants needed in the clinical setting. We provide a python 3 implementation at: https://github.com/timothyjamesbecker/somacx .

Abstract Image

Abstract Image

Abstract Image

SoMaCX:一个复杂的生成基因组建模框架。
背景:体细胞结构变异(SVs)在肿瘤组织中普遍存在,但由于肿瘤异质性和其他技术测序因素,短读段和长读段测序仍然具有挑战性。只有具有足够部分的读取跨越事件的sv才会被检测到,而像chromothripsis这样的问题会显著增加复杂性和结果解释。由于结构变化很难在体内测量和复制,因此利用模拟框架来确定现实的系统限制是合乎逻辑的。我们的生成建模方法soMaCX使用来自数据的分布来支持接近真实数据的模拟。结果:我们的生殖框架包括生殖系的生物保护机制,以及体细胞的组织组成,以及区域分布控制和复杂的SV生成,这在其他系统中是不可用的。该系统的输出是FASTA格式,然后可以用作输入到任何下游读取模拟器制作Illumina, PacBio, 10X基因组学,Oxford-Nanopore和Bionano FASTQ数据文件,这些文件被进一步处理成为SV调用的标准BAM文件。结论:在真实数据方面,与其他仿真框架相比,soMaCX框架提供了优越的基于生成建模的性能。我们的开源方法引入了一个重要的概念元素,通过利用生物相关区域(基因和调控元件)作为分布控制以及已知途径的生物调节(末端连接)来模拟更详细和真实的基因组。通过设计一种生成方法来探索最困难的基因组条件,我们提供了一种测量生殖系变异调用性能的方法,并为临床环境中需要的罕见变异校准结果。我们在:https://github.com/timothyjamesbecker/somacx提供了python 3的实现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMC Genomics
BMC Genomics 生物-生物工程与应用微生物
CiteScore
7.40
自引率
4.50%
发文量
769
审稿时长
6.4 months
期刊介绍: BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics. BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信