Enhancing clinical outcome predictions through effective sample size evaluation in graph-based digital twin modeling.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-04-15 DOI:10.1186/s13040-025-00446-9

Xi Li, Jui-Hsuan Chang, Mythreye Venkatesan, Zhiping Paul Wang, Jason H Moore

{"title":"Enhancing clinical outcome predictions through effective sample size evaluation in graph-based digital twin modeling.","authors":"Xi Li, Jui-Hsuan Chang, Mythreye Venkatesan, Zhiping Paul Wang, Jason H Moore","doi":"10.1186/s13040-025-00446-9","DOIUrl":null,"url":null,"abstract":"<p><p>Digital twins in healthcare offer an innovative approach to precision diagnosis, prognosis, and treatment. SynTwin, a novel computational methodology to generate digital twins using synthetic data and network science, has previously shown promise for improving prediction of breast cancer mortality. In this study, we validate SynTwin using population-level data for different cancer types from the Surveillance, Epidemiology, and End Results (SEER) program from the National Cancer Institute (USA). We assess its predictive accuracy across cancer types of varying sample sizes (n = 1,000 to 30,000 records), mortality rates (35% to 60%), and study designs, revealing insights into the strengths and limitations of digital twins derived from synthetic data in mortality prediction. We also evaluate the effect of sample size (n = 1,000 to 70,000 records) on predictive accuracy for selected cancers (non-Hodgkin lymphoma, bladder, and colorectal cancers). Our results indicate that for larger datasets (n > 10,000) including digital twins in the nearest network neighbor prediction model significantly improves the performance compared to using real patients alone. Specifically, AUROCs ranged from 0.828 to 0.884 for cancers such as cervix uteri and ovarian cancer with digital twins, compared to 0.720 to 0.858 when using real patient data. Similarly, among the selected three cancers, AUROCs using digital twins exceeded AUROCs using real patients alone by at least 0.06 with narrowing variance in performance as the sample size increased. These results highlight the benefit of network-based digital twins, while emphasizing the importance of considering effective sample size when developing predictive models like SynTwin.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"30"},"PeriodicalIF":6.1000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11998210/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00446-9","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Digital twins in healthcare offer an innovative approach to precision diagnosis, prognosis, and treatment. SynTwin, a novel computational methodology to generate digital twins using synthetic data and network science, has previously shown promise for improving prediction of breast cancer mortality. In this study, we validate SynTwin using population-level data for different cancer types from the Surveillance, Epidemiology, and End Results (SEER) program from the National Cancer Institute (USA). We assess its predictive accuracy across cancer types of varying sample sizes (n = 1,000 to 30,000 records), mortality rates (35% to 60%), and study designs, revealing insights into the strengths and limitations of digital twins derived from synthetic data in mortality prediction. We also evaluate the effect of sample size (n = 1,000 to 70,000 records) on predictive accuracy for selected cancers (non-Hodgkin lymphoma, bladder, and colorectal cancers). Our results indicate that for larger datasets (n > 10,000) including digital twins in the nearest network neighbor prediction model significantly improves the performance compared to using real patients alone. Specifically, AUROCs ranged from 0.828 to 0.884 for cancers such as cervix uteri and ovarian cancer with digital twins, compared to 0.720 to 0.858 when using real patient data. Similarly, among the selected three cancers, AUROCs using digital twins exceeded AUROCs using real patients alone by at least 0.06 with narrowing variance in performance as the sample size increased. These results highlight the benefit of network-based digital twins, while emphasizing the importance of considering effective sample size when developing predictive models like SynTwin.

Abstract Image

查看原文本刊更多论文

通过基于图形的数字孪生模型的有效样本量评估，增强临床结果预测。

医疗保健中的数字孪生为精确诊断、预后和治疗提供了一种创新方法。SynTwin是一种利用合成数据和网络科学生成数字双胞胎的新型计算方法，此前曾显示出改善乳腺癌死亡率预测的希望。在这项研究中，我们使用来自美国国家癌症研究所（National cancer Institute， USA）的监测、流行病学和最终结果（SEER）项目的不同癌症类型的人口水平数据来验证SynTwin。我们评估了其在不同样本量的癌症类型（n = 1,000至30,000条记录）、死亡率（35%至60%）和研究设计中的预测准确性，揭示了从死亡率预测的合成数据中得出的数字双胞胎的优势和局限性。我们还评估了样本量（n = 1,000至70,000条记录）对选定癌症（非霍奇金淋巴瘤、膀胱癌和结直肠癌）预测准确性的影响。我们的研究结果表明，与单独使用真实患者相比，在最近邻网络预测模型中包含数字双胞胎的更大数据集（n > 10,000）显着提高了性能。具体来说，数字双胞胎的宫颈癌和卵巢癌等癌症的auroc范围为0.828至0.884，而使用真实患者数据的auroc范围为0.720至0.858。同样，在选定的三种癌症中，使用数字双胞胎的auroc比单独使用真实患者的auroc至少高出0.06，随着样本量的增加，性能差异逐渐缩小。这些结果突出了基于网络的数字双胞胎的好处，同时强调了在开发像SynTwin这样的预测模型时考虑有效样本量的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.