Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer.

IF 3.3 Q2 ONCOLOGY
Hyunwook Kim, Won Seok Jang, Woo Seob Sim, Han Sang Kim, Jeong Eun Choi, Eun Sil Baek, Yu Rang Park, Sang Joon Shin
{"title":"Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer.","authors":"Hyunwook Kim, Won Seok Jang, Woo Seob Sim, Han Sang Kim, Jeong Eun Choi, Eun Sil Baek, Yu Rang Park, Sang Joon Shin","doi":"10.1200/CCI.23.00201","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>In artificial intelligence-based modeling, working with a limited number of patient groups is challenging. This retrospective study aimed to evaluate whether applying synthetic data generation methods to the clinical data of small patient groups can enhance the performance of prediction models.</p><p><strong>Materials and methods: </strong>A data set collected by the Cancer Registry Library Project from the Yonsei Cancer Center (YCC), Severance Hospital, between January 2008 and October 2020 was reviewed. Patients with colorectal cancer younger than 50 years who started their initial treatment at YCC were included. A Bayesian network-based synthesizing model was used to generate a synthetic data set, combined with the differential privacy (DP) method.</p><p><strong>Results: </strong>A synthetic population of 5,005 was generated from a data set of 1,253 patients with 93 clinical features. The Hellinger distance and correlation difference metric were below 0.3 and 0.5, respectively, indicating no statistical difference. The overall survival by disease stage did not differ between the synthetic and original populations. Training with the synthetic data and validating with the original data showed the highest performances of 0.850, 0.836, and 0.790 for the Decision Tree, Random Forest, and XGBoost models, respectively. Comparison of synthetic data sets with different epsilon parameters from the original data sets showed improved performance >0.1%. For extremely small data sets, models using synthetic data outperformed those using only original data sets. The reidentification risk measures demonstrated that the epsilons between 0.1 and 100 fell below the baseline, indicating a preserved privacy state.</p><p><strong>Conclusion: </strong>The synthetic data generation approach enhances predictive modeling performance by maintaining statistical and clinical integrity, and simultaneously reduces privacy risks through the application of DP techniques.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10830088/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI.23.00201","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: In artificial intelligence-based modeling, working with a limited number of patient groups is challenging. This retrospective study aimed to evaluate whether applying synthetic data generation methods to the clinical data of small patient groups can enhance the performance of prediction models.

Materials and methods: A data set collected by the Cancer Registry Library Project from the Yonsei Cancer Center (YCC), Severance Hospital, between January 2008 and October 2020 was reviewed. Patients with colorectal cancer younger than 50 years who started their initial treatment at YCC were included. A Bayesian network-based synthesizing model was used to generate a synthetic data set, combined with the differential privacy (DP) method.

Results: A synthetic population of 5,005 was generated from a data set of 1,253 patients with 93 clinical features. The Hellinger distance and correlation difference metric were below 0.3 and 0.5, respectively, indicating no statistical difference. The overall survival by disease stage did not differ between the synthetic and original populations. Training with the synthetic data and validating with the original data showed the highest performances of 0.850, 0.836, and 0.790 for the Decision Tree, Random Forest, and XGBoost models, respectively. Comparison of synthetic data sets with different epsilon parameters from the original data sets showed improved performance >0.1%. For extremely small data sets, models using synthetic data outperformed those using only original data sets. The reidentification risk measures demonstrated that the epsilons between 0.1 and 100 fell below the baseline, indicating a preserved privacy state.

Conclusion: The synthetic data generation approach enhances predictive modeling performance by maintaining statistical and clinical integrity, and simultaneously reduces privacy risks through the application of DP techniques.

合成数据改进了早发结直肠癌的生存状态预测模型
目的:在基于人工智能的建模中,使用数量有限的患者群体是一项挑战。这项回顾性研究旨在评估将合成数据生成方法应用于小型患者群体的临床数据是否能提高预测模型的性能:本研究回顾了 2008 年 1 月至 2020 年 10 月期间延世大学附属 Severance 医院癌症中心(YCC)癌症登记库项目收集的数据集。研究对象包括在延世大学附属肿瘤医院开始初次治疗的 50 岁以下结直肠癌患者。研究采用基于贝叶斯网络的合成模型,结合差分隐私法(DP)生成合成数据集:从具有 93 个临床特征的 1,253 名患者的数据集中生成了 5,005 人的合成人群。海灵格距离和相关性差异指标分别低于 0.3 和 0.5,表明没有统计学差异。按疾病分期划分的总生存率在合成数据和原始数据之间没有差异。用合成数据进行训练,再用原始数据进行验证,结果显示决策树、随机森林和 XGBoost 模型的最高性能分别为 0.850、0.836 和 0.790。使用与原始数据集不同的ε参数对合成数据集进行比较,结果显示性能提高了 0.1%。对于极小的数据集,使用合成数据的模型性能优于仅使用原始数据集的模型。再识别风险测量结果表明,0.1 至 100 之间的ε均低于基线,表明隐私状态得到了保护:结论:合成数据生成方法通过保持统计和临床完整性提高了预测建模性能,同时通过应用 DP 技术降低了隐私风险。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.20
自引率
4.80%
发文量
190
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信