Genomics and integrative clinical data machine learning scoring model to ascertain likely Lynch syndrome patients.

Ramadhani Chambuso, Takudzwa Nyasha Musarurwa, Alessandro Pietro Aldera, Armin Deffur, Hayli Geffen, Douglas Perkins, Raj Ramesar
{"title":"Genomics and integrative clinical data machine learning scoring model to ascertain likely Lynch syndrome patients.","authors":"Ramadhani Chambuso, Takudzwa Nyasha Musarurwa, Alessandro Pietro Aldera, Armin Deffur, Hayli Geffen, Douglas Perkins, Raj Ramesar","doi":"10.1038/s44276-025-00140-7","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Lynch syndrome (LS) screening methods include multistep molecular somatic tumor testing to distinguish likely-LS patients from sporadic cases, which can be costly and complex. Also, direct germline testing for LS for every diagnosed solid cancer patient is a challenge in resource limited settings. We developed a unique machine learning scoring model to ascertain likely-LS cases from a cohort of colorectal cancer (CRC) patients.</p><p><strong>Methods: </strong>We used CRC patients from the cBioPortal database (TCGA studies) with complete clinicopathologic and somatic genomics data. We determined the rate of pathogenic/likely pathogenic variants in five (5) LS genes (MLH1, MSH2, MSH6, PMS2, EPCAM), and the BRAF mutations using a pre-designed bioinformatic annotation pipeline. Annovar, Intervar, Variant Effect Predictor (VEP), and OncoKB software tools were used to functionally annotate and interpret somatic variants detected. The OncoKB precision oncology knowledge base was used to provide information on the effects of the identified variants. We scored the clinicopathologic and somatic genomics data automatically using a machine learning model to discriminate between likely-LS and sporadic CRC cases. The training and testing datasets comprised of 80% and 20% of the total CRC patients, respectively. Group regularisation methods in combination with 10-fold cross-validation were performed for feature selection on the training data.</p><p><strong>Results: </strong>Out of 4800 CRC patients frorm the TCGA datasets with clinicopathological and somatic genomics data, we ascertained 524 patients with complete data. The scoring model using both clinicopathological and genetic characteristics for likely-LS showed a sensitivity and specificity of 100%, and both had the maximum accuracy, area under the curve (AUC) and AUC for precision-recall (AUCPR) of 1. In a similar analysis, the training and testing models that only relied on clinical or pathological characteristics had a sensitivity of 0.88 and 0.50, specificity of 0.55 and 0.51, accuracy of 0.58 and 0.51, AUC of 0.74 and 0.61, and AUCPR of 0.21 and 0.19, respectively.</p><p><strong>Conclusions: </strong>Simultaneous scoring of LS clinicopathological and somatic genomics data can improve prediction and ascertainment for likely-LS from all CRC cases. This approach can increase accuracy while reducing the reliance on expensive direct germline testing for all CRC patients, making LS screening more accessible and cost-effective, especially in resource-limited settings.</p>","PeriodicalId":519964,"journal":{"name":"BJC reports","volume":"3 1","pages":"30"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12053672/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BJC reports","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s44276-025-00140-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Lynch syndrome (LS) screening methods include multistep molecular somatic tumor testing to distinguish likely-LS patients from sporadic cases, which can be costly and complex. Also, direct germline testing for LS for every diagnosed solid cancer patient is a challenge in resource limited settings. We developed a unique machine learning scoring model to ascertain likely-LS cases from a cohort of colorectal cancer (CRC) patients.

Methods: We used CRC patients from the cBioPortal database (TCGA studies) with complete clinicopathologic and somatic genomics data. We determined the rate of pathogenic/likely pathogenic variants in five (5) LS genes (MLH1, MSH2, MSH6, PMS2, EPCAM), and the BRAF mutations using a pre-designed bioinformatic annotation pipeline. Annovar, Intervar, Variant Effect Predictor (VEP), and OncoKB software tools were used to functionally annotate and interpret somatic variants detected. The OncoKB precision oncology knowledge base was used to provide information on the effects of the identified variants. We scored the clinicopathologic and somatic genomics data automatically using a machine learning model to discriminate between likely-LS and sporadic CRC cases. The training and testing datasets comprised of 80% and 20% of the total CRC patients, respectively. Group regularisation methods in combination with 10-fold cross-validation were performed for feature selection on the training data.

Results: Out of 4800 CRC patients frorm the TCGA datasets with clinicopathological and somatic genomics data, we ascertained 524 patients with complete data. The scoring model using both clinicopathological and genetic characteristics for likely-LS showed a sensitivity and specificity of 100%, and both had the maximum accuracy, area under the curve (AUC) and AUC for precision-recall (AUCPR) of 1. In a similar analysis, the training and testing models that only relied on clinical or pathological characteristics had a sensitivity of 0.88 and 0.50, specificity of 0.55 and 0.51, accuracy of 0.58 and 0.51, AUC of 0.74 and 0.61, and AUCPR of 0.21 and 0.19, respectively.

Conclusions: Simultaneous scoring of LS clinicopathological and somatic genomics data can improve prediction and ascertainment for likely-LS from all CRC cases. This approach can increase accuracy while reducing the reliance on expensive direct germline testing for all CRC patients, making LS screening more accessible and cost-effective, especially in resource-limited settings.

基因组学和综合临床数据机器学习评分模型确定可能的Lynch综合征患者。
背景:Lynch综合征(LS)的筛查方法包括多步骤分子体肿瘤检测,以区分可能的LS患者和零星病例,这可能是昂贵和复杂的。此外,在资源有限的情况下,对每个确诊的实体癌患者进行LS的直接种系检测是一项挑战。我们开发了一种独特的机器学习评分模型,从结直肠癌(CRC)患者队列中确定可能的ls病例。方法:我们使用来自cbiopportal数据库(TCGA研究)的CRC患者,这些患者具有完整的临床病理和体细胞基因组学数据。我们使用预先设计的生物信息学注释管道确定了5个LS基因(MLH1、MSH2、MSH6、PMS2、EPCAM)和BRAF突变的致病性/可能致病性变异率。使用Annovar、Intervar、Variant Effect Predictor (VEP)和OncoKB软件工具对检测到的体细胞变异进行功能性注释和解释。使用OncoKB精确肿瘤学知识库来提供有关已识别变体影响的信息。我们使用机器学习模型自动对临床病理和体细胞基因组学数据进行评分,以区分可能的ls和散发性CRC病例。训练和测试数据集分别占总CRC患者的80%和20%。结合10倍交叉验证的组正则化方法对训练数据进行特征选择。结果:在TCGA数据集中有临床病理和体细胞基因组数据的4800例结直肠癌患者中,我们确定了524例数据完整的患者。结合临床病理和遗传特征的评分模型对可能- ls的敏感性和特异性均为100%,两者的最大准确度、曲线下面积(AUC)和精确召回率(AUCPR)均为1。在类似的分析中,仅依赖临床或病理特征的训练和测试模型的敏感性分别为0.88和0.50,特异性分别为0.55和0.51,准确性分别为0.58和0.51,AUC分别为0.74和0.61,AUCPR分别为0.21和0.19。结论:同时对LS临床病理和体细胞基因组学数据进行评分可以提高对所有CRC病例可能发生LS的预测和确定。这种方法可以提高准确性,同时减少对所有CRC患者昂贵的直接生殖系检测的依赖,使LS筛查更容易获得和具有成本效益,特别是在资源有限的环境中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信