Automated real-world data integration improves cancer outcome prediction

IF 50.5 1区 综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES
Nature Pub Date : 2024-11-06 DOI:10.1038/s41586-024-08167-5
Justin Jee, Christopher Fong, Karl Pichotta, Thinh Ngoc Tran, Anisha Luthra, Michele Waters, Chenlian Fu, Mirella Altoe, Si-Yang Liu, Steven B. Maron, Mehnaj Ahmed, Susie Kim, Mono Pirun, Walid K. Chatila, Ino de Bruijn, Arfath Pasha, Ritika Kundra, Benjamin Gross, Brooke Mastrogiacomo, Tyler J. Aprati, David Liu, JianJiong Gao, Marzia Capelletti, Kelly Pekala, Lisa Loudon, Maria Perry, Chaitanya Bandlamudi, Mark Donoghue, Baby Anusha Satravada, Axel Martin, Ronglai Shen, Yuan Chen, A. Rose Brannon, Jason Chang, Lior Braunstein, Anyi Li, Anton Safonov, Aaron Stonestrom, Pablo Sanchez-Vela, Clare Wilhelm, Mark Robson, Howard Scher, Marc Ladanyi, Jorge S. Reis-Filho, David B. Solit, David R. Jones, Daniel Gomez, Helena Yu, Debyani Chakravarty, Rona Yaeger, Wassim Abida, Wungki Park, Eileen M. O’Reilly, Julio Garcia-Aguilar, Nicholas Socci, Francisco Sanchez-Vega, Jian Carrot-Zhang, Peter D. Stetson, Ross Levine, Charles M. Rudin, Michael F. Berger, Sohrab P. Shah, Deborah Schrag, Pedram Razavi, Kenneth L. Kehl, Bob T. Li, Gregory J. Riely, Nikolaus Schultz
{"title":"Automated real-world data integration improves cancer outcome prediction","authors":"Justin Jee, Christopher Fong, Karl Pichotta, Thinh Ngoc Tran, Anisha Luthra, Michele Waters, Chenlian Fu, Mirella Altoe, Si-Yang Liu, Steven B. Maron, Mehnaj Ahmed, Susie Kim, Mono Pirun, Walid K. Chatila, Ino de Bruijn, Arfath Pasha, Ritika Kundra, Benjamin Gross, Brooke Mastrogiacomo, Tyler J. Aprati, David Liu, JianJiong Gao, Marzia Capelletti, Kelly Pekala, Lisa Loudon, Maria Perry, Chaitanya Bandlamudi, Mark Donoghue, Baby Anusha Satravada, Axel Martin, Ronglai Shen, Yuan Chen, A. Rose Brannon, Jason Chang, Lior Braunstein, Anyi Li, Anton Safonov, Aaron Stonestrom, Pablo Sanchez-Vela, Clare Wilhelm, Mark Robson, Howard Scher, Marc Ladanyi, Jorge S. Reis-Filho, David B. Solit, David R. Jones, Daniel Gomez, Helena Yu, Debyani Chakravarty, Rona Yaeger, Wassim Abida, Wungki Park, Eileen M. O’Reilly, Julio Garcia-Aguilar, Nicholas Socci, Francisco Sanchez-Vega, Jian Carrot-Zhang, Peter D. Stetson, Ross Levine, Charles M. Rudin, Michael F. Berger, Sohrab P. Shah, Deborah Schrag, Pedram Razavi, Kenneth L. Kehl, Bob T. Li, Gregory J. Riely, Nikolaus Schultz","doi":"10.1038/s41586-024-08167-5","DOIUrl":null,"url":null,"abstract":"<p>The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations<sup>1,2</sup> with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (<i>n</i> = 7,809), breast (<i>n</i> = 5,368), colorectal (<i>n</i> = 5,543), prostate (<i>n</i> = 3,211) and pancreatic (<i>n</i> = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between <i>SETD2</i> mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research.</p>","PeriodicalId":18787,"journal":{"name":"Nature","volume":"91 1","pages":""},"PeriodicalIF":50.5000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41586-024-08167-5","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations1,2 with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research.

Abstract Image

自动整合真实世界数据可改善癌症结果预测
健康记录的数字化和肿瘤 DNA 测序的日益普及为研究癌症预后的决定因素提供了前所未有的机会。患者数据通常存储在非结构化文本和孤立的数据集中。在这里,我们将自然语言处理注释1,2 与纪念斯隆-凯特琳癌症中心 24,950 名患者的结构化药物治疗、患者报告的人口统计学、肿瘤登记和肿瘤基因组数据相结合,生成了一个临床基因组学、统一的肿瘤学真实世界数据集(MSK-CHORD)。MSK-CHORD 包括非小细胞肺癌(7809 人)、乳腺癌(5368 人)、结直肠癌(5543 人)、前列腺癌(3211 人)和胰腺癌(3109 人)的数据,能够发现较小数据集中不明显的临床基因组学关系。通过利用 MSK-CHORD 训练机器学习模型来预测总生存率,我们发现,经交叉验证和外部多机构数据集测试,包含疾病部位等自然语言处理特征的模型优于仅基于基因组数据或分期的模型。通过注释 705241 份放射学报告,MSK-CHORD 还发现了向特定器官部位转移的预测因素,其中包括 SETD2 突变与免疫疗法治疗的肺腺癌较低的转移潜力之间的关系,这在独立数据集中得到了证实。我们展示了从非结构化笔记中自动注释的可行性及其在预测患者预后方面的实用性。由此产生的数据将作为公共资源提供给真实世界的肿瘤研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Nature
Nature 综合性期刊-综合性期刊
CiteScore
90.00
自引率
1.20%
发文量
3652
审稿时长
3 months
期刊介绍: Nature is a prestigious international journal that publishes peer-reviewed research in various scientific and technological fields. The selection of articles is based on criteria such as originality, importance, interdisciplinary relevance, timeliness, accessibility, elegance, and surprising conclusions. In addition to showcasing significant scientific advances, Nature delivers rapid, authoritative, insightful news, and interpretation of current and upcoming trends impacting science, scientists, and the broader public. The journal serves a dual purpose: firstly, to promptly share noteworthy scientific advances and foster discussions among scientists, and secondly, to ensure the swift dissemination of scientific results globally, emphasizing their significance for knowledge, culture, and daily life.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信