Bridging data gaps: Methodological advances in extracting and analyzing genetic information from unstructured clinical records in hereditary cancer

IF 2.3 3区 医学 Q3 ONCOLOGY
Danny Styvens Cardona , Juan Pablo Valencia-Arango , Juan Pablo Gallo , Catalina Andrea Bustamante , Richard Orlando Salazar , Carlos Andrés Carmona , Melisa Naranjo Vanegas , Carolina Jaramillo Jaramillo , Juliana Espinosa Moncada , Harvy Mauricio Velasco , Natalia Gallego Lopera
{"title":"Bridging data gaps: Methodological advances in extracting and analyzing genetic information from unstructured clinical records in hereditary cancer","authors":"Danny Styvens Cardona ,&nbsp;Juan Pablo Valencia-Arango ,&nbsp;Juan Pablo Gallo ,&nbsp;Catalina Andrea Bustamante ,&nbsp;Richard Orlando Salazar ,&nbsp;Carlos Andrés Carmona ,&nbsp;Melisa Naranjo Vanegas ,&nbsp;Carolina Jaramillo Jaramillo ,&nbsp;Juliana Espinosa Moncada ,&nbsp;Harvy Mauricio Velasco ,&nbsp;Natalia Gallego Lopera","doi":"10.1016/j.canep.2025.102940","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The global impact of cancer, driven by both acquired and hereditary mutations, underscores the necessity for extensive research efforts. Despite the increasing volume of genetic data, significant gaps remain in data science research, particularly in Latinos and admixed populations. This study utilizes advanced data science techniques to integrate genetic and clinical data, aiming to improve the understanding of hereditary cancer in Colombia and demonstrating the transformative potential of data-driven approaches in cancer research.</div></div><div><h3>Methods</h3><div>This observational study analyzed healthcare databases from four regions and 11 cities in Colombia. Genetic data were extracted from PDF reports within SURA Colombia's Electronic Health Records (a Latin American health insurance provider) for individuals referred for hereditary cancer testing between October 2019 and November 2021. Variants in 30 genes, aligned with NCCN guidelines, were examined using Next-Generation Sequencing (NGS). Data extraction was automated using Python and R, followed by integration and analysis of genetic, clinical, and sociodemographic data using advanced data science tools hosted on Azure infrastructure. These tools enabled predictive modeling and cross-referencing to explore correlations between genetic variants and clinical outcomes.</div></div><div><h3>Results</h3><div>The study included 1377 patients, with a predominance of women (92.81 %) and 63 % from the northwestern region of Colombia. The largest age group (40.37 %) was between 31 and 44 years, and 95.35 % had a personal cancer history, primarily breast cancer (75.86 %). Hereditary cancer testing revealed 145 positive results and 587 uncertain outcomes. Data science-driven analysis identified higher positivity rates in patients aged 31–44 and over 50, particularly in the northeast and central regions. Among positive results, 42.6 % included variants of uncertain significance, with 95.9 % of these patients having a personal cancer history.</div></div><div><h3>Conclusion</h3><div>This study highlights the significant role of data science in analyzing hereditary cancer data. Advanced computational techniques can aid in genetic variant reclassification, uncover patterns in underrepresented populations, and inform personalized interventions for hereditary cancer management in Latin America.</div></div>","PeriodicalId":56322,"journal":{"name":"Cancer Epidemiology","volume":"99 ","pages":"Article 102940"},"PeriodicalIF":2.3000,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877782125002000","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction

The global impact of cancer, driven by both acquired and hereditary mutations, underscores the necessity for extensive research efforts. Despite the increasing volume of genetic data, significant gaps remain in data science research, particularly in Latinos and admixed populations. This study utilizes advanced data science techniques to integrate genetic and clinical data, aiming to improve the understanding of hereditary cancer in Colombia and demonstrating the transformative potential of data-driven approaches in cancer research.

Methods

This observational study analyzed healthcare databases from four regions and 11 cities in Colombia. Genetic data were extracted from PDF reports within SURA Colombia's Electronic Health Records (a Latin American health insurance provider) for individuals referred for hereditary cancer testing between October 2019 and November 2021. Variants in 30 genes, aligned with NCCN guidelines, were examined using Next-Generation Sequencing (NGS). Data extraction was automated using Python and R, followed by integration and analysis of genetic, clinical, and sociodemographic data using advanced data science tools hosted on Azure infrastructure. These tools enabled predictive modeling and cross-referencing to explore correlations between genetic variants and clinical outcomes.

Results

The study included 1377 patients, with a predominance of women (92.81 %) and 63 % from the northwestern region of Colombia. The largest age group (40.37 %) was between 31 and 44 years, and 95.35 % had a personal cancer history, primarily breast cancer (75.86 %). Hereditary cancer testing revealed 145 positive results and 587 uncertain outcomes. Data science-driven analysis identified higher positivity rates in patients aged 31–44 and over 50, particularly in the northeast and central regions. Among positive results, 42.6 % included variants of uncertain significance, with 95.9 % of these patients having a personal cancer history.

Conclusion

This study highlights the significant role of data science in analyzing hereditary cancer data. Advanced computational techniques can aid in genetic variant reclassification, uncover patterns in underrepresented populations, and inform personalized interventions for hereditary cancer management in Latin America.
弥合数据差距:从遗传性癌症的非结构化临床记录中提取和分析遗传信息的方法学进展
由获得性和遗传性突变驱动的癌症的全球影响强调了广泛研究工作的必要性。尽管基因数据量不断增加,但数据科学研究仍存在重大差距,特别是在拉丁裔和混合人口中。本研究利用先进的数据科学技术整合遗传和临床数据,旨在提高对哥伦比亚遗传性癌症的了解,并展示数据驱动方法在癌症研究中的变革潜力。方法本观察性研究分析了哥伦比亚4个地区和11个城市的卫生保健数据库。遗传数据提取自2019年10月至2021年11月期间转诊进行遗传性癌症检测的个人的SURA哥伦比亚电子健康记录(拉丁美洲健康保险提供商)中的PDF报告。根据NCCN指南,使用下一代测序(NGS)检测了30个基因的变异。使用Python和R自动提取数据,然后使用托管在Azure基础设施上的高级数据科学工具集成和分析遗传、临床和社会人口数据。这些工具使预测建模和交叉参考能够探索遗传变异和临床结果之间的相关性。结果共纳入1377例患者,其中以女性为主(92.81 %),63 %来自哥伦比亚西北部地区。最大年龄组(40.37 %)为31 - 44岁,95.35 %有个人癌症病史,主要是乳腺癌(75.86 %)。遗传性癌症检测显示145个阳性结果和587个不确定结果。数据科学驱动的分析发现,31-44岁和50岁以上患者的阳性率较高,特别是在东北部和中部地区。在阳性结果中,42.6 %包含不确定意义的变异,其中95.9% %的患者有个人癌症史。结论本研究突出了数据科学在遗传肿瘤数据分析中的重要作用。先进的计算技术可以帮助基因变异重新分类,揭示代表性不足人群的模式,并为拉丁美洲的遗传性癌症管理提供个性化干预措施。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Cancer Epidemiology
Cancer Epidemiology 医学-肿瘤学
CiteScore
4.50
自引率
3.80%
发文量
200
审稿时长
39 days
期刊介绍: Cancer Epidemiology is dedicated to increasing understanding about cancer causes, prevention and control. The scope of the journal embraces all aspects of cancer epidemiology including: • Descriptive epidemiology • Studies of risk factors for disease initiation, development and prognosis • Screening and early detection • Prevention and control • Methodological issues The journal publishes original research articles (full length and short reports), systematic reviews and meta-analyses, editorials, commentaries and letters to the editor commenting on previously published research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信