Danny Styvens Cardona , Juan Pablo Valencia-Arango , Juan Pablo Gallo , Catalina Andrea Bustamante , Richard Orlando Salazar , Carlos Andrés Carmona , Melisa Naranjo Vanegas , Carolina Jaramillo Jaramillo , Juliana Espinosa Moncada , Harvy Mauricio Velasco , Natalia Gallego Lopera
{"title":"弥合数据差距:从遗传性癌症的非结构化临床记录中提取和分析遗传信息的方法学进展","authors":"Danny Styvens Cardona , Juan Pablo Valencia-Arango , Juan Pablo Gallo , Catalina Andrea Bustamante , Richard Orlando Salazar , Carlos Andrés Carmona , Melisa Naranjo Vanegas , Carolina Jaramillo Jaramillo , Juliana Espinosa Moncada , Harvy Mauricio Velasco , Natalia Gallego Lopera","doi":"10.1016/j.canep.2025.102940","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The global impact of cancer, driven by both acquired and hereditary mutations, underscores the necessity for extensive research efforts. Despite the increasing volume of genetic data, significant gaps remain in data science research, particularly in Latinos and admixed populations. This study utilizes advanced data science techniques to integrate genetic and clinical data, aiming to improve the understanding of hereditary cancer in Colombia and demonstrating the transformative potential of data-driven approaches in cancer research.</div></div><div><h3>Methods</h3><div>This observational study analyzed healthcare databases from four regions and 11 cities in Colombia. Genetic data were extracted from PDF reports within SURA Colombia's Electronic Health Records (a Latin American health insurance provider) for individuals referred for hereditary cancer testing between October 2019 and November 2021. Variants in 30 genes, aligned with NCCN guidelines, were examined using Next-Generation Sequencing (NGS). Data extraction was automated using Python and R, followed by integration and analysis of genetic, clinical, and sociodemographic data using advanced data science tools hosted on Azure infrastructure. These tools enabled predictive modeling and cross-referencing to explore correlations between genetic variants and clinical outcomes.</div></div><div><h3>Results</h3><div>The study included 1377 patients, with a predominance of women (92.81 %) and 63 % from the northwestern region of Colombia. The largest age group (40.37 %) was between 31 and 44 years, and 95.35 % had a personal cancer history, primarily breast cancer (75.86 %). Hereditary cancer testing revealed 145 positive results and 587 uncertain outcomes. Data science-driven analysis identified higher positivity rates in patients aged 31–44 and over 50, particularly in the northeast and central regions. Among positive results, 42.6 % included variants of uncertain significance, with 95.9 % of these patients having a personal cancer history.</div></div><div><h3>Conclusion</h3><div>This study highlights the significant role of data science in analyzing hereditary cancer data. Advanced computational techniques can aid in genetic variant reclassification, uncover patterns in underrepresented populations, and inform personalized interventions for hereditary cancer management in Latin America.</div></div>","PeriodicalId":56322,"journal":{"name":"Cancer Epidemiology","volume":"99 ","pages":"Article 102940"},"PeriodicalIF":2.3000,"publicationDate":"2025-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bridging data gaps: Methodological advances in extracting and analyzing genetic information from unstructured clinical records in hereditary cancer\",\"authors\":\"Danny Styvens Cardona , Juan Pablo Valencia-Arango , Juan Pablo Gallo , Catalina Andrea Bustamante , Richard Orlando Salazar , Carlos Andrés Carmona , Melisa Naranjo Vanegas , Carolina Jaramillo Jaramillo , Juliana Espinosa Moncada , Harvy Mauricio Velasco , Natalia Gallego Lopera\",\"doi\":\"10.1016/j.canep.2025.102940\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>The global impact of cancer, driven by both acquired and hereditary mutations, underscores the necessity for extensive research efforts. Despite the increasing volume of genetic data, significant gaps remain in data science research, particularly in Latinos and admixed populations. This study utilizes advanced data science techniques to integrate genetic and clinical data, aiming to improve the understanding of hereditary cancer in Colombia and demonstrating the transformative potential of data-driven approaches in cancer research.</div></div><div><h3>Methods</h3><div>This observational study analyzed healthcare databases from four regions and 11 cities in Colombia. Genetic data were extracted from PDF reports within SURA Colombia's Electronic Health Records (a Latin American health insurance provider) for individuals referred for hereditary cancer testing between October 2019 and November 2021. Variants in 30 genes, aligned with NCCN guidelines, were examined using Next-Generation Sequencing (NGS). Data extraction was automated using Python and R, followed by integration and analysis of genetic, clinical, and sociodemographic data using advanced data science tools hosted on Azure infrastructure. These tools enabled predictive modeling and cross-referencing to explore correlations between genetic variants and clinical outcomes.</div></div><div><h3>Results</h3><div>The study included 1377 patients, with a predominance of women (92.81 %) and 63 % from the northwestern region of Colombia. The largest age group (40.37 %) was between 31 and 44 years, and 95.35 % had a personal cancer history, primarily breast cancer (75.86 %). Hereditary cancer testing revealed 145 positive results and 587 uncertain outcomes. Data science-driven analysis identified higher positivity rates in patients aged 31–44 and over 50, particularly in the northeast and central regions. Among positive results, 42.6 % included variants of uncertain significance, with 95.9 % of these patients having a personal cancer history.</div></div><div><h3>Conclusion</h3><div>This study highlights the significant role of data science in analyzing hereditary cancer data. Advanced computational techniques can aid in genetic variant reclassification, uncover patterns in underrepresented populations, and inform personalized interventions for hereditary cancer management in Latin America.</div></div>\",\"PeriodicalId\":56322,\"journal\":{\"name\":\"Cancer Epidemiology\",\"volume\":\"99 \",\"pages\":\"Article 102940\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cancer Epidemiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1877782125002000\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877782125002000","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ONCOLOGY","Score":null,"Total":0}
Bridging data gaps: Methodological advances in extracting and analyzing genetic information from unstructured clinical records in hereditary cancer
Introduction
The global impact of cancer, driven by both acquired and hereditary mutations, underscores the necessity for extensive research efforts. Despite the increasing volume of genetic data, significant gaps remain in data science research, particularly in Latinos and admixed populations. This study utilizes advanced data science techniques to integrate genetic and clinical data, aiming to improve the understanding of hereditary cancer in Colombia and demonstrating the transformative potential of data-driven approaches in cancer research.
Methods
This observational study analyzed healthcare databases from four regions and 11 cities in Colombia. Genetic data were extracted from PDF reports within SURA Colombia's Electronic Health Records (a Latin American health insurance provider) for individuals referred for hereditary cancer testing between October 2019 and November 2021. Variants in 30 genes, aligned with NCCN guidelines, were examined using Next-Generation Sequencing (NGS). Data extraction was automated using Python and R, followed by integration and analysis of genetic, clinical, and sociodemographic data using advanced data science tools hosted on Azure infrastructure. These tools enabled predictive modeling and cross-referencing to explore correlations between genetic variants and clinical outcomes.
Results
The study included 1377 patients, with a predominance of women (92.81 %) and 63 % from the northwestern region of Colombia. The largest age group (40.37 %) was between 31 and 44 years, and 95.35 % had a personal cancer history, primarily breast cancer (75.86 %). Hereditary cancer testing revealed 145 positive results and 587 uncertain outcomes. Data science-driven analysis identified higher positivity rates in patients aged 31–44 and over 50, particularly in the northeast and central regions. Among positive results, 42.6 % included variants of uncertain significance, with 95.9 % of these patients having a personal cancer history.
Conclusion
This study highlights the significant role of data science in analyzing hereditary cancer data. Advanced computational techniques can aid in genetic variant reclassification, uncover patterns in underrepresented populations, and inform personalized interventions for hereditary cancer management in Latin America.
期刊介绍:
Cancer Epidemiology is dedicated to increasing understanding about cancer causes, prevention and control. The scope of the journal embraces all aspects of cancer epidemiology including:
• Descriptive epidemiology
• Studies of risk factors for disease initiation, development and prognosis
• Screening and early detection
• Prevention and control
• Methodological issues
The journal publishes original research articles (full length and short reports), systematic reviews and meta-analyses, editorials, commentaries and letters to the editor commenting on previously published research.