John M Culnan, Sergey D Goryachev, John R Bihn, Grace Lee, Daniel C R Chen, Oleg Soloviev, Robert Zwolinski, Karlynn N Dulberger, Nhan V Do, Channing J Paller, Matthew R Cooperberg, Nathanael R Fillmore
{"title":"Automated Extraction of Imaging and Pathology Data From Diverse Prostate Cancer Electronic Records.","authors":"John M Culnan, Sergey D Goryachev, John R Bihn, Grace Lee, Daniel C R Chen, Oleg Soloviev, Robert Zwolinski, Karlynn N Dulberger, Nhan V Do, Channing J Paller, Matthew R Cooperberg, Nathanael R Fillmore","doi":"10.1200/CCI-25-00085","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To develop and validate an algorithm to extract clinically relevant data elements for prostate cancer (PCa) from prostate biopsy reports and magnetic resonance imaging (MRI) reports.</p><p><strong>Patients and methods: </strong>MRI reports and biopsy pathology reports were extracted from a cohort of 1,360,866 patients with PCa in the VA Cancer Registry System or the VA Corporate Data Warehouse, with 155,570 patients having the relevant reports for inclusion. We hand-annotated a sample of these reports, which were used to develop a rule-based natural language processing (NLP) algorithm for extracting Gleason score, positive cores, and total cores taken during biopsy from biopsy pathology reports and Prostate Imaging Reporting and Data System (PI-RADS) score, prostate-specific antigen (PSA) density, prostate volume, and prostate dimensions from MRI reports. Our algorithm was validated on a set of 250 biopsy reports and 250 MRI reports representing 378 patients at 78 VA centers with procedures between 2004 and 2024.</p><p><strong>Results: </strong>Our algorithm performed well across all data elements, demonstrating high F1 scores: Gleason (96.9), PI-RADS (93.7), PSA density (99.5), prostate volume (95.7), and prostate dimensions (93.2), with the percentage of positive cores being greater than or less than 34% (88.4). Error analysis demonstrated that items missed by our algorithm were often explained by unusual or vague wording within the notes or especially complex language.</p><p><strong>Conclusion: </strong>We developed an NLP algorithm and validated that it successfully captures salient information about data elements of interest in PCa research. Reliable extraction of these key data elements will have numerous uses for downstream research in this field.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2500085"},"PeriodicalIF":2.8000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12352557/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-25-00085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/7 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: To develop and validate an algorithm to extract clinically relevant data elements for prostate cancer (PCa) from prostate biopsy reports and magnetic resonance imaging (MRI) reports.
Patients and methods: MRI reports and biopsy pathology reports were extracted from a cohort of 1,360,866 patients with PCa in the VA Cancer Registry System or the VA Corporate Data Warehouse, with 155,570 patients having the relevant reports for inclusion. We hand-annotated a sample of these reports, which were used to develop a rule-based natural language processing (NLP) algorithm for extracting Gleason score, positive cores, and total cores taken during biopsy from biopsy pathology reports and Prostate Imaging Reporting and Data System (PI-RADS) score, prostate-specific antigen (PSA) density, prostate volume, and prostate dimensions from MRI reports. Our algorithm was validated on a set of 250 biopsy reports and 250 MRI reports representing 378 patients at 78 VA centers with procedures between 2004 and 2024.
Results: Our algorithm performed well across all data elements, demonstrating high F1 scores: Gleason (96.9), PI-RADS (93.7), PSA density (99.5), prostate volume (95.7), and prostate dimensions (93.2), with the percentage of positive cores being greater than or less than 34% (88.4). Error analysis demonstrated that items missed by our algorithm were often explained by unusual or vague wording within the notes or especially complex language.
Conclusion: We developed an NLP algorithm and validated that it successfully captures salient information about data elements of interest in PCa research. Reliable extraction of these key data elements will have numerous uses for downstream research in this field.