Automated Extraction of Imaging and Pathology Data From Diverse Prostate Cancer Electronic Records.

IF 2.8 Q2 ONCOLOGY

JCO Clinical Cancer Informatics Pub Date : 2025-08-01 Epub Date: 2025-08-07 DOI:10.1200/CCI-25-00085

John M Culnan, Sergey D Goryachev, John R Bihn, Grace Lee, Daniel C R Chen, Oleg Soloviev, Robert Zwolinski, Karlynn N Dulberger, Nhan V Do, Channing J Paller, Matthew R Cooperberg, Nathanael R Fillmore

{"title":"Automated Extraction of Imaging and Pathology Data From Diverse Prostate Cancer Electronic Records.","authors":"John M Culnan, Sergey D Goryachev, John R Bihn, Grace Lee, Daniel C R Chen, Oleg Soloviev, Robert Zwolinski, Karlynn N Dulberger, Nhan V Do, Channing J Paller, Matthew R Cooperberg, Nathanael R Fillmore","doi":"10.1200/CCI-25-00085","DOIUrl":null,"url":null,"abstract":"Purpose: To develop and validate an algorithm to extract clinically relevant data elements for prostate cancer (PCa) from prostate biopsy reports and magnetic resonance imaging (MRI) reports.Patients and methods: MRI reports and biopsy pathology reports were extracted from a cohort of 1,360,866 patients with PCa in the VA Cancer Registry System or the VA Corporate Data Warehouse, with 155,570 patients having the relevant reports for inclusion. We hand-annotated a sample of these reports, which were used to develop a rule-based natural language processing (NLP) algorithm for extracting Gleason score, positive cores, and total cores taken during biopsy from biopsy pathology reports and Prostate Imaging Reporting and Data System (PI-RADS) score, prostate-specific antigen (PSA) density, prostate volume, and prostate dimensions from MRI reports. Our algorithm was validated on a set of 250 biopsy reports and 250 MRI reports representing 378 patients at 78 VA centers with procedures between 2004 and 2024.Results: Our algorithm performed well across all data elements, demonstrating high F1 scores: Gleason (96.9), PI-RADS (93.7), PSA density (99.5), prostate volume (95.7), and prostate dimensions (93.2), with the percentage of positive cores being greater than or less than 34% (88.4). Error analysis demonstrated that items missed by our algorithm were often explained by unusual or vague wording within the notes or especially complex language.Conclusion: We developed an NLP algorithm and validated that it successfully captures salient information about data elements of interest in PCa research. Reliable extraction of these key data elements will have numerous uses for downstream research in this field.","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2500085"},"PeriodicalIF":2.8000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12352557/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-25-00085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/7 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To develop and validate an algorithm to extract clinically relevant data elements for prostate cancer (PCa) from prostate biopsy reports and magnetic resonance imaging (MRI) reports.

Patients and methods: MRI reports and biopsy pathology reports were extracted from a cohort of 1,360,866 patients with PCa in the VA Cancer Registry System or the VA Corporate Data Warehouse, with 155,570 patients having the relevant reports for inclusion. We hand-annotated a sample of these reports, which were used to develop a rule-based natural language processing (NLP) algorithm for extracting Gleason score, positive cores, and total cores taken during biopsy from biopsy pathology reports and Prostate Imaging Reporting and Data System (PI-RADS) score, prostate-specific antigen (PSA) density, prostate volume, and prostate dimensions from MRI reports. Our algorithm was validated on a set of 250 biopsy reports and 250 MRI reports representing 378 patients at 78 VA centers with procedures between 2004 and 2024.

Results: Our algorithm performed well across all data elements, demonstrating high F1 scores: Gleason (96.9), PI-RADS (93.7), PSA density (99.5), prostate volume (95.7), and prostate dimensions (93.2), with the percentage of positive cores being greater than or less than 34% (88.4). Error analysis demonstrated that items missed by our algorithm were often explained by unusual or vague wording within the notes or especially complex language.

Conclusion: We developed an NLP algorithm and validated that it successfully captures salient information about data elements of interest in PCa research. Reliable extraction of these key data elements will have numerous uses for downstream research in this field.

查看原文本刊更多论文

从各种前列腺癌电子记录中自动提取成像和病理数据。

目的：开发并验证一种从前列腺活检报告和磁共振成像（MRI）报告中提取前列腺癌（PCa）临床相关数据元素的算法。患者和方法：从VA癌症登记系统或VA公司数据仓库中的1,360,866例PCa患者中提取MRI报告和活检病理报告，其中155,570例患者有相关报告纳入。我们手工注释了这些报告的样本，用于开发基于规则的自然语言处理（NLP）算法，用于从活检病理报告中提取活检期间采集的Gleason评分、阳性核心和总核心，以及从MRI报告中提取前列腺成像报告和数据系统（PI-RADS）评分、前列腺特异性抗原（PSA）密度、前列腺体积和前列腺尺寸。我们的算法在一组250份活检报告和250份MRI报告上得到了验证，这些报告代表了78个VA中心的378名患者，他们在2004年至2024年间进行了手术。结果：我们的算法在所有数据元素上都表现良好，F1得分很高：Gleason (96.9), PI-RADS (93.7)， PSA密度（99.5），前列腺体积（95.7）和前列腺尺寸（93.2），阳性核心的百分比大于或小于34%（88.4）。错误分析表明，我们的算法遗漏的项目往往是由注释中不寻常或模糊的措辞或特别复杂的语言来解释的。结论：我们开发了一种NLP算法，并验证了它成功捕获了PCa研究中感兴趣的数据元素的显著信息。这些关键数据元素的可靠提取将对该领域的下游研究有许多用途。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JCO Clinical Cancer Informatics ONCOLOGY-

CiteScore

6.20

自引率

4.80%

发文量

190