Development and Validation of a Natural Language Processing Algorithm for Extracting Clinical and Pathological Features of Breast Cancer From Pathology Reports.

IF 3.3 Q2 ONCOLOGY

JCO Clinical Cancer Informatics Pub Date : 2024-08-01 DOI:10.1200/CCI.24.00034

Elisabetta Munzone, Antonio Marra, Federico Comotto, Lorenzo Guercio, Claudia Anna Sangalli, Martina Lo Cascio, Eleonora Pagan, Davide Sangalli, Ilaria Bigoni, Francesca Maria Porta, Marianna D'Ercole, Fabiana Ritorti, Vincenzo Bagnardi, Nicola Fusco, Giuseppe Curigliano

{"title":"Development and Validation of a Natural Language Processing Algorithm for Extracting Clinical and Pathological Features of Breast Cancer From Pathology Reports.","authors":"Elisabetta Munzone, Antonio Marra, Federico Comotto, Lorenzo Guercio, Claudia Anna Sangalli, Martina Lo Cascio, Eleonora Pagan, Davide Sangalli, Ilaria Bigoni, Francesca Maria Porta, Marianna D'Ercole, Fabiana Ritorti, Vincenzo Bagnardi, Nicola Fusco, Giuseppe Curigliano","doi":"10.1200/CCI.24.00034","DOIUrl":null,"url":null,"abstract":"Purpose: Electronic health records (EHRs) are valuable information repositories that offer insights for enhancing clinical research on breast cancer (BC) using real-world data. The objective of this study was to develop a natural language processing (NLP) model specifically designed to extract structured data from BC pathology reports written in natural language.Methods: During the initial phase, the algorithm's development cohort comprised 193 pathology reports from 116 patients with BC from 2012 to 2016. A rule-based NLP algorithm was applied to extract 26 variables for analysis and was compared with the manual extraction of data performed by both a data entry specialist and an oncologist. Following the first approach, the data set was expanded to include 513 reports, and a Named Entity Recognition (NER)-NLP model was trained and evaluated using K-fold cross-validation.Results: The first approach led to a concordance analysis, which revealed an 82.9% agreement between the algorithm and the oncologist, whereas the concordance between the data entry specialist and the oncologist was 90.8%. The second training approach introduced the definition of an NER-NLP model, in which the accuracy showed remarkable potential (97.8%). Notably, the model demonstrated remarkable performance, especially for parameters such as estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2, and Ki-67 (F1-score 1.0).Conclusion: The present study aligns with the rapidly evolving field of artificial intelligence (AI) applications in oncology, seeking to expedite the development of complex cancer databases and registries. The results of the model are currently undergoing postprocessing procedures to organize the data into tabular structures, facilitating their utilization in real-world clinical and research endeavors.","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"8 ","pages":"e2400034"},"PeriodicalIF":3.3000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI.24.00034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Electronic health records (EHRs) are valuable information repositories that offer insights for enhancing clinical research on breast cancer (BC) using real-world data. The objective of this study was to develop a natural language processing (NLP) model specifically designed to extract structured data from BC pathology reports written in natural language.

Methods: During the initial phase, the algorithm's development cohort comprised 193 pathology reports from 116 patients with BC from 2012 to 2016. A rule-based NLP algorithm was applied to extract 26 variables for analysis and was compared with the manual extraction of data performed by both a data entry specialist and an oncologist. Following the first approach, the data set was expanded to include 513 reports, and a Named Entity Recognition (NER)-NLP model was trained and evaluated using K-fold cross-validation.

Results: The first approach led to a concordance analysis, which revealed an 82.9% agreement between the algorithm and the oncologist, whereas the concordance between the data entry specialist and the oncologist was 90.8%. The second training approach introduced the definition of an NER-NLP model, in which the accuracy showed remarkable potential (97.8%). Notably, the model demonstrated remarkable performance, especially for parameters such as estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2, and Ki-67 (F1-score 1.0).

Conclusion: The present study aligns with the rapidly evolving field of artificial intelligence (AI) applications in oncology, seeking to expedite the development of complex cancer databases and registries. The results of the model are currently undergoing postprocessing procedures to organize the data into tabular structures, facilitating their utilization in real-world clinical and research endeavors.

查看原文本刊更多论文

从病理报告中提取乳腺癌临床和病理特征的自然语言处理算法的开发与验证

目的：电子健康记录（EHR）是宝贵的信息库，可为利用真实世界数据加强乳腺癌（BC）临床研究提供见解。本研究的目的是开发一种自然语言处理（NLP）模型，专门用于从以自然语言编写的乳腺癌病理报告中提取结构化数据：在初始阶段，该算法的开发队列包括2012年至2016年期间116名BC患者的193份病理报告。应用基于规则的 NLP 算法提取了 26 个变量进行分析，并与数据录入专家和肿瘤学家的手动数据提取进行了比较。在第一种方法之后，数据集扩大到包括513份报告，并使用K倍交叉验证对命名实体识别（NER）-NLP模型进行了训练和评估：第一种方法进行了一致性分析，结果显示算法与肿瘤学家的一致性为 82.9%，而数据录入专家与肿瘤学家的一致性为 90.8%。第二种训练方法引入了 NER-NLP 模型的定义，该模型的准确率显示出显著的潜力（97.8%）。值得注意的是，该模型表现出了卓越的性能，尤其是在雌激素受体、孕酮受体、人表皮生长因子受体 2 和 Ki-67 等参数方面（F1 分数为 1.0）：本研究与人工智能（AI）在肿瘤学应用领域的快速发展相一致，旨在加快复杂癌症数据库和登记册的开发。目前正在对模型结果进行后处理，将数据整理成表格结构，以便在实际临床和研究工作中加以利用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JCO Clinical Cancer Informatics ONCOLOGY-

CiteScore

6.20

自引率

4.80%

发文量

190