Classifying Tumor Reportability Status From Unstructured Electronic Pathology Reports Using Language Models in a Population-Based Cancer Registry Setting.

IF 2.8 Q2 ONCOLOGY

JCO Clinical Cancer Informatics Pub Date : 2024-11-01 Epub Date: 2024-11-19 DOI:10.1200/CCI.24.00110

Lovedeep Gondara, Jonathan Simkin, Gregory Arbour, Shebnum Devji, Raymond Ng

{"title":"Classifying Tumor Reportability Status From Unstructured Electronic Pathology Reports Using Language Models in a Population-Based Cancer Registry Setting.","authors":"Lovedeep Gondara, Jonathan Simkin, Gregory Arbour, Shebnum Devji, Raymond Ng","doi":"10.1200/CCI.24.00110","DOIUrl":null,"url":null,"abstract":"Purpose: Population-based cancer registries (PBCRs) collect data on all new cancer diagnoses in a defined population. Data are sourced from pathology reports, and the PBCRs rely on manual and rule-based solutions. This study presents a state-of-the-art natural language processing (NLP) pipeline, built by fine-tuning pretrained language models (LMs). The pipeline is deployed at the British Columbia Cancer Registry (BCCR) to detect reportable tumors from a population-based feed of electronic pathology.Methods: We fine-tune two publicly available LMs, GatorTron and BlueBERT, which are pretrained on clinical text. Fine-tuning is done using BCCR's pathology reports. For the final decision making, we combine both models' output using an OR approach. The fine-tuning data set consisted of 40,000 reports from the diagnosis year of 2021, and the test data sets consisted of 10,000 reports from the diagnosis year 2021, 20,000 reports from diagnosis year 2022, and 400 reports from diagnosis year 2023.Results: The retrospective evaluation of our proposed approach showed boosted reportable accuracy, maintaining the true reportable threshold of 98%.Conclusion: Disadvantages of rule-based NLP in cancer surveillance include manual effort in rule design and sensitivity to language change. Deep learning approaches demonstrate superior performance in classification. PBCRs distinguish reportability status of incoming electronic cancer pathology reports. Deep learning methods provide significant advantages over rule-based NLP.","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"8 ","pages":"e2400110"},"PeriodicalIF":2.8000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11593994/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI.24.00110","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/19 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Population-based cancer registries (PBCRs) collect data on all new cancer diagnoses in a defined population. Data are sourced from pathology reports, and the PBCRs rely on manual and rule-based solutions. This study presents a state-of-the-art natural language processing (NLP) pipeline, built by fine-tuning pretrained language models (LMs). The pipeline is deployed at the British Columbia Cancer Registry (BCCR) to detect reportable tumors from a population-based feed of electronic pathology.

Methods: We fine-tune two publicly available LMs, GatorTron and BlueBERT, which are pretrained on clinical text. Fine-tuning is done using BCCR's pathology reports. For the final decision making, we combine both models' output using an OR approach. The fine-tuning data set consisted of 40,000 reports from the diagnosis year of 2021, and the test data sets consisted of 10,000 reports from the diagnosis year 2021, 20,000 reports from diagnosis year 2022, and 400 reports from diagnosis year 2023.

Results: The retrospective evaluation of our proposed approach showed boosted reportable accuracy, maintaining the true reportable threshold of 98%.

Conclusion: Disadvantages of rule-based NLP in cancer surveillance include manual effort in rule design and sensitivity to language change. Deep learning approaches demonstrate superior performance in classification. PBCRs distinguish reportability status of incoming electronic cancer pathology reports. Deep learning methods provide significant advantages over rule-based NLP.

Abstract Image

查看原文本刊更多论文

在基于人群的癌症登记环境中使用语言模型对非结构化电子病理报告中的肿瘤可报告性状态进行分类。

目的：基于人群的癌症登记处（PBCR）收集特定人群中所有新诊断癌症的数据。数据来源于病理报告，PBCR 依赖于人工和基于规则的解决方案。本研究介绍了最先进的自然语言处理 (NLP) 管道，该管道是通过微调预训练语言模型 (LM) 而建立的。该管道部署在不列颠哥伦比亚省癌症登记处（BCCR），用于从基于人群的电子病理资料中检测可报告的肿瘤：方法：我们对两个公开可用的 LM（GatorTron 和 BlueBERT）进行了微调，这两个 LM 在临床文本上进行了预训练。微调使用 BCCR 的病理报告进行。在最终决策时，我们使用 OR 方法将两个模型的输出结果结合起来。微调数据集包括 2021 诊断年的 40,000 份报告，测试数据集包括 2021 诊断年的 10,000 份报告、2022 诊断年的 20,000 份报告和 2023 诊断年的 400 份报告：结果：对我们提出的方法进行的回顾性评估显示，报告准确率有所提高，真实报告阈值保持在 98%：基于规则的 NLP 在癌症监测中的缺点包括规则设计中的人工工作量和对语言变化的敏感性。深度学习方法在分类方面表现出卓越的性能。PBCR 可以区分收到的电子癌症病理报告的可报告性状态。与基于规则的 NLP 相比，深度学习方法具有显著优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JCO Clinical Cancer Informatics ONCOLOGY-

CiteScore

6.20

自引率

4.80%

发文量

190