Large multimodal model-based standardisation of pathology reports with confidence and its prognostic significance

IF 3.4 2区医学 Q1 PATHOLOGY

Journal of Pathology Clinical Research Pub Date : 2024-11-15 DOI:10.1002/2056-4538.70010

Ethar Alzaid, Gabriele Pergola, Harriet Evans, David Snead, Fayyaz Minhas

{"title":"Large multimodal model-based standardisation of pathology reports with confidence and its prognostic significance","authors":"Ethar Alzaid, Gabriele Pergola, Harriet Evans, David Snead, Fayyaz Minhas","doi":"10.1002/2056-4538.70010","DOIUrl":null,"url":null,"abstract":"<p>Despite the existence of established standards and guidelines for pathology reporting, many pathology reports are still written in unstructured free text. Extracting information from these reports and formatting it according to a standard is crucial for consistent interpretation. Automated information extraction from unstructured pathology reports is a challenging task, as it requires accurately interpreting medical terminologies and context-dependent details. In this work, we present a practical approach for automatically extracting information from unstructured pathology reports or scanned paper reports utilising a large multimodal model. This framework uses context-aware prompting strategies to extract values of individual fields, such as grade, size, etc. from pathology reports. A unique feature of the proposed approach is that it assigns a confidence value indicating the correctness of the model's extraction for each field and generates a structured report in line with national pathology guidelines in human and machine-readable formats. We have analysed the extraction performance in terms of accuracy and kappa scores, and the quality of the confidence scores assigned by the model. We have also evaluated the prognostic value of the extracted fields and feature embeddings of the raw text. Results showed that the model can accurately extract information with an accuracy and kappa score up to 0.99 and 0.98, respectively. Our results indicate that confidence scores are an effective indicator of the correctness of the extracted information achieving an area under the receiver operating characteristic curve up to 0.93 thus enabling automatic flagging of extraction errors. Our analysis further reveals that, as expected, information extracted from pathology reports is highly prognostically relevant. The framework demo is available at: https://labieb.dcs.warwick.ac.uk/. Information extracted from pathology reports of colorectal cancer cases in the cancer genome atlas using the proposed approach and its code are available at: https://github.com/EtharZaid/Labieb.</p>","PeriodicalId":48612,"journal":{"name":"Journal of Pathology Clinical Research","volume":"10 6","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/2056-4538.70010","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pathology Clinical Research","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/2056-4538.70010","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PATHOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Despite the existence of established standards and guidelines for pathology reporting, many pathology reports are still written in unstructured free text. Extracting information from these reports and formatting it according to a standard is crucial for consistent interpretation. Automated information extraction from unstructured pathology reports is a challenging task, as it requires accurately interpreting medical terminologies and context-dependent details. In this work, we present a practical approach for automatically extracting information from unstructured pathology reports or scanned paper reports utilising a large multimodal model. This framework uses context-aware prompting strategies to extract values of individual fields, such as grade, size, etc. from pathology reports. A unique feature of the proposed approach is that it assigns a confidence value indicating the correctness of the model's extraction for each field and generates a structured report in line with national pathology guidelines in human and machine-readable formats. We have analysed the extraction performance in terms of accuracy and kappa scores, and the quality of the confidence scores assigned by the model. We have also evaluated the prognostic value of the extracted fields and feature embeddings of the raw text. Results showed that the model can accurately extract information with an accuracy and kappa score up to 0.99 and 0.98, respectively. Our results indicate that confidence scores are an effective indicator of the correctness of the extracted information achieving an area under the receiver operating characteristic curve up to 0.93 thus enabling automatic flagging of extraction errors. Our analysis further reveals that, as expected, information extracted from pathology reports is highly prognostically relevant. The framework demo is available at: https://labieb.dcs.warwick.ac.uk/. Information extracted from pathology reports of colorectal cancer cases in the cancer genome atlas using the proposed approach and its code are available at: https://github.com/EtharZaid/Labieb.

Abstract Image

查看原文本刊更多论文

基于大型多模态模型的病理报告置信度标准化及其预后意义。

尽管病理报告有既定的标准和指南，但许多病理报告仍以非结构化的自由文本形式撰写。从这些报告中提取信息并按照标准进行格式化，对于统一解释至关重要。从非结构化病理报告中自动提取信息是一项具有挑战性的任务，因为这需要准确解释医学术语和与上下文相关的细节。在这项工作中，我们提出了一种利用大型多模态模型从非结构化病理报告或扫描纸质报告中自动提取信息的实用方法。该框架采用上下文感知提示策略，从病理报告中提取等级、大小等单个字段的值。所提方法的独特之处在于，它能为每个字段分配一个置信度值，表明模型提取的正确性，并生成符合国家病理学指南的人机可读格式的结构化报告。我们分析了提取的准确性和 kappa 分数，以及模型分配的置信度分数的质量。我们还评估了提取字段和原始文本特征嵌入的预后价值。结果表明，该模型可以准确地提取信息，准确率和 kappa 分数分别高达 0.99 和 0.98。我们的结果表明，置信度得分是提取信息正确性的有效指标，接收者工作特征曲线下的面积高达 0.93，从而实现了提取错误的自动标记。我们的分析进一步表明，正如预期的那样，从病理报告中提取的信息与预后高度相关。该框架的演示可在以下网址获得：https://labieb.dcs.warwick.ac.uk/。利用所提出的方法从癌症基因组图谱中的结直肠癌病例病理报告中提取的信息及其代码可在以下网址获取：https://github.com/EtharZaid/Labieb。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Pathology Clinical Research Medicine-Pathology and Forensic Medicine

CiteScore

7.40

自引率

2.40%

发文量

审稿时长

20 weeks

期刊介绍： The Journal of Pathology: Clinical Research and The Journal of Pathology serve as translational bridges between basic biomedical science and clinical medicine with particular emphasis on, but not restricted to, tissue based studies. The focus of The Journal of Pathology: Clinical Research is the publication of studies that illuminate the clinical relevance of research in the broad area of the study of disease. Appropriately powered and validated studies with novel diagnostic, prognostic and predictive significance, and biomarker discover and validation, will be welcomed. Studies with a predominantly mechanistic basis will be more appropriate for the companion Journal of Pathology.