使用较长时间前的大语言模型分割非结构化癌症病理报告。

IF 2.8 Q2 ONCOLOGY

JCO Clinical Cancer Informatics Pub Date : 2025-03-01 Epub Date: 2025-03-04 DOI:10.1200/CCI-24-00143

Damien Fung, Gregory Arbour, Krisha Malik, Kaitlin Muzio, Raymond Ng

{"title":"使用较长时间前的大语言模型分割非结构化癌症病理报告。","authors":"Damien Fung, Gregory Arbour, Krisha Malik, Kaitlin Muzio, Raymond Ng","doi":"10.1200/CCI-24-00143","DOIUrl":null,"url":null,"abstract":"Purpose: Many Natural Language Processing (NLP) methods achieve greater performance when the input text is preprocessed to remove extraneous or unnecessary text. A technique known as text segmentation can facilitate this step by isolating key sections from a document. Give that transformer models-such as Bidirectional Encoder Representations from Transformers (BERT)-have demonstrated state-of-the-art performance on many NLP tasks, it is desirable to leverage such models for segmentation. However, transformer models are typically limited to only 512 input tokens and are not well suited for lengthy documents such as cancer pathology reports. The Longformer is a modified transformer model designed to intake longer documents while retaining the positive characteristics of standard transformers. This study presents a Longformer model fine-tuned for cancer pathology report segmentation.Methods: We fine-tuned a Longformer Question-Answer (QA) model on 504 manually annotated pathology reports to isolate sections such as diagnosis, addenda, and clinical history. We compared baseline methods including regular expressions (regex) and BERT QA. However, those methods may fail to correctly identify section boundaries. Model performance was evaluated using sequence recall, precision, and F1 score.Results: Final test results were obtained on a hold-out test set of 304 cancer pathology reports. We report sequence F1 scores for the following sections: diagnosis (0.77), addenda (0.48), clinical history (0.89), and overall (0.68).Conclusion: We present a fine-tuned Longformer model to isolate key sections from cancer pathology reports for downstream analyses. Our model performs segmentation with greater accuracy.","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2400143"},"PeriodicalIF":2.8000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using a Longformer Large Language Model for Segmenting Unstructured Cancer Pathology Reports.\",\"authors\":\"Damien Fung, Gregory Arbour, Krisha Malik, Kaitlin Muzio, Raymond Ng\",\"doi\":\"10.1200/CCI-24-00143\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: Many Natural Language Processing (NLP) methods achieve greater performance when the input text is preprocessed to remove extraneous or unnecessary text. A technique known as text segmentation can facilitate this step by isolating key sections from a document. Give that transformer models-such as Bidirectional Encoder Representations from Transformers (BERT)-have demonstrated state-of-the-art performance on many NLP tasks, it is desirable to leverage such models for segmentation. However, transformer models are typically limited to only 512 input tokens and are not well suited for lengthy documents such as cancer pathology reports. The Longformer is a modified transformer model designed to intake longer documents while retaining the positive characteristics of standard transformers. This study presents a Longformer model fine-tuned for cancer pathology report segmentation.Methods: We fine-tuned a Longformer Question-Answer (QA) model on 504 manually annotated pathology reports to isolate sections such as diagnosis, addenda, and clinical history. We compared baseline methods including regular expressions (regex) and BERT QA. However, those methods may fail to correctly identify section boundaries. Model performance was evaluated using sequence recall, precision, and F1 score.Results: Final test results were obtained on a hold-out test set of 304 cancer pathology reports. We report sequence F1 scores for the following sections: diagnosis (0.77), addenda (0.48), clinical history (0.89), and overall (0.68).Conclusion: We present a fine-tuned Longformer model to isolate key sections from cancer pathology reports for downstream analyses. Our model performs segmentation with greater accuracy.\",\"PeriodicalId\":51626,\"journal\":{\"name\":\"JCO Clinical Cancer Informatics\",\"volume\":\"9 \",\"pages\":\"e2400143\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JCO Clinical Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1200/CCI-24-00143\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/3/4 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-24-00143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/4 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的：许多自然语言处理（NLP）方法在对输入文本进行预处理以去除无关或不必要的文本时可以获得更高的性能。一种称为文本分割的技术可以通过从文档中分离关键部分来简化这一步骤。考虑到变压器模型——比如来自变压器的双向编码器表示（BERT）——已经在许多NLP任务上展示了最先进的性能，利用这样的模型进行分割是可取的。但是，转换器模型通常仅限于512个输入令牌，并且不适合冗长的文档，例如癌症病理报告。Longformer是一种改进的变压器模型，设计用于接收更长的文件，同时保留标准变压器的积极特性。本研究提出了一个Longformer模型微调癌症病理报告分割。方法：我们对504份手工注释的病理报告调整了一个较长的问答（QA）模型，以分离诊断、附录和临床病史等部分。我们比较了包括正则表达式（regex）和BERT QA在内的基线方法。但是，这些方法可能无法正确识别区段边界。使用序列召回率、精度和F1分数来评估模型的性能。结果：对304份肿瘤病理报告进行了保留测试，得到了最终的测试结果。我们报告了以下部分的序列F1得分：诊断（0.77）、附录（0.48）、临床病史（0.89）和总体（0.68）。结论：我们提出了一个微调的Longformer模型，从癌症病理报告中分离出关键部分，用于下游分析。我们的模型以更高的精度执行分割。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using a Longformer Large Language Model for Segmenting Unstructured Cancer Pathology Reports.

Purpose: Many Natural Language Processing (NLP) methods achieve greater performance when the input text is preprocessed to remove extraneous or unnecessary text. A technique known as text segmentation can facilitate this step by isolating key sections from a document. Give that transformer models-such as Bidirectional Encoder Representations from Transformers (BERT)-have demonstrated state-of-the-art performance on many NLP tasks, it is desirable to leverage such models for segmentation. However, transformer models are typically limited to only 512 input tokens and are not well suited for lengthy documents such as cancer pathology reports. The Longformer is a modified transformer model designed to intake longer documents while retaining the positive characteristics of standard transformers. This study presents a Longformer model fine-tuned for cancer pathology report segmentation.

Methods: We fine-tuned a Longformer Question-Answer (QA) model on 504 manually annotated pathology reports to isolate sections such as diagnosis, addenda, and clinical history. We compared baseline methods including regular expressions (regex) and BERT QA. However, those methods may fail to correctly identify section boundaries. Model performance was evaluated using sequence recall, precision, and F1 score.

Results: Final test results were obtained on a hold-out test set of 304 cancer pathology reports. We report sequence F1 scores for the following sections: diagnosis (0.77), addenda (0.48), clinical history (0.89), and overall (0.68).

Conclusion: We present a fine-tuned Longformer model to isolate key sections from cancer pathology reports for downstream analyses. Our model performs segmentation with greater accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JCO Clinical Cancer Informatics ONCOLOGY-

CiteScore

6.20

自引率

4.80%

发文量

190