Damien Fung, Gregory Arbour, Krisha Malik, Kaitlin Muzio, Raymond Ng
{"title":"Using a Longformer Large Language Model for Segmenting Unstructured Cancer Pathology Reports.","authors":"Damien Fung, Gregory Arbour, Krisha Malik, Kaitlin Muzio, Raymond Ng","doi":"10.1200/CCI-24-00143","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Many Natural Language Processing (NLP) methods achieve greater performance when the input text is preprocessed to remove extraneous or unnecessary text. A technique known as text segmentation can facilitate this step by isolating key sections from a document. Give that transformer models-such as Bidirectional Encoder Representations from Transformers (BERT)-have demonstrated state-of-the-art performance on many NLP tasks, it is desirable to leverage such models for segmentation. However, transformer models are typically limited to only 512 input tokens and are not well suited for lengthy documents such as cancer pathology reports. The Longformer is a modified transformer model designed to intake longer documents while retaining the positive characteristics of standard transformers. This study presents a Longformer model fine-tuned for cancer pathology report segmentation.</p><p><strong>Methods: </strong>We fine-tuned a Longformer Question-Answer (QA) model on 504 manually annotated pathology reports to isolate sections such as diagnosis, addenda, and clinical history. We compared baseline methods including regular expressions (regex) and BERT QA. However, those methods may fail to correctly identify section boundaries. Model performance was evaluated using sequence recall, precision, and F1 score.</p><p><strong>Results: </strong>Final test results were obtained on a hold-out test set of 304 cancer pathology reports. We report sequence F1 scores for the following sections: diagnosis (0.77), addenda (0.48), clinical history (0.89), and overall (0.68).</p><p><strong>Conclusion: </strong>We present a fine-tuned Longformer model to isolate key sections from cancer pathology reports for downstream analyses. Our model performs segmentation with greater accuracy.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2400143"},"PeriodicalIF":3.3000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-24-00143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/4 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Many Natural Language Processing (NLP) methods achieve greater performance when the input text is preprocessed to remove extraneous or unnecessary text. A technique known as text segmentation can facilitate this step by isolating key sections from a document. Give that transformer models-such as Bidirectional Encoder Representations from Transformers (BERT)-have demonstrated state-of-the-art performance on many NLP tasks, it is desirable to leverage such models for segmentation. However, transformer models are typically limited to only 512 input tokens and are not well suited for lengthy documents such as cancer pathology reports. The Longformer is a modified transformer model designed to intake longer documents while retaining the positive characteristics of standard transformers. This study presents a Longformer model fine-tuned for cancer pathology report segmentation.
Methods: We fine-tuned a Longformer Question-Answer (QA) model on 504 manually annotated pathology reports to isolate sections such as diagnosis, addenda, and clinical history. We compared baseline methods including regular expressions (regex) and BERT QA. However, those methods may fail to correctly identify section boundaries. Model performance was evaluated using sequence recall, precision, and F1 score.
Results: Final test results were obtained on a hold-out test set of 304 cancer pathology reports. We report sequence F1 scores for the following sections: diagnosis (0.77), addenda (0.48), clinical history (0.89), and overall (0.68).
Conclusion: We present a fine-tuned Longformer model to isolate key sections from cancer pathology reports for downstream analyses. Our model performs segmentation with greater accuracy.