支持真实世界数据的癌症研究：从自由文本成像和组织病理学报告中提取结直肠癌状态和明确书写的TNM分期。

IF 4.4 Q1 HEALTH CARE SCIENCES & SERVICES

BMJ Health & Care Informatics Pub Date : 2025-09-21 DOI:10.1136/bmjhci-2025-101521

Andres Tamm, Helen J S Jones, Neel Doshi, William Perry, Jaimie Withers, Hizni Salih, Theresa Noble, Kinga Anna Varnai, Stephanie Little, Gail Roadknight, Des Campell, Sheila Matharu, Naureen Starling, Marion Teare, Algirdas Galdikas, Ben Glampson, Luca Mercuri, Dimitri Papadimitriou, Harpreet Wasan, Lauren A Scanlon, Lee Malcomson, Catherine O'Hara, Andrew Renehan, Brian D Nicholson, Jim Davies, Eva J A Morris, Kerrie Woods, Chris Cunningham

{"title":"支持真实世界数据的癌症研究：从自由文本成像和组织病理学报告中提取结直肠癌状态和明确书写的TNM分期。","authors":"Andres Tamm, Helen J S Jones, Neel Doshi, William Perry, Jaimie Withers, Hizni Salih, Theresa Noble, Kinga Anna Varnai, Stephanie Little, Gail Roadknight, Des Campell, Sheila Matharu, Naureen Starling, Marion Teare, Algirdas Galdikas, Ben Glampson, Luca Mercuri, Dimitri Papadimitriou, Harpreet Wasan, Lauren A Scanlon, Lee Malcomson, Catherine O'Hara, Andrew Renehan, Brian D Nicholson, Jim Davies, Eva J A Morris, Kerrie Woods, Chris Cunningham","doi":"10.1136/bmjhci-2025-101521","DOIUrl":null,"url":null,"abstract":"Objectives: The 'tumour, node, metastasis' (TNM) classification of colorectal cancer (CRC) predicts prognosis and so is vital to consider in analyses of patterns and outcomes of care when using electronic health records. Unfortunately, it is often only available in free-text reports. This study aimed to develop regex-based text-processing algorithms that identify the reports describing CRC and extract the TNM staging at a low computational cost.Methods: The CRC and TNM extraction algorithms were iteratively developed using 58 634 imaging and pathology reports of patients with CRC from the Oxford University Hospitals (OUH) and Royal Marsden (RMH) NHS Foundation Trusts (FT), with additional input from Imperial College Healthcare and Christie NHS FTs. The algorithms were evaluated on a stratified random sample of 400 OUH development data reports and 400 newer 'unseen' OUH reports. The reports were annotated with the help of two clinicians.Results: The CRC algorithm achieved at least 93.0% positive predictive value (PPV), 72.1% sensitivity, 64.0% negative predictive value (NPV) and 90.1% specificity for primary CRC on pathology reports. On imaging reports, it demonstrated at least 78.0% PPV, 91.8% sensitivity, 93.0% NPV and 80.9% specificity. For the main T/N/M categories, the TNM algorithm achieved PPVs of at least 93.9% (T), 97.7% (N) and 97.2% (M), and sensitivities of 63.6% (T), 89.6% (N) and 64.8% (M). NPVs were at least 45.0% (T), 91.1% (N), 88.4% (M), and specificities 95.7% (T), 98.1% (N), 99.3% (M). Reductions in performance were mostly due to implicit staging. For extracting explicit TNM stages, current or historical, the algorithm made no errors on 400 pathology reports and six errors on 400 imaging reports.Conclusion: The TNM algorithm accurately extracts explicit TNM staging, but other methods are needed for retrieving implicit stages. The CRC algorithm is accurate on non-supplementary reports, but outputs need additional review if higher precision is required.","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12458752/pdf/","citationCount":"0","resultStr":"{\"title\":\"Supporting cancer research on real-world data: extracting colorectal cancer status and explicitly written TNM stages from free-text imaging and histopathology reports.\",\"authors\":\"Andres Tamm, Helen J S Jones, Neel Doshi, William Perry, Jaimie Withers, Hizni Salih, Theresa Noble, Kinga Anna Varnai, Stephanie Little, Gail Roadknight, Des Campell, Sheila Matharu, Naureen Starling, Marion Teare, Algirdas Galdikas, Ben Glampson, Luca Mercuri, Dimitri Papadimitriou, Harpreet Wasan, Lauren A Scanlon, Lee Malcomson, Catherine O'Hara, Andrew Renehan, Brian D Nicholson, Jim Davies, Eva J A Morris, Kerrie Woods, Chris Cunningham\",\"doi\":\"10.1136/bmjhci-2025-101521\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objectives: The 'tumour, node, metastasis' (TNM) classification of colorectal cancer (CRC) predicts prognosis and so is vital to consider in analyses of patterns and outcomes of care when using electronic health records. Unfortunately, it is often only available in free-text reports. This study aimed to develop regex-based text-processing algorithms that identify the reports describing CRC and extract the TNM staging at a low computational cost.Methods: The CRC and TNM extraction algorithms were iteratively developed using 58 634 imaging and pathology reports of patients with CRC from the Oxford University Hospitals (OUH) and Royal Marsden (RMH) NHS Foundation Trusts (FT), with additional input from Imperial College Healthcare and Christie NHS FTs. The algorithms were evaluated on a stratified random sample of 400 OUH development data reports and 400 newer 'unseen' OUH reports. The reports were annotated with the help of two clinicians.Results: The CRC algorithm achieved at least 93.0% positive predictive value (PPV), 72.1% sensitivity, 64.0% negative predictive value (NPV) and 90.1% specificity for primary CRC on pathology reports. On imaging reports, it demonstrated at least 78.0% PPV, 91.8% sensitivity, 93.0% NPV and 80.9% specificity. For the main T/N/M categories, the TNM algorithm achieved PPVs of at least 93.9% (T), 97.7% (N) and 97.2% (M), and sensitivities of 63.6% (T), 89.6% (N) and 64.8% (M). NPVs were at least 45.0% (T), 91.1% (N), 88.4% (M), and specificities 95.7% (T), 98.1% (N), 99.3% (M). Reductions in performance were mostly due to implicit staging. For extracting explicit TNM stages, current or historical, the algorithm made no errors on 400 pathology reports and six errors on 400 imaging reports.Conclusion: The TNM algorithm accurately extracts explicit TNM staging, but other methods are needed for retrieving implicit stages. The CRC algorithm is accurate on non-supplementary reports, but outputs need additional review if higher precision is required.\",\"PeriodicalId\":9050,\"journal\":{\"name\":\"BMJ Health & Care Informatics\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12458752/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMJ Health & Care Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1136/bmjhci-2025-101521\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2025-101521","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

目的：结直肠癌（CRC）的“肿瘤、淋巴结、转移”（TNM）分类预测预后，因此在使用电子健康记录时分析护理模式和结果时至关重要。不幸的是，它通常只在自由文本报告中可用。本研究旨在开发基于正则表达式的文本处理算法，以较低的计算成本识别描述CRC的报告并提取TNM分期。方法：使用来自牛津大学医院（OUH）和皇家马斯登（RMH） NHS基金会信托基金（FT）的58634例CRC患者的影像学和病理报告，以及帝国理工学院医疗保健和克里斯蒂NHS FTs的额外输入，迭代开发CRC和TNM提取算法。对400份OUH开发数据报告和400份较新的“未见过的”OUH报告的分层随机样本进行了算法评估。报告在两位临床医生的帮助下进行了注释。结果：CRC算法对原发性CRC的病理报告至少达到93.0%阳性预测值（PPV）、72.1%敏感性、64.0%阴性预测值（NPV）和90.1%特异性。在影像学报告中，它至少显示78.0%的PPV， 91.8%的敏感性，93.0%的NPV和80.9%的特异性。对于主要的T/N/M类别，TNM算法的ppv至少达到93.9% (T)、97.7% (N)和97.2% (M)，灵敏度分别为63.6% (T)、89.6% (N)和64.8% (M)。npv至少为45.0% (T)、91.1% (N)、88.4% (M)，特异性为95.7% (T)、98.1% (N)、99.3% (M)。性能的降低主要是由于隐式分段。对于提取明确的TNM分期，无论是当前的还是历史的，该算法在400份病理报告中没有错误，在400份成像报告中有6个错误。结论：TNM算法可准确提取显式TNM分期，而提取隐式TNM分期还需其他方法。CRC算法在非补充报告上是准确的，但如果需要更高的精度，则需要对输出进行额外的审查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Supporting cancer research on real-world data: extracting colorectal cancer status and explicitly written TNM stages from free-text imaging and histopathology reports.

Objectives: The 'tumour, node, metastasis' (TNM) classification of colorectal cancer (CRC) predicts prognosis and so is vital to consider in analyses of patterns and outcomes of care when using electronic health records. Unfortunately, it is often only available in free-text reports. This study aimed to develop regex-based text-processing algorithms that identify the reports describing CRC and extract the TNM staging at a low computational cost.

Methods: The CRC and TNM extraction algorithms were iteratively developed using 58 634 imaging and pathology reports of patients with CRC from the Oxford University Hospitals (OUH) and Royal Marsden (RMH) NHS Foundation Trusts (FT), with additional input from Imperial College Healthcare and Christie NHS FTs. The algorithms were evaluated on a stratified random sample of 400 OUH development data reports and 400 newer 'unseen' OUH reports. The reports were annotated with the help of two clinicians.

Results: The CRC algorithm achieved at least 93.0% positive predictive value (PPV), 72.1% sensitivity, 64.0% negative predictive value (NPV) and 90.1% specificity for primary CRC on pathology reports. On imaging reports, it demonstrated at least 78.0% PPV, 91.8% sensitivity, 93.0% NPV and 80.9% specificity. For the main T/N/M categories, the TNM algorithm achieved PPVs of at least 93.9% (T), 97.7% (N) and 97.2% (M), and sensitivities of 63.6% (T), 89.6% (N) and 64.8% (M). NPVs were at least 45.0% (T), 91.1% (N), 88.4% (M), and specificities 95.7% (T), 98.1% (N), 99.3% (M). Reductions in performance were mostly due to implicit staging. For extracting explicit TNM stages, current or historical, the algorithm made no errors on 400 pathology reports and six errors on 400 imaging reports.

Conclusion: The TNM algorithm accurately extracts explicit TNM staging, but other methods are needed for retrieving implicit stages. The CRC algorithm is accurate on non-supplementary reports, but outputs need additional review if higher precision is required.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMJ Health & Care Informatics Multiple-

CiteScore

6.10

自引率

4.90%

发文量

审稿时长

18 weeks