Andres Tamm, Helen J S Jones, Neel Doshi, William Perry, Jaimie Withers, Hizni Salih, Theresa Noble, Kinga Anna Varnai, Stephanie Little, Gail Roadknight, Des Campell, Sheila Matharu, Naureen Starling, Marion Teare, Algirdas Galdikas, Ben Glampson, Luca Mercuri, Dimitri Papadimitriou, Harpreet Wasan, Lauren A Scanlon, Lee Malcomson, Catherine O'Hara, Andrew Renehan, Brian D Nicholson, Jim Davies, Eva J A Morris, Kerrie Woods, Chris Cunningham
{"title":"支持真实世界数据的癌症研究:从自由文本成像和组织病理学报告中提取结直肠癌状态和明确书写的TNM分期。","authors":"Andres Tamm, Helen J S Jones, Neel Doshi, William Perry, Jaimie Withers, Hizni Salih, Theresa Noble, Kinga Anna Varnai, Stephanie Little, Gail Roadknight, Des Campell, Sheila Matharu, Naureen Starling, Marion Teare, Algirdas Galdikas, Ben Glampson, Luca Mercuri, Dimitri Papadimitriou, Harpreet Wasan, Lauren A Scanlon, Lee Malcomson, Catherine O'Hara, Andrew Renehan, Brian D Nicholson, Jim Davies, Eva J A Morris, Kerrie Woods, Chris Cunningham","doi":"10.1136/bmjhci-2025-101521","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>The 'tumour, node, metastasis' (TNM) classification of colorectal cancer (CRC) predicts prognosis and so is vital to consider in analyses of patterns and outcomes of care when using electronic health records. Unfortunately, it is often only available in free-text reports. This study aimed to develop regex-based text-processing algorithms that identify the reports describing CRC and extract the TNM staging at a low computational cost.</p><p><strong>Methods: </strong>The CRC and TNM extraction algorithms were iteratively developed using 58 634 imaging and pathology reports of patients with CRC from the Oxford University Hospitals (OUH) and Royal Marsden (RMH) NHS Foundation Trusts (FT), with additional input from Imperial College Healthcare and Christie NHS FTs. The algorithms were evaluated on a stratified random sample of 400 OUH development data reports and 400 newer 'unseen' OUH reports. The reports were annotated with the help of two clinicians.</p><p><strong>Results: </strong>The CRC algorithm achieved at least 93.0% positive predictive value (PPV), 72.1% sensitivity, 64.0% negative predictive value (NPV) and 90.1% specificity for primary CRC on pathology reports. On imaging reports, it demonstrated at least 78.0% PPV, 91.8% sensitivity, 93.0% NPV and 80.9% specificity. For the main T/N/M categories, the TNM algorithm achieved PPVs of at least 93.9% (T), 97.7% (N) and 97.2% (M), and sensitivities of 63.6% (T), 89.6% (N) and 64.8% (M). NPVs were at least 45.0% (T), 91.1% (N), 88.4% (M), and specificities 95.7% (T), 98.1% (N), 99.3% (M). Reductions in performance were mostly due to implicit staging. For extracting explicit TNM stages, current or historical, the algorithm made no errors on 400 pathology reports and six errors on 400 imaging reports.</p><p><strong>Conclusion: </strong>The TNM algorithm accurately extracts explicit TNM staging, but other methods are needed for retrieving implicit stages. The CRC algorithm is accurate on non-supplementary reports, but outputs need additional review if higher precision is required.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12458752/pdf/","citationCount":"0","resultStr":"{\"title\":\"Supporting cancer research on real-world data: extracting colorectal cancer status and explicitly written TNM stages from free-text imaging and histopathology reports.\",\"authors\":\"Andres Tamm, Helen J S Jones, Neel Doshi, William Perry, Jaimie Withers, Hizni Salih, Theresa Noble, Kinga Anna Varnai, Stephanie Little, Gail Roadknight, Des Campell, Sheila Matharu, Naureen Starling, Marion Teare, Algirdas Galdikas, Ben Glampson, Luca Mercuri, Dimitri Papadimitriou, Harpreet Wasan, Lauren A Scanlon, Lee Malcomson, Catherine O'Hara, Andrew Renehan, Brian D Nicholson, Jim Davies, Eva J A Morris, Kerrie Woods, Chris Cunningham\",\"doi\":\"10.1136/bmjhci-2025-101521\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>The 'tumour, node, metastasis' (TNM) classification of colorectal cancer (CRC) predicts prognosis and so is vital to consider in analyses of patterns and outcomes of care when using electronic health records. Unfortunately, it is often only available in free-text reports. This study aimed to develop regex-based text-processing algorithms that identify the reports describing CRC and extract the TNM staging at a low computational cost.</p><p><strong>Methods: </strong>The CRC and TNM extraction algorithms were iteratively developed using 58 634 imaging and pathology reports of patients with CRC from the Oxford University Hospitals (OUH) and Royal Marsden (RMH) NHS Foundation Trusts (FT), with additional input from Imperial College Healthcare and Christie NHS FTs. The algorithms were evaluated on a stratified random sample of 400 OUH development data reports and 400 newer 'unseen' OUH reports. The reports were annotated with the help of two clinicians.</p><p><strong>Results: </strong>The CRC algorithm achieved at least 93.0% positive predictive value (PPV), 72.1% sensitivity, 64.0% negative predictive value (NPV) and 90.1% specificity for primary CRC on pathology reports. On imaging reports, it demonstrated at least 78.0% PPV, 91.8% sensitivity, 93.0% NPV and 80.9% specificity. For the main T/N/M categories, the TNM algorithm achieved PPVs of at least 93.9% (T), 97.7% (N) and 97.2% (M), and sensitivities of 63.6% (T), 89.6% (N) and 64.8% (M). NPVs were at least 45.0% (T), 91.1% (N), 88.4% (M), and specificities 95.7% (T), 98.1% (N), 99.3% (M). Reductions in performance were mostly due to implicit staging. For extracting explicit TNM stages, current or historical, the algorithm made no errors on 400 pathology reports and six errors on 400 imaging reports.</p><p><strong>Conclusion: </strong>The TNM algorithm accurately extracts explicit TNM staging, but other methods are needed for retrieving implicit stages. The CRC algorithm is accurate on non-supplementary reports, but outputs need additional review if higher precision is required.</p>\",\"PeriodicalId\":9050,\"journal\":{\"name\":\"BMJ Health & Care Informatics\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12458752/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMJ Health & Care Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1136/bmjhci-2025-101521\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2025-101521","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Supporting cancer research on real-world data: extracting colorectal cancer status and explicitly written TNM stages from free-text imaging and histopathology reports.
Objectives: The 'tumour, node, metastasis' (TNM) classification of colorectal cancer (CRC) predicts prognosis and so is vital to consider in analyses of patterns and outcomes of care when using electronic health records. Unfortunately, it is often only available in free-text reports. This study aimed to develop regex-based text-processing algorithms that identify the reports describing CRC and extract the TNM staging at a low computational cost.
Methods: The CRC and TNM extraction algorithms were iteratively developed using 58 634 imaging and pathology reports of patients with CRC from the Oxford University Hospitals (OUH) and Royal Marsden (RMH) NHS Foundation Trusts (FT), with additional input from Imperial College Healthcare and Christie NHS FTs. The algorithms were evaluated on a stratified random sample of 400 OUH development data reports and 400 newer 'unseen' OUH reports. The reports were annotated with the help of two clinicians.
Results: The CRC algorithm achieved at least 93.0% positive predictive value (PPV), 72.1% sensitivity, 64.0% negative predictive value (NPV) and 90.1% specificity for primary CRC on pathology reports. On imaging reports, it demonstrated at least 78.0% PPV, 91.8% sensitivity, 93.0% NPV and 80.9% specificity. For the main T/N/M categories, the TNM algorithm achieved PPVs of at least 93.9% (T), 97.7% (N) and 97.2% (M), and sensitivities of 63.6% (T), 89.6% (N) and 64.8% (M). NPVs were at least 45.0% (T), 91.1% (N), 88.4% (M), and specificities 95.7% (T), 98.1% (N), 99.3% (M). Reductions in performance were mostly due to implicit staging. For extracting explicit TNM stages, current or historical, the algorithm made no errors on 400 pathology reports and six errors on 400 imaging reports.
Conclusion: The TNM algorithm accurately extracts explicit TNM staging, but other methods are needed for retrieving implicit stages. The CRC algorithm is accurate on non-supplementary reports, but outputs need additional review if higher precision is required.