Natural Language Clinical Pathways for Automated Coding of Topography and Morphology in Cancer Registries: Leveraging Healthcare Dataflows through the LN-PDTA Algorithm.
IF 1.5 4区 医学Q4 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH
Adele Zanfino, Carlotta Buzzoni, Antonio Giampiero Russo
{"title":"Natural Language Clinical Pathways for Automated Coding of Topography and Morphology in Cancer Registries: Leveraging Healthcare Dataflows through the LN-PDTA Algorithm.","authors":"Adele Zanfino, Carlotta Buzzoni, Antonio Giampiero Russo","doi":"10.19191/EP26.2.A955.030","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>cancer registries (CRs) are essential tools for oncological surveillance. The accurate coding of topography and morphology (through ICD-O-3 coding), traditionally performed manually, is complex and time-consuming. Artificial Intelligence (AI) offers new opportunities to automate this process, overcoming the limitations of existing algorithms, which often focus only on topography.</p><p><strong>Objectives: </strong>to develop an AI-based algorithm capable of automatically assigning the combined topography-morphology (topo-morpho) code from a synthetic clinical pathway expressed in natural language (LN-PDTA).</p><p><strong>Design: </strong>retrospective observational study based on integrated registry and healthcare administrative data. Deterministic record linkage was performed among the CR, administrative databases, and pathology reports (AP), considering clinical events within ±180 days from incidence date. Clinical information (diagnoses, pharmacological therapies, surgical procedures, causes of death, pathology codes) was transformed into chronological clinical tokens concatenated into a single string. The target variable was the combined topo-morpho code assigned by registry coders. An LSTM neural network (embedding=64, hidden=128) was trained to learn token sequences.</p><p><strong>Setting and participants: </strong>incident cancer cases recorded by the Cancer Registry of the Agency for Health Protection of the Metropolitan Area of Milan in 2017-2018; multiple, benign, and uncertain tumors were excluded. Main outcome measures: accuracy in the prediction of topography, morphology, and combined topo-morpho. Precision, recall, and F1 score at different confidence thresholds. Secondary analysis for high-incidence cancer sites and identification of the most predictive tokens and information sources.</p><p><strong>Results: </strong>the dataset included 34,168 cases, split 80:20 into training and test sets. On the test set, the model achieved 89% accuracy for topography prediction, 59% for morphology, and 56% for the combined topo-morpho classification. Performances were better for highly frequent sites (breast 73%; colorectal 61%). For lung and prostate cancers, accuracy for topography reached 94% and 98%, respectively. The most predictive tokens and information sources were identified: pathology reports, mortality data, and surgical procedures for topography; pathology reports, hospital discharge diagnoses, and mortality for morphology.</p><p><strong>Conclusions: </strong>the LN-PDTA-based neural network approach shows promising results for the most frequent topographies and morphologies, thus enabling automatic coding of a fair number of cases, reducing manual coding time and supporting more efficient cancer registry operations.</p>","PeriodicalId":50511,"journal":{"name":"Epidemiologia & Prevenzione","volume":"50 2","pages":"In press"},"PeriodicalIF":1.5000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Epidemiologia & Prevenzione","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.19191/EP26.2.A955.030","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0
Abstract
Background: cancer registries (CRs) are essential tools for oncological surveillance. The accurate coding of topography and morphology (through ICD-O-3 coding), traditionally performed manually, is complex and time-consuming. Artificial Intelligence (AI) offers new opportunities to automate this process, overcoming the limitations of existing algorithms, which often focus only on topography.
Objectives: to develop an AI-based algorithm capable of automatically assigning the combined topography-morphology (topo-morpho) code from a synthetic clinical pathway expressed in natural language (LN-PDTA).
Design: retrospective observational study based on integrated registry and healthcare administrative data. Deterministic record linkage was performed among the CR, administrative databases, and pathology reports (AP), considering clinical events within ±180 days from incidence date. Clinical information (diagnoses, pharmacological therapies, surgical procedures, causes of death, pathology codes) was transformed into chronological clinical tokens concatenated into a single string. The target variable was the combined topo-morpho code assigned by registry coders. An LSTM neural network (embedding=64, hidden=128) was trained to learn token sequences.
Setting and participants: incident cancer cases recorded by the Cancer Registry of the Agency for Health Protection of the Metropolitan Area of Milan in 2017-2018; multiple, benign, and uncertain tumors were excluded. Main outcome measures: accuracy in the prediction of topography, morphology, and combined topo-morpho. Precision, recall, and F1 score at different confidence thresholds. Secondary analysis for high-incidence cancer sites and identification of the most predictive tokens and information sources.
Results: the dataset included 34,168 cases, split 80:20 into training and test sets. On the test set, the model achieved 89% accuracy for topography prediction, 59% for morphology, and 56% for the combined topo-morpho classification. Performances were better for highly frequent sites (breast 73%; colorectal 61%). For lung and prostate cancers, accuracy for topography reached 94% and 98%, respectively. The most predictive tokens and information sources were identified: pathology reports, mortality data, and surgical procedures for topography; pathology reports, hospital discharge diagnoses, and mortality for morphology.
Conclusions: the LN-PDTA-based neural network approach shows promising results for the most frequent topographies and morphologies, thus enabling automatic coding of a fair number of cases, reducing manual coding time and supporting more efficient cancer registry operations.
期刊介绍:
Epidemiologia & Prevenzione, oggi organo della Associazione italiana di epidemiologia, raccoglie buona parte delle migliori e originali esperienze italiane di ricerca epidemiologica e di studio degli interventi per la prevenzione e la sanità pubblica.
La rivista – indicizzata su Medline e dotata di Impact Factor – è un canale importante anche per la segnalazione al pubblico internazionale di contributi che altrimenti circolerebbero soltanto in Italia.
E&P in questi decenni ha svolto una funzione di riferimento per la sanità pubblica ma anche per i cittadini e le loro diverse forme di aggregazione. Il principio che l’ha ispirata era, e rimane, che l’epidemiologia ha senso se è funzionale alla prevenzione e alla sanità pubblica e che la prevenzione ha ben poche possibilità di realizzarsi se non si fonda su valide basi scientifiche e se non c’è la partecipazione di tutti i soggetti interessati.
Modalità di comunicazione aggiornate, metodologia statistica ed epidemiologica rigorosa, validità degli studi e solidità delle interpretazioni dei risultati sono la solida matrice su cui E&P è costruita. A questa si accompagna una forte responsabilità etica verso la salute pubblica, che oggi ha ampliato in forma irreversibile il suo orizzonte, e include in forma sempre più consapevole non solo gli esseri umani, ma l’intero pianeta e le modificazioni che l’uomo apporta all’universo in cui vive.
L’ambizione è che l’offerta di nuovi strumenti di comunicazione, informazione e formazione, soprattutto attraverso l''uso di internet, renda la rivista non solo un tradizionale veicolo di contenuti e analisi scientifiche, ma anche un potente strumento a disposizione di una comunità di interessi e di valori che ha a cuore la salute pubblica.