Natural Language Clinical Pathways for Automated Coding of Topography and Morphology in Cancer Registries: Leveraging Healthcare Dataflows through the LN-PDTA Algorithm.

IF 1.5 4区 医学 Q4 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH
Adele Zanfino, Carlotta Buzzoni, Antonio Giampiero Russo
{"title":"Natural Language Clinical Pathways for Automated Coding of Topography and Morphology in Cancer Registries: Leveraging Healthcare Dataflows through the LN-PDTA Algorithm.","authors":"Adele Zanfino, Carlotta Buzzoni, Antonio Giampiero Russo","doi":"10.19191/EP26.2.A955.030","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>cancer registries (CRs) are essential tools for oncological surveillance. The accurate coding of topography and morphology (through ICD-O-3 coding), traditionally performed manually, is complex and time-consuming. Artificial Intelligence (AI) offers new opportunities to automate this process, overcoming the limitations of existing algorithms, which often focus only on topography.</p><p><strong>Objectives: </strong>to develop an AI-based algorithm capable of automatically assigning the combined topography-morphology (topo-morpho) code from a synthetic clinical pathway expressed in natural language (LN-PDTA).</p><p><strong>Design: </strong>retrospective observational study based on integrated registry and healthcare administrative data. Deterministic record linkage was performed among the CR, administrative databases, and pathology reports (AP), considering clinical events within ±180 days from incidence date. Clinical information (diagnoses, pharmacological therapies, surgical procedures, causes of death, pathology codes) was transformed into chronological clinical tokens concatenated into a single string. The target variable was the combined topo-morpho code assigned by registry coders. An LSTM neural network (embedding=64, hidden=128) was trained to learn token sequences.</p><p><strong>Setting and participants: </strong>incident cancer cases recorded by the Cancer Registry of the Agency for Health Protection of the Metropolitan Area of Milan in 2017-2018; multiple, benign, and uncertain tumors were excluded.  Main outcome measures: accuracy in the prediction of topography, morphology, and combined topo-morpho. Precision, recall, and F1 score at different confidence thresholds. Secondary analysis for high-incidence cancer sites and identification of the most predictive tokens and information sources.</p><p><strong>Results: </strong>the dataset included 34,168 cases, split 80:20 into training and test sets. On the test set, the model achieved 89% accuracy for topography prediction, 59% for morphology, and 56% for the combined topo-morpho classification. Performances were better for highly frequent sites (breast 73%; colorectal 61%). For lung and prostate cancers, accuracy for topography reached 94% and 98%, respectively. The most predictive tokens and information sources were identified: pathology reports, mortality data, and surgical procedures for topography; pathology reports, hospital discharge diagnoses, and mortality for morphology.</p><p><strong>Conclusions: </strong>the LN-PDTA-based neural network approach shows promising results for the most frequent topographies and morphologies, thus enabling automatic coding of a fair number of cases, reducing manual coding time and supporting more efficient cancer registry operations.</p>","PeriodicalId":50511,"journal":{"name":"Epidemiologia & Prevenzione","volume":"50 2","pages":"In press"},"PeriodicalIF":1.5000,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Epidemiologia & Prevenzione","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.19191/EP26.2.A955.030","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0

Abstract

Background: cancer registries (CRs) are essential tools for oncological surveillance. The accurate coding of topography and morphology (through ICD-O-3 coding), traditionally performed manually, is complex and time-consuming. Artificial Intelligence (AI) offers new opportunities to automate this process, overcoming the limitations of existing algorithms, which often focus only on topography.

Objectives: to develop an AI-based algorithm capable of automatically assigning the combined topography-morphology (topo-morpho) code from a synthetic clinical pathway expressed in natural language (LN-PDTA).

Design: retrospective observational study based on integrated registry and healthcare administrative data. Deterministic record linkage was performed among the CR, administrative databases, and pathology reports (AP), considering clinical events within ±180 days from incidence date. Clinical information (diagnoses, pharmacological therapies, surgical procedures, causes of death, pathology codes) was transformed into chronological clinical tokens concatenated into a single string. The target variable was the combined topo-morpho code assigned by registry coders. An LSTM neural network (embedding=64, hidden=128) was trained to learn token sequences.

Setting and participants: incident cancer cases recorded by the Cancer Registry of the Agency for Health Protection of the Metropolitan Area of Milan in 2017-2018; multiple, benign, and uncertain tumors were excluded.  Main outcome measures: accuracy in the prediction of topography, morphology, and combined topo-morpho. Precision, recall, and F1 score at different confidence thresholds. Secondary analysis for high-incidence cancer sites and identification of the most predictive tokens and information sources.

Results: the dataset included 34,168 cases, split 80:20 into training and test sets. On the test set, the model achieved 89% accuracy for topography prediction, 59% for morphology, and 56% for the combined topo-morpho classification. Performances were better for highly frequent sites (breast 73%; colorectal 61%). For lung and prostate cancers, accuracy for topography reached 94% and 98%, respectively. The most predictive tokens and information sources were identified: pathology reports, mortality data, and surgical procedures for topography; pathology reports, hospital discharge diagnoses, and mortality for morphology.

Conclusions: the LN-PDTA-based neural network approach shows promising results for the most frequent topographies and morphologies, thus enabling automatic coding of a fair number of cases, reducing manual coding time and supporting more efficient cancer registry operations.

癌症登记处地形和形态学自动编码的自然语言临床路径:通过LN-PDTA算法利用医疗保健数据流。
背景:癌症登记(CRs)是肿瘤监测的重要工具。地形和形态的精确编码(通过ICD-O-3编码)传统上是手工进行的,既复杂又耗时。人工智能(AI)为自动化这一过程提供了新的机会,克服了现有算法的局限性,这些算法通常只关注地形。目的:开发一种基于人工智能的算法,能够自动分配以自然语言(LN-PDTA)表达的合成临床路径的组合地形-形态学(topo-morpho)代码。设计:基于综合登记和医疗管理数据的回顾性观察研究。考虑从发病日期起±180天内的临床事件,在CR、行政数据库和病理报告(AP)之间进行确定性记录链接。临床信息(诊断、药物治疗、外科手术、死亡原因、病理代码)被转换成按时间顺序排列的临床标记,并连接成一个字符串。目标变量是由注册表编码器分配的组合拓扑-形态代码。训练LSTM神经网络(嵌入=64,隐藏=128)学习token序列。环境和参与者:2017-2018年米兰大都会区卫生保护机构癌症登记处记录的癌症病例;排除多发、良性和不确定的肿瘤。主要结果测量:地形、形态和组合地形-形态预测的准确性。精度,召回率和F1分数在不同的置信阈值。对高发癌症部位进行二次分析,确定最具预测性的标记和信息来源。结果:数据集包括34,168个案例,分为训练集和测试集,比例为80:20。在测试集上,该模型的地形预测准确率为89%,形态预测准确率为59%,地形-形态组合分类准确率为56%。高频率部位(乳房73%;结肠61%)的疗效更好。对于肺癌和前列腺癌,地形图的准确率分别达到94%和98%。确定了最具预测性的标记和信息来源:病理报告、死亡率数据和地形外科手术;病理报告,出院诊断和形态学死亡率。结论:基于nn - pdta的神经网络方法对最常见的地形和形态显示出有希望的结果,从而实现了相当数量病例的自动编码,减少了人工编码时间,并支持更有效的癌症登记操作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Epidemiologia & Prevenzione
Epidemiologia & Prevenzione 医学-公共卫生、环境卫生与职业卫生
CiteScore
2.60
自引率
14.30%
发文量
0
审稿时长
>12 weeks
期刊介绍: Epidemiologia & Prevenzione, oggi organo della Associazione italiana di epidemiologia, raccoglie buona parte delle migliori e originali esperienze italiane di ricerca epidemiologica e di studio degli interventi per la prevenzione e la sanità pubblica. La rivista – indicizzata su Medline e dotata di Impact Factor – è un canale importante anche per la segnalazione al pubblico internazionale di contributi che altrimenti circolerebbero soltanto in Italia. E&P in questi decenni ha svolto una funzione di riferimento per la sanità pubblica ma anche per i cittadini e le loro diverse forme di aggregazione. Il principio che l’ha ispirata era, e rimane, che l’epidemiologia ha senso se è funzionale alla prevenzione e alla sanità pubblica e che la prevenzione ha ben poche possibilità di realizzarsi se non si fonda su valide basi scientifiche e se non c’è la partecipazione di tutti i soggetti interessati. Modalità di comunicazione aggiornate, metodologia statistica ed epidemiologica rigorosa, validità degli studi e solidità delle interpretazioni dei risultati sono la solida matrice su cui E&P è costruita. A questa si accompagna una forte responsabilità etica verso la salute pubblica, che oggi ha ampliato in forma irreversibile il suo orizzonte, e include in forma sempre più consapevole non solo gli esseri umani, ma l’intero pianeta e le modificazioni che l’uomo apporta all’universo in cui vive. L’ambizione è che l’offerta di nuovi strumenti di comunicazione, informazione e formazione, soprattutto attraverso l''uso di internet, renda la rivista non solo un tradizionale veicolo di contenuti e analisi scientifiche, ma anche un potente strumento a disposizione di una comunità di interessi e di valori che ha a cuore la salute pubblica.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书