Data-centric artificial intelligence and cancer research: construction of a real-world head and neck treatment data repository

V. Butterworth , T. Young , H. Drake , I. Palmer , T. Avgoulea , E. Ivy , J. Andriolo , C. Creppy , C. Routledge , D. Adjogatse , A. Kong , I. Petkar , M. Reis Ferreira , D. Eaton , M. Lei , S. Misson , D. Vilic , T. Guerrero Urbano
{"title":"Data-centric artificial intelligence and cancer research: construction of a real-world head and neck treatment data repository","authors":"V. Butterworth ,&nbsp;T. Young ,&nbsp;H. Drake ,&nbsp;I. Palmer ,&nbsp;T. Avgoulea ,&nbsp;E. Ivy ,&nbsp;J. Andriolo ,&nbsp;C. Creppy ,&nbsp;C. Routledge ,&nbsp;D. Adjogatse ,&nbsp;A. Kong ,&nbsp;I. Petkar ,&nbsp;M. Reis Ferreira ,&nbsp;D. Eaton ,&nbsp;M. Lei ,&nbsp;S. Misson ,&nbsp;D. Vilic ,&nbsp;T. Guerrero Urbano","doi":"10.1016/j.esmorw.2025.100162","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and purpose</h3><div>The performance and generalisability of machine learning (ML) models relies on high-quality data. Retrospective and prospective collection of high-quality data for research use while respecting data protection and patient privacy remains a challenge in the clinical environment. Currently, months of laborious extraction and clinical annotation are often necessary before data analysis can begin. We present a novel institutional federated data lake, utilising open-source software, to facilitate efficient production of ML models from head and neck cancer (HNC) imaging and radiotherapy (RT) data. This structured pipeline dramatically reduces the time associated with the production of ML models and real-world evidence generation. This paper describes our governance-compliant processes and provides a framework for establishing similar databases.</div></div><div><h3>Materials and methods</h3><div>Extensible NeuroImaging Archival Toolkit (XNAT) is a powerful open-source imaging platform. Within our department, it forms a part of the local secure enclave for the purposes of federated learning in artificial intelligence projects and provides import, archiving, processing, search and secure distribution facilities for imaging and RT data.</div></div><div><h3>Results</h3><div>We have created a clinically annotated, carefully curated, data lake of 2895 consenting HNC patients containing 22 170 relevant diagnostic, staging, treatment and monitoring imaging sets. Key recommendations for replication include infrastructure planning, robust patient and data selection criteria and prioritising patient consent and privacy.</div></div><div><h3>Conclusions</h3><div>This secure and extensible imaging and HNC RT cancer database setup promises to be an exceedingly useful tool for research, revolutionising the time and cost associated with the production of ML models, making the process safer, faster and more efficient.</div></div>","PeriodicalId":100491,"journal":{"name":"ESMO Real World Data and Digital Oncology","volume":"9 ","pages":"Article 100162"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESMO Real World Data and Digital Oncology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949820125000517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background and purpose

The performance and generalisability of machine learning (ML) models relies on high-quality data. Retrospective and prospective collection of high-quality data for research use while respecting data protection and patient privacy remains a challenge in the clinical environment. Currently, months of laborious extraction and clinical annotation are often necessary before data analysis can begin. We present a novel institutional federated data lake, utilising open-source software, to facilitate efficient production of ML models from head and neck cancer (HNC) imaging and radiotherapy (RT) data. This structured pipeline dramatically reduces the time associated with the production of ML models and real-world evidence generation. This paper describes our governance-compliant processes and provides a framework for establishing similar databases.

Materials and methods

Extensible NeuroImaging Archival Toolkit (XNAT) is a powerful open-source imaging platform. Within our department, it forms a part of the local secure enclave for the purposes of federated learning in artificial intelligence projects and provides import, archiving, processing, search and secure distribution facilities for imaging and RT data.

Results

We have created a clinically annotated, carefully curated, data lake of 2895 consenting HNC patients containing 22 170 relevant diagnostic, staging, treatment and monitoring imaging sets. Key recommendations for replication include infrastructure planning, robust patient and data selection criteria and prioritising patient consent and privacy.

Conclusions

This secure and extensible imaging and HNC RT cancer database setup promises to be an exceedingly useful tool for research, revolutionising the time and cost associated with the production of ML models, making the process safer, faster and more efficient.
以数据为中心的人工智能与癌症研究:构建真实世界的头颈部治疗数据存储库
机器学习(ML)模型的性能和通用性依赖于高质量的数据。在临床环境中,在尊重数据保护和患者隐私的同时,回顾性和前瞻性地收集用于研究的高质量数据仍然是一个挑战。目前,在数据分析开始之前,通常需要几个月的艰苦提取和临床注释。我们提出了一个新的机构联合数据湖,利用开源软件,以促进头颈癌(HNC)成像和放疗(RT)数据的ML模型的高效生产。这种结构化的流水线极大地减少了与ML模型生产和真实世界证据生成相关的时间。本文描述了我们的符合治理的流程,并提供了建立类似数据库的框架。材料和方法可扩展神经成像档案工具包(XNAT)是一个强大的开源成像平台。在我们部门内,它构成了本地安全飞地的一部分,用于人工智能项目的联合学习,并为成像和RT数据提供导入、存档、处理、搜索和安全分发设施。结果:我们创建了一个临床注释、精心整理的2895例同意HNC患者的数据湖,包含22 170组相关的诊断、分期、治疗和监测成像集。复制的关键建议包括基础设施规划、健全的患者和数据选择标准以及优先考虑患者同意和隐私。这种安全且可扩展的成像和HNC RT癌症数据库设置有望成为一种非常有用的研究工具,彻底改变与ML模型生产相关的时间和成本,使该过程更安全,更快速,更高效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信