V. Butterworth , T. Young , H. Drake , I. Palmer , T. Avgoulea , E. Ivy , J. Andriolo , C. Creppy , C. Routledge , D. Adjogatse , A. Kong , I. Petkar , M. Reis Ferreira , D. Eaton , M. Lei , S. Misson , D. Vilic , T. Guerrero Urbano
{"title":"以数据为中心的人工智能与癌症研究:构建真实世界的头颈部治疗数据存储库","authors":"V. Butterworth , T. Young , H. Drake , I. Palmer , T. Avgoulea , E. Ivy , J. Andriolo , C. Creppy , C. Routledge , D. Adjogatse , A. Kong , I. Petkar , M. Reis Ferreira , D. Eaton , M. Lei , S. Misson , D. Vilic , T. Guerrero Urbano","doi":"10.1016/j.esmorw.2025.100162","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and purpose</h3><div>The performance and generalisability of machine learning (ML) models relies on high-quality data. Retrospective and prospective collection of high-quality data for research use while respecting data protection and patient privacy remains a challenge in the clinical environment. Currently, months of laborious extraction and clinical annotation are often necessary before data analysis can begin. We present a novel institutional federated data lake, utilising open-source software, to facilitate efficient production of ML models from head and neck cancer (HNC) imaging and radiotherapy (RT) data. This structured pipeline dramatically reduces the time associated with the production of ML models and real-world evidence generation. This paper describes our governance-compliant processes and provides a framework for establishing similar databases.</div></div><div><h3>Materials and methods</h3><div>Extensible NeuroImaging Archival Toolkit (XNAT) is a powerful open-source imaging platform. Within our department, it forms a part of the local secure enclave for the purposes of federated learning in artificial intelligence projects and provides import, archiving, processing, search and secure distribution facilities for imaging and RT data.</div></div><div><h3>Results</h3><div>We have created a clinically annotated, carefully curated, data lake of 2895 consenting HNC patients containing 22 170 relevant diagnostic, staging, treatment and monitoring imaging sets. Key recommendations for replication include infrastructure planning, robust patient and data selection criteria and prioritising patient consent and privacy.</div></div><div><h3>Conclusions</h3><div>This secure and extensible imaging and HNC RT cancer database setup promises to be an exceedingly useful tool for research, revolutionising the time and cost associated with the production of ML models, making the process safer, faster and more efficient.</div></div>","PeriodicalId":100491,"journal":{"name":"ESMO Real World Data and Digital Oncology","volume":"9 ","pages":"Article 100162"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data-centric artificial intelligence and cancer research: construction of a real-world head and neck treatment data repository\",\"authors\":\"V. Butterworth , T. Young , H. Drake , I. Palmer , T. Avgoulea , E. Ivy , J. Andriolo , C. Creppy , C. Routledge , D. Adjogatse , A. Kong , I. Petkar , M. Reis Ferreira , D. Eaton , M. Lei , S. Misson , D. Vilic , T. Guerrero Urbano\",\"doi\":\"10.1016/j.esmorw.2025.100162\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background and purpose</h3><div>The performance and generalisability of machine learning (ML) models relies on high-quality data. Retrospective and prospective collection of high-quality data for research use while respecting data protection and patient privacy remains a challenge in the clinical environment. Currently, months of laborious extraction and clinical annotation are often necessary before data analysis can begin. We present a novel institutional federated data lake, utilising open-source software, to facilitate efficient production of ML models from head and neck cancer (HNC) imaging and radiotherapy (RT) data. This structured pipeline dramatically reduces the time associated with the production of ML models and real-world evidence generation. This paper describes our governance-compliant processes and provides a framework for establishing similar databases.</div></div><div><h3>Materials and methods</h3><div>Extensible NeuroImaging Archival Toolkit (XNAT) is a powerful open-source imaging platform. Within our department, it forms a part of the local secure enclave for the purposes of federated learning in artificial intelligence projects and provides import, archiving, processing, search and secure distribution facilities for imaging and RT data.</div></div><div><h3>Results</h3><div>We have created a clinically annotated, carefully curated, data lake of 2895 consenting HNC patients containing 22 170 relevant diagnostic, staging, treatment and monitoring imaging sets. Key recommendations for replication include infrastructure planning, robust patient and data selection criteria and prioritising patient consent and privacy.</div></div><div><h3>Conclusions</h3><div>This secure and extensible imaging and HNC RT cancer database setup promises to be an exceedingly useful tool for research, revolutionising the time and cost associated with the production of ML models, making the process safer, faster and more efficient.</div></div>\",\"PeriodicalId\":100491,\"journal\":{\"name\":\"ESMO Real World Data and Digital Oncology\",\"volume\":\"9 \",\"pages\":\"Article 100162\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESMO Real World Data and Digital Oncology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949820125000517\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESMO Real World Data and Digital Oncology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949820125000517","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Data-centric artificial intelligence and cancer research: construction of a real-world head and neck treatment data repository
Background and purpose
The performance and generalisability of machine learning (ML) models relies on high-quality data. Retrospective and prospective collection of high-quality data for research use while respecting data protection and patient privacy remains a challenge in the clinical environment. Currently, months of laborious extraction and clinical annotation are often necessary before data analysis can begin. We present a novel institutional federated data lake, utilising open-source software, to facilitate efficient production of ML models from head and neck cancer (HNC) imaging and radiotherapy (RT) data. This structured pipeline dramatically reduces the time associated with the production of ML models and real-world evidence generation. This paper describes our governance-compliant processes and provides a framework for establishing similar databases.
Materials and methods
Extensible NeuroImaging Archival Toolkit (XNAT) is a powerful open-source imaging platform. Within our department, it forms a part of the local secure enclave for the purposes of federated learning in artificial intelligence projects and provides import, archiving, processing, search and secure distribution facilities for imaging and RT data.
Results
We have created a clinically annotated, carefully curated, data lake of 2895 consenting HNC patients containing 22 170 relevant diagnostic, staging, treatment and monitoring imaging sets. Key recommendations for replication include infrastructure planning, robust patient and data selection criteria and prioritising patient consent and privacy.
Conclusions
This secure and extensible imaging and HNC RT cancer database setup promises to be an exceedingly useful tool for research, revolutionising the time and cost associated with the production of ML models, making the process safer, faster and more efficient.