Pretrained Language Models for Semantics-Aware Data Harmonisation of Observational Clinical Studies in the Era of Big Data

Jakub Jan Dylag, Zlatko Zlatev, Michael Boniface
{"title":"Pretrained Language Models for Semantics-Aware Data Harmonisation of Observational Clinical Studies in the Era of Big Data","authors":"Jakub Jan Dylag, Zlatko Zlatev, Michael Boniface","doi":"10.1101/2024.07.12.24310136","DOIUrl":null,"url":null,"abstract":"In clinical research, there is a strong drive to leverage big data from population cohort studies and routine electronic healthcare records to design new interventions, improve health outcomes and increase efficiency of healthcare delivery. Yet, realising these potential demands requires substantial efforts in harmonising source datasets and curating study data, which currently relies on costly and time-consuming manual and labour-intensive methods. We evaluate the applicability of AI methods for natural language processing (NLP) and unsupervised machine learning (ML) to the challenges of big data semantic harmonisation and curation. Our aim is to establish an efficient and robust technological foundation for the development of automated tools supporting data curation of large clinical datasets. We assess NLP and unsupervised ML algorithms and propose two pipelines for automated semantic harmonisation: a pipeline for semantics-aware search for domain relevant variables and a pipeline for clustering of semantically similar variables. We evaluate pipeline performance using 94,037 textual variable descriptions from the English Longitudinal Study of Ageing (ELSA) database. We observe high accuracy of our Semantic Search pipeline with an AUC of 0.899 (SD=0.056). Our Semantic Clustering pipeline achieves a V-measure of 0.237 (SD=0.157), which is on par with leading implementations in other relevant domains. Automation can significantly accelerate the process of dataset harmonization. Manual labelling was performed at a speed of 2.1 descriptions per minute, with our automated labelling increasing speed to 245 descriptions per minute. Our study findings underscore the potential of AI technologies, such as NLP and unsupervised ML, in automating the harmonisation and curation of big data for clinical research. By establishing a robust technological foundation, we pave the way for the development of automated tools that streamline the process, enabling health data scientists to leverage big data more efficiently and effectively in their studies, accelerating insights from data for clinical benefit.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.12.24310136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In clinical research, there is a strong drive to leverage big data from population cohort studies and routine electronic healthcare records to design new interventions, improve health outcomes and increase efficiency of healthcare delivery. Yet, realising these potential demands requires substantial efforts in harmonising source datasets and curating study data, which currently relies on costly and time-consuming manual and labour-intensive methods. We evaluate the applicability of AI methods for natural language processing (NLP) and unsupervised machine learning (ML) to the challenges of big data semantic harmonisation and curation. Our aim is to establish an efficient and robust technological foundation for the development of automated tools supporting data curation of large clinical datasets. We assess NLP and unsupervised ML algorithms and propose two pipelines for automated semantic harmonisation: a pipeline for semantics-aware search for domain relevant variables and a pipeline for clustering of semantically similar variables. We evaluate pipeline performance using 94,037 textual variable descriptions from the English Longitudinal Study of Ageing (ELSA) database. We observe high accuracy of our Semantic Search pipeline with an AUC of 0.899 (SD=0.056). Our Semantic Clustering pipeline achieves a V-measure of 0.237 (SD=0.157), which is on par with leading implementations in other relevant domains. Automation can significantly accelerate the process of dataset harmonization. Manual labelling was performed at a speed of 2.1 descriptions per minute, with our automated labelling increasing speed to 245 descriptions per minute. Our study findings underscore the potential of AI technologies, such as NLP and unsupervised ML, in automating the harmonisation and curation of big data for clinical research. By establishing a robust technological foundation, we pave the way for the development of automated tools that streamline the process, enabling health data scientists to leverage big data more efficiently and effectively in their studies, accelerating insights from data for clinical benefit.
用于大数据时代临床观察研究语义感知数据协调的预训练语言模型
在临床研究中,人们强烈希望利用来自人群队列研究和常规电子医疗记录的大数据来设计新的干预措施、改善健康结果并提高医疗服务效率。然而,要实现这些潜在需求,需要在协调源数据集和整理研究数据方面付出巨大努力,而这目前依赖于成本高、耗时长的人工和劳动密集型方法。我们评估了自然语言处理(NLP)和无监督机器学习(ML)的人工智能方法在应对大数据语义协调和整理挑战方面的适用性。我们的目标是为开发支持大型临床数据集数据整理的自动化工具奠定高效稳健的技术基础。我们对 NLP 算法和无监督 ML 算法进行了评估,并提出了两个用于自动语义协调的管道:一个用于对领域相关变量进行语义感知搜索的管道和一个用于对语义相似变量进行聚类的管道。我们使用英语老龄化纵向研究(ELSA)数据库中的 94,037 个文本变量描述来评估管道性能。我们发现语义搜索管道的准确度很高,AUC为0.899(SD=0.056)。我们的语义聚类管道实现了0.237(SD=0.157)的V-measure,与其他相关领域的领先实现相当。自动化可以大大加快数据集协调过程。人工标注的速度为每分钟 2.1 条描述,而我们的自动标注速度提高到了每分钟 245 条描述。我们的研究结果凸显了 NLP 和无监督 ML 等人工智能技术在临床研究大数据自动协调和整理方面的潜力。通过建立强大的技术基础,我们为开发简化流程的自动化工具铺平了道路,使健康数据科学家能够在研究中更高效、更有效地利用大数据,加快从数据中获得临床益处的洞察力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信