Pretrained Language Models for Semantics-Aware Data Harmonisation of Observational Clinical Studies in the Era of Big Data

medRxiv - Health Informatics Pub Date : 2024-07-12 DOI:10.1101/2024.07.12.24310136

Jakub Jan Dylag, Zlatko Zlatev, Michael Boniface

{"title":"Pretrained Language Models for Semantics-Aware Data Harmonisation of Observational Clinical Studies in the Era of Big Data","authors":"Jakub Jan Dylag, Zlatko Zlatev, Michael Boniface","doi":"10.1101/2024.07.12.24310136","DOIUrl":null,"url":null,"abstract":"In clinical research, there is a strong drive to leverage big data from population cohort studies and routine electronic healthcare records to design new interventions, improve health outcomes and increase efficiency of healthcare delivery. Yet, realising these potential demands requires substantial efforts in harmonising source datasets and curating study data, which currently relies on costly and time-consuming manual and labour-intensive methods. We evaluate the applicability of AI methods for natural language processing (NLP) and unsupervised machine learning (ML) to the challenges of big data semantic harmonisation and curation. Our aim is to establish an efficient and robust technological foundation for the development of automated tools supporting data curation of large clinical datasets. We assess NLP and unsupervised ML algorithms and propose two pipelines for automated semantic harmonisation: a pipeline for semantics-aware search for domain relevant variables and a pipeline for clustering of semantically similar variables. We evaluate pipeline performance using 94,037 textual variable descriptions from the English Longitudinal Study of Ageing (ELSA) database. We observe high accuracy of our Semantic Search pipeline with an AUC of 0.899 (SD=0.056). Our Semantic Clustering pipeline achieves a V-measure of 0.237 (SD=0.157), which is on par with leading implementations in other relevant domains. Automation can significantly accelerate the process of dataset harmonization. Manual labelling was performed at a speed of 2.1 descriptions per minute, with our automated labelling increasing speed to 245 descriptions per minute. Our study findings underscore the potential of AI technologies, such as NLP and unsupervised ML, in automating the harmonisation and curation of big data for clinical research. By establishing a robust technological foundation, we pave the way for the development of automated tools that streamline the process, enabling health data scientists to leverage big data more efficiently and effectively in their studies, accelerating insights from data for clinical benefit.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"70 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.12.24310136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In clinical research, there is a strong drive to leverage big data from population cohort studies and routine electronic healthcare records to design new interventions, improve health outcomes and increase efficiency of healthcare delivery. Yet, realising these potential demands requires substantial efforts in harmonising source datasets and curating study data, which currently relies on costly and time-consuming manual and labour-intensive methods. We evaluate the applicability of AI methods for natural language processing (NLP) and unsupervised machine learning (ML) to the challenges of big data semantic harmonisation and curation. Our aim is to establish an efficient and robust technological foundation for the development of automated tools supporting data curation of large clinical datasets. We assess NLP and unsupervised ML algorithms and propose two pipelines for automated semantic harmonisation: a pipeline for semantics-aware search for domain relevant variables and a pipeline for clustering of semantically similar variables. We evaluate pipeline performance using 94,037 textual variable descriptions from the English Longitudinal Study of Ageing (ELSA) database. We observe high accuracy of our Semantic Search pipeline with an AUC of 0.899 (SD=0.056). Our Semantic Clustering pipeline achieves a V-measure of 0.237 (SD=0.157), which is on par with leading implementations in other relevant domains. Automation can significantly accelerate the process of dataset harmonization. Manual labelling was performed at a speed of 2.1 descriptions per minute, with our automated labelling increasing speed to 245 descriptions per minute. Our study findings underscore the potential of AI technologies, such as NLP and unsupervised ML, in automating the harmonisation and curation of big data for clinical research. By establishing a robust technological foundation, we pave the way for the development of automated tools that streamline the process, enabling health data scientists to leverage big data more efficiently and effectively in their studies, accelerating insights from data for clinical benefit.

查看原文本刊更多论文

用于大数据时代临床观察研究语义感知数据协调的预训练语言模型

在临床研究中，人们强烈希望利用来自人群队列研究和常规电子医疗记录的大数据来设计新的干预措施、改善健康结果并提高医疗服务效率。然而，要实现这些潜在需求，需要在协调源数据集和整理研究数据方面付出巨大努力，而这目前依赖于成本高、耗时长的人工和劳动密集型方法。我们评估了自然语言处理（NLP）和无监督机器学习（ML）的人工智能方法在应对大数据语义协调和整理挑战方面的适用性。我们的目标是为开发支持大型临床数据集数据整理的自动化工具奠定高效稳健的技术基础。我们对 NLP 算法和无监督 ML 算法进行了评估，并提出了两个用于自动语义协调的管道：一个用于对领域相关变量进行语义感知搜索的管道和一个用于对语义相似变量进行聚类的管道。我们使用英语老龄化纵向研究（ELSA）数据库中的 94,037 个文本变量描述来评估管道性能。我们发现语义搜索管道的准确度很高，AUC为0.899（SD=0.056）。我们的语义聚类管道实现了0.237（SD=0.157）的V-measure，与其他相关领域的领先实现相当。自动化可以大大加快数据集协调过程。人工标注的速度为每分钟 2.1 条描述，而我们的自动标注速度提高到了每分钟 245 条描述。我们的研究结果凸显了 NLP 和无监督 ML 等人工智能技术在临床研究大数据自动协调和整理方面的潜力。通过建立强大的技术基础，我们为开发简化流程的自动化工具铺平了道路，使健康数据科学家能够在研究中更高效、更有效地利用大数据，加快从数据中获得临床益处的洞察力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv - Health Informatics

自引率

0.00%

发文量