Raquel Paradinha, Vicente Barros, João Rafael Almeida, José Luís Oliveira
{"title":"语义驱动的队列数据协调到OMOP CDM模式。","authors":"Raquel Paradinha, Vicente Barros, João Rafael Almeida, José Luís Oliveira","doi":"10.3233/SHTI251524","DOIUrl":null,"url":null,"abstract":"<p><p>Clinical research often requires integrating data from diverse sources, which differ not only in structure but also in semantics and language. Traditional extract-transform-load (ETL) pipelines struggle to handle semantic variability and lack built-in support for multilingual or ontology-driven harmonisation. This fragmentation limits the interoperability and reuse of clinical datasets in large-scale analyses. In this paper, we propose an integrated framework that combines an embedding-based concept mapping engine with an automated ETL pipeline using Apache Airflow. The mapping engine uses transformer-based embeddings to align clinical terms with standard concepts, producing outputs in White Rabbit and Usagi-compatible formats to ensure backward interoperability. We validated the system using multilingual real-world datasets demonstrating its ability to handle heterogeneous inputs and maintain end-to-end reproducibility.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"332 ","pages":"190-194"},"PeriodicalIF":0.0000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Semantic-Driven for Cohort Data Harmonisation into OMOP CDM Schema.\",\"authors\":\"Raquel Paradinha, Vicente Barros, João Rafael Almeida, José Luís Oliveira\",\"doi\":\"10.3233/SHTI251524\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Clinical research often requires integrating data from diverse sources, which differ not only in structure but also in semantics and language. Traditional extract-transform-load (ETL) pipelines struggle to handle semantic variability and lack built-in support for multilingual or ontology-driven harmonisation. This fragmentation limits the interoperability and reuse of clinical datasets in large-scale analyses. In this paper, we propose an integrated framework that combines an embedding-based concept mapping engine with an automated ETL pipeline using Apache Airflow. The mapping engine uses transformer-based embeddings to align clinical terms with standard concepts, producing outputs in White Rabbit and Usagi-compatible formats to ensure backward interoperability. We validated the system using multilingual real-world datasets demonstrating its ability to handle heterogeneous inputs and maintain end-to-end reproducibility.</p>\",\"PeriodicalId\":94357,\"journal\":{\"name\":\"Studies in health technology and informatics\",\"volume\":\"332 \",\"pages\":\"190-194\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Studies in health technology and informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/SHTI251524\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251524","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Semantic-Driven for Cohort Data Harmonisation into OMOP CDM Schema.
Clinical research often requires integrating data from diverse sources, which differ not only in structure but also in semantics and language. Traditional extract-transform-load (ETL) pipelines struggle to handle semantic variability and lack built-in support for multilingual or ontology-driven harmonisation. This fragmentation limits the interoperability and reuse of clinical datasets in large-scale analyses. In this paper, we propose an integrated framework that combines an embedding-based concept mapping engine with an automated ETL pipeline using Apache Airflow. The mapping engine uses transformer-based embeddings to align clinical terms with standard concepts, producing outputs in White Rabbit and Usagi-compatible formats to ensure backward interoperability. We validated the system using multilingual real-world datasets demonstrating its ability to handle heterogeneous inputs and maintain end-to-end reproducibility.